methodology

Best local LLMs for private document work

The useful question is not which local LLM sounds smartest in chat. It is which one can survive private desktop work: scans, CSVs, messy folders, revised invoices, evidence paths, proof codes, and exact final files.

Local document work benchmark note infographic
guidePractical score
current suiteResolved
resolved + coreCore pass
City Plan SVG separateVisual sample

If I had to pick one reason to run a local LLM, it would be document work.

Not because invoices are exciting. They are not. But they are private, repetitive, full of small traps, and exactly the kind of thing people hesitate to send to a cloud model.

That is why ordinary model leaderboards are not enough. A model can sound excellent in chat and still fail a paperwork workflow because it chooses the old attachment, writes malformed JSON, misses a duplicate invoice, or invents an evidence path.

Short answer

As of the current public Local Model Bench suite, the best local row for private document work is Qwen3.6 27B. It is not a magic model and it does not close every case, but it has the best practical balance we have seen locally across generated invoice scans and messy workflow folders.

Gemma 4 26B-A4B remains the useful local baseline. It is fast enough to work with, often understands the broad situation, and is a better everyday comparison point than many smaller models. But it is no longer the local row to beat.

The dense Gemma 4 31B result is the useful warning. Bigger and denser did not automatically mean better in this workflow. It often looked close, then lost the run on exact closure.

  • First local model to try here: Qwen3.6 27B
  • Useful local baseline: Gemma 4 26B-A4B
  • Do not judge by chat feel alone
  • Do not judge by parameter count alone
  • Check the final files

The benchmark angle

Local Model Bench is built around a narrow practical question: can local models handle private desktop work? The current suite uses synthetic invoice cases, messy intake folders, generated scans, CSV exports, hidden oracles, and visible run outputs. It is not a universal intelligence ranking.

The benchmark deliberately favors boring work. That is the point. A useful desktop assistant has to finish exact tasks, not just produce plausible commentary. It has to read sources, choose active files, ignore stale drafts, produce structured artifacts, cite evidence paths, and close the final proof step.

A model that looks polished can still lose because the file is in the wrong directory or the proof code is wrong.

Current practical readout

The strongest local row in the current public suite is Qwen3.6 27B. It is not perfect, but it handled the combined paperwork and workflow cases better than the other local models tested so far.

Gemma 4 26B-A4B remains useful, especially as a practical local MoE candidate. I would keep it in the comparison set when testing any new local candidate.

Smaller models can still be interesting. Some are fast, cheap to run, and decent at text-only diagnostic tasks. But in the main multimodal paperwork suite, the hard failures are usually not style problems. They are workflow failures: wrong document selection, missing evidence, proof-code errors, duplicate-risk misses, or incomplete artifacts.

  • Best current local practical result: Qwen3.6 27B
  • Useful local MoE baseline: Gemma 4 26B-A4B
  • Good diagnostic split: Paperwork Text-Only
  • Still hard for most models: agentic folder workflows
  • Separate sanity check: City Plan SVG

Vision is only one part

Document work is often described as a vision problem, but that is only partly true. A model must read the scan, yes. It also has to reconcile the scan against bank rows, vendor status, purchase orders, previous invoices, notes from accounts payable, and revision hints in messy folders.

That is why the site keeps text-only diagnostics separate. If a model improves sharply when normalized text extracts replace images, the bottleneck may be OCR or layout understanding. If it still fails text-only, the issue is deeper: bookkeeping logic, schema discipline, or final workflow closure.

A text-only model can still be useful if another tool extracts the document text first. It just should not be mixed into the same score as a model that handled the original scans.

What I would actually run

For private document work on a Mac mini-class machine, I would start with three lanes rather than one winner.

First, run the strongest local multimodal/document candidate you can fit comfortably. In our current table that means Qwen3.6 27B. Second, keep Gemma 4 26B-A4B as a practical baseline because it is useful and not wildly expensive to run. Third, keep a text-only lane for non-vision models so you can tell whether a failure is image reading or workflow logic.

That setup is less glamorous than a single winner. It is also more honest. The question is not which model wins a chart. The question is which setup gets your documents into a correct final artifact with the least human cleanup.

  • Multimodal lane: generated scans and messy folder cases
  • Text-only lane: normalized extracts and same oracle logic
  • Runtime lane: speed, timeout, visible output, token budget
  • Visual sanity lane: constrained SVG or diagram output
  • Manual review lane: inspect raw outputs before trusting automation

How to read the scores

The main practical score combines two signals. Resolved means the whole case passed, including hidden oracle, proof code, final artifacts, and protected-file rules. Core pass means the central audit facts were right even if exact closure failed. That split matters because a proof-code miss should not look identical to a wrong invoice decision.

The site also shows failure types. They are diagnostics, not extra score penalties. A `proof_code_error`, `duplicate_risk_missed`, or `missing_or_wrong_evidence` label tells you why a model lost the case. It does not secretly add more failed checks.

What not to over-read

This is a small benchmark. It should not be read as a universal claim about model intelligence, all document tasks, or every possible runtime setting.

It is also not financial, tax, or legal advice. The cases are synthetic bookkeeping tasks designed to test extraction, reconciliation, artifact discipline, and hidden-oracle closure.

The useful claim is narrower: if a local model cannot close these small synthetic workflows, you should be careful before trusting it with messier real private folders.

What should be tested next

The next useful expansion is not a bigger generic prompt set. It is more private-work tasks: contracts with conflicting amendments, insurance forms, synthetic appointment packets, messy support exports, local repo repair, and inbox attachment cleanup. Each task needs visible artifacts and hidden oracles.

The benchmark should also become easier to reproduce. A public runner and documented case format would make the site less like a blog and more like infrastructure.

Bottom line

For private document work, the best local LLM is not necessarily the model with the biggest context window or the prettiest chat answer. It is the model that can finish boring local work without leaking files, hallucinating fields, or producing an artifact that fails the next step.

That is the current bar: not magic, not AGI, not another leaderboard badge. Just local models doing private desktop work well enough that a human can verify instead of repair.

Model Context

Topic
Local LLMs for private document work
Main benchmark
Paperwork Trial + Paperwork Workflow
Diagnostic mode
Paperwork Text-Only
Local hardware
Mac mini M4, 64 GB unified memory
Current local leader
Qwen3.6 27B in the public suite
Last updated
May 31, 2026

Positioned As

  • This is an evergreen guide to reading Local Model Bench results for document workflows.
  • The project is not trying to replace broad LLM leaderboards.
  • The narrow claim is practical: local models should be tested on private desktop work with visible outputs and hidden oracles.

What We Actually Tested

  • The current public score combines generated invoice scans and agentic paperwork workflow folders.
  • The site also shows a text-only diagnostic benchmark for separating document vision from bookkeeping logic.
  • Run pages expose prompts, outputs, checks, failure labels, and model artifacts.
  • Case pages show synthetic inputs but do not publish hidden answer keys beside the tasks.

What Worked

  • Explains why local document workflows need a different benchmark shape.
  • Links model ranking to practical artifacts instead of broad model vibes.
  • Separates multimodal document work from text-only bookkeeping diagnostics.

Where It Broke

  • The current suite is still small.
  • Manual review is still needed for interpreting near misses.
  • More case families are needed before making broad claims about desktop agents.

Readout

The practical benchmark for local LLMs is private desktop work: scans, folders, CSVs, revisions, warnings, evidence, and exact artifacts. That is where local inference has a real reason to exist.