methodology

Best local LLMs for private document work

The useful question is not which local LLM sounds smartest in chat. It is which model can survive private desktop work: generated scans, messy folders, CSVs, invoice revisions, evidence paths, proof codes, and exact final files.

Local document work benchmark note infographic
guide Practical score
current suite Resolved
resolved + core Core pass
City Plan SVG separate Visual sample

Local language models are attractive for document work because the input is often private. Invoices, bank exports, vendor lists, purchase orders, support notes, screenshots, and scanned PDFs are exactly the files many people do not want to upload to a cloud model.

That makes ordinary leaderboards incomplete. A model can score well on broad reasoning tests and still fail a private paperwork workflow because it chooses the wrong attachment, writes malformed JSON, misses a duplicate invoice, or changes a protected source folder.

The benchmark angle

Local Model Bench is built around a narrow practical question: can local models handle private desktop work? The current suite uses synthetic invoice cases, messy intake folders, generated scans, CSV exports, hidden oracles, and visible run outputs. It is not a universal intelligence ranking.

The benchmark deliberately favors boring work. That is the point. A useful desktop assistant needs to finish exact tasks, not just produce plausible commentary. It has to read sources, choose active files, ignore stale drafts, produce structured artifacts, cite evidence paths, and close the final proof step.

Current practical readout

The strongest local row in the current public suite is Qwen3.6 27B. It is not perfect, but it handled the combined paperwork and workflow cases better than the other local models tested so far. Gemma 4 26B A4B remains useful, especially as a practical local MoE candidate, but it is no longer the local row to beat.

Smaller models can still be interesting. Some are fast, cheap to run, and decent at text-only diagnostic tasks. But in the main multimodal paperwork suite, the hard failures are usually not style problems. They are workflow failures: wrong document selection, missing evidence, proof-code errors, duplicate-risk misses, or incomplete artifacts.

  • Best current local practical result: Qwen3.6 27B
  • Useful local MoE baseline: Gemma 4 26B A4B
  • Good diagnostic split: Paperwork Text-Only
  • Still hard for most models: agentic folder workflows
  • Separate sanity check: City Plan SVG

Vision is only one part

Document work is often described as a vision problem, but that is only partly true. A model must read the scan, yes. It also has to reconcile the scan against bank rows, vendor status, purchase orders, previous invoices, notes from accounts payable, and revision hints in messy folders.

That is why the site keeps text-only diagnostics separate. If a model improves sharply when normalized text extracts replace images, the bottleneck may be OCR or layout understanding. If it still fails text-only, the issue is deeper: bookkeeping logic, schema discipline, or final workflow closure.

How to read the scores

The main practical score combines two signals. Resolved means the whole case passed, including hidden oracle, proof code, final artifacts, and protected-file rules. Core pass means the central audit facts were right even if exact closure failed. That split matters because a proof-code miss should not look identical to a wrong invoice decision.

The site also shows failure types. They are diagnostics, not extra score penalties. A `proof_code_error`, `duplicate_risk_missed`, or `missing_or_wrong_evidence` label tells you why a model lost the case. It does not secretly add more failed checks.

What to test next

The next useful expansion is not a bigger generic prompt set. It is more private-work tasks: contracts with conflicting amendments, insurance forms, synthetic appointment packets, messy support exports, local repo repair, and inbox attachment cleanup. Each task needs visible artifacts and hidden oracles.

The benchmark should also become easier to reproduce. A public runner and documented case format would make the site less like a blog and more like infrastructure. That is the direction Local Model Bench is moving toward.

Bottom line

For private document work, the best local LLM is not necessarily the model with the biggest context window or the prettiest chat answer. It is the model that can finish boring local work without leaking files, hallucinating fields, or producing an artifact that fails the next step.

That is the current bar: not magic, not AGI, not another leaderboard badge. Just local models doing private desktop work well enough to be trusted with verification.

Model Context

Topic
Local LLMs for private document work
Main benchmark
Paperwork Trial + Paperwork Workflow
Diagnostic mode
Paperwork Text-Only
Local hardware
Mac mini M4, 64 GB unified memory
Current local leader
Qwen3.6 27B in the public suite

Positioned As

  • This is an evergreen guide to reading Local Model Bench results for document workflows.
  • The project is not trying to replace broad LLM leaderboards.
  • The narrow claim is practical: local models should be tested on private desktop work with visible outputs and hidden oracles.

What We Actually Tested

  • The current public score combines generated invoice scans and agentic paperwork workflow folders.
  • The site also shows a text-only diagnostic benchmark for separating document vision from bookkeeping logic.
  • Run pages expose prompts, outputs, checks, failure labels, and model artifacts.
  • Case pages show synthetic inputs but do not publish hidden answer keys beside the tasks.

What Worked

  • Explains why local document workflows need a different benchmark shape.
  • Links model ranking to practical artifacts instead of broad model vibes.
  • Separates multimodal document work from text-only bookkeeping diagnostics.

Where It Broke

  • The current suite is still small.
  • Manual review is still needed for interpreting near misses.
  • More case families are needed before making broad claims about desktop agents.

Readout

The practical benchmark for local LLMs is private desktop work: scans, folders, CSVs, revisions, warnings, evidence, and exact artifacts. That is where local inference has a real reason to exist.