benchmark note

Bigger was not always better in the paperwork benchmark

In the current nine-case paperwork suite, two larger-sounding local model choices produced fewer finished cases than leaner alternatives. That does not prove small models are better. It shows that parameter count and model labels are poor substitutes for checking whether the final work actually got done.

Paperwork suite comparison benchmark note infographic
comparison Practical score
9 cases Resolved
resolved + core Core pass
paperwork suite Visual sample

The easiest story would be simple: bigger model, better result.

That is not what happened in the current Local Model Bench paperwork suite. The benchmark asks models to work through synthetic private-document tasks: generated invoice images, vendor tables, bank exports, purchase orders, messy folders, revised attachments, protected source files, and hidden oracle checks.

The final score is not a vibe score. It is half strict resolved cases and half core-oracle pass. That split matters, because some models understood much of the case and still failed to finish the job.

The surprising comparison

The current local result to beat is qwen3.6-27b. It scored 72.2%, with 5/9 resolved cases and 8/9 core passes.

The larger-sounding qwen3.6-35b-a3b did not beat it. It scored 38.9%, with 1/9 resolved cases and 6/9 core passes.

That is the interesting part. The 35B-A3B run was not simply useless. It had six core passes. But it finished only one case strictly. In this benchmark, that gap between mostly understanding the case and actually completing it is the whole point.

The Gemma comparison showed the same pattern. gemma-4-26b-a4b scored 61.1%, with 4/9 resolved cases and 7/9 core passes. gemma-4-31b-it scored 27.8%, with 0/9 resolved cases and 5/9 core passes.

Where the bigger rows lost points

The weaker runs tended to fail in unglamorous places.

For qwen3.6-35b-a3b, the recurring problems were evidence path formatting, manifest errors, missing or wrong evidence, and one case where the final audit result landed in the wrong location.

For gemma-4-31b-it, the recurring problems included proof-code errors, duplicate-risk misses, invoice classification mistakes, total calculation errors, and normalized text issues in workflow cases.

Those are not abstract intelligence failures. They are desktop-work failures. A local document model has to do more than read the page. It has to choose the right source files, ignore stale or irrelevant documents, preserve protected folders, write the expected artifacts, keep evidence paths consistent, and finish the final proof step.

This is not a small-model victory lap

This result should not be overread. The benchmark does not prove that smaller models are generally better. It does not prove that Qwen 27B is always better than Qwen 35B-A3B, or that Gemma 26B-A4B is always better than Gemma 31B-IT.

Different prompts, runtimes, quantizations, context settings, or task types could change the ranking. A larger model may still be better at long-form writing, code reasoning, broad knowledge, or other benchmark categories.

There is also a useful control: in the Paperwork Text-Only diagnostic, qwen3.6-27b and qwen3.6-35b-a3b both reached 80.0%. That suggests the full paperwork suite is not just testing language reasoning. It is also testing OCR and vision behavior, file selection, workflow discipline, and exact final artifact handling.

The practical readout

For local private-document work, do not buy the model label. Test the workflow.

The model that looks stronger on paper may still be worse at finishing a messy folder task. The model that produces a convincing explanation may still miss the proof code. The model that gets the invoice total right may still write the result into the wrong place.

Local Model Bench is trying to measure that boring last mile. The question is not which model sounds biggest. The question is which model can survive the actual folder.

Right now, on this nine-case paperwork suite, bigger was not always better.

Model Context

Benchmark
Current nine-case paperwork suite
Scoring
50% strict resolved cases, 50% core-oracle pass
Best local row
qwen3.6-27b, 72.2%
Main comparison
Qwen3.6 27B vs 35B-A3B; Gemma 4 26B-A4B vs 31B-IT
Hardware for local runs
Mac mini M4, 64 GB unified memory

Positioned As

  • This is a benchmark interpretation note, not a universal model ranking.
  • The result is strongest as evidence that practical workflow closure does not map cleanly to model size.
  • The text-only diagnostic is included as a control because it separates part of the language-reasoning problem from the full document-workflow problem.

What We Actually Tested

  • Compared current local rows in the nine-case paperwork suite.
  • Checked strict resolved counts, core-oracle passes, and recurring failure types.
  • Compared the same model families against the Paperwork Text-Only diagnostic where available.
  • Kept failure types diagnostic rather than adding them as extra score checks.

What Worked

  • qwen3.6-27b remained the strongest local paperwork row: 72.2%, 5/9 resolved, 8/9 core.
  • gemma-4-26b-a4b remained the second strongest local paperwork row: 61.1%, 4/9 resolved, 7/9 core.
  • The text-only control showed that bigger-looking Qwen did not fail everywhere; the drop appeared in the full document workflow.

Where It Broke

  • qwen3.6-35b-a3b had six core passes but only one strict resolved case.
  • gemma-4-31b-it had five core passes but zero strict resolved cases.
  • Recurring failures were mostly practical closure problems: evidence paths, manifests, proof codes, totals, and final artifact placement.

Readout

The current paperwork suite does not support a simple bigger-is-better reading. The stronger practical rows were the models that finished more cases, not the models with the largest-sounding labels.