methodology

Paperwork Text-Only: separating logic from vision

The text-only paperwork benchmark gives models normalized document extracts instead of generated scans. It is not the main leaderboard, but it answers a useful question: can the model close the bookkeeping logic once OCR and vision are removed from the problem?

Paperwork Text-Only benchmark benchmark note infographic
diagnostic Practical score
5 cases Resolved
same oracle Core pass
not used Visual sample

The main Local Model Bench score uses generated invoice images and messy workflow folders. That is the right default for private paperwork because real documents usually arrive as scans, screenshots, attachments, and badly named files.

But when a model fails the visual benchmark, the next question is obvious: did it fail because it could not read the image, or because it could not finish the bookkeeping logic? The text-only benchmark exists to split those two problems.

The wrong question

A lot of local model testing turns into one vague question: is this model good? That is too broad to be useful. A model can be good at chat, acceptable at code, poor at OCR, surprisingly strong at JSON, and hopeless at closing a multi-file workflow. Local Model Bench tries to split those abilities apart.

The Paperwork Text-Only benchmark exists because image-based document work has two very different failure modes. One model may fail because it cannot read the scanned invoice. Another may read the invoice correctly and still fail the bookkeeping logic. Those are not the same problem.

What changes in text-only mode

In the main paperwork benchmark, models receive generated invoice images, bank exports, vendor records, purchase orders, and sometimes messy workflow folders. That is closer to real private paperwork because documents usually arrive as scans, screenshots, PDFs, and badly named attachments.

In text-only mode, the image reading step is removed. The model receives normalized extracts from the same generated-invoice cases. It still sees the relevant invoice fields, bank rows, vendor data, and task instructions. It still has to produce `audit_result.json`. It still has to satisfy the same oracle logic.

  • Same generated-invoice case logic
  • No raw image input
  • Same final JSON contract
  • Same evidence and warning expectations
  • Same proof-code closure requirement

Why non-vision models belong here

A text-only diagnostic mode also lets non-vision models enter the conversation fairly. It would be pointless to give a text-only model an invoice PNG and then score it as a document worker. But it is useful to ask whether that same model can complete the audit once the document text is provided.

That gives the leaderboard a cleaner split. The image benchmark says something about multimodal document handling. The text-only benchmark says something about audit logic, schema discipline, and final artifact closure.

Why a high text-only score is not enough

Text-only success does not mean a model is ready for private paperwork. A real user does not usually have perfect normalized extracts. They have screenshots, PDF scans, invoice revisions, strange file names, and notes buried in email threads. A strong text-only result is a useful diagnostic signal, not a full deployment recommendation.

The best use of the text-only result is comparative. If a model fails both image and text-only modes, the issue is probably not just OCR. If it jumps sharply in text-only mode, then the next work is image reading, layout understanding, and source-file selection.

What we saw in practice

The diagnostic split already produced interesting results. Some models that looked weak in image mode were better once normalized text was provided. Others still failed proof codes, warning codes, or exact JSON output. Chrome Gemini Nano, for example, reached one core pass but no strict resolved cases in text-only mode.

That is useful information. It says the problem was not only image input. It was also final closure: turning an apparently understood case into a machine-usable result.

How to read the score

The text-only benchmark should not be merged into the main practical score. It is a diagnostic companion. The main score stays focused on the current paperwork suite with generated scans and workflow folders. Text-only mode answers a narrower question: once the documents are readable, can the model finish the job?

That narrowness is a feature. It helps avoid lazy conclusions like 'bad at invoices' or 'good at paperwork' when the real failure might be OCR, revision tracking, proof-code arithmetic, JSON formatting, or evidence selection.

Model Context

Benchmark
Paperwork Text-Only
Input mode
Normalized text extracts, CSVs, and task instructions
Case count
5 generated-invoice cases
Scoring
50% resolved pass + 50% core-oracle pass
Benchmark role
Diagnostic companion to the multimodal paperwork suite

Positioned As

  • This is not a replacement for the main generated-image paperwork benchmark.
  • It is a diagnostic mode: remove document vision and see whether the audit logic, evidence selection, warnings, totals, and proof codes still close.
  • Text-only runs are especially useful for non-vision models and for separating OCR failures from reasoning or workflow failures.

What We Actually Tested

  • The same five generated-invoice cases are used, but the model receives normalized text extracts rather than image input.
  • The final output still has to be `audit_result.json` with the expected fields.
  • Visible checks, core oracle, hidden oracle, proof code, evidence, invoice classification, warning codes, and totals are still enforced.
  • A model cannot win just by summarizing the documents. It has to close the case.

What Worked

  • Makes non-vision models comparable on the bookkeeping part of the task.
  • Shows whether multimodal failures are really visual failures or workflow-logic failures.
  • Keeps the same hidden-oracle discipline as the image benchmark.

Where It Broke

  • It is less realistic than the main scan-based benchmark because OCR and layout understanding are removed.
  • It should not be mixed into the main multimodal score.
  • A high text-only score does not mean the model can handle raw private paperwork.

Readout

The text-only benchmark is a diagnostic lens. If a model fails the image suite and the text-only suite, the problem is not just OCR. If it improves sharply in text-only mode, the next work is image reading, layout, and attachment handling. That distinction is useful before declaring a model good or bad at paperwork.