The wrong question
A lot of local model testing turns into one vague question: is this model good? That is too broad to be useful. A model can be good at chat, acceptable at code, poor at OCR, surprisingly strong at JSON, and hopeless at closing a multi-file workflow. Local Model Bench tries to split those abilities apart.
The Paperwork Text-Only benchmark exists because image-based document work has two very different failure modes. One model may fail because it cannot read the scanned invoice. Another may read the invoice correctly and still fail the bookkeeping logic. Those are not the same problem.
What changes in text-only mode
In the main paperwork benchmark, models receive generated invoice images, bank exports, vendor records, purchase orders, and sometimes messy workflow folders. That is closer to real private paperwork because documents usually arrive as scans, screenshots, PDFs, and badly named attachments.
In text-only mode, the image reading step is removed. The model receives normalized extracts from the same generated-invoice cases. It still sees the relevant invoice fields, bank rows, vendor data, and task instructions. It still has to produce `audit_result.json`. It still has to satisfy the same oracle logic.
- Same generated-invoice case logic
- No raw image input
- Same final JSON contract
- Same evidence and warning expectations
- Same proof-code closure requirement
Why non-vision models belong here
A text-only diagnostic mode also lets non-vision models enter the conversation fairly. It would be pointless to give a text-only model an invoice PNG and then score it as a document worker. But it is useful to ask whether that same model can complete the audit once the document text is provided.
That gives the leaderboard a cleaner split. The image benchmark says something about multimodal document handling. The text-only benchmark says something about audit logic, schema discipline, and final artifact closure.
Why a high text-only score is not enough
Text-only success does not mean a model is ready for private paperwork. A real user does not usually have perfect normalized extracts. They have screenshots, PDF scans, invoice revisions, strange file names, and notes buried in email threads. A strong text-only result is a useful diagnostic signal, not a full deployment recommendation.
The best use of the text-only result is comparative. If a model fails both image and text-only modes, the issue is probably not just OCR. If it jumps sharply in text-only mode, then the next work is image reading, layout understanding, and source-file selection.
What we saw in practice
The diagnostic split already produced interesting results. Some models that looked weak in image mode were better once normalized text was provided. Others still failed proof codes, warning codes, or exact JSON output. Chrome Gemini Nano, for example, reached one core pass but no strict resolved cases in text-only mode.
That is useful information. It says the problem was not only image input. It was also final closure: turning an apparently understood case into a machine-usable result.
How to read the score
The text-only benchmark should not be merged into the main practical score. It is a diagnostic companion. The main score stays focused on the current paperwork suite with generated scans and workflow folders. Text-only mode answers a narrower question: once the documents are readable, can the model finish the job?
That narrowness is a feature. It helps avoid lazy conclusions like 'bad at invoices' or 'good at paperwork' when the real failure might be OCR, revision tracking, proof-code arithmetic, JSON formatting, or evidence selection.