Start with the job, not the model
A local benchmark should begin with a task a person would actually delegate. For Local Model Bench, that task is intentionally boring: inspect synthetic paperwork, select the right source files, reconcile invoice details against bank and vendor data, and produce a final audit JSON.
That job is narrower than general intelligence and broader than one prompt. It includes reading documents, following instructions, preserving input files, creating intermediate artifacts, and finishing the exact output contract.
Use synthetic data
Private-document benchmarks should not use real private documents. The clean solution is synthetic data with realistic structure: invoice scans, payment exports, vendor records, purchase orders, email notes, duplicate files, and old drafts. The data can be public, repeatable, and inspectable without leaking anyone's actual finances or workplace material.
Synthetic does not have to mean easy. A case can include withdrawn invoices, revised attachments, stale bank exports, inactive vendors, short payments, or a note that says which file is final. The important part is that the expected answer is knowable and objective.
Separate visible checks from hidden oracles
Visible checks catch the obvious problems: did the model create `audit_result.json`, is the JSON parseable, are required keys present, did it create `proof.txt`, did it leave the incoming folder unchanged? These checks are useful during development because models and runners fail in very ordinary ways.
Hidden oracles catch the actual task logic. They decide whether the model selected the corrected invoice, ignored the draft, mapped the payment to the revised document, computed the right total, and cited acceptable evidence. Without hidden oracles, a benchmark rewards neat-looking answers rather than correct outcomes.
- Visible checks: file exists, JSON parses, required fields present
- Hidden oracles: correct invoice set, warnings, totals, evidence, proof code
- Failure types: diagnostics only, not extra score inflation
- Run outputs: public enough for inspection, without exposing ground truth
Score resolved and core separately
The most important scoring split is resolved versus core. Resolved means the whole case passed: final artifact, hidden oracle, proof code, evidence, protected-folder rules, and required intermediate files. Core pass means the central audit facts were right even if exact closure failed.
That split prevents two bad interpretations. It avoids calling every near miss a total failure. It also avoids giving a model a clean win when it understood the case but produced an unusable final artifact.
Make the model handle files
A local work benchmark should not only ask for a chat answer. Real private work involves files. The model should inspect a folder, identify active sources, ignore drafts, write normalized manifests, and keep protected input folders unchanged.
This is where agentic tasks matter. A model can be good at one-shot extraction and still weak at multi-step workflow closure. The folder task tests whether the model can move from reading to organizing to final output.
Keep text-only as a diagnostic mode
Image-based document benchmarks are realistic, but they mix two problems: vision/OCR and reasoning. A text-only companion benchmark removes raw image input and gives the model normalized extracts. If a model improves sharply, the issue was probably document reading. If it still fails, the issue is workflow logic or output discipline.
That diagnostic split also gives text-only local models a fair place in the project. They should not be scored as image readers, but they can still be scored as bookkeeping and structured-output workers.
What to publish
A useful public benchmark should show more than a score. It should show the prompt, the model answer, generated artifacts, visible checks, failure labels, model/runtime settings, hardware, and enough case material for readers to understand the task. It should not publish hidden solutions directly beside the prompt.
This is the balance Local Model Bench is aiming for: transparent enough to inspect, strict enough to be meaningful, and boring enough that passing the benchmark feels like real work rather than benchmark theater.