methodologyMay 17, 2026

Why “looks right” is not enough

A model can read the invoice, name the right vendor, and still fail the job. Local Model Bench separates rough understanding from clean workflow closure because real paperwork work ends with correct files, evidence, and proof.

50/50 splitPractical score

strict passResolved

near miss signalCore pass

separate sampleVisual sample

A lot of model outputs look plausible on first read. That is exactly the trap this benchmark is trying to avoid.

Local Model Bench does not ask whether a model produced a confident answer. It asks whether the final artifacts match the case: correct JSON, correct source evidence, protected input folders, required intermediate files, and a proof code that closes the loop.

The problem with plausible answers

A model answer can look good while still being unusable. It can name the right vendor, repeat the right invoice total, and write a confident summary. Then it can put the wrong warning code in JSON, use a filename where the case asks for a document ID, forget an intermediate artifact, or calculate a proof code from the wrong numbers.

That kind of failure is easy to miss in a chat transcript. It is harder to miss when the task has a final artifact and an oracle. Local Model Bench is built around that difference.

What resolved means

A resolved case means the model finished the whole job. It did not merely understand the broad situation. It returned the required files, kept protected source folders unchanged, matched the hidden oracle, used acceptable evidence, and closed the final proof-code check.

This makes the score harsher than a human vibe check. That is intentional. A local model used for private paperwork is not just writing an opinion. It is supposed to produce something a user or script can rely on.

Why core pass exists

The benchmark also tracks core pass because not all failures are equally bad. If a model identifies the right invoice, vendor, payment status, and review decision but misses the proof code, that is different from selecting the wrong document entirely.

Core pass is the near-miss signal. It tells us when the model understood the central audit facts but failed exact closure. That distinction matters for model selection. A model with many core passes may be useful with human review. A model with neither resolved nor core passes is just not doing the work.

The checksum idea

The proof code is deliberately boring. It is a small checksum-style field derived from visible final values in the case. It is not a secret trick and it is not meant to be clever. It checks whether the model can keep its own output internally consistent.

This is where many strong-looking answers fail. The model can describe the case correctly, then put the wrong final integer in `proof_code` or leave it as an arithmetic expression. That is a real failure if the task demands a usable JSON artifact.

Why hidden oracles matter

Visible checks are useful but insufficient. They can confirm that a file exists, JSON parses, or required keys are present. They cannot tell whether the model chose the revised invoice instead of the withdrawn one, ignored the correct distractor, or mapped a payment reference to the right final document.

Hidden oracles let the public case remain inspectable while preventing a model from passing through superficial formatting alone. They also make failure analysis more honest: the site can show whether a failure was format, evidence, protected-file, classification, warning, or proof related.

What this says about local models

The current results suggest a practical pattern. Many local models can produce something that looks useful. Fewer can resolve the full case. The difference is not academic. It is the difference between a tool that helps a person review paperwork and a tool that can safely complete a workflow.

That is why Local Model Bench should stay focused on finished artifacts. The world already has enough benchmarks where models answer clean prompts. The more interesting local question is whether they can survive messy, private, ordinary work.

Model Context

Topic: Scoring methodology
Primary signal: Resolved cases
Secondary signal: Core oracle pass
Diagnostics: Visible checks and failure types
Benchmark role: Avoid plausible-but-wrong wins

Positioned As

Resolved means the model completed the whole case contract, not just the obvious document reading part.
Core pass means the main business logic was mostly right, but some final closure detail failed.
Failure types explain what went wrong. They do not inflate the score as extra hidden penalties.

What We Actually Tested

The paperwork cases use synthetic scans, bank exports, vendor records, purchase orders, notes, and messy folders.
The model must produce final artifacts, not just a chat answer.
Visible checks catch basic output shape. Hidden oracles check the exact case logic.
A proof code acts like a small checksum: if the model cannot close its own numbers, the run is not fully resolved.

What Worked

Makes near misses visible instead of pretending every failure is equally bad.
Prevents models from winning by sounding correct while missing final artifacts.
Keeps visible checks, hidden oracles, and diagnostic failure labels separate.

Where It Broke

The score can look harsh when a model understood the gist but failed the final proof.
A small case set should be read as a practical signal, not a universal model ranking.
Manual review still matters for interpreting strange near misses.

Readout

The useful distinction is not smart versus dumb. It is finished versus plausible. Local work often fails in the last ten percent: wrong file, wrong evidence path, wrong checksum, modified source folder, or a JSON answer that cannot be used. That is why resolved beats looks right.

Sources

Methodology Public cases Benchmarks

Run Outputs

Overall leaderboard Paperwork benchmark Workflow benchmark all notes