Why local?
Paperwork is a natural use case for local models because the input often contains private data. The benchmark uses fully synthetic documents, but the workflow is modeled after real tasks: match records, pick the right file version, produce evidence, and avoid touching protected folders.
What is scored?
The main score combines resolved cases and core passes. A resolved case means the final result, hidden oracle, proof code, and workflow artifacts pass. A core pass means the central audit facts are right, even if the closure is imperfect.
What this is not
This is not a general intelligence ranking, a tax or financial tool, or a claim that one model is universally better. It is a practical signal for local-document workflows under reproducible prompts and visible outputs.