Local Model Bench

Can local models survive real work?

A practical benchmark for synthetic paperwork, messy document workflows, and inspectable model outputs.

83.3% Top practical
72.2% Top local model
9 Scored cases
44 Active runs
Infographic explaining the Local Model Bench workflow from synthetic documents to model artifacts and resolved, core pass, or fail outcomes
scored The Paperwork Trial

Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.

8 runs · 8 models
scored Paperwork Workflow

Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.

32 runs · 8 models
visual sample City Plan SVG

A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.

4 runs · not part of overall score

Overall Leaderboard

Practical Score = 50% resolved cases + 50% core passes across the current calibrated paperwork suite. Local LM Studio runs were executed on a Mac mini M4 with 64 GB unified memory; reference/API rows are marked separately.

methodology
OK near miss / core pass fail not run

Swipe sideways to see all columns.

Rank Model Type Practical Resolved Core Tried Case Matrix Main misses
1 reference 83.3% 7/9 8/9 9/9
ignored_document_id_errorproof_code_error
2 local 72.2% 5/9 8/9 9/9
duplicate_risk_missedevidence_path_format +5
3 local 61.1% 4/9 7/9 9/9
duplicate_risk_missedevidence_path_format +9
4 local 38.9% 1/9 6/9 9/9
audit_result_wrong_locationevidence_path_format +2
5 local 27.8% 2/9 3/9 9/9
attachment_index_errordocument_index_error +15
6 local 27.8% 0/9 5/9 9/9
duplicate_risk_missedfinal_document_set_error +6
7 local 0.0% 0/9 0/9 9/9
document_index_errorduplicate_risk_missed +14
8 local 0.0% 0/9 0/9 9/9
attachment_index_errordocument_index_error +12

Case Sets

The overall score uses the two paperwork sets. The visual sample is kept separate.

inspect all
in overall score

The Paperwork Trial

Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.

8 runs 8 models generated invoice images · CSV cross-checks · evidence fields · proof code oracle
in overall score

Paperwork Workflow

Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.

32 runs 8 models source selection · generated images · protected folder · proof.txt oracle

Next Calibration Work

Add a few more genuinely agentic paperwork cases, then lock the scoring rules before running many models. The target is not a huge leaderboard yet; it is a small suite where failures are interpretable.

More generated document folders Manual review labels for near misses Cleaner public run pages