The Paperwork Trial
Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.
A practical benchmark for synthetic paperwork, messy document workflows, and inspectable model outputs.
Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.
8 runs · 8 models scored Paperwork WorkflowSynthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.
32 runs · 8 models visual sample City Plan SVGA city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.
4 runs · not part of overall scorePractical Score = 50% resolved cases + 50% core passes across the current calibrated paperwork suite. Local LM Studio runs were executed on a Mac mini M4 with 64 GB unified memory; reference/API rows are marked separately.
Swipe sideways to see all columns.
| Rank | Model | Type | Practical | Resolved | Core | Tried | Case Matrix | Main misses |
|---|---|---|---|---|---|---|---|---|
| 1 | codex-default details | reference | 83.3% | 7/9 | 8/9 | 9/9 | ignored_document_id_errorproof_code_error | |
| 2 | qwen3.6-27b details | local | 72.2% | 5/9 | 8/9 | 9/9 | duplicate_risk_missedevidence_path_format +5 | |
| 3 | gemma-4-26b-a4b details | local | 61.1% | 4/9 | 7/9 | 9/9 | duplicate_risk_missedevidence_path_format +9 | |
| 4 | qwen3.6-35b-a3b details | local | 38.9% | 1/9 | 6/9 | 9/9 | audit_result_wrong_locationevidence_path_format +2 | |
| 5 | gemma-4-e4b details | local | 27.8% | 2/9 | 3/9 | 9/9 | attachment_index_errordocument_index_error +15 | |
| 6 | gemma-4-31b-it details | local | 27.8% | 0/9 | 5/9 | 9/9 | duplicate_risk_missedfinal_document_set_error +6 | |
| 7 | gemma-4-e2b details | local | 0.0% | 0/9 | 0/9 | 9/9 | document_index_errorduplicate_risk_missed +14 | |
| 8 | ministral-3-3b details | local | 0.0% | 0/9 | 0/9 | 9/9 | attachment_index_errordocument_index_error +12 |
The overall score uses the two paperwork sets. The visual sample is kept separate.
Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.
Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.
A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.