One model score, many case sets.
The overall score comes from calibrated practical work, while each case set keeps its own outputs, checks, and failure types visible.
The Paperwork Trial
Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.
Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.
The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.
| Model | Type | Practical | Resolved | Near miss | Core | Visible | Failure types | Common checks | Run |
|---|---|---|---|---|---|---|---|---|---|
| gemma-4-26b-a4b | local | 80.0% | 4/5 (80%) | 0/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 18/20 (90%) | details |
| codex-default | reference | 70.0% | 3/5 (60%) | 1/5 | 4/5 (80%) | 5/5 | ignored_document_id_error, proof_code_error | 17/20 (85%) | details |
| qwen3.6-27b | local | 70.0% | 3/5 (60%) | 1/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 17/20 (85%) | details |
| gemma-4-e4b | local | 50.0% | 2/5 (40%) | 1/5 | 3/5 (60%) | 4/5 | duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 14/20 (70%) | details |
| qwen3.6-35b-a3b | local | 40.0% | 1/5 (20%) | 2/5 | 3/5 (60%) | 3/5 | evidence_path_format | 10/20 (50%) | details |
| gemma-4-31b-it | local | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| gemma-4-e2b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 5/5 | duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| ministral-3-3b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 0/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 1/20 (5%) | details |
Paperwork Workflow
Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.
Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.
The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.
| Model | Type | Practical | Resolved | Near miss | Core | Visible | Failure types | Common checks | Run |
|---|---|---|---|---|---|---|---|---|---|
| codex-default | reference | 100.0% | 4/4 (100%) | 0/4 | 4/4 (100%) | 4/4 | none | 16/16 (100%) | details best of 4 |
| qwen3.6-27b | local | 75.0% | 2/4 (50%) | 2/4 | 4/4 (100%) | 4/4 | final_document_set_error | 14/16 (88%) | details best of 4 |
| gemma-4-26b-a4b | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | final_document_set_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error | 11/16 (69%) | details best of 4 |
| gemma-4-31b-it | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | final_document_set_error, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error | 11/16 (69%) | details best of 4 |
| qwen3.6-35b-a3b | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | audit_result_wrong_location, manifest_error, missing_or_wrong_evidence | 10/16 (63%) | details best of 4 |
| gemma-4-e2b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 4/4 | document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 5/16 (31%) | details best of 4 |
| gemma-4-e4b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 3/4 | attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error | 5/16 (31%) | details best of 4 |
| ministral-3-3b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 0/4 | attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing | 0/16 (0%) | details best of 4 |
City Plan SVG
A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.
Visual sample: this is one constrained SVG prompt, not a statistical benchmark. It is shown as pass/review/fail with checks and the generated artifact for manual inspection.
A pass only means the output met the automated SVG and constraint checks. Visual quality still needs a human look.
| Model | Type | Result | Checks | SVG preview | Run |
|---|---|---|---|---|---|
| gemma-4-31b-it | local | pass | 3/3 | | details |
| codex-default | reference | pass | 3/3 | | details |
| gpt-oss-20b:free | reference | pass | 3/3 | | details |
| gemma-4-e4b | local | pass | 3/3 | | details |
| ministral-3-3b | local | review | 2/3 | | details |
| gemma-4-e2b | local | review | 2/3 | | details |
| qwen3.6-35b-a3b | local | fail | 2/3 | No SVG output | details best of 2 |
| qwen3.6-27b | local | fail | 1/3 | No SVG output | details best of 2 |
| gemma-4-26b-a4b | local | fail | 1/3 | No SVG output | details best of 2 |