One model score, many case sets.
The overall score comes from calibrated practical work, while each case set keeps its own outputs, checks, and failure types visible.
The Paperwork Trial
Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.
Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.
The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.
| Model | Type | Practical | Resolved | Near miss | Core | Visible | Failure types | Common checks | Run |
|---|---|---|---|---|---|---|---|---|---|
| OpenAI GPT-5.4 Mini (Codex CLI) | reference | 100.0% | 5/5 (100%) | 0/5 | 5/5 (100%) | 5/5 | none | 20/20 (100%) | details |
| gemma-4-26b-a4b | local | 80.0% | 4/5 (80%) | 0/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 18/20 (90%) | details |
| opencode/minimax-m3-free | api cheap | 80.0% | 4/5 (80%) | 0/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 18/20 (90%) | details |
| OpenAI GPT-5.5 (Codex CLI) | reference | 70.0% | 3/5 (60%) | 1/5 | 4/5 (80%) | 5/5 | ignored_document_id_error, proof_code_error | 17/20 (85%) | details |
| qwen3.6-27b | local | 70.0% | 3/5 (60%) | 1/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 17/20 (85%) | details |
| gemma-4-e4b | local | 50.0% | 2/5 (40%) | 1/5 | 3/5 (60%) | 4/5 | duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 14/20 (70%) | details |
| qwen3.6-35b-a3b | local | 40.0% | 1/5 (20%) | 2/5 | 3/5 (60%) | 3/5 | evidence_path_format | 10/20 (50%) | details |
| qwen3.6-flash | api cheap | 40.0% | 0/5 (0%) | 4/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 14/20 (70%) | details |
| mistral-small-3.2 | local | 30.0% | 0/5 (0%) | 3/5 | 3/5 (60%) | 4/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| Mistral Small 4 | api cheap | 30.0% | 0/5 (0%) | 3/5 | 3/5 (60%) | 4/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| gemma-4-31b-it | local | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| Seed 2.0 Mini | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| gemini-3.1-flash-lite | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| qwen3-vl-32b-instruct | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| ministral-3-14b | local | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 4/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 11/20 (55%) | details |
| gemini-2.5-flash | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 4/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| Qwen3 VL 30B A3B | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 4/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| gemma-4-12b | local | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 4/5 | ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 9/20 (45%) | details |
| gemma-4-e2b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 5/5 | duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| nemotron-3-nano-omni-30b-a3b-reasoning:free | api cheap | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 3/5 | duplicate_risk_missed, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 7/20 (35%) | details |
| qwen3-vl-8b-instruct | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 2/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 7/20 (35%) | details |
| qwen3-14b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 1/5 | ignored_document_id_error, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 2/20 (10%) | details |
| ministral-3-3b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 0/5 | duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 1/20 (5%) | details |
Paperwork Workflow
Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.
Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.
The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.
| Model | Type | Practical | Resolved | Near miss | Core | Visible | Failure types | Common checks | Run |
|---|---|---|---|---|---|---|---|---|---|
| opencode/minimax-m3-free | api cheap | 100.0% | 4/4 (100%) | 0/4 | 4/4 (100%) | 4/4 | none | 16/16 (100%) | details best of 4 |
| OpenAI GPT-5.5 (Codex CLI) | reference | 100.0% | 4/4 (100%) | 0/4 | 4/4 (100%) | 4/4 | none | 16/16 (100%) | details best of 4 |
| qwen3.6-27b | local | 75.0% | 2/4 (50%) | 2/4 | 4/4 (100%) | 4/4 | final_document_set_error | 14/16 (88%) | details best of 4 |
| OpenAI GPT-5.4 Mini (Codex CLI) | reference | 50.0% | 2/4 (50%) | 0/4 | 2/4 (50%) | 4/4 | invoice_classification_error, proof_code_error, proof_txt_error, warning_code_error, wrong_document_selected | 12/16 (75%) | details best of 4 |
| gemini-3.1-flash-lite | api cheap | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | attachment_index_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, warning_code_error | 11/16 (69%) | details best of 4 |
| gemma-4-26b-a4b | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | final_document_set_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error | 11/16 (69%) | details best of 4 |
| gemma-4-31b-it | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | final_document_set_error, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error | 11/16 (69%) | details best of 4 |
| gemini-2.5-flash | api cheap | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | final_document_set_error, format_failure, manifest_error, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error | 10/16 (63%) | details best of 4 |
| qwen3.6-35b-a3b | local | 37.5% | 0/4 (0%) | 3/4 | 3/4 (75%) | 4/4 | audit_result_wrong_location, manifest_error, missing_or_wrong_evidence | 10/16 (63%) | details best of 4 |
| Qwen3 VL 30B A3B | api cheap | 37.5% | 0/4 (0%) | 2/4 | 3/4 (75%) | 3/4 | audit_result_wrong_location, document_index_error, final_document_set_error, manifest_error, proof_code_error, proof_txt_error, required_artifact_missing | 9/16 (56%) | details best of 4 |
| qwen3-vl-32b-instruct | api cheap | 25.0% | 0/4 (0%) | 2/4 | 2/4 (50%) | 4/4 | final_document_set_error, invoice_classification_error, missing_or_wrong_evidence, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 10/16 (63%) | details best of 4 |
| Seed 2.0 Mini | api cheap | 25.0% | 0/4 (0%) | 2/4 | 2/4 (50%) | 4/4 | attachment_index_error, final_document_set_error, format_failure, invoice_classification_error, manifest_error, proforma_not_ignored, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 9/16 (56%) | details best of 4 |
| qwen3.6-flash | api cheap | 25.0% | 0/4 (0%) | 2/4 | 2/4 (50%) | 4/4 | final_document_set_error, format_failure, manifest_error, proof_code_error, proof_txt_error | 8/16 (50%) | details best of 4 |
| Mistral Small 4 | api cheap | 12.5% | 0/4 (0%) | 1/4 | 1/4 (25%) | 2/4 | audit_result_wrong_location, final_document_set_error, manifest_error, payment_reconciliation_error, proof_code_error, proof_txt_error, required_artifact_missing, warning_code_error | 5/16 (31%) | details best of 4 |
| ministral-3-14b | local | 12.5% | 0/4 (0%) | 1/4 | 1/4 (25%) | 2/4 | attachment_index_error, final_document_set_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error | 5/16 (31%) | details best of 4 |
| qwen3-vl-8b-instruct | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 4/4 | attachment_index_error, document_index_error, final_document_set_error, format_failure, ignored_document_id_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 7/16 (44%) | details best of 4 |
| qwen3-14b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 3/4 | document_index_error, final_document_set_error, format_failure, ignored_document_id_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error, wrong_document_selected | 6/16 (38%) | details best of 4 |
| mistral-small-3.2 | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 4/4 | attachment_index_error, cancelled_po_not_rejected, credit_memo_counted_as_invoice, duplicate_scan_counted, final_document_set_error, format_failure, invoice_classification_error, manifest_error, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 6/16 (38%) | details best of 4 |
| gemma-4-e2b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 4/4 | document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error | 5/16 (31%) | details best of 4 |
| gemma-4-e4b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 3/4 | attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error | 5/16 (31%) | details best of 4 |
| gemma-4-12b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 1/4 | attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing | 1/16 (6%) | details best of 4 |
| nemotron-3-nano-omni-30b-a3b-reasoning:free | api cheap | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 0/4 | attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing | 0/16 (0%) | details best of 4 |
| ministral-3-3b | local | 0.0% | 0/4 (0%) | 0/4 | 0/4 (0%) | 0/4 | attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing | 0/16 (0%) | details best of 4 |
Paperwork Text-Only
The same generated invoice cases, but with normalized text extracts instead of image input. This separates bookkeeping logic from document vision.
Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.
The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.
| Model | Type | Practical | Resolved | Near miss | Core | Visible | Failure types | Common checks | Run |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.6-27b-mtp | local | 100.0% | 5/5 (100%) | 0/5 | 5/5 (100%) | 5/5 | none | 20/20 (100%) | details |
| qwen3.6-27b | local | 90.0% | 4/5 (80%) | 1/5 | 5/5 (100%) | 5/5 | proof_code_error | 19/20 (95%) | details best of 2 |
| qwen3.6-35b-a3b | local | 80.0% | 4/5 (80%) | 0/5 | 4/5 (80%) | 4/5 | format_failure, reasoning_no_visible_output, token_limit_exhausted | 16/20 (80%) | details |
| gemma-4-26b-a4b | local | 70.0% | 3/5 (60%) | 1/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 17/20 (85%) | details |
| OpenAI GPT-5.5 (Codex CLI) | reference | 60.0% | 2/5 (40%) | 2/5 | 4/5 (80%) | 5/5 | proof_code_error, warning_code_error | 16/20 (80%) | details |
| OpenAI GPT-5.4 Mini (Codex CLI) | reference | 60.0% | 3/5 (60%) | 0/5 | 3/5 (60%) | 5/5 | missing_or_wrong_evidence, proof_code_error, warning_code_error | 16/20 (80%) | details |
| gemma-4-e2b | local | 60.0% | 3/5 (60%) | 0/5 | 3/5 (60%) | 4/5 | format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, reasoning_no_visible_output, token_limit_exhausted, total_calculation_error, warning_code_error | 14/20 (70%) | details |
| gemini-3.5-flash | local | 50.0% | 2/5 (40%) | 1/5 | 3/5 (60%) | 5/5 | invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error | 15/20 (75%) | details |
| gemma-4-e4b | local | 50.0% | 0/5 (0%) | 5/5 | 5/5 (100%) | 5/5 | missing_or_wrong_evidence, proof_code_error | 15/20 (75%) | details |
| microsoft/phi-4-reasoning-plus | local | 40.0% | 2/5 (40%) | 0/5 | 2/5 (40%) | 2/5 | format_failure, reasoning_no_visible_output, token_limit_exhausted | 8/20 (40%) | details |
| Qwen3.7 Max | api cheap | 40.0% | 0/5 (0%) | 4/5 | 4/5 (80%) | 5/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 14/20 (70%) | details |
| gpt-oss-20b | local | 30.0% | 0/5 (0%) | 3/5 | 3/5 (60%) | 4/5 | duplicate_risk_missed, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, warning_code_error | 12/20 (60%) | details |
| ollama-gpt-oss-20b | local | 30.0% | 0/5 (0%) | 3/5 | 3/5 (60%) | 4/5 | format_failure, missing_or_wrong_evidence, token_limit_exhausted, warning_code_error | 11/20 (55%) | details |
| gemma-4-31b-it | local | 30.0% | 0/5 (0%) | 3/5 | 3/5 (60%) | 4/5 | format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error | 11/20 (55%) | details |
| qwen3.6-flash | api cheap | 20.0% | 0/5 (0%) | 2/5 | 2/5 (40%) | 5/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 12/20 (60%) | details |
| Granite 4.1 8B | api cheap | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 5/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 11/20 (55%) | details |
| ollama-mistral-small-24b | local | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 4/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| ministral-3-14b | local | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 4/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 10/20 (50%) | details |
| mistral-small-3.2 | local | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 3/5 | duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 8/20 (40%) | details |
| Chrome Gemini Nano | browser | 10.0% | 0/5 (0%) | 1/5 | 1/5 (20%) | 3/5 | duplicate_risk_missed, format_failure, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 7/20 (35%) | details |
| Apple Foundation Model | system | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 2/5 | duplicate_risk_missed, format_failure, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 6/20 (30%) | details |
| liquid/lfm2-24b-a2b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 3/5 | format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 6/20 (30%) | details |
| qwen3-vl-4b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 1/5 | duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 6/20 (30%) | details |
| qwen3-vl-8b-instruct | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 2/5 | duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 5/20 (25%) | details |
| qwen3-14b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 2/5 | duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 4/20 (20%) | details |
| gemma-3n-e4b | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 1/5 | duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error | 3/20 (15%) | details |
| microsoft/phi-4 | local | 0.0% | 0/5 (0%) | 0/5 | 0/5 (0%) | 0/5 | format_failure | 0/20 (0%) | details |
City Plan SVG
A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.
Visual sample: this is one constrained SVG prompt, not a statistical benchmark. It is shown as pass/review/fail with checks and the generated artifact for manual inspection.
A pass only means the output met the automated SVG and constraint checks. Visual quality still needs a human look.
| Model | Type | Result | Checks | SVG preview | Run |
|---|---|---|---|---|---|
| gemini-3.5-flash | local | pass | 3/3 | details | |
| gpt-oss-20b | local | pass | 3/3 | details | |
| OpenAI GPT-5.4 Mini (Codex CLI) | reference | pass | 3/3 | details | |
| qwen3.6-flash | api cheap | pass | 3/3 | details | |
| poolside/laguna-xs.2:free | api cheap | pass | 3/3 | details | |
| qwen3-14b | local | pass | 3/3 | details | |
| gemini-3.1-flash-lite | api cheap | pass | 3/3 | details | |
| gemini-2.5-flash | api cheap | pass | 3/3 | details | |
| nemotron-3-super-120b-a12b:free | api cheap | pass | 3/3 | details | |
| qwen3-vl-32b-instruct | api cheap | pass | 3/3 | details | |
| gemma-4-31b-it | local | pass | 3/3 | details | |
| OpenAI GPT-5.5 (Codex CLI) | reference | pass | 3/3 | details | |
| gpt-oss-20b:free | api cheap | pass | 3/3 | details | |
| gemma-4-e4b | local | pass | 3/3 | details | |
| Mistral Small 4 | api cheap | review | 2/3 | details | |
| Chrome Gemini Nano | browser | review | 2/3 | details | |
| liquid/lfm2-24b-a2b | local | review | 2/3 | details | |
| ministral-3-14b | local | review | 2/3 | details | |
| mistral-small-3.2 | local | review | 2/3 | details | |
| ministral-3-3b | local | review | 2/3 | details | |
| gemma-4-e2b | local | review | 2/3 | details | |
| Seed 2.0 Mini | api cheap | fail | 2/3 | No SVG output | details |
| Qwen3.7 Max | api cheap | fail | 2/3 | No SVG output | details |
| Granite 4.1 8B | api cheap | fail | 2/3 | No SVG output | details |
| microsoft/phi-4-reasoning-plus | local | fail | 2/3 | No SVG output | details |
| qwen3.6-35b-a3b | local | fail | 2/3 | No SVG output | details best of 2 |
| gemma-4-12b | local | fail | 1/3 | No SVG output | details |
| Qwen3 VL 30B A3B | api cheap | fail | 1/3 | No SVG output | details |
| qwen3.6-27b | local | fail | 1/3 | No SVG output | details best of 2 |
| gemma-4-26b-a4b | local | fail | 1/3 | No SVG output | details best of 2 |
| opencode/minimax-m3-free | api cheap | fail | 0/3 | No SVG output | details |