One model score, many case sets.

The overall score comes from calibrated practical work, while each case set keeps its own outputs, checks, and failure types visible.

4Live benchmarks
0Planned benchmarks
104Complete runs
Scanned paperwork

The Paperwork Trial

Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.

23complete runs23models
generated invoice imagesCSV cross-checksevidence fieldsproof code oracle

Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.

The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.

ModelTypePracticalResolvedNear missCoreVisibleFailure typesCommon checksRun
OpenAI GPT-5.4 Mini (Codex CLI)reference100.0%5/5 (100%)0/55/5 (100%)5/5none20/20 (100%)details
gemma-4-26b-a4blocal80.0%4/5 (80%)0/54/5 (80%)5/5duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error18/20 (90%)details
opencode/minimax-m3-freeapi cheap80.0%4/5 (80%)0/54/5 (80%)5/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error18/20 (90%)details
OpenAI GPT-5.5 (Codex CLI)reference70.0%3/5 (60%)1/54/5 (80%)5/5ignored_document_id_error, proof_code_error17/20 (85%)details
qwen3.6-27blocal70.0%3/5 (60%)1/54/5 (80%)5/5duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error17/20 (85%)details
gemma-4-e4blocal50.0%2/5 (40%)1/53/5 (60%)4/5duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error14/20 (70%)details
qwen3.6-35b-a3blocal40.0%1/5 (20%)2/53/5 (60%)3/5evidence_path_format10/20 (50%)details
qwen3.6-flashapi cheap40.0%0/5 (0%)4/54/5 (80%)5/5duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error14/20 (70%)details
mistral-small-3.2local30.0%0/5 (0%)3/53/5 (60%)4/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
Mistral Small 4api cheap30.0%0/5 (0%)3/53/5 (60%)4/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
gemma-4-31b-itlocal20.0%0/5 (0%)2/52/5 (40%)5/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
Seed 2.0 Miniapi cheap20.0%0/5 (0%)2/52/5 (40%)5/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
gemini-3.1-flash-liteapi cheap20.0%0/5 (0%)2/52/5 (40%)5/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
qwen3-vl-32b-instructapi cheap20.0%0/5 (0%)2/52/5 (40%)5/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
ministral-3-14blocal20.0%0/5 (0%)2/52/5 (40%)4/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error11/20 (55%)details
gemini-2.5-flashapi cheap20.0%0/5 (0%)2/52/5 (40%)4/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error10/20 (50%)details
Qwen3 VL 30B A3Bapi cheap20.0%0/5 (0%)2/52/5 (40%)4/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error10/20 (50%)details
gemma-4-12blocal10.0%0/5 (0%)1/51/5 (20%)4/5ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error9/20 (45%)details
gemma-4-e2blocal0.0%0/5 (0%)0/50/5 (0%)5/5duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error10/20 (50%)details
nemotron-3-nano-omni-30b-a3b-reasoning:freeapi cheap0.0%0/5 (0%)0/50/5 (0%)3/5duplicate_risk_missed, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error7/20 (35%)details
qwen3-vl-8b-instructlocal0.0%0/5 (0%)0/50/5 (0%)2/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error7/20 (35%)details
qwen3-14blocal0.0%0/5 (0%)0/50/5 (0%)1/5ignored_document_id_error, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error2/20 (10%)details
ministral-3-3blocal0.0%0/5 (0%)0/50/5 (0%)0/5duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error1/20 (5%)details
Agentic paperwork folders

Paperwork Workflow

Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.

23complete runs23models
source selectiongenerated imagesprotected folderproof.txt oracle

Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.

The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.

ModelTypePracticalResolvedNear missCoreVisibleFailure typesCommon checksRun
opencode/minimax-m3-freeapi cheap100.0%4/4 (100%)0/44/4 (100%)4/4none16/16 (100%)details best of 4
OpenAI GPT-5.5 (Codex CLI)reference100.0%4/4 (100%)0/44/4 (100%)4/4none16/16 (100%)details best of 4
qwen3.6-27blocal75.0%2/4 (50%)2/44/4 (100%)4/4final_document_set_error14/16 (88%)details best of 4
OpenAI GPT-5.4 Mini (Codex CLI)reference50.0%2/4 (50%)0/42/4 (50%)4/4invoice_classification_error, proof_code_error, proof_txt_error, warning_code_error, wrong_document_selected12/16 (75%)details best of 4
gemini-3.1-flash-liteapi cheap37.5%0/4 (0%)3/43/4 (75%)4/4attachment_index_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, warning_code_error11/16 (69%)details best of 4
gemma-4-26b-a4blocal37.5%0/4 (0%)3/43/4 (75%)4/4final_document_set_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error11/16 (69%)details best of 4
gemma-4-31b-itlocal37.5%0/4 (0%)3/43/4 (75%)4/4final_document_set_error, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error11/16 (69%)details best of 4
gemini-2.5-flashapi cheap37.5%0/4 (0%)3/43/4 (75%)4/4final_document_set_error, format_failure, manifest_error, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error10/16 (63%)details best of 4
qwen3.6-35b-a3blocal37.5%0/4 (0%)3/43/4 (75%)4/4audit_result_wrong_location, manifest_error, missing_or_wrong_evidence10/16 (63%)details best of 4
Qwen3 VL 30B A3Bapi cheap37.5%0/4 (0%)2/43/4 (75%)3/4audit_result_wrong_location, document_index_error, final_document_set_error, manifest_error, proof_code_error, proof_txt_error, required_artifact_missing9/16 (56%)details best of 4
qwen3-vl-32b-instructapi cheap25.0%0/4 (0%)2/42/4 (50%)4/4final_document_set_error, invoice_classification_error, missing_or_wrong_evidence, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error10/16 (63%)details best of 4
Seed 2.0 Miniapi cheap25.0%0/4 (0%)2/42/4 (50%)4/4attachment_index_error, final_document_set_error, format_failure, invoice_classification_error, manifest_error, proforma_not_ignored, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error9/16 (56%)details best of 4
qwen3.6-flashapi cheap25.0%0/4 (0%)2/42/4 (50%)4/4final_document_set_error, format_failure, manifest_error, proof_code_error, proof_txt_error8/16 (50%)details best of 4
Mistral Small 4api cheap12.5%0/4 (0%)1/41/4 (25%)2/4audit_result_wrong_location, final_document_set_error, manifest_error, payment_reconciliation_error, proof_code_error, proof_txt_error, required_artifact_missing, warning_code_error5/16 (31%)details best of 4
ministral-3-14blocal12.5%0/4 (0%)1/41/4 (25%)2/4attachment_index_error, final_document_set_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error5/16 (31%)details best of 4
qwen3-vl-8b-instructlocal0.0%0/4 (0%)0/40/4 (0%)4/4attachment_index_error, document_index_error, final_document_set_error, format_failure, ignored_document_id_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error7/16 (44%)details best of 4
qwen3-14blocal0.0%0/4 (0%)0/40/4 (0%)3/4document_index_error, final_document_set_error, format_failure, ignored_document_id_error, invoice_classification_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error, wrong_document_selected6/16 (38%)details best of 4
mistral-small-3.2local0.0%0/4 (0%)0/40/4 (0%)4/4attachment_index_error, cancelled_po_not_rejected, credit_memo_counted_as_invoice, duplicate_scan_counted, final_document_set_error, format_failure, invoice_classification_error, manifest_error, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error6/16 (38%)details best of 4
gemma-4-e2blocal0.0%0/4 (0%)0/40/4 (0%)4/4document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error5/16 (31%)details best of 4
gemma-4-e4blocal0.0%0/4 (0%)0/40/4 (0%)3/4attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error5/16 (31%)details best of 4
gemma-4-12blocal0.0%0/4 (0%)0/40/4 (0%)1/4attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing1/16 (6%)details best of 4
nemotron-3-nano-omni-30b-a3b-reasoning:freeapi cheap0.0%0/4 (0%)0/40/4 (0%)0/4attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing0/16 (0%)details best of 4
ministral-3-3blocal0.0%0/4 (0%)0/40/4 (0%)0/4attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing0/16 (0%)details best of 4
OCR text extracts

Paperwork Text-Only

The same generated invoice cases, but with normalized text extracts instead of image input. This separates bookkeeping logic from document vision.

27complete runs27models
no image inputsame hidden oraclesOCR-normalized documentstext-only model friendly

Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.

The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.

ModelTypePracticalResolvedNear missCoreVisibleFailure typesCommon checksRun
qwen3.6-27b-mtplocal100.0%5/5 (100%)0/55/5 (100%)5/5none20/20 (100%)details
qwen3.6-27blocal90.0%4/5 (80%)1/55/5 (100%)5/5proof_code_error19/20 (95%)details best of 2
qwen3.6-35b-a3blocal80.0%4/5 (80%)0/54/5 (80%)4/5format_failure, reasoning_no_visible_output, token_limit_exhausted16/20 (80%)details
gemma-4-26b-a4blocal70.0%3/5 (60%)1/54/5 (80%)5/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error17/20 (85%)details
OpenAI GPT-5.5 (Codex CLI)reference60.0%2/5 (40%)2/54/5 (80%)5/5proof_code_error, warning_code_error16/20 (80%)details
OpenAI GPT-5.4 Mini (Codex CLI)reference60.0%3/5 (60%)0/53/5 (60%)5/5missing_or_wrong_evidence, proof_code_error, warning_code_error16/20 (80%)details
gemma-4-e2blocal60.0%3/5 (60%)0/53/5 (60%)4/5format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, reasoning_no_visible_output, token_limit_exhausted, total_calculation_error, warning_code_error14/20 (70%)details
gemini-3.5-flashlocal50.0%2/5 (40%)1/53/5 (60%)5/5invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error15/20 (75%)details
gemma-4-e4blocal50.0%0/5 (0%)5/55/5 (100%)5/5missing_or_wrong_evidence, proof_code_error15/20 (75%)details
microsoft/phi-4-reasoning-pluslocal40.0%2/5 (40%)0/52/5 (40%)2/5format_failure, reasoning_no_visible_output, token_limit_exhausted8/20 (40%)details
Qwen3.7 Maxapi cheap40.0%0/5 (0%)4/54/5 (80%)5/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error14/20 (70%)details
gpt-oss-20blocal30.0%0/5 (0%)3/53/5 (60%)4/5duplicate_risk_missed, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, warning_code_error12/20 (60%)details
ollama-gpt-oss-20blocal30.0%0/5 (0%)3/53/5 (60%)4/5format_failure, missing_or_wrong_evidence, token_limit_exhausted, warning_code_error11/20 (55%)details
gemma-4-31b-itlocal30.0%0/5 (0%)3/53/5 (60%)4/5format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error11/20 (55%)details
qwen3.6-flashapi cheap20.0%0/5 (0%)2/52/5 (40%)5/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error12/20 (60%)details
Granite 4.1 8Bapi cheap10.0%0/5 (0%)1/51/5 (20%)5/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error11/20 (55%)details
ollama-mistral-small-24blocal10.0%0/5 (0%)1/51/5 (20%)4/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error10/20 (50%)details
ministral-3-14blocal10.0%0/5 (0%)1/51/5 (20%)4/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error10/20 (50%)details
mistral-small-3.2local10.0%0/5 (0%)1/51/5 (20%)3/5duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error8/20 (40%)details
Chrome Gemini Nanobrowser10.0%0/5 (0%)1/51/5 (20%)3/5duplicate_risk_missed, format_failure, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error7/20 (35%)details
Apple Foundation Modelsystem0.0%0/5 (0%)0/50/5 (0%)2/5duplicate_risk_missed, format_failure, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error6/20 (30%)details
liquid/lfm2-24b-a2blocal0.0%0/5 (0%)0/50/5 (0%)3/5format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error6/20 (30%)details
qwen3-vl-4blocal0.0%0/5 (0%)0/50/5 (0%)1/5duplicate_risk_missed, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error6/20 (30%)details
qwen3-vl-8b-instructlocal0.0%0/5 (0%)0/50/5 (0%)2/5duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error5/20 (25%)details
qwen3-14blocal0.0%0/5 (0%)0/50/5 (0%)2/5duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error4/20 (20%)details
gemma-3n-e4blocal0.0%0/5 (0%)0/50/5 (0%)1/5duplicate_risk_missed, format_failure, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error3/20 (15%)details
microsoft/phi-4local0.0%0/5 (0%)0/50/5 (0%)0/5format_failure0/20 (0%)details
Constrained SVG visual

City Plan SVG

A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.

31complete runs31models
valid SVGcity-plan constraintsshareable artifact

Visual sample: this is one constrained SVG prompt, not a statistical benchmark. It is shown as pass/review/fail with checks and the generated artifact for manual inspection.

A pass only means the output met the automated SVG and constraint checks. Visual quality still needs a human look.

gemini-3.5-flash city plan SVG preview
#1 · localgemini-3.5-flashpass · 3/3 checks
gpt-oss-20b city plan SVG preview
#2 · localgpt-oss-20bpass · 3/3 checks
OpenAI GPT-5.4 Mini (Codex CLI) city plan SVG preview
#3 · referenceOpenAI GPT-5.4 Mini (Codex CLI)pass · 3/3 checks
qwen3.6-flash city plan SVG preview
#4 · api cheapqwen3.6-flashpass · 3/3 checks
poolside/laguna-xs.2:free city plan SVG preview
#5 · api cheappoolside/laguna-xs.2:freepass · 3/3 checks
qwen3-14b city plan SVG preview
#6 · localqwen3-14bpass · 3/3 checks
gemini-3.1-flash-lite city plan SVG preview
#7 · api cheapgemini-3.1-flash-litepass · 3/3 checks
gemini-2.5-flash city plan SVG preview
#8 · api cheapgemini-2.5-flashpass · 3/3 checks
nemotron-3-super-120b-a12b:free city plan SVG preview
#9 · api cheapnemotron-3-super-120b-a12b:freepass · 3/3 checks
qwen3-vl-32b-instruct city plan SVG preview
#10 · api cheapqwen3-vl-32b-instructpass · 3/3 checks
gemma-4-31b-it city plan SVG preview
#11 · localgemma-4-31b-itpass · 3/3 checks
OpenAI GPT-5.5 (Codex CLI) city plan SVG preview
#12 · referenceOpenAI GPT-5.5 (Codex CLI)pass · 3/3 checks
gpt-oss-20b:free city plan SVG preview
#13 · api cheapgpt-oss-20b:freepass · 3/3 checks
gemma-4-e4b city plan SVG preview
#14 · localgemma-4-e4bpass · 3/3 checks
Mistral Small 4 city plan SVG preview
#15 · api cheapMistral Small 4review · 2/3 checks
Chrome Gemini Nano city plan SVG preview
#16 · browserChrome Gemini Nanoreview · 2/3 checks
liquid/lfm2-24b-a2b city plan SVG preview
#17 · localliquid/lfm2-24b-a2breview · 2/3 checks
ministral-3-14b city plan SVG preview
#18 · localministral-3-14breview · 2/3 checks
mistral-small-3.2 city plan SVG preview
#19 · localmistral-small-3.2review · 2/3 checks
ministral-3-3b city plan SVG preview
#20 · localministral-3-3breview · 2/3 checks
gemma-4-e2b city plan SVG preview
#21 · localgemma-4-e2breview · 2/3 checks
No SVG output
#22 · api cheapSeed 2.0 Minifail · 2/3 checks
No SVG output
#23 · api cheapQwen3.7 Maxfail · 2/3 checks
No SVG output
#24 · api cheapGranite 4.1 8Bfail · 2/3 checks
No SVG output
#25 · localmicrosoft/phi-4-reasoning-plusfail · 2/3 checks
No SVG output
#26 · localqwen3.6-35b-a3bfail · 2/3 checks
No SVG output
#27 · localgemma-4-12bfail · 1/3 checks
No SVG output
#28 · api cheapQwen3 VL 30B A3Bfail · 1/3 checks
No SVG output
#29 · localqwen3.6-27bfail · 1/3 checks
No SVG output
#30 · localgemma-4-26b-a4bfail · 1/3 checks
No SVG output
#31 · api cheapopencode/minimax-m3-freefail · 0/3 checks
ModelTypeResultChecksSVG previewRun
gemini-3.5-flashlocalpass3/3gemini-3.5-flash city plan SVG preview1 SVGdetails
gpt-oss-20blocalpass3/3gpt-oss-20b city plan SVG preview1 SVGdetails
OpenAI GPT-5.4 Mini (Codex CLI)referencepass3/3OpenAI GPT-5.4 Mini (Codex CLI) city plan SVG preview1 SVGdetails
qwen3.6-flashapi cheappass3/3qwen3.6-flash city plan SVG preview1 SVGdetails
poolside/laguna-xs.2:freeapi cheappass3/3poolside/laguna-xs.2:free city plan SVG preview1 SVGdetails
qwen3-14blocalpass3/3qwen3-14b city plan SVG preview1 SVGdetails
gemini-3.1-flash-liteapi cheappass3/3gemini-3.1-flash-lite city plan SVG preview1 SVGdetails
gemini-2.5-flashapi cheappass3/3gemini-2.5-flash city plan SVG preview1 SVGdetails
nemotron-3-super-120b-a12b:freeapi cheappass3/3nemotron-3-super-120b-a12b:free city plan SVG preview1 SVGdetails
qwen3-vl-32b-instructapi cheappass3/3qwen3-vl-32b-instruct city plan SVG preview1 SVGdetails
gemma-4-31b-itlocalpass3/3gemma-4-31b-it city plan SVG preview1 SVGdetails
OpenAI GPT-5.5 (Codex CLI)referencepass3/3OpenAI GPT-5.5 (Codex CLI) city plan SVG preview1 SVGdetails
gpt-oss-20b:freeapi cheappass3/3gpt-oss-20b:free city plan SVG preview1 SVGdetails
gemma-4-e4blocalpass3/3gemma-4-e4b city plan SVG preview1 SVGdetails
Mistral Small 4api cheapreview2/3Mistral Small 4 city plan SVG preview1 SVGdetails
Chrome Gemini Nanobrowserreview2/3Chrome Gemini Nano city plan SVG preview1 SVGdetails
liquid/lfm2-24b-a2blocalreview2/3liquid/lfm2-24b-a2b city plan SVG preview1 SVGdetails
ministral-3-14blocalreview2/3ministral-3-14b city plan SVG preview1 SVGdetails
mistral-small-3.2localreview2/3mistral-small-3.2 city plan SVG preview1 SVGdetails
ministral-3-3blocalreview2/3ministral-3-3b city plan SVG preview1 SVGdetails
gemma-4-e2blocalreview2/3gemma-4-e2b city plan SVG preview1 SVGdetails
Seed 2.0 Miniapi cheapfail2/3No SVG outputdetails
Qwen3.7 Maxapi cheapfail2/3No SVG outputdetails
Granite 4.1 8Bapi cheapfail2/3No SVG outputdetails
microsoft/phi-4-reasoning-pluslocalfail2/3No SVG outputdetails
qwen3.6-35b-a3blocalfail2/3No SVG outputdetails best of 2
gemma-4-12blocalfail1/3No SVG outputdetails
Qwen3 VL 30B A3Bapi cheapfail1/3No SVG outputdetails
qwen3.6-27blocalfail1/3No SVG outputdetails best of 2
gemma-4-26b-a4blocalfail1/3No SVG outputdetails best of 2
opencode/minimax-m3-freeapi cheapfail0/3No SVG outputdetails