Local Model Bench

Can local models survive private desktop work?

The practical benchmark for local LLMs doing synthetic paperwork, messy local folders, hidden oracles, visible outputs, and one constrained City Plan SVG sanity check.

88.9%Top practical
72.2%Top local model
9Scored cases
31City SVG runs
Infographic explaining the Local Model Bench workflow from synthetic documents to model artifacts and resolved, core pass, or fail outcomes
scoredThe Paperwork Trial

Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.

23 runs · 23 models
scoredPaperwork Workflow

Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.

23 runs · 23 models
diagnosticPaperwork Text-Only

The same generated invoice cases, but with normalized text extracts instead of image input. This separates bookkeeping logic from document vision.

27 runs · shown as leaderboard mode
visual sampleCity Plan SVG

A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.

31 runs · not part of overall score

City Plan SVG Outputs

A small visual sanity check: standalone SVG only, city blocks, roads, and 3D or isometric buildings.

open gallery

Latest Model Notes

Test reports that compare model positioning with observed benchmark behavior.

all notes
Gemma 4 12B Unified via LM Studio note infographic
run logGemma 4 12B did not close

10% practical · 0/5 resolved · 1/5 core

MiniMax M3, MiniMax M2.7 note infographic
field noteMiniMax M3: long-loop failure

MiniMax M3 agent-loop failure · M2.7 builder / M3 reviewer workaround

MiniMax M3 Free via OpenCode Zen note infographic
run logMiniMax M3 Free leads paperwork

88.9% practical · 8/9 resolved · 8/9 core

Qwen3.6 27B MTP note infographic
runtime noteQwen3.6 MTP: speed versus artifacts

Methodology guide and benchmark context

Overall Leaderboard

Top public comparison rows. Practical Score = 50% resolved cases + 50% core passes across the current v1 paperwork suite. Local LM Studio runs were executed on a Mac mini M4 with 64 GB unified memory.

Generated scans and messy workflow folders. This is the main public comparison score.

10 of 23 shown
OKnear miss / core passfail

Swipe sideways to see all columns.

RankModelTypePracticalResolvedCoreTriedCase Matrix
1api cheap88.9%8/98/99/9
2reference83.3%7/98/99/9
3reference77.8%7/97/99/9
4local72.2%5/98/99/9
5local61.1%4/97/99/9
6local38.9%1/96/99/9
7api cheap33.3%0/96/99/9
8local27.8%2/93/99/9
9local27.8%0/95/99/9
10api cheap27.8%0/95/99/9
11api cheap27.8%0/95/99/9
12api cheap27.8%0/95/99/9
13api cheap22.2%0/94/99/9
14api cheap22.2%0/94/99/9
15api cheap22.2%0/94/99/9
16local16.7%0/93/99/9
17local16.7%0/93/99/9
18local5.6%0/91/99/9
19local0.0%0/90/99/9
20local0.0%0/90/99/9
21local0.0%0/90/99/9
22api cheap0.0%0/90/99/9
23local0.0%0/90/99/9