Model Notes

What the scores hide.

Short notes for selected runs. Not universal model reviews, just observed failure patterns on this benchmark.

How to read these

Each note separates strict resolution from core task understanding. A model can understand the documents and still fail the workflow contract.

Local runs and reference/API runs are marked separately.

qwen3-vl-32b-instruct analysis infographic
reference

Qwen3 VL 32B: good reading, weak closure

A paid OpenRouter vision reference run that read many document facts correctly, then repeatedly lost the benchmark at proof codes, duplicate-risk logic, and workflow closure.

22.2% practical 0/9 resolved 4/9 core
gemma-4-26b-a4b analysis infographic
local

Gemma 4 26B A4B: the strongest local baseline so far

The best local LM Studio result in the current public table: not perfect, but unusually solid across both scanned invoices and agentic paperwork folders.

61.1% practical 4/9 resolved 7/9 core
gemma-4-31b-it analysis infographic
local

Gemma 4 31B: bigger did not mean cleaner

The 31B run looks worse than expected. The interesting signal is not that the model is useless, but that exact workflow closure punished it hard.

27.8% practical 0/9 resolved 5/9 core
codex-default analysis infographic
reference

Codex reference: strong workflow closure

Codex is not a local LM Studio run. It is kept as a reference line for what stronger agentic tooling does on the same public cases.

83.3% practical 7/9 resolved 8/9 core