run log

New model candidates, same paperwork problem

Three newer candidates were added to Local Model Bench. The useful signal was not that every model failed. It was where they failed: proof codes, evidence paths, workflow closure, and provider constraints that should not be confused with model capability.

Mistral Small 4, Qwen3.7 Max, Granite 4.1 8B benchmark note infographic
mixed Practical score
Mistral full run Resolved
Qwen stronger than strict score Core pass
SVG remained strict Visual sample

I added three new candidates to Local Model Bench: Mistral Small 4, Qwen3.7 Max, and Granite 4.1 8B.

The results were a useful reminder of what this benchmark is trying to measure.

The hard part is not sounding competent about paperwork. The hard part is closing the folder: selecting the right files, writing the right artifacts, citing evidence that exists, and passing the hidden oracle at the end.

What was tested

Mistral Small 4 was run through the current practical suite: generated invoice images, agentic paperwork workflow cases, and the City Plan SVG sanity check.

Qwen3.7 Max and Granite 4.1 8B were tested in the Paperwork Text-Only diagnostic and City Plan SVG. That means they were not given the same full vision workflow footprint as the complete local candidates.

Two free OpenRouter candidates were also probed but not added as scored model failures. Kimi K2.6 free hit a provider image-count limit, and DeepSeek V4 Flash free hit upstream rate limits. Those are useful operational notes, not fair benchmark outcomes.

Mistral Small 4 was not a clean win

Mistral Small 4 completed the full current practical footprint, but the result was not a strong launch.

In the Paperwork Trial it produced near misses: enough structure to look close, but not enough closure to resolve the cases strictly. In the workflow cases it struggled with artifact placement, manifests, proof files, and hidden-oracle details.

That is exactly the kind of failure this benchmark is designed to expose. The model can be useful in parts of the task while still being unreliable as an unattended desktop worker.

Qwen3.7 Max looked competent but unfinished

Qwen3.7 Max was more interesting in the Text-Only diagnostic.

It often got the core bookkeeping structure right, but strict passes were blocked by proof-code and evidence issues. That distinction matters. A model that understands the case but fails the final output contract is not useless, but it is not resolved either.

For practical automation, core understanding is only half the story. The file still has to close correctly.

Granite 4.1 8B did not stand out here

Granite 4.1 8B was added as a smaller cheap API candidate.

In this benchmark shape, it did not produce a strong signal. The failures were not exotic. They were the usual practical ones: missing or weak evidence, proof-code problems, classification misses, and calculation details.

That does not make it a bad model in every use case. It means this particular workflow is not where it looked competitive.

Why provider failures stay out of the leaderboard

There is a difference between a model failing the case and a provider failing the run.

If a provider cannot accept enough images for the prompt, or a free endpoint is rate-limited before the task runs, that is not a resolved benchmark result. It is operational friction.

Local Model Bench should be blunt about model failures, but it should not pretend infrastructure limits are cognitive failures.

Practical readout

The new candidates mostly reinforce the existing pattern.

Models are getting good at appearing close. The remaining gap is workflow closure.

For private desktop work, the question is not only whether a model can read, reason, or summarize. It is whether it can finish the artifact, preserve the source folder, cite usable evidence, and pass the hidden check without a human cleaning up behind it.

Readout

The newest candidates did not change the main lesson: practical document work is less about sounding right and more about closing the case. Evidence, proof codes, artifact placement, and hidden oracles still separate near misses from resolved runs.