run log Jun 1, 2026

MiniMax M3 Free leads the paperwork benchmark, with one ugly SVG caveat

MiniMax M3 Free is the top Practical Score on Local Model Bench: 8/9 resolved and 88.9% across the current paperwork suite. That does not make it a universal agent winner. It is a provider-routed benchmark result, and it still produced no parseable City Plan SVG after a 600 second timeout.

88.9% Practical score

8/9 Resolved

8/9 Core pass

City Plan SVG failed Visual sample

MiniMax M3 Free finally showed up in the OpenCode Zen model list.

That detail matters. The model is not currently available to this setup as an OpenRouter `:free` endpoint. `minimax/minimax-m3` exists on OpenRouter, but `minimax/minimax-m3:free` returned no endpoint. The free run here used `opencode/minimax-m3-free`.

The result is no longer just interesting. It is the current top practical paperwork/workflow result on the site. The catch is equally important: this is a provider-routed `api cheap` row, not a local Mac mini model, and it failed completely on the constrained SVG sanity check.

Why this does not contradict the long-loop field note

There is a second MiniMax M3 note on the site that calls out long autonomous agent-loop failures. That note should not be read as a correction of this benchmark result.

The distinction is the test shape. Here, M3 ran the defined Local Model Bench paperwork cases and closed eight of nine practical cases. In the field note, M3 was used as a broader autonomous builder across many generated files, assets, validators, and reports.

Those are related but not identical capabilities. The clean statement is: M3 Free currently leads this paperwork benchmark, while separate long-loop probes make it look risky for open-ended productive agent work.

What ran

The model was first smoke-tested through OpenCode Zen. It returned the expected short answer and reported zero cost in the OpenCode trace.

Then it ran the five generated-image Paperwork Trial cases. Those cases use synthetic invoice scans, bank exports, vendor records, purchase orders, previous-invoice traps, and an exact `audit_result.json` oracle.

After that it ran the four agentic Paperwork Workflow cases. Those require selecting active files from a messy intake folder, preserving protected sources, writing intermediate artifacts, producing `audit_result.json`, and closing with `proof.txt`.

Finally it ran the City Plan SVG prompt: standalone SVG only, roads, blocks, and 3D or isometric buildings.

Paperwork was strong

On the generated-image Paperwork Trial, MiniMax M3 Free resolved four of five cases. The only strict failure was generated case 03.

That failure was not a formatting accident. It counted an old duplicate-risk invoice as approved, which inflated the approved total and broke the proof code. The failure labels were exactly the practical ones: `duplicate_risk_missed`, `invoice_classification_error`, `total_calculation_error`, `warning_code_error`, and `proof_code_error`.

The workflow side was cleaner. W04, W05, W06, and W07 all passed: required artifacts existed, incoming sources were unchanged, core oracle passed, and hidden oracle passed.

SVG did not merely look bad. It did not arrive.

The City Plan SVG run timed out after 600 seconds without a parseable SVG artifact.

That is a different failure from a crude drawing. Some models return a valid but visually weak SVG; MiniMax M3 Free did not return the artifact at all in this run.

For the site, that is counted as a visual sample failure: 0/3 automated checks, no SVG artifact, and no image preview.

Why this is more interesting than a single score

The headline score is high because the main benchmark is practical paperwork, not vector art. Across the current scored paperwork suite, MiniMax M3 Free lands at 8/9 resolved and 8/9 core, or 88.9% Practical Score. That makes it the current overall leader in the main public leaderboard.

That does not mean it is broadly reliable. It means it was unusually good at this specific work shape: documents, folders, evidence, artifacts, and hidden checks.

The SVG timeout keeps the result honest. Capability is not one blob. A model can be solid at workflow closure and still be unusable for a constrained visual artifact prompt.

Practical readout

If the question is who currently leads the main Practical Score, the answer is MiniMax M3 Free via OpenCode Zen.

If the question is whether this route deserves a benchmark row, yes. It completed the full current paperwork footprint and failed in a clearly documented place.

If the question is whether MiniMax M3 Free should be described as a general local-model winner, no. It is not local, it is provider-routed, and the SVG behavior was a hard failure.

The useful sentence is narrower and stronger: MiniMax M3 Free is the current paperwork/workflow leader on Local Model Bench, while also showing a hard failure on constrained standalone SVG generation.

Readout

MiniMax M3 Free is the current Local Model Bench Practical Score leader, with a very explicit caveat: this is an `api cheap` provider route, not a local Mac mini run. It looked strong as a document-workflow model and completely failed the SVG artifact test.

OpenRouter MiniMax M3 OpenCode Methodology MiniMax M3 Free Paperwork Trial Workflow W04 Workflow W07 City Plan SVG failure all notes