Why this does not contradict the long-loop field note
There is a second MiniMax M3 note on the site that calls out long autonomous agent-loop failures. That note should not be read as a correction of this benchmark result.
The distinction is the test shape. Here, M3 ran the defined Local Model Bench paperwork cases and closed eight of nine practical cases. In the field note, M3 was used as a broader autonomous builder across many generated files, assets, validators, and reports.
Those are related but not identical capabilities. The clean statement is: M3 Free currently leads this paperwork benchmark, while separate long-loop probes make it look risky for open-ended productive agent work.
What ran
The model was first smoke-tested through OpenCode Zen. It returned the expected short answer and reported zero cost in the OpenCode trace.
Then it ran the five generated-image Paperwork Trial cases. Those cases use synthetic invoice scans, bank exports, vendor records, purchase orders, previous-invoice traps, and an exact `audit_result.json` oracle.
After that it ran the four agentic Paperwork Workflow cases. Those require selecting active files from a messy intake folder, preserving protected sources, writing intermediate artifacts, producing `audit_result.json`, and closing with `proof.txt`.
Finally it ran the City Plan SVG prompt: standalone SVG only, roads, blocks, and 3D or isometric buildings.
Paperwork was strong
On the generated-image Paperwork Trial, MiniMax M3 Free resolved four of five cases. The only strict failure was generated case 03.
That failure was not a formatting accident. It counted an old duplicate-risk invoice as approved, which inflated the approved total and broke the proof code. The failure labels were exactly the practical ones: `duplicate_risk_missed`, `invoice_classification_error`, `total_calculation_error`, `warning_code_error`, and `proof_code_error`.
The workflow side was cleaner. W04, W05, W06, and W07 all passed: required artifacts existed, incoming sources were unchanged, core oracle passed, and hidden oracle passed.
SVG did not merely look bad. It did not arrive.
The City Plan SVG run timed out after 600 seconds without a parseable SVG artifact.
That is a different failure from a crude drawing. Some models return a valid but visually weak SVG; MiniMax M3 Free did not return the artifact at all in this run.
For the site, that is counted as a visual sample failure: 0/3 automated checks, no SVG artifact, and no image preview.
Why this is more interesting than a single score
The headline score is high because the main benchmark is practical paperwork, not vector art. Across the current scored paperwork suite, MiniMax M3 Free lands at 8/9 resolved and 8/9 core, or 88.9% Practical Score. That makes it the current overall leader in the main public leaderboard.
That does not mean it is broadly reliable. It means it was unusually good at this specific work shape: documents, folders, evidence, artifacts, and hidden checks.
The SVG timeout keeps the result honest. Capability is not one blob. A model can be solid at workflow closure and still be unusable for a constrained visual artifact prompt.
Practical readout
If the question is who currently leads the main Practical Score, the answer is MiniMax M3 Free via OpenCode Zen.
If the question is whether this route deserves a benchmark row, yes. It completed the full current paperwork footprint and failed in a clearly documented place.
If the question is whether MiniMax M3 Free should be described as a general local-model winner, no. It is not local, it is provider-routed, and the SVG behavior was a hard failure.
The useful sentence is narrower and stronger: MiniMax M3 Free is the current paperwork/workflow leader on Local Model Bench, while also showing a hard failure on constrained standalone SVG generation.