field note

MiniMax M3 looks broken in long agent loops

MiniMax M3 can score well in bounded benchmark work and still break in long autonomous agent loops. This field note is about that second failure mode: many files, generated assets, validation, and a final report that never cleanly arrives.

MiniMax M3, MiniMax M2.7 benchmark note infographic
field note Practical score
broken Resolved
role split Core pass
n/a Visual sample

This is a field note from testing MiniMax M3 in real local agent workflows, not the same thing as the public paperwork leaderboard row.

The short version is uncomfortable: M3 is not generally useless, and it can do well in bounded benchmark runs. But it currently looks broken for productive long-running agent work.

It can produce good writing, review code, and handle bounded diagnostic tasks. The failure mode appears when it is asked to run as an autonomous worker across many files, assets, validation steps, and a final report.

Why this does not contradict the leaderboard

Local Model Bench has a separate MiniMax M3 Free run log where the provider-routed model leads the current Practical Score. That result is real and should stay visible.

This note is about a different work shape: long autonomous OpenCode/MiniMax-Code-style sessions with many generated files, images, validators, and a final report. In that shape, the model repeatedly did most of the work and then failed to close.

So the useful readout is not "M3 is good" or "M3 is bad". It is narrower: M3 can be strong in bounded paperwork benchmark cases, while still being unsafe as a long-running autonomous builder.

What worked

M3 completed short server/SRE-style diagnosis tasks.

It handled a large-context multi-turn probe up to roughly 135k input tokens.

In an article-quality probe, M3 produced the cleaner reporter-style article compared with M2.7-highspeed. M2.7 was faster, but invented unnamed quotes, which is a serious problem for journalism-style work.

Where it broke

The breakage showed up in long autonomous local work.

In heavy site/content generation, M3 scaffolded or produced partial artifacts and then stalled before the dataset, final assets, validator, or report existed.

In a 30-article workflow, M2.7-highspeed completed all 30 articles plus validator/report in about 671 seconds. M3 timed out at 900 seconds with 23 markdown files and no validator/report.

In a five-article fair test, M3 passed without images in about 332 seconds. Add locally generated SVG images and it timed out at 600 seconds: five articles and four of five SVGs existed, but no validator/report/contact sheet.

The diagnosis

This does not look like a basic intelligence failure. It looks like a tool-loop, reasoning-loop, or finish-condition failure.

M3 often completes most of the real work, then fails to terminate cleanly or fails to produce the final validator/report.

Images and assets make the problem worse, but they are not the only cause. The common trigger is long serial work with many local file operations and a finalization step.

The practical workaround

Use M2.7-highspeed as the builder: bulk writing, repetitive edits, many files, boilerplate, and mechanical implementation.

Use M3 as the planner, reviewer, editor, and final consistency pass.

Do not hand M3 a huge all-in-one prompt with images, many generated files, validation, and a final report. That is exactly where the current behavior becomes unfit for productive use.

Readout

MiniMax M3 is useful as a planner, reviewer, bounded writer, and benchmark paperwork runner. The current concern is long autonomous builder work: there it repeatedly stalls near finalization. The safer workflow is to let faster builder models work in small batches and use M3 for planning, review, and final editorial judgment.