What went wrong
The first City Plan SVG run used Google's OpenAI-compatible Gemini endpoint. The prompt asked for a valid standalone SVG, no Markdown, no external images, no JavaScript, roads or city blocks, multiple buildings, and some 3D or isometric building shapes.
The model did not simply produce a bad drawing. It returned planning-like fragments and stopped because the completion hit a length limit. A direct diagnostic run showed why: without a native thinking configuration, Gemini spent most of the available budget on hidden thoughts instead of the visible SVG artifact.
The diagnostic
A direct native Gemini API call made the issue visible. With no thinking configuration, the response used more than eleven thousand thought tokens and produced an incomplete visible artifact. With `thinkingBudget: 512`, it still ran out before closing the SVG.
With `thinkingBudget: 0`, the same model produced a complete SVG: it started with `<svg`, ended with `</svg>`, parsed as XML, avoided scripts and external images, and included roads, blocks, polygons, paths, and building shapes.
- OpenAI-compatible path: unfair SVG artifact run
- Native Gemini path: explicit thinking budget control
- `thinkingBudget: 0`: complete visible SVG
- Final native run: 3/3 City Plan SVG checks
Why this matters
Artifact benchmarks are especially sensitive to output-budget behavior. If the task is to write SVG, JSON, CSV, code, or a final file, hidden reasoning can become a liability when it consumes the budget needed for the artifact itself.
This is not an argument against reasoning models. It is an argument for matching the runner to the task. For a proof-heavy reasoning task, a thinking budget may help. For constrained artifact generation, the benchmark needs the final artifact more than it needs private deliberation.
What changed
Local Model Bench now has a separate native Gemini SVG runner. It calls `models/:generateContent` directly and sets `thinkingConfig` explicitly. For the City Plan SVG sanity check, the fair setting is `thinkingBudget: 0`.
The earlier Gemini 3.5 Flash SVG failure from the OpenAI-compatible path was removed from the public data. The corrected native run remains visible, including the generated SVG and the run settings.
Practical readout
Gemini 3.5 Flash can generate the City Plan SVG under fair settings. That does not change its paperwork result: it still resolved three of five generated-image invoice cases strictly and failed two harder cases. But it does change the SVG interpretation.
The useful lesson is boring and important: benchmark infrastructure is part of the benchmark. If the runner hides, truncates, or misallocates output budget, the score can become a runner score instead of a model score.
Text-only follow-up
The same runner issue also matters for text-only artifact tasks. Local Model Bench added a native Gemini text-only runner with `thinkingBudget: 0`, then reran the Paperwork Text-Only diagnostic against normalized document extracts instead of images.
Gemini 3.5 Flash finished the text-only suite with 15/20 checks, 2/5 strict resolved cases, and 3/5 core passes. That is a useful diagnostic split: the model handled some bookkeeping facts cleanly, but exact closure still failed on the harder cases, mostly through calculation and proof-code errors.