benchmark note

Gemini 3.5 Flash can draw SVG. The runner was the problem.

A first Gemini 3.5 Flash SVG run looked like a model failure. It was not. The OpenAI-compatible API path let hidden thinking consume the output budget, so the model returned planning fragments instead of a complete SVG. The native Gemini API with thinkingBudget: 0 produced a valid City Plan SVG and passed 3/3 checks. A later native text-only paperwork run landed at 2/5 strict cases, 3/5 core passes, and 15/20 checks.

gemini-3.5-flash benchmark note infographic
80.0% image checks / 75.0% text-only checks Practical score
3/5 image cases / 2/5 text-only cases Resolved
3/5 text-only core passes Core pass
3/3 SVG checks Visual sample

This is a small but important benchmark correction. A failed artifact is not always a failed model. Sometimes the runner is giving the model the wrong operating conditions.

Gemini 3.5 Flash initially appeared to fail the City Plan SVG prompt. The output did not start as a complete SVG and the run scored poorly. That result looked suspicious because even much smaller models can usually produce a basic standalone SVG.

What went wrong

The first City Plan SVG run used Google's OpenAI-compatible Gemini endpoint. The prompt asked for a valid standalone SVG, no Markdown, no external images, no JavaScript, roads or city blocks, multiple buildings, and some 3D or isometric building shapes.

The model did not simply produce a bad drawing. It returned planning-like fragments and stopped because the completion hit a length limit. A direct diagnostic run showed why: without a native thinking configuration, Gemini spent most of the available budget on hidden thoughts instead of the visible SVG artifact.

The diagnostic

A direct native Gemini API call made the issue visible. With no thinking configuration, the response used more than eleven thousand thought tokens and produced an incomplete visible artifact. With `thinkingBudget: 512`, it still ran out before closing the SVG.

With `thinkingBudget: 0`, the same model produced a complete SVG: it started with `<svg`, ended with `</svg>`, parsed as XML, avoided scripts and external images, and included roads, blocks, polygons, paths, and building shapes.

  • OpenAI-compatible path: unfair SVG artifact run
  • Native Gemini path: explicit thinking budget control
  • `thinkingBudget: 0`: complete visible SVG
  • Final native run: 3/3 City Plan SVG checks

Why this matters

Artifact benchmarks are especially sensitive to output-budget behavior. If the task is to write SVG, JSON, CSV, code, or a final file, hidden reasoning can become a liability when it consumes the budget needed for the artifact itself.

This is not an argument against reasoning models. It is an argument for matching the runner to the task. For a proof-heavy reasoning task, a thinking budget may help. For constrained artifact generation, the benchmark needs the final artifact more than it needs private deliberation.

What changed

Local Model Bench now has a separate native Gemini SVG runner. It calls `models/:generateContent` directly and sets `thinkingConfig` explicitly. For the City Plan SVG sanity check, the fair setting is `thinkingBudget: 0`.

The earlier Gemini 3.5 Flash SVG failure from the OpenAI-compatible path was removed from the public data. The corrected native run remains visible, including the generated SVG and the run settings.

Practical readout

Gemini 3.5 Flash can generate the City Plan SVG under fair settings. That does not change its paperwork result: it still resolved three of five generated-image invoice cases strictly and failed two harder cases. But it does change the SVG interpretation.

The useful lesson is boring and important: benchmark infrastructure is part of the benchmark. If the runner hides, truncates, or misallocates output budget, the score can become a runner score instead of a model score.

Text-only follow-up

The same runner issue also matters for text-only artifact tasks. Local Model Bench added a native Gemini text-only runner with `thinkingBudget: 0`, then reran the Paperwork Text-Only diagnostic against normalized document extracts instead of images.

Gemini 3.5 Flash finished the text-only suite with 15/20 checks, 2/5 strict resolved cases, and 3/5 core passes. That is a useful diagnostic split: the model handled some bookkeeping facts cleanly, but exact closure still failed on the harder cases, mostly through calculation and proof-code errors.

visual sample

Gemini 3.5 Flash City Plan SVG

Native Gemini API run with thinkingBudget set to 0. The same model passed the City Plan SVG checks once hidden thinking stopped consuming the output budget.

gemini-3.5-flash City Plan SVG output

Model Context

Model
gemini-3.5-flash
Provider path
Google Gemini native API
Corrected setting
thinkingBudget: 0
SVG run
City Plan SVG, 3/3 checks
Paperwork image run
3/5 strict, 16/20 checks
Paperwork text-only run
2/5 strict, 3/5 core, 15/20 checks

Positioned As

  • This is a benchmark correction, not a model hype note.
  • The original SVG fail came from an unfair runner/API configuration.
  • Gemini SVG runs should use the native Gemini API when thinking budget matters.

What We Actually Tested

  • The initial OpenAI-compatible run failed to produce a complete SVG artifact.
  • A native Gemini diagnostic showed hidden thought tokens consuming the output budget.
  • `thinkingBudget: 0` produced a parseable standalone City Plan SVG.
  • The corrected SVG run passed all three automated SVG checks.
  • The native text-only paperwork run completed with 15/20 checks and 3/5 core passes.
  • The unfair run was removed from the public data.

What Worked

  • Produces a valid City Plan SVG when hidden thinking is disabled.
  • Handles the simpler generated-image paperwork cases cleanly.
  • The native API exposes enough control to make artifact generation fair.

Where It Broke

  • Failed two harder generated-image paperwork cases.
  • Failed exact closure on two harder text-only cases and one proof-code near miss.
  • OpenAI-compatible Gemini path can be misleading for artifact benchmarks.
  • Pro models were not retested because the current account hit quota limits.

Readout

The corrected result is simple: Gemini 3.5 Flash can draw the SVG, and it can complete the text-only paperwork diagnostic under the native runner. The model still misses exact closure on harder cases, but the earlier SVG failure was a runner problem, not a fair model result.