api cheap May 17, 2026

Gemini 2.5 Flash: good at reading, sloppy at closure

Gemini 2.5 Flash understood many of the synthetic paperwork cases, but repeatedly failed the exact final contract: valid JSON, proof codes, and workflow checksum closure.

27.8% Practical score

0/9 Resolved

5/9 Core pass

City Plan SVG passed Visual sample

Gemini 2.5 Flash is not a local model. It is included as a cheap API comparison point, so the local rows have something current and capable to measure against.

The result is useful because it separates document understanding from finished work. Gemini often got the central facts close enough for a human reviewer, then failed strict resolution on the proof step or output format.

Model Context

Model family: Google Gemini 2.5
Run type: OpenRouter API run
Leaderboard group: api cheap
Local hardware: Not a Mac mini LM Studio run
Benchmark role: Cheap current API comparison

Positioned As

Google describes Gemini 2.5 Flash as the price-performance model in the Gemini 2.5 family, with multimodal input and thinking support.
That makes it a reasonable comparison for document-style work, but not a local privacy-preserving deployment option.
Local Model Bench treats it as a reference pressure test: if a cheap API model still misses exact workflow closure, the closure requirement is doing real work.

What We Actually Tested

The run covered all five generated-invoice Paperwork Trial cases and all four agentic Paperwork Workflow cases.
The model also ran the separate City Plan SVG prompt, which is shown outside the overall practical score.
The overall score uses the same 50/50 Practical Score: half strict resolved pass, half core-oracle pass.

What Worked

Reached core-pass level on five of nine practical cases.
Often identified the right document situation before failing the final checksum.
Generated a valid standalone City Plan SVG sample.

Where It Broke

Zero strict resolved cases.
Repeatedly emitted proof codes as expressions or wrong final numbers.
A few near misses were invalid JSON or wrong proof.txt closure rather than broad comprehension failures.
It should not be read as a local privacy-preserving model result.

Readout

Gemini 2.5 Flash is a useful cheap comparison row precisely because it did not simply crush the benchmark. It often read the case, but did not reliably finish the job. For real local paperwork workflows, that last step is the product.

Sources

Google Gemini model docs Gemini 2.5 family announcement

Run Outputs

Paperwork run Workflow W07 SVG sample all notes