api cheap

Gemini 2.5 Flash: good at reading, sloppy at closure

Gemini 2.5 Flash understood many of the synthetic paperwork cases, but repeatedly failed the exact final contract: valid JSON, proof codes, and workflow checksum closure.

gemini-2.5-flash benchmark note infographic
27.8% Practical score
0/9 Resolved
5/9 Core pass
City Plan SVG passed Visual sample

Gemini 2.5 Flash is not a local model. It is included as a cheap API comparison point, so the local rows have something current and capable to measure against.

The result is useful because it separates document understanding from finished work. Gemini often got the central facts close enough for a human reviewer, then failed strict resolution on the proof step or output format.

Model Context

Model family
Google Gemini 2.5
Run type
OpenRouter API run
Leaderboard group
api cheap
Local hardware
Not a Mac mini LM Studio run
Benchmark role
Cheap current API comparison

Positioned As

  • Google describes Gemini 2.5 Flash as the price-performance model in the Gemini 2.5 family, with multimodal input and thinking support.
  • That makes it a reasonable comparison for document-style work, but not a local privacy-preserving deployment option.
  • Local Model Bench treats it as a reference pressure test: if a cheap API model still misses exact workflow closure, the closure requirement is doing real work.

What We Actually Tested

  • The run covered all five generated-invoice Paperwork Trial cases and all four agentic Paperwork Workflow cases.
  • The model also ran the separate City Plan SVG prompt, which is shown outside the overall practical score.
  • The overall score uses the same 50/50 Practical Score: half strict resolved pass, half core-oracle pass.

What Worked

  • Reached core-pass level on five of nine practical cases.
  • Often identified the right document situation before failing the final checksum.
  • Generated a valid standalone City Plan SVG sample.

Where It Broke

  • Zero strict resolved cases.
  • Repeatedly emitted proof codes as expressions or wrong final numbers.
  • A few near misses were invalid JSON or wrong proof.txt closure rather than broad comprehension failures.
  • It should not be read as a local privacy-preserving model result.

Readout

Gemini 2.5 Flash is a useful cheap comparison row precisely because it did not simply crush the benchmark. It often read the case, but did not reliably finish the job. For real local paperwork workflows, that last step is the product.