Gemini 2.5 Flash: good at reading, sloppy at closure
Gemini 2.5 Flash understood many of the synthetic paperwork cases, but repeatedly failed the exact final contract: valid JSON, proof codes, and workflow checksum closure.
Gemini 2.5 Flash is not a local model. It is included as a cheap API comparison point, so the local rows have something current and capable to measure against.
The result is useful because it separates document understanding from finished work. Gemini often got the central facts close enough for a human reviewer, then failed strict resolution on the proof step or output format.
Model Context
- Model family
- Google Gemini 2.5
- Run type
- OpenRouter API run
- Leaderboard group
- api cheap
- Local hardware
- Not a Mac mini LM Studio run
- Benchmark role
- Cheap current API comparison
Positioned As
- Google describes Gemini 2.5 Flash as the price-performance model in the Gemini 2.5 family, with multimodal input and thinking support.
- That makes it a reasonable comparison for document-style work, but not a local privacy-preserving deployment option.
- Local Model Bench treats it as a reference pressure test: if a cheap API model still misses exact workflow closure, the closure requirement is doing real work.
What We Actually Tested
- The run covered all five generated-invoice Paperwork Trial cases and all four agentic Paperwork Workflow cases.
- The model also ran the separate City Plan SVG prompt, which is shown outside the overall practical score.
- The overall score uses the same 50/50 Practical Score: half strict resolved pass, half core-oracle pass.
What Worked
- Reached core-pass level on five of nine practical cases.
- Often identified the right document situation before failing the final checksum.
- Generated a valid standalone City Plan SVG sample.
Where It Broke
- Zero strict resolved cases.
- Repeatedly emitted proof codes as expressions or wrong final numbers.
- A few near misses were invalid JSON or wrong proof.txt closure rather than broad comprehension failures.
- It should not be read as a local privacy-preserving model result.
Readout
Gemini 2.5 Flash is a useful cheap comparison row precisely because it did not simply crush the benchmark. It often read the case, but did not reliably finish the job. For real local paperwork workflows, that last step is the product.