local

Gemma 4 31B: bigger did not mean cleaner

The 31B run looks worse than expected. The interesting signal is not that the model is useless, but that exact workflow closure punished it hard.

gemma-4-31b-it benchmark note infographic
27.8% Practical score
0/9 Resolved
5/9 Core pass
City Plan SVG passed Visual sample

What Worked

  • Reached core-pass level on five cases despite resolving none strictly.
  • Produced a valid City Plan SVG sample.
  • Often got the broad document situation, but lost the exact benchmark contract.

Where It Broke

  • No strict resolved cases in the current run.
  • Format, evidence, and exact artifact expectations caused many failures.
  • The result is a reminder that larger local models can still be brittle in constrained workflows.

Readout

This should be read as a workflow-contract failure pattern, not a universal verdict on the model.