Gemma 4 31B: bigger did not mean cleaner
The 31B run looks worse than expected. The interesting signal is not that the model is useless, but that exact workflow closure punished it hard.
27.8% Practical score
0/9 Resolved
5/9 Core pass
City Plan SVG passed Visual sample
What Worked
- Reached core-pass level on five cases despite resolving none strictly.
- Produced a valid City Plan SVG sample.
- Often got the broad document situation, but lost the exact benchmark contract.
Where It Broke
- No strict resolved cases in the current run.
- Format, evidence, and exact artifact expectations caused many failures.
- The result is a reminder that larger local models can still be brittle in constrained workflows.
Readout
This should be read as a workflow-contract failure pattern, not a universal verdict on the model.