Gemma 4 31B: dense model, brittle closure
Gemma 4 31B is positioned as the dense, all-parameters-active sibling of the Gemma 4 family. It often understood the broad document situation, but the strict benchmark contract punished it hard.

The dense 31B run is useful because it breaks a simple assumption: larger local model, better workflow result. The model is listed as the dense Gemma 4 counterpart, with all parameters active during inference and support for vision, structured output, and agentic workflows.
On this benchmark, that did not translate into clean closure. The model reached core-pass level often enough to show understanding, but exact final artifacts, proof codes, and evidence formatting kept it from resolving any case strictly.
Model Context
- Model family
- Gemma 4
- Run type
- Local LM Studio run
- Local hardware
- Mac mini M4, 64 GB unified memory
- Architecture listing
- Dense 31B / all active
- Benchmark role
- Larger local comparison point
Positioned As
- Vercel's AI Gateway model page describes Gemma 4 31B IT as Google's open-weight dense model with 31B parameters active during inference, targeting higher output quality than the MoE sibling.
- That same listing describes support for text and image input, structured JSON output, function calling, system instructions, and agentic workflows.
What We Actually Tested
- The run tested exact local paperwork workflow completion, not broad model quality.
- A core pass counted when central audit facts were correct. A resolved case required the full contract: hidden oracle, proof code, artifacts, evidence, and protected-source checks.
- This distinction matters: the model was not blank, but it still failed the thing a user would need completed.
What Worked
- Reached core-pass level on five cases despite resolving none strictly.
- Produced a valid City Plan SVG sample.
- Often got the broad document situation, but lost the exact benchmark contract.
Where It Broke
- No strict resolved cases in the current run.
- Format, evidence, and exact artifact expectations caused many failures.
- The result is a reminder that larger local models can still be brittle in constrained workflows.
Readout
This should be read as a workflow-contract failure pattern, not a universal verdict on the model. If a user only needs rough document triage, the score looks harsh; if a user needs final artifacts they can trust, the score is doing its job.