local

Gemma 4 31B: dense model, brittle closure

Gemma 4 31B is positioned as the dense, all-parameters-active sibling of the Gemma 4 family. It often understood the broad document situation, but the strict benchmark contract punished it hard.

gemma-4-31b-it benchmark note infographic
27.8%Practical score
0/9Resolved
5/9Core pass
City Plan SVG passedVisual sample

The dense 31B run is useful because it breaks a simple assumption: larger local model, better workflow result. The model is listed as the dense Gemma 4 counterpart, with all parameters active during inference and support for vision, structured output, and agentic workflows.

On this benchmark, that did not translate into clean closure. The model reached core-pass level often enough to show understanding, but exact final artifacts, proof codes, and evidence formatting kept it from resolving any case strictly.

Model Context

Model family
Gemma 4
Run type
Local LM Studio run
Local hardware
Mac mini M4, 64 GB unified memory
Architecture listing
Dense 31B / all active
Benchmark role
Larger local comparison point

Positioned As

  • Vercel's AI Gateway model page describes Gemma 4 31B IT as Google's open-weight dense model with 31B parameters active during inference, targeting higher output quality than the MoE sibling.
  • That same listing describes support for text and image input, structured JSON output, function calling, system instructions, and agentic workflows.

What We Actually Tested

  • The run tested exact local paperwork workflow completion, not broad model quality.
  • A core pass counted when central audit facts were correct. A resolved case required the full contract: hidden oracle, proof code, artifacts, evidence, and protected-source checks.
  • This distinction matters: the model was not blank, but it still failed the thing a user would need completed.

What Worked

  • Reached core-pass level on five cases despite resolving none strictly.
  • Produced a valid City Plan SVG sample.
  • Often got the broad document situation, but lost the exact benchmark contract.

Where It Broke

  • No strict resolved cases in the current run.
  • Format, evidence, and exact artifact expectations caused many failures.
  • The result is a reminder that larger local models can still be brittle in constrained workflows.

Readout

This should be read as a workflow-contract failure pattern, not a universal verdict on the model. If a user only needs rough document triage, the score looks harsh; if a user needs final artifacts they can trust, the score is doing its job.