localMay 16, 2026

Gemma 4 31B: dense model, brittle closure

Gemma 4 31B is positioned as the dense, all-parameters-active sibling of the Gemma 4 family. It often understood the broad document situation, but the strict benchmark contract punished it hard.

27.8%Practical score

0/9Resolved

5/9Core pass

City Plan SVG passedVisual sample

The dense 31B run is useful because it breaks a simple assumption: larger local model, better workflow result. The model is listed as the dense Gemma 4 counterpart, with all parameters active during inference and support for vision, structured output, and agentic workflows.

On this benchmark, that did not translate into clean closure. The model reached core-pass level often enough to show understanding, but exact final artifacts, proof codes, and evidence formatting kept it from resolving any case strictly.

Model Context

Model family: Gemma 4
Run type: Local LM Studio run
Local hardware: Mac mini M4, 64 GB unified memory
Architecture listing: Dense 31B / all active
Benchmark role: Larger local comparison point

Positioned As

Vercel's AI Gateway model page describes Gemma 4 31B IT as Google's open-weight dense model with 31B parameters active during inference, targeting higher output quality than the MoE sibling.
That same listing describes support for text and image input, structured JSON output, function calling, system instructions, and agentic workflows.

What We Actually Tested

The run tested exact local paperwork workflow completion, not broad model quality.
A core pass counted when central audit facts were correct. A resolved case required the full contract: hidden oracle, proof code, artifacts, evidence, and protected-source checks.
This distinction matters: the model was not blank, but it still failed the thing a user would need completed.

What Worked

Reached core-pass level on five cases despite resolving none strictly.
Produced a valid City Plan SVG sample.
Often got the broad document situation, but lost the exact benchmark contract.

Where It Broke

No strict resolved cases in the current run.
Format, evidence, and exact artifact expectations caused many failures.
The result is a reminder that larger local models can still be brittle in constrained workflows.

Readout

This should be read as a workflow-contract failure pattern, not a universal verdict on the model. If a user only needs rough document triage, the score looks harsh; if a user needs final artifacts they can trust, the score is doing its job.

Sources

Vercel AI Gateway model page Gemma 4 family page

Run Outputs

Paperwork run SVG sample all notes