reference

Qwen3 VL 32B: good reading, weak closure

A paid OpenRouter vision reference run that read many document facts correctly, then repeatedly lost the benchmark at proof codes, duplicate-risk logic, and workflow closure.

qwen3-vl-32b-instruct benchmark note infographic
22.2% Practical score
0/9 Resolved
4/9 Core pass
City Plan SVG passed Visual sample

What Worked

  • Extracted core fields correctly in the easier invoice-image cases.
  • Handled visible invoice IDs, warning labels, and document types better than its strict score suggests.
  • Generated a valid City Plan SVG sample.

Where It Broke

  • Repeated proof-code failures even when the business facts were otherwise correct.
  • Missed duplicate-risk and revision-style traps.
  • Produced required workflow files, but several intermediate artifacts did not match the hidden oracle.

Readout

Treat this as a useful vision/document extraction candidate, not as a reliable autonomous paperwork worker on this benchmark.