Qwen3 VL 32B: good reading, weak closure
A paid OpenRouter vision reference run that read many document facts correctly, then repeatedly lost the benchmark at proof codes, duplicate-risk logic, and workflow closure.
22.2% Practical score
0/9 Resolved
4/9 Core pass
City Plan SVG passed Visual sample
What Worked
- Extracted core fields correctly in the easier invoice-image cases.
- Handled visible invoice IDs, warning labels, and document types better than its strict score suggests.
- Generated a valid City Plan SVG sample.
Where It Broke
- Repeated proof-code failures even when the business facts were otherwise correct.
- Missed duplicate-risk and revision-style traps.
- Produced required workflow files, but several intermediate artifacts did not match the hidden oracle.
Readout
Treat this as a useful vision/document extraction candidate, not as a reliable autonomous paperwork worker on this benchmark.