Qwen3 VL 32B in a paperwork workflow test
Qwen3-VL is positioned as a strong vision-language model for long-context image reasoning. In this benchmark it read many document facts correctly, then repeatedly lost the run at proof codes, duplicate-risk logic, and workflow closure.

Qwen3-VL-32B-Instruct is a plausible candidate for local or semi-local document work: it is a 33B-parameter vision-language model with image-text input, and its model card positions Qwen3-VL as the strongest vision-language generation in the Qwen line so far.
That makes it tempting to treat the model as a document assistant. Local Model Bench tested something narrower and less glamorous: can it turn synthetic invoice scans and messy intake folders into exact final artifacts under hidden-oracle checks?
Model Context
- Model family
- Qwen3-VL
- Run type
- OpenRouter reference/API run
- Public model size
- 33B params on Hugging Face
- Input class
- Image + text
- Benchmark role
- Vision/document reference, not local LM Studio
Positioned As
- The Hugging Face model card describes Qwen3-VL as a vision-language model with upgraded text generation, visual perception and reasoning, long-context handling, spatial/video understanding, and stronger agent interaction capabilities.
- The Qwen3-VL technical report frames the family as supporting interleaved text, image, and video contexts up to 256K tokens, with dense 32B and MoE variants for different latency-quality tradeoffs.
What We Actually Tested
- The run did not test broad visual intelligence, video understanding, or generic chat quality.
- It tested synthetic paperwork: invoice images, bank exports, vendor lists, revised documents, final JSON artifacts, evidence paths, and proof-code closure.
- The visual City Plan SVG sample is reported separately because one SVG prompt is not the same thing as a statistical vision benchmark.
What Worked
- Extracted core fields correctly in the easier invoice-image cases.
- Handled visible invoice IDs, warning labels, and document types better than its strict score suggests.
- Generated a valid City Plan SVG sample.
Where It Broke
- Repeated proof-code failures even when the business facts were otherwise correct.
- Missed duplicate-risk and revision-style traps.
- Produced required workflow files, but several intermediate artifacts did not match the hidden oracle.
Readout
The model looked useful for reading documents, but not reliable enough to close a private paperwork workflow without review. Its score is not a broad verdict on Qwen3-VL; it is a warning about document-understanding claims meeting exact workflow contracts.