reference

Qwen3 VL 32B in a paperwork workflow test

Qwen3-VL is positioned as a strong vision-language model for long-context image reasoning. In this benchmark it read many document facts correctly, then repeatedly lost the run at proof codes, duplicate-risk logic, and workflow closure.

qwen3-vl-32b-instruct benchmark note infographic
22.2%Practical score
0/9Resolved
4/9Core pass
City Plan SVG passedVisual sample

Qwen3-VL-32B-Instruct is a plausible candidate for local or semi-local document work: it is a 33B-parameter vision-language model with image-text input, and its model card positions Qwen3-VL as the strongest vision-language generation in the Qwen line so far.

That makes it tempting to treat the model as a document assistant. Local Model Bench tested something narrower and less glamorous: can it turn synthetic invoice scans and messy intake folders into exact final artifacts under hidden-oracle checks?

Model Context

Model family
Qwen3-VL
Run type
OpenRouter reference/API run
Public model size
33B params on Hugging Face
Input class
Image + text
Benchmark role
Vision/document reference, not local LM Studio

Positioned As

  • The Hugging Face model card describes Qwen3-VL as a vision-language model with upgraded text generation, visual perception and reasoning, long-context handling, spatial/video understanding, and stronger agent interaction capabilities.
  • The Qwen3-VL technical report frames the family as supporting interleaved text, image, and video contexts up to 256K tokens, with dense 32B and MoE variants for different latency-quality tradeoffs.

What We Actually Tested

  • The run did not test broad visual intelligence, video understanding, or generic chat quality.
  • It tested synthetic paperwork: invoice images, bank exports, vendor lists, revised documents, final JSON artifacts, evidence paths, and proof-code closure.
  • The visual City Plan SVG sample is reported separately because one SVG prompt is not the same thing as a statistical vision benchmark.

What Worked

  • Extracted core fields correctly in the easier invoice-image cases.
  • Handled visible invoice IDs, warning labels, and document types better than its strict score suggests.
  • Generated a valid City Plan SVG sample.

Where It Broke

  • Repeated proof-code failures even when the business facts were otherwise correct.
  • Missed duplicate-risk and revision-style traps.
  • Produced required workflow files, but several intermediate artifacts did not match the hidden oracle.

Readout

The model looked useful for reading documents, but not reliable enough to close a private paperwork workflow without review. Its score is not a broad verdict on Qwen3-VL; it is a warning about document-understanding claims meeting exact workflow contracts.