What changed in LM Studio
LM Studio's 0.4.14 release notes describe stable MTP speculative decoding for models that include built-in multi-token prediction heads. The same release also fixed a `lms get gemma4` issue, which we verified locally.
Speculative decoding is not a quality feature by itself. It is a runtime feature. The runtime tries to predict multiple future tokens, the model accepts or rejects those draft tokens, and the useful question becomes whether that makes real output faster without breaking the artifact.
What we tested
We first loaded the normal local `qwen/qwen3.6-27b` build in LM Studio and ran a small OpenAI-compatible API prompt asking for exactly twelve short bullet points about local LLM runtime speed. Then we downloaded and loaded `qwen3.6-27b-mtp`, a Qwen3.6 27B GGUF build with MTP support.
The test was deliberately small. This was not a Local Model Bench ranking run. It was a runtime smoke test: does MTP engage, what does the log say, and does the final answer actually appear?
- Runtime: LM Studio 0.4.14
- Hardware: Mac mini M4, 64 GB unified memory
- Model: Qwen3.6 27B MTP GGUF
- Prompt type: short visible-output artifact
- API path: LM Studio OpenAI-compatible chat completions
MTP was really active
The server log showed that LM Studio created an MTP draft context against the target model and initialized a `draft-mtp` speculative decoding implementation. That is the important part: this was not just a model name containing MTP.
During the smoke test, the log reported a draft acceptance rate of about 0.68: 172 accepted draft tokens out of 252 generated draft tokens. That proves the runtime feature was engaged.
- `creating MTP draft context` appeared in the LM Studio log
- `draft-mtp` was initialized
- Draft acceptance: 172 accepted / 252 generated
- Reported draft acceptance: about 0.68
Where it broke
The final answer did not appear. The MTP run generated 300 completion tokens, but 299 of them were counted as reasoning tokens. The visible answer channel was empty. The request asked for twelve bullet points. The model returned no user-visible bullet points before hitting the length limit.
This also happened with the normal Qwen3.6 27B build. Adding `/no_think`, `reasoning: none`, `enable_thinking: false`, and `chat_template_kwargs: { enable_thinking: false }` through LM Studio's OpenAI-compatible API did not change the practical outcome in this test.
- MTP run: about 9.6 tokens per second
- Completion: 300 tokens
- Reasoning tokens: 299
- Visible output: 0 characters
- Finish reason: length
How other runtimes handle this
vLLM and SGLang document Qwen thinking control at the chat-template layer. In those runtimes, disabling thinking is not merely a normal OpenAI-style request field. It is passed through `chat_template_kwargs`, or configured at server start.
That distinction matters for artifact benchmarks. If the task is to write JSON, CSV, SVG, code, or a final file, hidden reasoning can consume the budget needed for the artifact itself. A benchmark runner needs to prove the final artifact exists, not only count total tokens.
Practical readout
This does not mean MTP is useless. It means MTP is not the whole story. The runtime can accept draft tokens and still fail the user's actual work if the model spends the completion budget in a non-visible reasoning channel.
For Local Model Bench, this is exactly why runtime notes stay separate from the leaderboard. Speed features are interesting. But the practical benchmark still asks the boring question: did the model produce the correct visible artifact?