runtime note

LM Studio MTP worked. The visible answer did not.

LM Studio 0.4.14 can activate MTP speculative decoding on an MTP-capable Qwen3.6 build. Our smoke test confirmed accepted draft tokens in the runtime log, but also caught the practical failure: almost the entire completion was hidden reasoning, with no visible final answer.

Qwen3.6 27B MTP benchmark note infographic
runtime note Practical score
not scored Resolved
MTP active Core pass
no visible answer Visual sample

LM Studio 0.4.14 shipped stable support for MTP speculative decoding. That sounds like a simple speed story: download a model with multi-token prediction heads, turn on the runtime, get faster generation.

The smoke test was useful because it did not stay that simple. MTP was active. Draft tokens were accepted. But the model still failed the actual user-facing task because the response budget was consumed by reasoning tokens instead of visible output.

What changed in LM Studio

LM Studio's 0.4.14 release notes describe stable MTP speculative decoding for models that include built-in multi-token prediction heads. The same release also fixed a `lms get gemma4` issue, which we verified locally.

Speculative decoding is not a quality feature by itself. It is a runtime feature. The runtime tries to predict multiple future tokens, the model accepts or rejects those draft tokens, and the useful question becomes whether that makes real output faster without breaking the artifact.

What we tested

We first loaded the normal local `qwen/qwen3.6-27b` build in LM Studio and ran a small OpenAI-compatible API prompt asking for exactly twelve short bullet points about local LLM runtime speed. Then we downloaded and loaded `qwen3.6-27b-mtp`, a Qwen3.6 27B GGUF build with MTP support.

The test was deliberately small. This was not a Local Model Bench ranking run. It was a runtime smoke test: does MTP engage, what does the log say, and does the final answer actually appear?

  • Runtime: LM Studio 0.4.14
  • Hardware: Mac mini M4, 64 GB unified memory
  • Model: Qwen3.6 27B MTP GGUF
  • Prompt type: short visible-output artifact
  • API path: LM Studio OpenAI-compatible chat completions

MTP was really active

The server log showed that LM Studio created an MTP draft context against the target model and initialized a `draft-mtp` speculative decoding implementation. That is the important part: this was not just a model name containing MTP.

During the smoke test, the log reported a draft acceptance rate of about 0.68: 172 accepted draft tokens out of 252 generated draft tokens. That proves the runtime feature was engaged.

  • `creating MTP draft context` appeared in the LM Studio log
  • `draft-mtp` was initialized
  • Draft acceptance: 172 accepted / 252 generated
  • Reported draft acceptance: about 0.68

Where it broke

The final answer did not appear. The MTP run generated 300 completion tokens, but 299 of them were counted as reasoning tokens. The visible answer channel was empty. The request asked for twelve bullet points. The model returned no user-visible bullet points before hitting the length limit.

This also happened with the normal Qwen3.6 27B build. Adding `/no_think`, `reasoning: none`, `enable_thinking: false`, and `chat_template_kwargs: { enable_thinking: false }` through LM Studio's OpenAI-compatible API did not change the practical outcome in this test.

  • MTP run: about 9.6 tokens per second
  • Completion: 300 tokens
  • Reasoning tokens: 299
  • Visible output: 0 characters
  • Finish reason: length

How other runtimes handle this

vLLM and SGLang document Qwen thinking control at the chat-template layer. In those runtimes, disabling thinking is not merely a normal OpenAI-style request field. It is passed through `chat_template_kwargs`, or configured at server start.

That distinction matters for artifact benchmarks. If the task is to write JSON, CSV, SVG, code, or a final file, hidden reasoning can consume the budget needed for the artifact itself. A benchmark runner needs to prove the final artifact exists, not only count total tokens.

Practical readout

This does not mean MTP is useless. It means MTP is not the whole story. The runtime can accept draft tokens and still fail the user's actual work if the model spends the completion budget in a non-visible reasoning channel.

For Local Model Bench, this is exactly why runtime notes stay separate from the leaderboard. Speed features are interesting. But the practical benchmark still asks the boring question: did the model produce the correct visible artifact?

Model Context

Runtime
LM Studio 0.4.14
Hardware
Mac mini M4, 64 GB unified memory
MTP model
qwen3.6-27b-mtp, Q4_K_S GGUF
Control models
qwen/qwen3.6-27b and qwen/qwen3-14b
MTP evidence
draft-mtp initialized, 172/252 draft tokens accepted
Leaderboard status
runtime smoke test only, not a scored benchmark row

Positioned As

  • This is a runtime lab note, not a model ranking.
  • The useful result is the split between technical MTP activation and practical visible-output failure.
  • The note should be read as a runner and runtime caution, not as a full Qwen3.6 capability review.

What We Actually Tested

  • Downloaded an MTP-capable Qwen3.6 27B GGUF build through LM Studio.
  • Loaded the model with GPU offload and a 4096-token context.
  • Ran a short visible-output prompt through the OpenAI-compatible API.
  • Compared behavior against a normal Qwen3.6 27B build and a Qwen3 14B control run.
  • Inspected LM Studio logs for MTP draft context, draft acceptance, and visible output behavior.

What Worked

  • LM Studio loaded the MTP-capable GGUF cleanly.
  • The server log confirmed `draft-mtp` speculative decoding.
  • The run accepted 172 draft tokens out of 252 generated draft tokens.
  • `lms get gemma4` returned the expected Gemma 4 model list after the update.

Where It Broke

  • The Qwen3.6 MTP run produced no visible final answer in the smoke test.
  • Almost the entire completion budget was spent on reasoning tokens.
  • Thinking suppression did not work through the tested LM Studio OpenAI-compatible request fields.
  • The MTP run was not faster than the ordinary Qwen3.6 27B smoke result in practical wall-clock terms.

Readout

LM Studio 0.4.14 really did activate MTP speculative decoding for the Qwen3.6 MTP build. But the user-facing artifact still failed: accepted draft tokens are not the same as a visible final answer. For practical local work, visible output remains the metric that matters.