studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832
studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832pjordanandrsn wants to merge 2 commits into
Conversation
Qwen3.5/3.6-style hybrid-thinking templates prefill '<think>\n' in the generation prompt, so the model's output stream carries no opening tag and Studio's tag-based splitting rendered all reasoning as answer text (unslothai#6815). Yield the opener before generate_stream when the rendered prompt ends with it; enable_thinking=False renders a closed empty block and is untouched. GGUF path unaffected (llama-server owns that convention). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the chat response generation in studio/backend/core/inference/inference.py to handle hybrid-thinking templates (like Qwen3.5/3.6) that prefill the reasoning opener. It yields a <think> tag if the formatted prompt ends with <think>, ensuring the reasoning stream is correctly parsed on the safetensors path. There are no review comments, so no feedback is provided.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7d3dcdd6e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # safetensors path (#6815). enable_thinking=False prompts end with an *empty* | ||
| # `<think>\n\n</think>` block, not the bare opener, so they are unaffected. | ||
| if formatted_prompt.rstrip().endswith("<think>"): | ||
| yield "<think>" |
There was a problem hiding this comment.
Keep the think opener in cumulative snapshots
When the template pre-fills a bare <think>, this yields one synthetic snapshot and then delegates to generate_stream, whose later snapshots start from only the model-generated text. The OpenAI/safetensors consumers treat every value from this generator as cumulative and diff against the previous snapshot (for example routes/inference.py slices cumulative[len(prev_text):]), so after prev_text is "<think>" the next reasoning snapshot loses its first characters and the final/non-streaming result still omits the opener. The synthetic prefix needs to be part of each subsequent cumulative value instead of emitted once separately.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — confirmed and fixed in e8db9ce. generate_stream yields cumulative snapshots (output += new_token), and the consumers diff with cumulative[len(prev_text):], so the one-shot opener made the next snapshot lose its first 7 characters and dropped the opener from the final value — a consumer-diff simulation reproduces exactly that ('<think>ing...'). The opener is now carried as a prefix on every snapshot: streamed reconstruction and final snapshot both come out as <think>reasoning...</think>answer, and the no-prefill path is byte-identical to before (empty prefix).
The stream protocol yields cumulative snapshots that consumers diff against
the previous value (routes/inference.py: cumulative[len(prev_text):]).
Yielding the opener once as its own snapshot made the next snapshot lose
its first len('<think>') characters and dropped the opener from the final
value. Prefix every generate_stream snapshot instead; the no-prefill path
is unchanged (empty prefix).
Addresses the Codex P1 review finding on unslothai#6832.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fixes #6815 (Qwen3.6-A3B "thinking broken" on the safetensors path).
Root cause
Qwen3.6-style hybrid-thinking templates prefill the reasoning opener in the generation prompt —
the rendered prompt ends
<|im_start|>assistant\n<think>\n(verified on the realQwen/Qwen3.6-35B-A3Btemplate). The model's own output stream is thereforereasoning…</think>answerwith no opening tag, so Studio's tag-based stream splitting rendersall reasoning as answer text. The GGUF path is unaffected because llama-server handles the
prefilled-opener convention natively — which is why the issue only reproduces on safetensors.
Fix
In
_generate_chat_response_inner: when the rendered promptrstrip().endswith("<think>"), theopener is carried as a prefix on every cumulative snapshot from
generate_stream(the streamprotocol is cumulative -- consumers diff against the previous value; per the Codex review
finding, a one-shot yield would drop characters from the diff). This is consistent with the existing stream protocol (the
harmony parser already emits literal tags for gpt-oss).
Predicate verified on real templates:
…assistant\n<think>\nenable_thinking=False…<think>\n\n</think>\n\n…assistant\nNote
mlx_inference.pylikely needs the same guard on its stream path — happy to extend this PR orfollow up separately, maintainer preference.