Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816
Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816shimmyshimmer wants to merge 3 commits into
Conversation
…ed <think> templates Reasoning templates like Qwen3.6 end the generation prompt with an open <think> tag. skip_prompt streaming drops it, so the frontend never sees the opening tag and shows reasoning as plain text. Detect the prefill and re-emit it at the start of the stream on the transformers and MLX paths. Also stop stripping think tags in _clean_generated_text when a tokenizer marks them special.
There was a problem hiding this comment.
Code Review
This pull request introduces a helper function detect_think_prefill to identify trailing open <think> tags in rendered prompts, which are often swallowed by skip_prompt during streaming. It updates the inference engines (including MLX) to re-emit this prefix at the start of the generated stream, ensuring the frontend can properly render thinking blocks. Unit tests are also added to verify this behavior. The reviewer feedback suggests yielding this <think> prefix immediately before the generation loop begins across all streaming paths. This would allow the frontend to render the collapsible thinking block during the prompt prefill phase, significantly improving perceived latency for the user.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| thread.start() | ||
|
|
||
| output = "" | ||
| output = think_prefix |
There was a problem hiding this comment.
Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.
| output = think_prefix | |
| output = think_prefix | |
| if think_prefix: | |
| yield think_prefix |
| thread.start() | ||
|
|
||
| output = "" | ||
| output = think_prefix |
There was a problem hiding this comment.
Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.
| output = think_prefix | |
| output = think_prefix | |
| if think_prefix: | |
| yield think_prefix |
|
|
||
| # An open <think> prefilled by the template lives in the prompt, not | ||
| # the generated tokens; re-emit it so the frontend renders the block. | ||
| think_prefix = detect_think_prefill(prompt) |
There was a problem hiding this comment.
Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.
think_prefix = detect_think_prefill(prompt)
if think_prefix:
yield think_prefix| from core.inference.chat_template_helpers import detect_think_prefill | ||
|
|
||
| # Re-emit an open <think> prefill from the prompt (see _generate_text). | ||
| cumulative = detect_think_prefill(prompt) |
There was a problem hiding this comment.
Yielding the prefilled cumulative (which contains the prefilled <think> tag) immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.
cumulative = detect_think_prefill(prompt)
if cumulative:
yield cumulativeThere was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1665762b38
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # skip_prompt swallows an open <think> prefilled by the template; | ||
| # re-emit it so the frontend can render the thinking block. | ||
| # gpt-oss emits its own tags via HarmonyTextStreamer. | ||
| think_prefix = "" if self._is_gpt_oss_model() else detect_think_prefill(prompt) |
There was a problem hiding this comment.
Preserve closing think tags before adding the prefix
Please don't rely on _clean_generated_text to keep these tags after setting think_prefix: in the transformers paths the TextIteratorStreamer below is still constructed with skip_special_tokens=True (and the MLX text path also decodes with skip_special_tokens=True), so for any tokenizer that lists </think> as a special token the generated close is removed before the cleaner sees it. When the prompt prefill opens <think>, those streams become <think>...final answer with no close, so parseAssistantContent keeps the visible answer in the reasoning block; decode/stream with special tokens preserved and strip only non-think specials when a prefix is re-emitted.
Useful? React with 👍 / 👎.
Address review feedback: - Guard: skip re-emitting the open <think> when the tokenizer marks </think> as a special token, since skip_special_tokens would strip the model's close tag and leave an unclosed block that swallows the answer. Falls back to plain text (pre-fix behaviour) for those tokenizers. - Yield the prefilled <think> before the first token so the thinking block renders during prompt prefill instead of after the first generated token. - Drop the now-unnecessary _clean_generated_text think-tag exemption; the guard handles the special-token case at the source. No mainstream reasoning model (Qwen3.6, Qwen3, DeepSeek-R1, QwQ, GLM-4.6) marks think tags special, so behaviour is unchanged for them.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Thanks both, pushed edb516b to address the feedback. Codex (special close tag): confirmed the mechanism. I checked with real tokenizers: when Gemini (yield prefix early): done on all four paths (transformers text + vision, MLX text + VLM). The prefilled Added unit tests for the guard (special vs non-special close tag, default/empty passthrough). |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Problem
Running a reasoning model from safetensors (transformers or MLX backend) in Studio Chat shows the model's reasoning as plain text in the response instead of a collapsible thinking block. The same model as GGUF renders the block correctly.
Repro: load Qwen/Qwen3.6-35B-A3B from safetensors, send any message. The reply starts with the raw chain of thought ("Thinking Process: 1. Analyze the Input...") followed by the actual answer.
Root cause
Qwen3.6-style chat templates end the generation prompt with an open
<think>\ntag so the model starts reasoning immediately:The frontend builds thinking blocks by parsing literal
<think>/</think>tags out of the streamed text (parse-assistant-content.ts). On the GGUF path this works because llama-server's reasoning parser accounts for the prompt prefill and returnsreasoning_content, which we re-wrap in think tags. On the safetensors paths the streamers run withskip_prompt=True, and since the opening<think>is part of the prompt it is never emitted. The frontend receives bare reasoning text plus a stray</think>and cannot build the block.Fix
detect_think_prefill()inchat_template_helpers.pyreturns the trailing open<think>prefill of a rendered prompt, or an empty string. It ignores the closed<think>\n\n</think>prefill fromenable_thinking=Falseand closed think blocks in prior turns (preserve_thinking).inference.py) and the MLX text and VLM paths (mlx_inference.py) re-emit that prefix at the start of the generated stream. gpt-oss is untouched sinceHarmonyTextStreameremits its own tags._clean_generated_textno longer strips<think>/</think>for tokenizers that mark them as special tokens.No frontend changes needed. The agentic tool loop already expects think tags in cumulative text (same protocol as the GGUF path).
Testing
studio/backend/tests/test_think_prefill_reemit.py(8 tests): open prefill, closed prefill, no prefill, historical think blocks with and without a fresh prefill, partial content after the tag, empty and None prompts.apply_chat_template_for_generation:'<|im_start|>assistant\n<think>\n', re-emit prefix'<think>\n', frontend receives<think>\n...reasoning...</think>\n\nanswerand renders the block'<think>\n\n</think>\n\n', no prefix, output untouched