Studio: render thinking blocks for safetensors inference with prefilled <think> templates by shimmyshimmer · Pull Request #6816 · unslothai/unsloth

shimmyshimmer · 2026-07-02T14:02:53Z

Problem

Running a reasoning model from safetensors (transformers or MLX backend) in Studio Chat shows the model's reasoning as plain text in the response instead of a collapsible thinking block. The same model as GGUF renders the block correctly.

Repro: load Qwen/Qwen3.6-35B-A3B from safetensors, send any message. The reply starts with the raw chain of thought ("Thinking Process: 1. Analyze the Input...") followed by the actual answer.

Root cause

Qwen3.6-style chat templates end the generation prompt with an open <think>\n tag so the model starts reasoning immediately:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

The frontend builds thinking blocks by parsing literal <think>/</think> tags out of the streamed text (parse-assistant-content.ts). On the GGUF path this works because llama-server's reasoning parser accounts for the prompt prefill and returns reasoning_content, which we re-wrap in think tags. On the safetensors paths the streamers run with skip_prompt=True, and since the opening <think> is part of the prompt it is never emitted. The frontend receives bare reasoning text plus a stray </think> and cannot build the block.

Fix

New detect_think_prefill() in chat_template_helpers.py returns the trailing open <think> prefill of a rendered prompt, or an empty string. It ignores the closed <think>\n\n</think> prefill from enable_thinking=False and closed think blocks in prior turns (preserve_thinking).
The transformers text and vision streaming paths (inference.py) and the MLX text and VLM paths (mlx_inference.py) re-emit that prefix at the start of the generated stream. gpt-oss is untouched since HarmonyTextStreamer emits its own tags.
_clean_generated_text no longer strips <think>/</think> for tokenizers that mark them as special tokens.

No frontend changes needed. The agentic tool loop already expects think tags in cumulative text (same protocol as the GGUF path).

Testing

Added studio/backend/tests/test_think_prefill_reemit.py (8 tests): open prefill, closed prefill, no prefill, historical think blocks with and without a fresh prefill, partial content after the tag, empty and None prompts.
Full backend suite: no new failures against main (the pre-existing failures on a macOS box without CUDA are identical before and after).
End-to-end with the real Qwen3.6-35B-A3B tokenizer through apply_chat_template_for_generation:
- thinking on: prompt ends '<|im_start|>assistant\n<think>\n', re-emit prefix '<think>\n', frontend receives <think>\n...reasoning...</think>\n\nanswer and renders the block
- thinking off: prompt ends '<think>\n\n</think>\n\n', no prefix, output untouched

…ed <think> templates Reasoning templates like Qwen3.6 end the generation prompt with an open <think> tag. skip_prompt streaming drops it, so the frontend never sees the opening tag and shows reasoning as plain text. Detect the prefill and re-emit it at the start of the stream on the transformers and MLX paths. Also stop stripping think tags in _clean_generated_text when a tokenizer marks them special.

gemini-code-assist

Code Review

This pull request introduces a helper function detect_think_prefill to identify trailing open <think> tags in rendered prompts, which are often swallowed by skip_prompt during streaming. It updates the inference engines (including MLX) to re-emit this prefix at the start of the generated stream, ensuring the frontend can properly render thinking blocks. Unit tests are also added to verify this behavior. The reviewer feedback suggests yielding this <think> prefix immediately before the generation loop begins across all streaming paths. This would allow the frontend to render the collapsible thinking block during the prompt prefill phase, significantly improving perceived latency for the user.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-02T14:05:08Z

            thread.start()

-            output = ""
+            output = think_prefix


Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

Suggested change

output = think_prefix

output = think_prefix

if think_prefix:

yield think_prefix

gemini-code-assist · 2026-07-02T14:05:08Z

            thread.start()

-            output = ""
+            output = think_prefix


Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

Suggested change

output = think_prefix

output = think_prefix

if think_prefix:

yield think_prefix

gemini-code-assist · 2026-07-02T14:05:08Z

+
+        # An open <think> prefilled by the template lives in the prompt, not
+        # the generated tokens; re-emit it so the frontend renders the block.
+        think_prefix = detect_think_prefill(prompt)


Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

think_prefix = detect_think_prefill(prompt) if think_prefix: yield think_prefix

gemini-code-assist · 2026-07-02T14:05:08Z

+        from core.inference.chat_template_helpers import detect_think_prefill
+
+        # Re-emit an open <think> prefill from the prompt (see _generate_text).
+        cumulative = detect_think_prefill(prompt)


Yielding the prefilled cumulative (which contains the prefilled <think> tag) immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

cumulative = detect_think_prefill(prompt) if cumulative: yield cumulative

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1665762b38

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T14:08:52Z

+            # skip_prompt swallows an open <think> prefilled by the template;
+            # re-emit it so the frontend can render the thinking block.
+            # gpt-oss emits its own tags via HarmonyTextStreamer.
+            think_prefix = "" if self._is_gpt_oss_model() else detect_think_prefill(prompt)


Preserve closing think tags before adding the prefix

Please don't rely on _clean_generated_text to keep these tags after setting think_prefix: in the transformers paths the TextIteratorStreamer below is still constructed with skip_special_tokens=True (and the MLX text path also decodes with skip_special_tokens=True), so for any tokenizer that lists </think> as a special token the generated close is removed before the cleaner sees it. When the prompt prefill opens <think>, those streams become <think>...final answer with no close, so parseAssistantContent keeps the visible answer in the reasoning block; decode/stream with special tokens preserved and strip only non-think specials when a prefix is re-emitted.

Useful? React with 👍 / 👎.

Address review feedback: - Guard: skip re-emitting the open <think> when the tokenizer marks </think> as a special token, since skip_special_tokens would strip the model's close tag and leave an unclosed block that swallows the answer. Falls back to plain text (pre-fix behaviour) for those tokenizers. - Yield the prefilled <think> before the first token so the thinking block renders during prompt prefill instead of after the first generated token. - Drop the now-unnecessary _clean_generated_text think-tag exemption; the guard handles the special-token case at the source. No mainstream reasoning model (Qwen3.6, Qwen3, DeepSeek-R1, QwQ, GLM-4.6) marks think tags special, so behaviour is unchanged for them.

chatgpt-codex-connector · 2026-07-03T13:05:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

shimmyshimmer · 2026-07-03T13:05:25Z

Thanks both, pushed edb516b to address the feedback.

Codex (special close tag): confirmed the mechanism. I checked with real tokenizers: when <think>/</think> are added special tokens, skip_special_tokens=True strips both at decode time (<think>reason</think>answer -> reasonanswer), before _clean_generated_text runs, so relying on the cleaner could not work. Rather than switch every model to skip_special_tokens=False (a broad change I could not fully verify for streaming), I guard at the source: detect_think_prefill now takes the tokenizer's special-token list and returns "" when </think> is special, so we do not open a block whose close will be stripped. Those tokenizers fall back to the pre-fix plain-text behaviour (answer stays visible, never swallowed). The _clean_generated_text exemption is dropped since it is no longer needed. In practice no mainstream reasoning model marks think tags special (verified: Qwen3.6, Qwen3, DeepSeek-R1, QwQ-32B, GLM-4.6), so real models are unaffected.

Gemini (yield prefix early): done on all four paths (transformers text + vision, MLX text + VLM). The prefilled <think> is now yielded before the first token so the block renders during prompt prefill. The stream stays cumulative, so the consumer's delta = cumulative[len(prev):] reconstruction is unchanged.

Added unit tests for the guard (special vs non-special close tag, default/empty passthrough).

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-07-03T13:06:56Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

shimmyshimmer requested a review from danielhanchen as a code owner July 2, 2026 14:02

gemini-code-assist Bot reviewed Jul 2, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

766ceb3

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816

Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816
shimmyshimmer wants to merge 3 commits into
mainfrom
fix/studio-safetensors-think-prefill

shimmyshimmer commented Jul 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

shimmyshimmer commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

shimmyshimmer commented Jul 2, 2026

Problem

Root cause

Fix

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

shimmyshimmer commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant