Skip to content

Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816

Open
shimmyshimmer wants to merge 3 commits into
mainfrom
fix/studio-safetensors-think-prefill
Open

Studio: render thinking blocks for safetensors inference with prefilled <think> templates#6816
shimmyshimmer wants to merge 3 commits into
mainfrom
fix/studio-safetensors-think-prefill

Conversation

@shimmyshimmer

Copy link
Copy Markdown
Member

Problem

Running a reasoning model from safetensors (transformers or MLX backend) in Studio Chat shows the model's reasoning as plain text in the response instead of a collapsible thinking block. The same model as GGUF renders the block correctly.

Repro: load Qwen/Qwen3.6-35B-A3B from safetensors, send any message. The reply starts with the raw chain of thought ("Thinking Process: 1. Analyze the Input...") followed by the actual answer.

Root cause

Qwen3.6-style chat templates end the generation prompt with an open <think>\n tag so the model starts reasoning immediately:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

The frontend builds thinking blocks by parsing literal <think>/</think> tags out of the streamed text (parse-assistant-content.ts). On the GGUF path this works because llama-server's reasoning parser accounts for the prompt prefill and returns reasoning_content, which we re-wrap in think tags. On the safetensors paths the streamers run with skip_prompt=True, and since the opening <think> is part of the prompt it is never emitted. The frontend receives bare reasoning text plus a stray </think> and cannot build the block.

Fix

  • New detect_think_prefill() in chat_template_helpers.py returns the trailing open <think> prefill of a rendered prompt, or an empty string. It ignores the closed <think>\n\n</think> prefill from enable_thinking=False and closed think blocks in prior turns (preserve_thinking).
  • The transformers text and vision streaming paths (inference.py) and the MLX text and VLM paths (mlx_inference.py) re-emit that prefix at the start of the generated stream. gpt-oss is untouched since HarmonyTextStreamer emits its own tags.
  • _clean_generated_text no longer strips <think>/</think> for tokenizers that mark them as special tokens.

No frontend changes needed. The agentic tool loop already expects think tags in cumulative text (same protocol as the GGUF path).

Testing

  • Added studio/backend/tests/test_think_prefill_reemit.py (8 tests): open prefill, closed prefill, no prefill, historical think blocks with and without a fresh prefill, partial content after the tag, empty and None prompts.
  • Full backend suite: no new failures against main (the pre-existing failures on a macOS box without CUDA are identical before and after).
  • End-to-end with the real Qwen3.6-35B-A3B tokenizer through apply_chat_template_for_generation:
    • thinking on: prompt ends '<|im_start|>assistant\n<think>\n', re-emit prefix '<think>\n', frontend receives <think>\n...reasoning...</think>\n\nanswer and renders the block
    • thinking off: prompt ends '<think>\n\n</think>\n\n', no prefix, output untouched

…ed <think> templates

Reasoning templates like Qwen3.6 end the generation prompt with an open
<think> tag. skip_prompt streaming drops it, so the frontend never sees
the opening tag and shows reasoning as plain text. Detect the prefill
and re-emit it at the start of the stream on the transformers and MLX
paths. Also stop stripping think tags in _clean_generated_text when a
tokenizer marks them special.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helper function detect_think_prefill to identify trailing open <think> tags in rendered prompts, which are often swallowed by skip_prompt during streaming. It updates the inference engines (including MLX) to re-emit this prefix at the start of the generated stream, ensuring the frontend can properly render thinking blocks. Unit tests are also added to verify this behavior. The reviewer feedback suggests yielding this <think> prefix immediately before the generation loop begins across all streaming paths. This would allow the frontend to render the collapsible thinking block during the prompt prefill phase, significantly improving perceived latency for the user.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

thread.start()

output = ""
output = think_prefix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

Suggested change
output = think_prefix
output = think_prefix
if think_prefix:
yield think_prefix

thread.start()

output = ""
output = think_prefix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

Suggested change
output = think_prefix
output = think_prefix
if think_prefix:
yield think_prefix


# An open <think> prefilled by the template lives in the prompt, not
# the generated tokens; re-emit it so the frontend renders the block.
think_prefix = detect_think_prefill(prompt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Yielding the prefilled think_prefix immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

        think_prefix = detect_think_prefill(prompt)
        if think_prefix:
            yield think_prefix

from core.inference.chat_template_helpers import detect_think_prefill

# Re-emit an open <think> prefill from the prompt (see _generate_text).
cumulative = detect_think_prefill(prompt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Yielding the prefilled cumulative (which contains the prefilled <think> tag) immediately before the generation loop starts is a great UX improvement. Currently, the generator yields nothing until the first token is generated. Since the prompt prefill/processing phase can take a few seconds (especially for larger models), yielding the <think> tag immediately allows the frontend to render the collapsible thinking block right away, significantly reducing perceived latency.

        cumulative = detect_think_prefill(prompt)
        if cumulative:
            yield cumulative

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1665762b38

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# skip_prompt swallows an open <think> prefilled by the template;
# re-emit it so the frontend can render the thinking block.
# gpt-oss emits its own tags via HarmonyTextStreamer.
think_prefix = "" if self._is_gpt_oss_model() else detect_think_prefill(prompt)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve closing think tags before adding the prefix

Please don't rely on _clean_generated_text to keep these tags after setting think_prefix: in the transformers paths the TextIteratorStreamer below is still constructed with skip_special_tokens=True (and the MLX text path also decodes with skip_special_tokens=True), so for any tokenizer that lists </think> as a special token the generated close is removed before the cleaner sees it. When the prompt prefill opens <think>, those streams become <think>...final answer with no close, so parseAssistantContent keeps the visible answer in the reasoning block; decode/stream with special tokens preserved and strip only non-think specials when a prefix is re-emitted.

Useful? React with 👍 / 👎.

Address review feedback:
- Guard: skip re-emitting the open <think> when the tokenizer marks </think>
  as a special token, since skip_special_tokens would strip the model's close
  tag and leave an unclosed block that swallows the answer. Falls back to
  plain text (pre-fix behaviour) for those tokenizers.
- Yield the prefilled <think> before the first token so the thinking block
  renders during prompt prefill instead of after the first generated token.
- Drop the now-unnecessary _clean_generated_text think-tag exemption; the
  guard handles the special-token case at the source.

No mainstream reasoning model (Qwen3.6, Qwen3, DeepSeek-R1, QwQ, GLM-4.6)
marks think tags special, so behaviour is unchanged for them.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@shimmyshimmer

Copy link
Copy Markdown
Member Author

Thanks both, pushed edb516b to address the feedback.

Codex (special close tag): confirmed the mechanism. I checked with real tokenizers: when <think>/</think> are added special tokens, skip_special_tokens=True strips both at decode time (<think>reason</think>answer -> reasonanswer), before _clean_generated_text runs, so relying on the cleaner could not work. Rather than switch every model to skip_special_tokens=False (a broad change I could not fully verify for streaming), I guard at the source: detect_think_prefill now takes the tokenizer's special-token list and returns "" when </think> is special, so we do not open a block whose close will be stripped. Those tokenizers fall back to the pre-fix plain-text behaviour (answer stays visible, never swallowed). The _clean_generated_text exemption is dropped since it is no longer needed. In practice no mainstream reasoning model marks think tags special (verified: Qwen3.6, Qwen3, DeepSeek-R1, QwQ-32B, GLM-4.6), so real models are unaffected.

Gemini (yield prefix early): done on all four paths (transformers text + vision, MLX text + VLM). The prefilled <think> is now yielded before the first token so the block renders during prompt prefill. The stream stays cumulative, so the consumer's delta = cumulative[len(prev):] reconstruction is unchanged.

Added unit tests for the guard (special vs non-special close tag, default/empty passthrough).

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant