studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815) by pjordanandrsn · Pull Request #6832 · unslothai/unsloth

pjordanandrsn · 2026-07-03T02:14:55Z

Fixes #6815 (Qwen3.6-A3B "thinking broken" on the safetensors path).

Root cause

Qwen3.6-style hybrid-thinking templates prefill the reasoning opener in the generation prompt —
the rendered prompt ends <|im_start|>assistant\n<think>\n (verified on the real
Qwen/Qwen3.6-35B-A3B template). The model's own output stream is therefore
reasoning…</think>answer with no opening tag, so Studio's tag-based stream splitting renders
all reasoning as answer text. The GGUF path is unaffected because llama-server handles the
prefilled-opener convention natively — which is why the issue only reproduces on safetensors.

Fix

In _generate_chat_response_inner: when the rendered prompt rstrip().endswith("<think>"), the
opener is carried as a prefix on every cumulative snapshot from generate_stream (the stream
protocol is cumulative -- consumers diff against the previous value; per the Codex review
finding, a one-shot yield would drop characters from the diff). This is consistent with the existing stream protocol (the
harmony parser already emits literal tags for gpt-oss).

Predicate verified on real templates:

template	prompt tail	predicate
Qwen3.6-35B-A3B, default	`…assistant\n<think>\n`	True (opener re-emitted)
Qwen3.6-35B-A3B, `enable_thinking=False`	`…<think>\n\n</think>\n\n`	False (closed empty block — untouched)
Qwen3-0.6B (older, no prefill)	`…assistant\n`	False (unchanged behavior)

Note

mlx_inference.py likely needs the same guard on its stream path — happy to extend this PR or
follow up separately, maintainer preference.

Qwen3.5/3.6-style hybrid-thinking templates prefill '<think>\n' in the generation prompt, so the model's output stream carries no opening tag and Studio's tag-based splitting rendered all reasoning as answer text (unslothai#6815). Yield the opener before generate_stream when the rendered prompt ends with it; enable_thinking=False renders a closed empty block and is untouched. GGUF path unaffected (llama-server owns that convention). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the chat response generation in studio/backend/core/inference/inference.py to handle hybrid-thinking templates (like Qwen3.5/3.6) that prefill the reasoning opener. It yields a <think> tag if the formatted prompt ends with <think>, ensuring the reasoning stream is correctly parsed on the safetensors path. There are no review comments, so no feedback is provided.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7d3dcdd6e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-03T02:18:25Z

+        # safetensors path (#6815). enable_thinking=False prompts end with an *empty*
+        # `<think>\n\n</think>` block, not the bare opener, so they are unaffected.
+        if formatted_prompt.rstrip().endswith("<think>"):
+            yield "<think>"


Keep the think opener in cumulative snapshots

When the template pre-fills a bare <think>, this yields one synthetic snapshot and then delegates to generate_stream, whose later snapshots start from only the model-generated text. The OpenAI/safetensors consumers treat every value from this generator as cumulative and diff against the previous snapshot (for example routes/inference.py slices cumulative[len(prev_text):]), so after prev_text is "<think>" the next reasoning snapshot loses its first characters and the final/non-streaming result still omits the opener. The synthetic prefix needs to be part of each subsequent cumulative value instead of emitted once separately.

Useful? React with 👍 / 👎.

Good catch — confirmed and fixed in e8db9ce. generate_stream yields cumulative snapshots (output += new_token), and the consumers diff with cumulative[len(prev_text):], so the one-shot opener made the next snapshot lose its first 7 characters and dropped the opener from the final value — a consumer-diff simulation reproduces exactly that ('<think>ing...'). The opener is now carried as a prefix on every snapshot: streamed reconstruction and final snapshot both come out as <think>reasoning...</think>answer, and the no-prefill path is byte-identical to before (empty prefix).

The stream protocol yields cumulative snapshots that consumers diff against the previous value (routes/inference.py: cumulative[len(prev_text):]). Yielding the opener once as its own snapshot made the next snapshot lose its first len('<think>') characters and dropped the opener from the final value. Prefix every generate_stream snapshot instead; the no-prefill path is unchanged (empty prefix). Addresses the Codex P1 review finding on unslothai#6832. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pjordanandrsn mentioned this pull request Jul 3, 2026

[Bug] Qwen3.6-35B-A3B thinking broken in .safetensors #6815

Open

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832

studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832
pjordanandrsn wants to merge 2 commits into
unslothai:mainfrom
pjordanandrsn:fix/6815-think-prefill

pjordanandrsn commented Jul 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Uh oh!

pjordanandrsn Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

pjordanandrsn commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Note

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

pjordanandrsn Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pjordanandrsn commented Jul 3, 2026 •

edited

Loading