Skip to content

studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832

Open
pjordanandrsn wants to merge 2 commits into
unslothai:mainfrom
pjordanandrsn:fix/6815-think-prefill
Open

studio: re-emit prefilled <think> opener on the safetensors stream (fixes #6815)#6832
pjordanandrsn wants to merge 2 commits into
unslothai:mainfrom
pjordanandrsn:fix/6815-think-prefill

Conversation

@pjordanandrsn

@pjordanandrsn pjordanandrsn commented Jul 3, 2026

Copy link
Copy Markdown

Fixes #6815 (Qwen3.6-A3B "thinking broken" on the safetensors path).

Root cause

Qwen3.6-style hybrid-thinking templates prefill the reasoning opener in the generation prompt —
the rendered prompt ends <|im_start|>assistant\n<think>\n (verified on the real
Qwen/Qwen3.6-35B-A3B template). The model's own output stream is therefore
reasoning…</think>answer with no opening tag, so Studio's tag-based stream splitting renders
all reasoning as answer text. The GGUF path is unaffected because llama-server handles the
prefilled-opener convention natively — which is why the issue only reproduces on safetensors.

Fix

In _generate_chat_response_inner: when the rendered prompt rstrip().endswith("<think>"), the
opener is carried as a prefix on every cumulative snapshot from generate_stream (the stream
protocol is cumulative -- consumers diff against the previous value; per the Codex review
finding, a one-shot yield would drop characters from the diff). This is consistent with the existing stream protocol (the
harmony parser already emits literal tags for gpt-oss).

Predicate verified on real templates:

template prompt tail predicate
Qwen3.6-35B-A3B, default …assistant\n<think>\n True (opener re-emitted)
Qwen3.6-35B-A3B, enable_thinking=False …<think>\n\n</think>\n\n False (closed empty block — untouched)
Qwen3-0.6B (older, no prefill) …assistant\n False (unchanged behavior)

Note

mlx_inference.py likely needs the same guard on its stream path — happy to extend this PR or
follow up separately, maintainer preference.

Qwen3.5/3.6-style hybrid-thinking templates prefill '<think>\n' in the
generation prompt, so the model's output stream carries no opening tag and
Studio's tag-based splitting rendered all reasoning as answer text
(unslothai#6815). Yield the opener before generate_stream when the
rendered prompt ends with it; enable_thinking=False renders a closed empty
block and is untouched. GGUF path unaffected (llama-server owns that
convention).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the chat response generation in studio/backend/core/inference/inference.py to handle hybrid-thinking templates (like Qwen3.5/3.6) that prefill the reasoning opener. It yields a <think> tag if the formatted prompt ends with <think>, ensuring the reasoning stream is correctly parsed on the safetensors path. There are no review comments, so no feedback is provided.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7d3dcdd6e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# safetensors path (#6815). enable_thinking=False prompts end with an *empty*
# `<think>\n\n</think>` block, not the bare opener, so they are unaffected.
if formatted_prompt.rstrip().endswith("<think>"):
yield "<think>"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the think opener in cumulative snapshots

When the template pre-fills a bare <think>, this yields one synthetic snapshot and then delegates to generate_stream, whose later snapshots start from only the model-generated text. The OpenAI/safetensors consumers treat every value from this generator as cumulative and diff against the previous snapshot (for example routes/inference.py slices cumulative[len(prev_text):]), so after prev_text is "<think>" the next reasoning snapshot loses its first characters and the final/non-streaming result still omits the opener. The synthetic prefix needs to be part of each subsequent cumulative value instead of emitted once separately.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed and fixed in e8db9ce. generate_stream yields cumulative snapshots (output += new_token), and the consumers diff with cumulative[len(prev_text):], so the one-shot opener made the next snapshot lose its first 7 characters and dropped the opener from the final value — a consumer-diff simulation reproduces exactly that ('<think>ing...'). The opener is now carried as a prefix on every snapshot: streamed reconstruction and final snapshot both come out as <think>reasoning...</think>answer, and the no-prefill path is byte-identical to before (empty prefix).

The stream protocol yields cumulative snapshots that consumers diff against
the previous value (routes/inference.py: cumulative[len(prev_text):]).
Yielding the opener once as its own snapshot made the next snapshot lose
its first len('<think>') characters and dropped the opener from the final
value. Prefix every generate_stream snapshot instead; the no-prefill path
is unchanged (empty prefix).

Addresses the Codex P1 review finding on unslothai#6832.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Qwen3.6-35B-A3B thinking broken in .safetensors

1 participant