Fix MLX notebook generate input/output compatibility by Lyxot · Pull Request #855 · unslothai/unsloth-zoo

Lyxot · 2026-07-03T15:55:05Z

Summary

This PR fixes the MLX-side compatibility gaps that prevented existing supported Unsloth notebooks from running their inference cells unchanged after importing Unsloth.

The changes are intentionally limited to the notebook-facing input/output contracts in the MLX loader:

return text and VLM generate(...) results as a batched generated-id sequence
prefer a torch.long tensor when torch is importable, with a NumPy int64 fallback when torch is unavailable
keep the returned sequence shaped as (1, prompt_length + generated_length) so existing notebook slicing and decode code works
make the patched mlx-lm TokenizerWrapper support apply_chat_template(..., return_dict=True) by returning a Hugging Face BatchEncoding
preserve the existing callable-tokenizer shim for mlx-lm wrappers that do not define __call__
add focused regression coverage for text generation, VLM generation, and chat-template return_dict=True expansion

Why

Some existing notebooks use standard Hugging Face / CUDA-style inference patterns such as:

inputs = tokenizer.apply_chat_template(..., return_tensors="pt", return_dict=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=...)
response = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[-1]:])

On the MLX path, the model and tokenizer objects are backed by mlx-lm / mlx-vlm rather than Transformers. Two notebook-visible mismatches showed up:

generate(...) returned an MLX array. That array can be sliced, but Transformers utilities such as to_py_obj(...) do not recognize the mlx array type. Returning torch when available gives notebooks the CUDA-like generated-id container they expect, while the NumPy fallback keeps torch-free MLX installs usable.
mlx-lm's TokenizerWrapper.apply_chat_template(..., return_dict=True) did not consistently return a mapping-like object that can be moved with .to(...) and expanded into model.generate(**inputs).

This PR normalizes those two surfaces without changing notebook code.

Scope

This is deliberately not a broad Transformers generation compatibility layer.

Not included in this PR:

no GenerationConfig merge or precedence support
no HF logits_processor compatibility policy
no return_dict_in_generate output mode
no changes to notebook files

The goal is only to make the currently supported notebook inference patterns work unchanged on MLX while keeping the diff small and reviewable. CUDA/ROCm paths are unaffected because these changes are isolated to unsloth_zoo/mlx/loader.py.

Validation

Focused checks run locally:

PYTHONPATH=/Users/long/Github/unsloth/unsloth:/Users/long/Github/unsloth/worktrees/unsloth-zoo-mlx-inference-notebooks:$PYTHONPATH \
  python -m pytest tests/test_mlx_save_export_regressions.py -q

Result:

35 passed

Additional checks:

python -m py_compile unsloth_zoo/mlx/loader.py tests/test_mlx_save_export_regressions.py
git diff --check

Both passed.

A manual blocked-torch smoke check also verified that _mlx_generate_output(...) falls back to a NumPy int64 array when torch import is unavailable.

The regression tests cover:

text generate(...) returns a batched torch generated-id tensor with working .shape, slicing, .tolist(), and transformers.to_py_obj(...)
VLM generate(...) returns the same notebook-friendly torch sequence shape
the MLX-on-torch regression module skips cleanly when torch is unavailable
apply_chat_template(..., return_dict=True) returns a BatchEncoding that supports .to(...) and model.generate(**inputs) expansion

chatgpt-codex-connector · 2026-07-03T15:55:10Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

gemini-code-assist · 2026-07-03T16:00:24Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Copilot

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Fixes MLX notebook-facing inference compatibility by normalizing generate(...) return types and improving mlx-lm TokenizerWrapper.apply_chat_template(..., return_dict=True) behavior to better match common Hugging Face notebook patterns.

Changes:

Add a shared _mlx_generate_output(...) helper and use it for both text and VLM generate(...) shims.
Patch mlx-lm TokenizerWrapper to (a) remain callable when needed and (b) return a BatchEncoding for apply_chat_template(..., return_dict=True).
Add regression tests for generation output shape/type and chat-template return_dict=True expansion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
unsloth_zoo/mlx/loader.py	Normalizes MLX `generate(...)` outputs and patches mlx-lm tokenizer wrapper to return HF-like `BatchEncoding` for chat templates.
tests/test_mlx_save_export_regressions.py	Adds regression assertions for generate output behavior and a new test covering chat-template `return_dict=True` expansion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def _mlx_generate_output(prompt_ids, generated_ids):
+    """Build a Transformers-friendly batched generate return value."""
+    import numpy as np
+
+    return np.asarray([list(prompt_ids) + list(generated_ids)], dtype=np.int64)
+
+
 def _mlx_eos_token_id_set(eos_token_id):
    """Normalize HF-style eos_token_id values into a set of token ids."""


+
+    return np.asarray([list(prompt_ids) + list(generated_ids)], dtype=np.int64)
+
+
 def _mlx_eos_token_id_set(eos_token_id):
    """Normalize HF-style eos_token_id values into a set of token ids."""


 @pytest.fixture(autouse=True, scope="module")
 def _install_mlx_torch_shim():
    from mlx_simulation import simulate_mlx_on_torch



fix(mlx): support notebook generate outputs

0ed77ee

Copilot AI review requested due to automatic review settings July 3, 2026 15:55

fix(mlx): prefer torch generate outputs

7119f19

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Copilot started reviewing on behalf of Lyxot July 3, 2026 20:08 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MLX notebook generate input/output compatibility#855

Fix MLX notebook generate input/output compatibility#855
Lyxot wants to merge 2 commits into
unslothai:mainfrom
Lyxot:fix/mlx-inference-notebook-compat

Lyxot commented Jul 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

gemini-code-assist Bot commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Lyxot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Scope

Validation

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

gemini-code-assist Bot commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lyxot commented Jul 3, 2026 •

edited

Loading