Fix MLX notebook generate input/output compatibility#855
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Fixes MLX notebook-facing inference compatibility by normalizing generate(...) return types and improving mlx-lm TokenizerWrapper.apply_chat_template(..., return_dict=True) behavior to better match common Hugging Face notebook patterns.
Changes:
- Add a shared
_mlx_generate_output(...)helper and use it for both text and VLMgenerate(...)shims. - Patch mlx-lm
TokenizerWrapperto (a) remain callable when needed and (b) return aBatchEncodingforapply_chat_template(..., return_dict=True). - Add regression tests for generation output shape/type and chat-template
return_dict=Trueexpansion.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| unsloth_zoo/mlx/loader.py | Normalizes MLX generate(...) outputs and patches mlx-lm tokenizer wrapper to return HF-like BatchEncoding for chat templates. |
| tests/test_mlx_save_export_regressions.py | Adds regression assertions for generate output behavior and a new test covering chat-template return_dict=True expansion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _mlx_generate_output(prompt_ids, generated_ids): | ||
| """Build a Transformers-friendly batched generate return value.""" | ||
| import numpy as np | ||
|
|
||
| return np.asarray([list(prompt_ids) + list(generated_ids)], dtype=np.int64) | ||
|
|
||
|
|
||
| def _mlx_eos_token_id_set(eos_token_id): | ||
| """Normalize HF-style eos_token_id values into a set of token ids.""" |
|
|
||
| return np.asarray([list(prompt_ids) + list(generated_ids)], dtype=np.int64) | ||
|
|
||
|
|
||
| def _mlx_eos_token_id_set(eos_token_id): | ||
| """Normalize HF-style eos_token_id values into a set of token ids.""" |
| @pytest.fixture(autouse=True, scope="module") | ||
| def _install_mlx_torch_shim(): | ||
| from mlx_simulation import simulate_mlx_on_torch | ||
|
|
Summary
This PR fixes the MLX-side compatibility gaps that prevented existing supported Unsloth notebooks from running their inference cells unchanged after importing Unsloth.
The changes are intentionally limited to the notebook-facing input/output contracts in the MLX loader:
generate(...)results as a batched generated-id sequencetorch.longtensor when torch is importable, with a NumPyint64fallback when torch is unavailable(1, prompt_length + generated_length)so existing notebook slicing and decode code worksTokenizerWrappersupportapply_chat_template(..., return_dict=True)by returning a Hugging FaceBatchEncoding__call__return_dict=TrueexpansionWhy
Some existing notebooks use standard Hugging Face / CUDA-style inference patterns such as:
On the MLX path, the model and tokenizer objects are backed by mlx-lm / mlx-vlm rather than Transformers. Two notebook-visible mismatches showed up:
generate(...)returned an MLX array. That array can be sliced, but Transformers utilities such asto_py_obj(...)do not recognize themlxarray type. Returning torch when available gives notebooks the CUDA-like generated-id container they expect, while the NumPy fallback keeps torch-free MLX installs usable.TokenizerWrapper.apply_chat_template(..., return_dict=True)did not consistently return a mapping-like object that can be moved with.to(...)and expanded intomodel.generate(**inputs).This PR normalizes those two surfaces without changing notebook code.
Scope
This is deliberately not a broad Transformers generation compatibility layer.
Not included in this PR:
GenerationConfigmerge or precedence supportlogits_processorcompatibility policyreturn_dict_in_generateoutput modeThe goal is only to make the currently supported notebook inference patterns work unchanged on MLX while keeping the diff small and reviewable. CUDA/ROCm paths are unaffected because these changes are isolated to
unsloth_zoo/mlx/loader.py.Validation
Focused checks run locally:
PYTHONPATH=/Users/long/Github/unsloth/unsloth:/Users/long/Github/unsloth/worktrees/unsloth-zoo-mlx-inference-notebooks:$PYTHONPATH \ python -m pytest tests/test_mlx_save_export_regressions.py -qResult:
Additional checks:
Both passed.
A manual blocked-torch smoke check also verified that
_mlx_generate_output(...)falls back to a NumPyint64array when torch import is unavailable.The regression tests cover:
generate(...)returns a batched torch generated-id tensor with working.shape, slicing,.tolist(), andtransformers.to_py_obj(...)generate(...)returns the same notebook-friendly torch sequence shapeapply_chat_template(..., return_dict=True)returns aBatchEncodingthat supports.to(...)andmodel.generate(**inputs)expansion