Skip to content

adding Support for Dream Architecture#270

Draft
Goekdeniz-Guelmez wants to merge 15 commits intoml-explore:mainfrom
Goekdeniz-Guelmez:adding-DiffuCoder
Draft

adding Support for Dream Architecture#270
Goekdeniz-Guelmez wants to merge 15 commits intoml-explore:mainfrom
Goekdeniz-Guelmez:adding-DiffuCoder

Conversation

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor

No description provided.

@Goekdeniz-Guelmez Goekdeniz-Guelmez marked this pull request as draft July 3, 2025 08:26
@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

python -m mlx_lm.generate --model /Users/gokdenizgulmez/Desktop/dream_grpo-4bit --prompt "write quick sort in c++"
<frozen runpy>:128: RuntimeWarning: 'mlx_lm.generate' found in sys.modules after import of package 'mlx_lm', but prior to execution of 'mlx_lm.generate'; this may result in unpredictable behaviour
Calling `python -m mlx_lm.generate...` directly is deprecated. Use `mlx_lm.generate...` or `python -m mlx_lm generate ...` instead.
==========
#include <iostream>
#include <vector>

using namespace std;

// Function to perform quick sort on a vector of integers
void quickSort(vector<int>& arr, int low, int high) {
    int pi = partition(arr, low, high);
    // Recursively sort elements before and after partition
    quickSort(arr, low, pi - 1);
    quickSort(arr, pi + 1, high);
}

// Function to partition the array
int partition(vector<int>& arr,
==========
Prompt: 25 tokens, 46.264 tokens-per-sec
Generation: 100 tokens, 12.091 tokens-per-sec
Peak memory: 4.365 GB

@Goekdeniz-Guelmez Goekdeniz-Guelmez changed the title adding Support for Apples DiffuCoder Dream Architecture adding Support for Dream Architecture Jul 3, 2025
@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

python -m mlx_lm.lora \
--model /Users/gokdenizgulmez/Desktop/dream_grpo-4bit \
--train \
--data /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/data_smoll \
--fine-tune-type lora \
--num-layers 2 \
--batch-size 1 \
--iters 5 \
--val-batches 1 \
--steps-per-report 1 \
--steps-per-eval 5 \
--adapter-path /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test_dream \
--save-every 500 \
--max-seq-length 128 \
--grad-checkpoint
Calling `python -m mlx_lm.lora...` directly is deprecated. Use `mlx_lm.lora...` or `python -m mlx_lm lora ...` instead.
Loading pretrained model
The repository for /Users/gokdenizgulmez/Desktop/dream_grpo-4bit contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/gokdenizgulmez/Desktop/dream_grpo-4bit.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Loading datasets
Training
Trainable parameters: 0.002% (0.180M/7615.617M)
Starting training..., iters: 5
[WARNING] Some sequences are longer than 128 tokens. The longest sentence 1263 will be truncated to 128. Consider pre-splitting your data to save memory.
Calculating loss...:   0%|                                                                              | 0/1 [00:00<?, ?it/s][WARNING] Some sequences are longer than 128 tokens. The longest sentence 2047 will be truncated to 128. Consider pre-splitting your data to save memory.
Calculating loss...: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.91s/it]
Iter 1: Val loss 5.289, Val took 1.937s
Iter 1: Train loss 6.134, Learning Rate 1.000e-05, It/sec 0.173, Tokens/sec 21.929, Trained Tokens 127, Peak mem 4.569 GB
[WARNING] Some sequences are longer than 128 tokens. The longest sentence 605 will be truncated to 128. Consider pre-splitting your data to save memory.
Iter 2: Train loss 5.653, Learning Rate 1.000e-05, It/sec 0.471, Tokens/sec 59.792, Trained Tokens 254, Peak mem 4.670 GB
[WARNING] Some sequences are longer than 128 tokens. The longest sentence 2035 will be truncated to 128. Consider pre-splitting your data to save memory.
Iter 3: Train loss 5.062, Learning Rate 1.000e-05, It/sec 0.486, Tokens/sec 61.665, Trained Tokens 381, Peak mem 4.670 GB
[WARNING] Some sequences are longer than 128 tokens. The longest sentence 1806 will be truncated to 128. Consider pre-splitting your data to save memory.
Iter 4: Train loss 5.426, Learning Rate 1.000e-05, It/sec 0.482, Tokens/sec 61.209, Trained Tokens 508, Peak mem 4.670 GB
[WARNING] Some sequences are longer than 128 tokens. The longest sentence 1607 will be truncated to 128. Consider pre-splitting your data to save memory.
Calculating loss...:   0%|                                                                              | 0/1 [00:00<?, ?it/s][WARNING] Some sequences are longer than 128 tokens. The longest sentence 1171 will be truncated to 128. Consider pre-splitting your data to save memory.
Calculating loss...: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
Iter 5: Val loss 5.226, Val took 1.789s
Iter 5: Train loss 4.596, Learning Rate 1.000e-05, It/sec 0.486, Tokens/sec 61.702, Trained Tokens 635, Peak mem 4.670 GB
Saved final weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test_dream/adapters.safetensors.

@Goekdeniz-Guelmez Goekdeniz-Guelmez marked this pull request as ready for review July 3, 2025 09:17
@awni
Copy link
Copy Markdown
Member

awni commented Jul 3, 2025

This is a diffusion model right? I'm not sure it makes sense to do auto-regressive decoding with it?

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Yes, I started to implement the diffusion generation, that was just to test the model implementation, or should it be better to wait until Llada has been merged to continue?

@awni
Copy link
Copy Markdown
Member

awni commented Jul 3, 2025

Yes, I started to implement the diffusion generation, that was just to test the model implementation, or should it be better to wait until Llada has been merged to continue?

There's no need to wait for that. My recommendation though for diffusion models is to write a new generate_step (and possibly stream_generate / generate. They are so different that I think we should have a separate path entirely to avoid cluttering the code and to keep them easy to change as the diffusion model APIs kind of converge. The models can of course still go in models/.

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

yes thats what I thought too, maybe even a different terminal command like mlx_lm.generate.diffusion ..., what is your thought?

@awni
Copy link
Copy Markdown
Member

awni commented Jul 3, 2025

yes thats what I thought too, maybe even a different terminal command like mlx_lm.generate.diffusion ..., what is your thought?

Yes possibly. How about start with the API and we can see what makes sense for the CLI base on what it looks like?

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Sure, that's a good plan, once I get this running, I'll ping you.

@ccckblaze
Copy link
Copy Markdown

any updates?

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

Not at the moment. Diffusion-based text-to-text is generally slower on Apple Silicon compared to token-wise autoregression, and I haven’t seen adoption of this model on other platforms yet. A separate port might be more suitable in my opinion. That said, if there’s genuine interest, I’m happy to continue and prioritize it. What do you think, @angeloskath @awni?

@awni
Copy link
Copy Markdown
Member

awni commented Aug 18, 2025

I agree so far there isn't a ton of interest for that specific model. I think it's fine to deprioritize this for now. I do think we should keep an eye on diffusion LLMs in general. As they improve it may make more sense to support them (either here or elsewhere), but we haven't reached that point yet and I haven't seen a ton of progress recently.

@Goekdeniz-Guelmez
Copy link
Copy Markdown
Contributor Author

I agree so far there isn't a ton of interest for that specific model. I think it's fine to deprioritize this for now. I do think we should keep an eye on diffusion LLMs in general. As they improve it may make more sense to support them (either here or elsewhere), but we haven't reached that point yet and I haven't seen a ton of progress recently.

Agreed! I’ll leave the PR open but move the implementation into here, so it’s easier to maintain and revisit once they gain more traction. That way, we don’t lose the work, and we’re ready if adoption picks up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants