I've made a fused Qwen3 MoE layer for faster fine-tuning #2890
Replies: 2 comments 9 replies
-
|
Hey @woct0rdho ! Nice work! We haven't yet started to integrate everything re MoE kernels, but we first wanted to compartmentalize stuff - hence there was some code for MoE kernels, but we haven't yet enabled them. I took a look at your repo - fantastic work! Would you be interested in making a PR? In fact, are you interested in joining Unsloth full time / part time to work on this? :) |
Beta Was this translation helpful? Give feedback.
-
|
Hi @danielhanchen, just came across this thread -- is integration for everything re MoE kernels still in-progress? It would be a game-changer for getting performant fine-tuning on the latest and greatest MoE models :) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://github.com/woct0rdho/transformers-qwen3-moe-fused
A few months ago there was a PR to introduce the fused MoE kernels: #2465 , but if I understand correctly, it's not actually used when we fine-tune an MoE model in Unsloth. So I started to try actually using it, while being compatible with the HF Transformers ecosystem.
Now I provide an example of fine-tuning the fused Qwen3-30B-A3B with LoRA and 4-bit quantization. On a single GPU with 24GB VRAM, it reaches 100% GPU usage and 5x speedup compared to the unfused model. The Unsloth optimizations such as fast attention and fast LoRA (on the non-MoE linear layers), RMSNorm, gradient checkpointing, can be automatically applied.
There is still room for further optimization, such as supporting the fast LoRA on the MoE layer. (Update: This is done!)
Do you have any idea how this can be integrated into Unsloth? I guess the MoE kernels can get some visibility only if we enable them by default.
Beta Was this translation helpful? Give feedback.
All reactions