This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute! - [x] Follow up #11523: enhance testing with shapes of production models and run it regularly on H100. * Solving via cutlas blockwise quantization kernels. - [x] Follow up #11502: - [x] Test and enable torch.compile - [ ] ~Refactor MoEMethodBase to unify and clean up the extra arguments of `scoring_func` and `e_correction_bias`~ - [x] Kernel tuning for 8xH200, MI300x, H100 (TP16 and TP8PP2 case) - Use https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py, but adapt it for the w8a8 fused moe kernel. - [x] CUDA Graph support - [x] MLA #10927 @simon-mo - [ ] Support nextn prediction heads ([EAGLE](https://arxiv.org/abs/2401.15077) style prediction heads) - Original PR for EAGLE support #6830 Perf #9565 Discussion #11126 Docs #11417 - [ ] Support expert parallelism for MoE. - [ ] Support data parallelism for MLA.
This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!
Refactor MoEMethodBase to unify and clean up the extra arguments ofscoring_funcande_correction_bias