[Feature] KTransformers Integration to Support CPU/GPU Hybrid Inference for MoE Models

### Motivation

While hybrid CPU/GPU inference alleviates memory constraints by leveraging CPU DRAM alongside GPU VRAM bandwidth, achieving high throughput remains challenging due to synchronization overheads and limited CPU compute efficiency. This PR upstreams the KTransformers approach (SOSP ’25), enabling GPU Tensor Parallelism + CPU/GPU Hybrid Expert Parallelism for MoE models—supporting hybrid prefilling and decoding that utilize kt-kernel AMX-optimized CPUs together with GPUs. With this design, dense layers benefit from high-throughput multi-GPU execution, while experts are flexibly scheduled across both CPUs and GPUs, maximizing hardware utilization and reducing bottlenecks.
KTransformers will be incorporated into SGLang as a library backend. Building on this backend, sglang generalizes the design to support multi-GPU tensor parallelism and CPU/GPU hybrid expert parallelism, while broadening coverage to additional models and weight formats. 

### Benchmark Results (Preview)

These are preliminary results of KTransformers with a single GPU, reflecting the performance of this feature in a single-card setting. More detailed benchmark data will be provided in follow-up updates. The figures below show the throughput performance of KTransformers on a dual-socket server with Intel® Xeon® Platinum 8452Y CPUs (36 cores × 2, 1 TB DDR5), equipped with an NVIDIA A100 (40 GB) for full-precision models and an NVIDIA RTX 4080 (16 GB) for quantized models. We evaluate on DeepSeek-V3-0324 (DS-3), DeepSeek-V2.5-1210 (DS-2), and Qwen2-57B-A14B (QW-2), comparing KTransformers against Llama.cpp and Fiddler across both the prefill and decode phases.

<img width="1280" height="662" alt="Image" src="https://github.com/user-attachments/assets/6ea0cc99-c50a-419b-a137-092798db9d0e" />

In the prefill phase, KTransformers consistently outperforms both baselines across all prompt lengths. While Llama.cpp shows advantages in short-prompt scenarios through aggressive operator fusion, and Fiddler benefits from AMX acceleration for long prompts, KTransformers surpasses both by leveraging AMX-optimized CPU kernels and improved CPU/GPU coordination. For example, our CPU MoE kernel achieves 21.3 TFLOPS on DS-3, a 3.98× improvement over the PyTorch baseline.

<img width="704" height="466" alt="Image" src="https://github.com/user-attachments/assets/4e2f69b1-fa5e-49c1-aef4-32a11d80d4ca" />

In the decode phase, KTransformers (without Expert Deferral) achieves 2.42×–4.09× speedups over Fiddler and 1.25×–1.76× over Llama.cpp on full-precision models. With quantized models, the gains are even larger (1.77×–1.93× vs. Llama.cpp), primarily due to reduced kernel execution time and our efficient CUDA Graph-based scheduling, which reduces GPU launch overhead from over 20% to nearly zero.


### Roadmap

1. Hybrid inference with compressed tensor format + AMX kernel integration + CUDA Graph support. https://github.com/sgl-project/sglang/pull/11487
2. Support hybrid quant config.
3. Support more weight formats (eg. GPTQ, AWQ).
4. refactor to use experts_map instead of num_gpu_experts.
5. Avoid padding when using DP Attention.
6. Hotness aware expert distribution. 
7. Experts deferral. https://github.com/sgl-project/sglang/pull/12586
8. Add unit tests.
9. Add tutorial and deployment guide.
10. Supporting speculative decoding.
11. Support more models (eg. Qwen3, GLM4.5, Kimi-K2)
### Related resources

Repo：https://github.com/kvcache-ai/ktransformers
SOSP25 Paper：https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/

CC: @Atream @ovowei @chenht2022 @Azure-Tang @ErvinXie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] KTransformers Integration to Support CPU/GPU Hybrid Inference for MoE Models #11425

Motivation

Benchmark Results (Preview)

Roadmap

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] KTransformers Integration to Support CPU/GPU Hybrid Inference for MoE Models #11425

Description

Motivation

Benchmark Results (Preview)

Roadmap

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions