Skip to content

[Feature] KTransformers Integration to Support CPU/GPU Hybrid Inference for MoE Models #11425

@Atream

Description

@Atream

Motivation

While hybrid CPU/GPU inference alleviates memory constraints by leveraging CPU DRAM alongside GPU VRAM bandwidth, achieving high throughput remains challenging due to synchronization overheads and limited CPU compute efficiency. This PR upstreams the KTransformers approach (SOSP ’25), enabling GPU Tensor Parallelism + CPU/GPU Hybrid Expert Parallelism for MoE models—supporting hybrid prefilling and decoding that utilize kt-kernel AMX-optimized CPUs together with GPUs. With this design, dense layers benefit from high-throughput multi-GPU execution, while experts are flexibly scheduled across both CPUs and GPUs, maximizing hardware utilization and reducing bottlenecks.
KTransformers will be incorporated into SGLang as a library backend. Building on this backend, sglang generalizes the design to support multi-GPU tensor parallelism and CPU/GPU hybrid expert parallelism, while broadening coverage to additional models and weight formats.

Benchmark Results (Preview)

These are preliminary results of KTransformers with a single GPU, reflecting the performance of this feature in a single-card setting. More detailed benchmark data will be provided in follow-up updates. The figures below show the throughput performance of KTransformers on a dual-socket server with Intel® Xeon® Platinum 8452Y CPUs (36 cores × 2, 1 TB DDR5), equipped with an NVIDIA A100 (40 GB) for full-precision models and an NVIDIA RTX 4080 (16 GB) for quantized models. We evaluate on DeepSeek-V3-0324 (DS-3), DeepSeek-V2.5-1210 (DS-2), and Qwen2-57B-A14B (QW-2), comparing KTransformers against Llama.cpp and Fiddler across both the prefill and decode phases.

Image

In the prefill phase, KTransformers consistently outperforms both baselines across all prompt lengths. While Llama.cpp shows advantages in short-prompt scenarios through aggressive operator fusion, and Fiddler benefits from AMX acceleration for long prompts, KTransformers surpasses both by leveraging AMX-optimized CPU kernels and improved CPU/GPU coordination. For example, our CPU MoE kernel achieves 21.3 TFLOPS on DS-3, a 3.98× improvement over the PyTorch baseline.

Image

In the decode phase, KTransformers (without Expert Deferral) achieves 2.42×–4.09× speedups over Fiddler and 1.25×–1.76× over Llama.cpp on full-precision models. With quantized models, the gains are even larger (1.77×–1.93× vs. Llama.cpp), primarily due to reduced kernel execution time and our efficient CUDA Graph-based scheduling, which reduces GPU launch overhead from over 20% to nearly zero.

Roadmap

  1. Hybrid inference with compressed tensor format + AMX kernel integration + CUDA Graph support. init support for KTransformers Heterogeneous Computing #11487
  2. Support hybrid quant config.
  3. Support more weight formats (eg. GPTQ, AWQ).
  4. refactor to use experts_map instead of num_gpu_experts.
  5. Avoid padding when using DP Attention.
  6. Hotness aware expert distribution.
  7. Experts deferral. Support Expert Deferral Mechanism in KTransformers #12586
  8. Add unit tests.
  9. Add tutorial and deployment guide.
  10. Supporting speculative decoding.
  11. Support more models (eg. Qwen3, GLM4.5, Kimi-K2)

Related resources

Repo:https://github.com/kvcache-ai/ktransformers
SOSP25 Paper:https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/

CC: @Atream @ovowei @chenht2022 @Azure-Tang @ErvinXie

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions