You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While hybrid CPU/GPU inference alleviates memory constraints by leveraging CPU DRAM alongside GPU VRAM bandwidth, achieving high throughput remains challenging due to synchronization overheads and limited CPU compute efficiency. This PR upstreams the KTransformers approach (SOSP ’25), enabling GPU Tensor Parallelism + CPU/GPU Hybrid Expert Parallelism for MoE models—supporting hybrid prefilling and decoding that utilize kt-kernel AMX-optimized CPUs together with GPUs. With this design, dense layers benefit from high-throughput multi-GPU execution, while experts are flexibly scheduled across both CPUs and GPUs, maximizing hardware utilization and reducing bottlenecks.
KTransformers will be incorporated into SGLang as a library backend. Building on this backend, sglang generalizes the design to support multi-GPU tensor parallelism and CPU/GPU hybrid expert parallelism, while broadening coverage to additional models and weight formats.
Benchmark Results (Preview)
These are preliminary results of KTransformers with a single GPU, reflecting the performance of this feature in a single-card setting. More detailed benchmark data will be provided in follow-up updates. The figures below show the throughput performance of KTransformers on a dual-socket server with Intel® Xeon® Platinum 8452Y CPUs (36 cores × 2, 1 TB DDR5), equipped with an NVIDIA A100 (40 GB) for full-precision models and an NVIDIA RTX 4080 (16 GB) for quantized models. We evaluate on DeepSeek-V3-0324 (DS-3), DeepSeek-V2.5-1210 (DS-2), and Qwen2-57B-A14B (QW-2), comparing KTransformers against Llama.cpp and Fiddler across both the prefill and decode phases.
In the prefill phase, KTransformers consistently outperforms both baselines across all prompt lengths. While Llama.cpp shows advantages in short-prompt scenarios through aggressive operator fusion, and Fiddler benefits from AMX acceleration for long prompts, KTransformers surpasses both by leveraging AMX-optimized CPU kernels and improved CPU/GPU coordination. For example, our CPU MoE kernel achieves 21.3 TFLOPS on DS-3, a 3.98× improvement over the PyTorch baseline.
In the decode phase, KTransformers (without Expert Deferral) achieves 2.42×–4.09× speedups over Fiddler and 1.25×–1.76× over Llama.cpp on full-precision models. With quantized models, the gains are even larger (1.77×–1.93× vs. Llama.cpp), primarily due to reduced kernel execution time and our efficient CUDA Graph-based scheduling, which reduces GPU launch overhead from over 20% to nearly zero.
Motivation
While hybrid CPU/GPU inference alleviates memory constraints by leveraging CPU DRAM alongside GPU VRAM bandwidth, achieving high throughput remains challenging due to synchronization overheads and limited CPU compute efficiency. This PR upstreams the KTransformers approach (SOSP ’25), enabling GPU Tensor Parallelism + CPU/GPU Hybrid Expert Parallelism for MoE models—supporting hybrid prefilling and decoding that utilize kt-kernel AMX-optimized CPUs together with GPUs. With this design, dense layers benefit from high-throughput multi-GPU execution, while experts are flexibly scheduled across both CPUs and GPUs, maximizing hardware utilization and reducing bottlenecks.
KTransformers will be incorporated into SGLang as a library backend. Building on this backend, sglang generalizes the design to support multi-GPU tensor parallelism and CPU/GPU hybrid expert parallelism, while broadening coverage to additional models and weight formats.
Benchmark Results (Preview)
These are preliminary results of KTransformers with a single GPU, reflecting the performance of this feature in a single-card setting. More detailed benchmark data will be provided in follow-up updates. The figures below show the throughput performance of KTransformers on a dual-socket server with Intel® Xeon® Platinum 8452Y CPUs (36 cores × 2, 1 TB DDR5), equipped with an NVIDIA A100 (40 GB) for full-precision models and an NVIDIA RTX 4080 (16 GB) for quantized models. We evaluate on DeepSeek-V3-0324 (DS-3), DeepSeek-V2.5-1210 (DS-2), and Qwen2-57B-A14B (QW-2), comparing KTransformers against Llama.cpp and Fiddler across both the prefill and decode phases.
In the prefill phase, KTransformers consistently outperforms both baselines across all prompt lengths. While Llama.cpp shows advantages in short-prompt scenarios through aggressive operator fusion, and Fiddler benefits from AMX acceleration for long prompts, KTransformers surpasses both by leveraging AMX-optimized CPU kernels and improved CPU/GPU coordination. For example, our CPU MoE kernel achieves 21.3 TFLOPS on DS-3, a 3.98× improvement over the PyTorch baseline.
In the decode phase, KTransformers (without Expert Deferral) achieves 2.42×–4.09× speedups over Fiddler and 1.25×–1.76× over Llama.cpp on full-precision models. With quantized models, the gains are even larger (1.77×–1.93× vs. Llama.cpp), primarily due to reduced kernel execution time and our efficient CUDA Graph-based scheduling, which reduces GPU launch overhead from over 20% to nearly zero.
Roadmap
Related resources
Repo:https://github.com/kvcache-ai/ktransformers
SOSP25 Paper:https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/
CC: @Atream @ovowei @chenht2022 @Azure-Tang @ErvinXie