We are very pleased to announce that Ktransformers now supports Kimi-K2 and Kimi-K2-0905.
On a single-socket CPU with one consumer-grade GPU, running the Q4_K_M model yields roughly 10 TPS and requires about 600 GB of DRAM.
With a dual-socket CPU and sufficient system memory, enabling NUMA optimizations increases performance to about 14 TPS.
- Official Kimi-K2 Release:
- GGUF Format(quantized models):
- Official Kimi-K2-0905 Release:
- GGUF Format(quantized models):
The model running with 384 Experts requires approximately 600 GB of memory and 14 GB of GPU memory.
# download gguf
huggingface-cli download --resume-download KVCache-ai/Kimi-K2-Instruct-GGUF
To install KTransformers, follow the official Installation Guide.
python ktransformers/server/main.py \
--port 10002 \
--model_path <path_to_safetensor_config> \
--gguf_path <path_to_gguf_files> \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve \curl -X POST http://localhost:10002/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "Kimi-K2",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'