Skip to content

Latest commit

 

History

History
188 lines (144 loc) · 5.15 KB

File metadata and controls

188 lines (144 loc) · 5.15 KB

Running KTransformers on AVX2 CPUs

This tutorial explains how to run KTransformers on machines that only support AVX2 (without AVX512 or AMX).

Table of Contents

Supported Precision Formats

--kt-method Precision Description
BF16 BF16 native precision Zero precision loss, uses BF16 weights directly
FP8 FP8 block quantization
GPTQ_INT4 INT4 GPTQ

Hardware Requirements

  • CPU: x86-64 + AVX2 + FMA (Intel Haswell 2013+ / AMD Zen+)
  • GPU: NVIDIA 24GB+ VRAM (RTX 3090/4090/5090, etc.)
  • Memory: At least the size of the model weights (e.g., Qwen3-30B-A3B BF16 requires 64GB+)
  • OS: Linux

Installation

Build and install from source (one-click install for kt-kernel + SGLang):

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive

# One-click install
./install.sh

On AVX512 or AMX machines, you can also manually force AVX2 compilation:

export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
./install.sh kt-kernel --manual

Verification

# Check if the CPU supports AVX2
lscpu | grep -i avx2

# Check the loaded kt-kernel variant
python -c "import kt_kernel; print(kt_kernel.__cpu_variant__)"
# Expected output: avx2

# System diagnostics
kt doctor

Starting the Inference Server

Use --kt-method BF16, FP8, or GPTQ_INT4. KT-Kernel will automatically detect the CPU and fall back to the AVX2 backend when AVX512/AMX is unavailable.

Example: Qwen3-30B-A3B (BF16)

# Download the model
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /path/to/Qwen3-30B-A3B

# Check physical core count and NUMA node count
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|NUMA node\(s\)"

# Start the server (adjust kt-cpuinfer and kt-threadpool-count based on your hardware)
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3-30B-A3B \
  --kt-weight-path /path/to/Qwen3-30B-A3B \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 32 \
  --kt-method BF16 \
  --attention-backend flashinfer \
  --trust-remote-code \
  --mem-fraction-static 0.80 \
  --chunked-prefill-size 8192 \
  --max-running-requests 2 \
  --served-model-name Qwen3 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --enable-p2p-check \
  --disable-shared-experts-fusion

Example: Qwen3.5-35B-A3B-FP8 (FP8)

# Download the model
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 --local-dir /path/to/Qwen3.5-35B-A3B-FP8

# Start the server
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3.5-35B-A3B-FP8 \
  --kt-weight-path /path/to/Qwen3.5-35B-A3B-FP8 \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 2 \
  --kt-method FP8 \
  --kt-gpu-prefill-token-threshold 400 \
  --attention-backend triton \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 4096 \
  --max-running-requests 1 \
  --max-total-tokens 32000 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --disable-shared-experts-fusion

Example: Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)

# Download the model
huggingface-cli download Qwen/Qwen3-30B-A3B-GPTQ-Int4 --local-dir /path/to/Qwen3-30B-A3B-GPTQ-Int4

# Start the server
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
  --kt-weight-path /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 2 \
  --kt-method GPTQ_INT4 \
  --attention-backend triton \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 4096 \
  --max-running-requests 1 \
  --max-total-tokens 32000 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --disable-shared-experts-fusion

Sending Requests

# Interactive chat
kt chat

# OpenAI-compatible API
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3","messages":[{"role":"user","content":"Hello"}],"stream":true}'

Performance Tuning

  • --kt-cpuinfer: set to the number of physical cores
  • --kt-threadpool-count: set to the number of NUMA nodes
  • --kt-num-gpu-experts: higher values reduce CPU load but increase GPU VRAM usage
  • Memory bandwidth is often the bottleneck; high-frequency DDR5 memory helps significantly

FAQ

GPU OOM

  • Reduce --kt-num-gpu-experts, --chunked-prefill-size, --max-total-tokens
  • Lower --mem-fraction-static

For more questions, see FAQ.