用户实测 Evaluation by Users #104

DrTangxc · 2025-08-29T06:01:46Z

DrTangxc
Aug 29, 2025
Maintainer

We appreciate your valuable evaluation results.
Please provide the following information:

hardware and software configuration (GPU, Model, OS, Network, etc.)
chitu version (and PR link if any)
evaluation methodology (input and output length, batch size, etc. )
performance and/or accuracy data

English description is mandatory and Chinese is optional.

我们期待您可以分享宝贵的实测数据，请提供以下信息：

软硬件配置，包括GPU型号、测试模型、操作系统、网络配置等
赤兔版本（如果有相关改动的PR链接，也请一并提供）
测试方法（输出输出长度、并发数等）
性能数据、精度数据

必须提供英文的描述，也欢迎附上中文说明。

Ethkuil · 2025-09-02T01:43:36Z

Ethkuil
Sep 2, 2025

GPU: NVIDIA H20
OS: Ubuntu 22.04

chitu version: v0.4.2

Model: Qwen3-32B
bf16, TP=1
input_len=128, output_len=1024

bs	TPS	TTFT	TPOT	Total Token throughput
1	44.39	199.49	22.35	50.29
2	85.02	198.68	23.35	96.33
4	165.74	397.50	23.77	187.78
8	307.56	615.70	25.43	348.45
16	559.22	1184.05	27.47	633.57
32	945.59	2287.28	31.62	1071.29
64	1301.04	4090.94	45.17	1474.00

0 replies

dyedd · 2025-09-02T11:50:19Z

dyedd
Sep 2, 2025

device: 910B3
os: 5.10.0-60.18.0.50.r865_35.hce2.aarch64
chitu version: v0.4.2

Model: Qwen3-32B
bf16, TP=8
input_len=128, output_len=1024
cache_type=skew
cuda_graph=True

bs	TPS	TTFT	TPOT	Total Token throughput
1	42.20	201.18	23.52	47.81
2	79.92	186.94	24.86	90.54
4	150.46	192.03	26.42	170.46
8	268.34	290.14	29.54	304.01
16	355.82	356.03	44.63	403.12
32	599.16	526.60	52.89	678.82
64	1024.38	888.38	61.55	1160.56
128	1582.14	1591.90	79.18	1792.48
256	2106.91	2997.18	118.15	2387.01
512	2530.81	5538.06	195.93	2867.26

0 replies

cyk2018 · 2025-09-04T11:31:45Z

cyk2018
Sep 4, 2025

device: 910B2 (64GB) * 2
os: Ubuntu 22.04.4 LTS
docker image: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.2
chitu version: v0.4.2

Model: Qwen3-32B
cache_type: skew
tp 2
cuda_graph: True
warmup: 2
iteration: 1

batch size	TPS	TTFT	TPOT	Total Token throughput
1	19.33	299.56	51.5	21.9
2	37.96	299.35	52.44	43.01
4	72.44	298.63	54.98	82.07
8	137.43	460.14	57.81	155.7
16	250.59	694.53	63.22	283.9
32	440.9	1215.38	71.44	499.51
64	751.04	2242.56	83.07	850.88
128	1128.31	4291.87	109.24	1278.31

Others:
when following the development.md, we found the parameter --seq-len which it use has been removed and we should use --input-len and --output-len now.

0 replies

johanvx · 2025-09-06T17:35:12Z

johanvx
Sep 6, 2025

Benchmark configuration

Model: Qwen/Qwen3-8B
iterations: 10
warmup: 3
input_len: 128
output_len: 1024

torchrun options

cache_type: paged
cuda_graph: True

Result set 1

OS: Ubuntu 22.04
GPU: NVIDIA A10 (1 * 24 GB)
chitu version: aa78c4d40fec78a1f218854f0f9e21208075a823
TORCH_CUDA_ARCH_LIST: 8.6
CUDA version: 12.8

Batch Size	TPS	TTFT	TPOT	Total Token Throughput
1	33.12	93.44	30.12	38.20
2	62.56	125.77	31.86	72.16
4	122.27	192.09	32.53	141.01
8	233.57	320.48	33.92	269.39
16	429.06	505.84	36.76	494.85
32	743.34	831.06	42.15	857.31
64	925.04	1375.56	54.78	1066.87

Result set 2

OS: Ubuntu 22.04.5 LTS
GPU: NVIDIA GeForce RTX 3090 (1 * 24 GB)
Docker Image: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.2

Batch Size	TPS	TTFT	TPOT	Total Token Throughput
1	45.42	158.89	21.86	52.38
2	90.46	199.10	21.91	104.33
4	176.74	297.84	22.32	203.84

P.S.: Results with batch size >=8 are currently unavailable due to some device issues. Maybe update them later.

0 replies

RuibaiXu · 2025-09-07T06:41:43Z

RuibaiXu
Sep 7, 2025

Environment Config:

GPU: RTX 4090 (SM89)
OS: EndeavourOS 2024.04.20
Python: 3.11
CUDA: 12.8
chitu: aa78c4d40fec78a1f218854f0f9e21208075a823

Inference Config:

Model: Qwen3-8B
bf16
max_reqs : 1
max_seq_len : 4096
max_new_tokens : 2048
cache_type: paged
cuda_graph: True

Benchmark Config:

Model: Qwen3-8B
warmup : 3
iterations: 3
input-len : 128
output-len : 1024

Results:

Batch Size	TPS	Total Token Throughput	TTFT	TPOT	ITL
1	58.54	66.32	50.04	17.05	17.65
2	58.53	66.31	64.97	25.60	26.63
4	58.50	66.28	89.54	42.77	44.15

0 replies

nyanyanyanyamii · 2025-09-07T09:49:10Z

nyanyanyanyamii
Sep 7, 2025

Device: NVIDIA H20
OS: Ubuntu 22.04
CUDA version: 12.4
Chitu version: v0.4.2
Model: Qwen3-8B
Precision: FP16
TP: 1
Input len: 128
Output len: 1024
Warm up: 2
Iteration: 1

Batch Size	TPS	TTFT	TPOT	Total Token throughput
1	150.2	95.1	18.3	171.5
2	290.4	97.3	18.9	333.0
4	560.7	188.6	19.5	643.5
8	1045.9	295.8	20.7	1199.2
16	1895.3	580.6	22.9	2176.8
32	3302.8	1100.7	28.4	3791.6

0 replies

RouteTrace · 2025-09-08T02:49:14Z

RouteTrace
Sep 8, 2025

Performance Test Report for gpt-oss-20b-BF16 on Hygon DCU

1. Test Environment

Hardware Platform: Hygon DCU (64G) x 2
Model: gpt-oss-20b-BF16
OS / Image: pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10
Inference Framework: chitu v0.4.2

2. Test Configuration

Input Length: 128 tokens
Output Length: 1024 tokens
Tensor Parallelism: TP = 2

3. Performance Data

The table below shows the model's performance metrics at different batch sizes.

Batch Size (BS)	Throughput (TPS, Tokens/s)	Time To First Token (TTFT, ms)	Time Per Output Token (TPOT, ms)	Tensor Parallelism (TP)
1	6.87	599.48	145.06	2
2	17.15	1027.38	115.69	2
4	24.50	1296.26	162.10	2
8	56.88	1582.04	139.24	2
16	87.14	2862.30	180.98	2

4. Problem Description

When running the test with a Batch Size of 32, the program encountered a bug and terminated, reporting the following error: IndexError: list index out of range

0 replies

BestKuan · 2025-09-12T07:28:32Z

BestKuan
Sep 12, 2025

Device: NVIDIA L40S x 4
OS: Ubuntu 22.04
CUDA version: 12.4
Chitu version: v0.4.2
Model: Qwen2.5-32B
Precision: FP16
TP: 4
Input len: 128
Output len: 1024
Warm up: 2
Iteration: 8

batch-size	TTFT	TTFT（P99）	TPOT	TPOT（P99）	Total Throughput
1	0.167	0.172	0.026	0.026	51
2	0.207	0.598	0.029	0.03	86
4	0.201	0.608	0.03	0.033	147
8	0.202	0.581	0.032	0.038	263
16	0.225	0.78	0.037	0.049	505
32	0.24	0.611	0.048	0.08	755

0 replies

confused666 · 2025-09-12T07:37:59Z

confused666
Sep 12, 2025

Environment Config:

GPU: RTX 4090 (24GB)
OS: Ubuntu 22.04
Python: 3.12.9
CUDA: 12.4
chitu: 0.4.2

Inference Config:

Model: Qwen2.5-3B
max_reqs : 1
max_seq_len : 4096
max_new_tokens : 2048
cache_type: paged
cuda_graph: True

3.Benchmark Config:

Model: Qwen2.5-3B
warmup : 3
iterations: 10
input-len : 128
output-len : 1024

4.Results:

Batch Size (BS)	Output throughput (TPS, Tokens/s)	Time to First Token (TTFT, ms)	Time per Output Token (TPOT, ms)
1	114.61	65.18	8.67
2	113.98	110.74	13.08
4	114.94	166.17	21.68
8	114.63	266.98	39.20

0 replies

Yicooong · 2025-09-12T08:52:32Z

Yicooong
Sep 12, 2025

System and Hardware Configuration

Category	Details
CPU	Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Memory	Total: 187Gi, Used: 53Gi, Free: 8.6Gi, Buff/Cache: 125Gi, Available: 132Gi
GPU	2 × Tesla V100S-PCIE-32GB
GPU Memory	GPU0: 408MiB / 32768MiB GPU1: 1664MiB / 32768MiB
GPU Driver	NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2
Temp/Power	GPU0: 39°C, 38W / 250W GPU1: 38°C, 37W / 250W

Software and Frameworks

Category	Version / Info
OS	Ubuntu 22.04.3 LTS
Chitu	0.4.2
Test Model	Qwen2.5-7B-Instruct

Startup Command

export WORLD_SIZE=1
torchrun --nnodes 1 \
    --nproc_per_node 1 \
    --master_port=22525 \
    -m chitu \
    serve.port=21002 \
    infer.cache_type=paged \
    infer.pp_size=1 \
    infer.tp_size=1 \
    models=Qwen2.5-7B-Instruct \
    models.ckpt_dir=/home/liuyicong/lyc_workdir/models/Qwen2.5-7B-Instruct \
    infer.max_reqs=16 \
    infer.max_seq_len=4096 \
    request.max_new_tokens=100

Runtime Error

RuntimeError: Internal Triton PTX codegen error
`ptxas` stderr:
ptxas /tmp/tmp3s3yr8ba.ptx, line 122; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/tmp3s3yr8ba.ptx, line 122; error   : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher

Explanation

My GPUs are Tesla V100 (compute capability 7.0, sm_70), which do not support BF16 instructions.
The Triton kernel tried to emit BF16 PTX (.bf16, cvt.f32.bf16), but sm_80 (Ampere or newer, e.g. A100, H100) is required.

0 replies

luliuliu12138 · 2025-09-14T04:54:57Z

luliuliu12138
Sep 14, 2025

Environment Config:

GPU: RTX 4090 （24GB）
Python: 3.13.7
CUDA: 12.4
chitu: v0.4.2

Inference Config:

Model: Qwen2.5-7B
max_reqs: 4
max_seq_len: 4096
max_new_tokens: 1024
cache_type: paged

Benchmark Config:

Model: Qwen2.5-7B
warmup: 3
iterations: 10
input-len: 128
output-len: 1024

Result:

Batch Size	TTFT (ms)	TPOT (ms)	Throughput (tok/s)
1	68.62	28.29	35.29
2	78.19	29.98	66.60
4	115.70	31.21	127.81
8	168.51	46.09	129.86
16	253.39	76.59	130.49
32	402.44	137.30	131.09

0 replies

qingwanruojun · 2025-09-18T12:31:34Z

qingwanruojun
Sep 18, 2025

Qwen3-0.6B Performance Benchmark on RTX 5090

Environment

OS: Ubuntu 22.04
GPU: NVIDIA RTX 5090 (32GB) × 1
CPU: 25 vCPU Intel(R) Xeon(R) Platinum 8470Q
Memory: 90GB
Python: 3.12
PyTorch: 2.8.0
CUDA: 12.8

Benchmark Configuration

Model: Qwen3-0.6B
Iterations: 10
Warmup: 3
Input Length: 128 tokens
Output Length: 1024 tokens

Results

Batch Size	TPS (Tokens/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (Tokens/s)
1	44.87	56.19	22.25	50.84
2	83.32	67.08	23.96	94.39
4	165.59	72.65	24.10	187.61
8	326.17	104.90	24.44	369.53
16	324.91	12706.86	24.53	368.10
32	320.65	38458.72	24.87	363.28
64	324.72	88506.51	24.56	367.89

0 replies

cranechu0131 · 2025-09-21T13:54:43Z

cranechu0131
Sep 21, 2025

Qwen3-8B Performance Benchmark on RTX 3090

Environment

OS: Ubuntu 22.04
GPU: NVIDIA RTX 3090 (24GB) × 1
CPU: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Memory: 256GB
Python: 3.11
PyTorch: 2.7.0+cu126
CUDA: 12.6

Benchmark Configuration

Model: Qwen3-8B
Iterations: 10
Warmup: 3
Input Length: 128 tokens
Output Length: 1024 tokens

Results

Batch Size	TPS (Tokens/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (Tokens/s)
1	19.11	124.06	52.25	21.65
2	36.86	165.49	54.15	41.76
4	72.47	241.84	55.01	82.11
8	138.49	376.79	57.45	156.90
16	277.27	622.64	57.14	314.13
32	275.20	1084.99	86.45	311.78

0 replies

jma85448-del · 2025-09-25T03:39:32Z

jma85448-del
Sep 25, 2025

Qwen3-8B Serving Benchmark on RTX 4090

Environment

OS: Ubuntu 22.04
GPU: NVIDIA RTX 4090 (24GB) × 1
CUDA: 11.8
Python: 3.10
PyTorch: 2.1.2
Chitu Version: 562e19680b6cf331992e14f8b6ef8bf839cf5535

Benchmark Configuration

Model: Qwen3-8B
Batch Size: 1
Input Length: 128 tokens
Output Length: 1024 tokens
Iterations: 10
Warmup Iterations: 3
Base URL: http://localhost:21002

Launch Command

torchrun --nnodes 1 \
    --nproc_per_node 1 \
    --master_port=22525 \
    -m chitu \
    serve.port=21002 \
    infer.cache_type=paged \
    infer.pp_size=1 \
    infer.tp_size=1 \
    models=Qwen3-8B \
    models.ckpt_dir=/root/autodl-tmp/model/Qwen3-8B \
    infer.mla_absorb=absorb-without-precomp \
    infer.raise_lower_bit_float_to=bfloat16 \
    infer.max_reqs=4 \
    infer.max_seq_len=1024 \
    request.max_new_tokens=100 \
    infer.use_cuda_graph=True

Results

Metric	Value
Successful requests	10
Benchmark duration (s)	162.99
Total input tokens	128
Total generated tokens	8880
Request throughput (req/s)	0.06
Output token throughput (tok/s)	54.48
Total token throughput (tok/s)	62.83

Time to First Token (TTFT)

Stat	ms
Mean	72.16
Median	74.92
P99	78.31

Time per Output Token (TPOT, excl. 1st)

Stat	ms
Mean	18.29
Median	18.29
P99	18.31

Inter-token Latency (ITL)

Stat	ms
Mean	18.80
Median	18.27
P99	36.67

0 replies

Yhorm26 · 2025-09-30T10:07:37Z

Yhorm26
Sep 30, 2025

测试环境信息
Device: NVIDIA H800 × 8
OS: Ubuntu 22.04
CUDA version: 12.4
Chitu version: v0.4.3
Model: DeepSeek-R1-Distill-Qwen-14B
Precision: FP16
TP: 8
Input len: 128
Output len: 1024
Warm up: 3
Iteration: 10

batch-size	TTFT	TTFT（P99）	TPOT	TPOT（P99）	Total Throughput
1	68.04	70.07	6.25	6.36	179.14
2	73.87	99.50	7.13	7.21	313.95
4	70.90	98.22	7.53	7.61	595.14
8	79.33	90.09	7.84	7.95	1141.99

1 reply

Yhorm26 Sep 30, 2025

batch-size	TTFT	TTFT（P99）	TPOT	TPOT（P99）	Total Throughput
1	68.04	70.07	6.25	6.36	179.14
2	73.87	99.50	7.13	7.21	313.95
4	70.90	98.22	7.53	7.61	595.14
8	79.33	90.09	7.84	7.95	1141.99
16	136.44	155.68	11.77	15.75	1143.57
32	232.91	342.09	19.78	31.69	1136.88

Artlesbol · 2025-12-16T02:44:06Z

Artlesbol
Dec 16, 2025

Device Information

Hardware

CPU: INTEL(R) XEON(R) PLATINUM 8575C
GPU: NVIDIA A100-SXM4-40GB

2.2.2 Software

Docker Image: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:latest
Python: Python 3.11.12
PyTorch: 2.7.0+cu126
Chitu: 0.5.0
Model: Qwen3-0.6B

Server Config

export WORLD_SIZE=1
torchrun --nnodes 1 \
  --nproc_per_node 1 \
  --master_port=22525 -m chitu serve.port=21002 \
  infer.cache_type=paged \
  infer.pp_size=1 \
  infer.tp_size=1 \
  models=Qwen3-0.6B \
  models.ckpt_dir=/workspace/qwen3 \
  infer.max_reqs=16 \
  infer.max_seq_len=4096 \
  request.max_new_tokens=100

Benchmark Config

iteration=10

input_len=128

output_len=1024

warmup=3

Benchmark Summary

Batch Size	Req/s	Out Tok/s	Total Tok/s	Mean TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	P99 ITL (ms)
1	0.28	290.08	328.60	35.42	37.03	3.42	3.42	3.52	7.27
2	0.40	407.30	461.40	35.80	37.52	4.88	4.89	5.03	10.21
4	0.76	778.52	881.91	35.56	37.21	5.11	5.12	5.23	10.49
8	1.41	1442.86	1634.49	45.73	47.94	5.50	5.51	5.63	11.30
16	2.45	2505.19	2837.91	56.17	68.58	6.33	6.34	6.48	13.14
32	2.45	2508.19	2841.31	3340.76	6594.42	6.33	6.36	6.48	13.15

0 replies

BochaoLi · 2025-12-23T05:45:38Z

BochaoLi
Dec 23, 2025

Hardware:

nvidia v100

software:

ubuntu 20.04
python 3.11.12
pytorch 2.7.0+cu126
cuda 12.6.3

image:

qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:latest

docker comand:

docker run -it --rm --gpus=all --shm-size=1g
-v ./qwen2.5:/mnt/qwen-model/qwen2.5-7b-instruct
qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:latest

server config:

torchrun --nnodes 1 --nproc_per_node 1 --master_port=22525 -m chitu
serve.port=21002 infer.cache_type=paged infer.pp_size=1 infer.tp_size=1
models=Qwen2.5-7B-Instruct models.ckpt_dir=/mnt/qwen-model/qwen2.5-7b-instruct
infer.max_reqs=16 infer.max_seq_len=4096 request.max_new_tokens=100

result:

meet problem and cant continue

0 replies

XzZZzX02 · 2025-12-24T06:16:06Z

XzZZzX02
Dec 24, 2025

Qwen3-32B Performance Benchmark on Ascend 910B3

Environment

Device: Huawei Ascend 910B3 (64GB) × 8
OS: Ubuntu 22.04.4 LTS
Docker Image: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.5.0
Chitu Version: v0.5.0
Model: Qwen3-32B
Graph Mode: True (NPU Graph enabled)
Network Mode: Host

Benchmark Configuration

Iterations: 10
Warmup: 1
Input Length: 128 tokens
Output Length: 1024 tokens

Results

Batch Size	TPS (Output tokens/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (tokens/s)
1	32.82	222.68	30.28	37.18
2	64.33	223.92	30.90	72.88
4	122.39	248.21	32.47	138.65
8	239.89	267.14	33.11	271.75
16	430.23	378.24	36.85	487.38
32	743.23	846.68	42.25	841.94
64	1300.74	1619.35	47.64	1473.49
128	2054.73	3082.37	59.30	2327.62

0 replies

Smallhucaptain · 2025-12-25T16:33:40Z

Smallhucaptain
Dec 25, 2025

Benchmark - Qwen3-8B on Nvidia A100 40GB

Environment

OS: Ubuntu 22.04.5 LTS
GPU: NVIDIA A100-SXM4-40GB (40GB) × 1
CPU: Intel(R) Xeon(R) Gold 6430 (Socket×2, 32 cores/socket, 2 threads/core, CPU(s)=128)
Memory: 251GiB
Python: 3.11.12
PyTorch: 2.7.0+cu126
CUDA: 12.6
Chitu: 0.5.0

Benchmark Configuration

Model: Qwen3-8B
Iterations: 10
Warmup: 3
Input Length: 128 tokens
Output Length: 1024 tokens

Batch Size	TPS (tok/s)	TTFT (ms)	TPOT (ms)	Total Token throughput (tok/s)
1	73.03	51.09	13.66	82.72
2	140.24	60.49	14.21	158.86
4	273.42	72.57	14.57	309.73
8	525.41	116.91	15.12	595.19
16	526.77	7955.03	15.09	596.74
32	526.06	23643.49	15.11	595.92

0 replies

kjuuii · 2026-01-06T12:17:39Z

kjuuii
Jan 6, 2026

[Evaluation] Qwen2.5-7B-Instruct Performance Benchmark on NVIDIA RTX 5090

Environment

Device: NVIDIA RTX 5090 (32GB VRAM) x 1
OS: Ubuntu 22.04 LTS (AutoDL)
Driver/CUDA: Driver 580.76 / CUDA 13.0
Chitu Version: v0.5.0

Model: Qwen2.5-7B-Instruct
bf16, TP=1
input_len=512, output_len=512

bs | TPS | TTFT | TPOT | Total Token throughput -- | -- | -- | -- | -- 1 | 99.19 | 65.29 | 9.97 | 204.00 4 | 370.48 | 172.79 | 10.47 | 761.94 8 | 706.03 | 348.99 | 10.66 | 1452.05 16 | 1225.32 | 544.29 | 12.00 | 2520.05 32 | 2057.94 | 891.02 | 13.80 | 4232.44 64 | 2069.09 | 1521.45 | 21.44 | 4255.37

0 replies

choson777 · 2026-01-19T04:20:15Z

choson777
Jan 19, 2026

Here are the benchmark results for Qwen3-0.6B running on Chitu.

Environment

GPU: NVIDIA RTX 4090 (24GB) x 2
TP Size: 2
Chitu Version: v0.4.2
CUDA Graph: Enabled
OS: Ubuntu 20.04
Python: 3.10
CUDA: 12.1

Benchmark Configuration

Model: Qwen3-0.6B
Input Length: 128 tokens
Output Length: 1024 tokens
Iterations: 3
Warmup: 3

Results

Batch Size	TPS (tok/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (tok/s)
1	234.98	99.60	4.16	266.22
2	397.06	133.37	4.91	449.84
4	709.80	130.40	5.51	804.16
8	1153.33	197.99	6.74	1306.65
16	1878.26	192.65	8.33	2127.96
32	2935.17	327.09	10.57	3325.38

0 replies

Crzax · 2026-01-19T17:44:05Z

Crzax
Jan 19, 2026

Performance Test Report for Qwen3-0.6B-BF16 on Muxi GPU

1. Test Environment

Hardware Platform: Muxi (MetaX) C500 GPU (16G Slice)
Model: Qwen3-0.6B (BF16)
OS / Image: Linux / PyTorch 2.8.0+metax (Muxi specialized)
Inference Framework: Chitu v0.5.0

2. Test Command

CHITU_MUXI_BUILD=1 torchrun --nproc_per_node 1 benchmarks/benchmark_offline.py \
    models=Qwen3-0.6B \
    models.ckpt_dir=/data/models/Qwen3-0.6B \
    infer.use_cuda_graph=True \
    infer.max_reqs=64 \
    infer.max_seq_len=1280 \
    benchmark.input_len=128 \
    benchmark.output_len=1024 \
    benchmark.num_reqs_list="[1, 8, 16, 32, 64]" \
    benchmark.iters=3 \
    2>&1 | tee benchmark_report.log

3. Performance Data

This table includes the real-time metrics captured by the Chitu throughput monitor during the stable execution phase (Iter 3).

bs	TPS (Output)	Avg prompt throughput	Avg generation throughput	TPOT (ms)	Total Token throughput
1	242.27	-	242.27	4.13	272.55
8	1241.60	204.6	1234.2	6.44	1396.80
16	2195.47	204.6	2173.7	7.29	2469.90
32	3379.67	409.2	3388.7	9.47	3802.13
64	4618.90	818.5	4585.2	13.86	5196.26

4. Problem Description

Tested on a 16GB VRAM partition. Using the default max_seq_len resulted in a CUDA Out of Memory error as the system attempted to reserve 17.50 GiB for the KV Cache.

Resolution: Set infer.max_seq_len=1280. This reduced the memory reservation to fit within the 15.22 GiB capacity of the GPU slice, enabling successful execution at BS 64 with ~84% KV cache utilization.

0 replies

polim0227 · 2026-01-20T03:50:10Z

polim0227
Jan 20, 2026

cuda 12.4 显卡名称 NVIDIA A40
显卡3卡和6卡显存大小分别为45G export CUDA_VISIBLE_DEVICES=3,6
torchrun --nnodes 1 --nproc_per_node 2 --master_port=22525 -m chitu
serve.port=35992
infer.cache_type=paged
models=Qwen3-32B
models.ckpt_dir=/aiproject/data/llm/qwen/Qwen3-32B
models.tokenizer_path=/aiproject/data/llm/qwen/Qwen3-32B
infer.tp_size=2
infer.pp_size=1
infer.max_reqs=20
float_16bit_variant=bfloat16
infer.attn_type=auto
infer.use_cuda_graph=True
infer.raise_lower_bit_float_to=bfloat16
1/19 12:06:31 - AISBench - INFO - Performance Results of task: vllm-api-general-chat/gsm8kdataset:

┌──────────────────────────────┬───────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┬────┐
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
├──────────────────────────────┼───────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────┤
│ E2EL │ total │ 66463.882 ms │ 66314.674 ms │ 66548.1314 ms │ 66470.791 ms │ 66532.9642 ms │ 66545.2447 ms │ 66547.8428 ms │ 10 │
│ InputTokens │ total │ 1445.9 │ 1443.0 │ 1448.0 │ 1446.5 │ 1448.0 │ 1448.0 │ 1448.0 │ 10 │
│ OutputTokens │ total │ 514.3 │ 132.0 │ 658.0 │ 559.5 │ 612.75 │ 637.3 │ 655.93 │ 10 │
│ OutputTokenThroughput │ total │ 7.738 token/s │ 1.9881 token/s │ 9.8876 token/s │ 8.4158 token/s │ 9.2239 token/s │ 9.6068 token/s │ 9.8595 token/s │ 10 │
└──────────────────────────────┴───────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┴────┘

┌──────────────────────────────┬───────┬───────────────────┐
│ Common Metric │ Stage │ Value │
├──────────────────────────────┼───────┼───────────────────┤
│ Benchmark Duration │ total │ 664645.3573 ms │
│ Total Requests │ total │ 10 │
│ Failed Requests │ total │ 0 │
│ Success Requests │ total │ 10 │
│ Concurrency │ total │ 1.0 │
│ Max Concurrency │ total │ 1 │
│ Request Throughput │ total │ 0.015 req/s │
│ Total Input Tokens │ total │ 14459 │
│ Total generated tokens │ total │ 5143 │
│ Input Token Throughput │ total │ 21.7545 token/s │
│ Output Token Throughput │ total │ 7.738 token/s │
│ Total Token Throughput │ total │ 29.4924 token/s │
└──────────────────────────────┴───────┴───────────────────┘

0 replies

Siiiiion · 2026-01-28T07:49:16Z

Siiiiion
Jan 28, 2026

Performance Test Report for Qwen3-0.6B-on RTX-4090

Environment

Hardware Platform: NVIDIA RTX 4090 (24GB) x 4
Model: Qwen3-0.6B
Chitu version: v0.4.2
OS: ubuntu 20.04
Python: 3.12
CUDA: 12.1

Server Config

torchrun --nnodes 1 \
    --nproc_per_node 4 \
    --master_port=22525 \
    -m chitu \
    serve.port=21002 \
    infer.cache_type=paged \
    infer.pp_size=1 \
    infer.tp_size=4\
    models=Qwen3-0.6B \
    models.ckpt_dir=/data3/qsy/models/qwen3-0.6b/ \
    infer.mla_absorb=absorb-without-precomp \
    infer.raise_lower_bit_float_to=bfloat16 \
    infer.max_reqs=4 \
    infer.max_seq_len=4096 \
    request.max_new_tokens=100 \
    infer.use_cuda_graph=True

Result

Batch Size	TPS (tok/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (tok/s)
1	235.34	99.80	4.15	266.62
2	394.89	99.55	4.96	447.38
4	723.76	131.55	5.39	819.97
8	727.46	198.21	8.05	824.12
16	731.96	248.85	13.50	829.19
32	736.24	381.24	24.35	834.04

0 replies

Lowbeee · 2026-02-05T04:09:16Z

Lowbeee
Feb 5, 2026

Qwen3-32B Performance Benchmark on Ascend 910B2

Environment

Device: Huawei Ascend 910B2
OS: Ubuntu 18.04.4 LTS
Docker Image: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.5.0
Chitu Version: v0.5.0
Model: Qwen3-32B
Graph Mode: True (NPU Graph enabled)
Network Mode: Host

Benchmark Configuration

Iterations: 10
Warmup: 1
Input Length: 128 tokens
Output Length: 1024 tokens

Results

Batch Size	TPS (Output tokens/s)	TTFT (ms)	TPOT (ms)	Total Token Throughput (tokens/s)
1	33.24	221.24	31.67	34.13
2	64.64	219.54	28.34	75.75
4	121.67	244.26	31.24	135.45
8	238.68	264.35	31.17	270.86
16	429.21	376.65	34.98	484.58
32	740.32	844.98	41.42	840.54
64	1310.53	1615.76	43.34	1471.79
128	2034.13	3081.56	53.78	2323.92

0 replies

mushang0 · 2026-02-22T06:55:45Z

mushang0
Feb 22, 2026

Chitu Engine Performance Evaluation Report

1. Hardware and Software Configuration

GPU: 1 × NVIDIA GeForce RTX 5090 (32GB VRAM, sm_90a architecture).
Model: Qwen/Qwen2.5-7B-Instruct.
OS: Linux/Ubuntu (AutoDL Container Environment).
Software Stack: Python 3.12, PyTorch 2.8.0, CUDA 12.8.
Network: AutoDL internal network with academic proxy acceleration for dependency retrieval.

2. Chitu Version

Version: Built from source (latest master branch, chitu==0.5.1). Due to cloud container restrictions (Docker-in-Docker limits), the engine was compiled locally from source to fully optimize underlying CUDA operators and FlashAttention-2 for the sm_90a architecture.
PR Link: N/A (Tested on the official, unmodified codebase).

3. Evaluation Methodology

Benchmark Tool: benchmarks/benchmark_serving.py.
Server Configuration: infer.max_reqs=128, infer.max_seq_len=4096, infer.use_cuda_graph=True, infer.cache_type=paged.
Input / Output Length: Input Length = 128, Output Length = 1024.
Batch Size: Tested incrementally at 1, 2, 4, 8, 16, 32, and 64.
Iterations: 3 Warmup Iterations, followed by 10 Benchmark Iterations.

4. Performance and Accuracy Data

[Performance Data]
The system demonstrated excellent scalability on the RTX 5090, peaking at over 3000 tokens/second at a batch size of 64 without encountering Out-Of-Memory (OOM) errors. The specific metrics are as follows:

Batch Size	Output Token Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)
1	99.40	37.29	10.01
2	191.55	54.59	10.36
4	375.85	93.87	10.50
8	735.04	156.16	10.64
16	1326.89	207.45	11.69
32	2348.16	364.44	12.94
64	3013.92	543.76	19.94

[Accuracy Data]

Note: This evaluation specifically focused on AI infrastructure-level serving throughput and latency.
Conclusion: Standard accuracy evaluation benchmarks (e.g., MMLU, GSM8K) were not executed. Since no quantization (e.g., INT8/INT4) or lossy optimizations were applied during the engine compilation, the inference accuracy strictly aligns with the original HuggingFace Transformers implementation.

0 replies

mty1996 · 2026-03-01T09:33:39Z

mty1996
Mar 1, 2026

Hardware and Software Configuration:
●GPU: 1x NVIDIA Bxx
●Model:Qwen3-32B
●OS: Ubuntu 24.04.2 LTS
●CUDA / PyTorch: PyTorch: 2.7.0a0+79aa17489c.nv25.04, CUDA: 12.9
●Other: numpy downgraded to 1.26.4 to avoid ABI conflicts during compilation.
Chitu Version:
public-main branch as of Feb 2026

Evaluation Methodology:
●Script: benchmarks/benchmark_serving.py
●Input Length: 128 tokens
●Output Length: 1024 tokens
●Batch Size: 1-64
●Engine Config: tp_size=1, pp_size=1, infer.use_cuda_graph=True
python benchmarks/benchmark_serving.py \ --model "Qwen3-32B" \ --batch-size 1 \ --iterations 1 \ --input-len 128 \ --output-len 1024 \ --warmup 1 \ --base-url "http://localhost:21015" \ --tokenizer-path /ssd1/models/Qwen3-32B

📝 中文说明
软硬件配置：
●GPU: 1张 NVIDIA Bxx（单卡测试）
●测试模型: Qwen3-32B
●操作系统:Ubuntu 24.04.2 LTS
●框架环境: PyTorch: 2.7.0a0+79aa17489c.nv25.04, CUDA: 12.9
●补充说明: 编译 C++ 算子时手动避开了 numpy 2.0+ 的兼容性问题，并将 metrics 端口置空以解决 Address already in use 冲突。
赤兔版本：
public-main branch as of Feb 2026
测试方法：
●使用官方 benchmark_serving.py 脚本，开启 CUDA Graph。
●输入长度 128，输出长度 1024，张量并行度 TP=1。
python benchmarks/benchmark_serving.py \ --model "Qwen3-32B" \ --batch-size 1 \ --iterations 1 \ --input-len 128 \ --output-len 1024 \ --warmup 1 \ --base-url "http://localhost:21015" \ --tokenizer-path /ssd1/models/Qwen3-32B

bs	TPS	TTFT	TPOT	Total Token throughput
1	71.93	63.43	13.85	81.49
2	134.29	67.44	14.84	152.15
4	265.09	109.77	14.99	300.34
8	519.81	120.93	15.28	588.91
16	1039.43	155.35	15.24	1177.55
32	1910.23	249.96	16.5	2164.3
64	3341.47	521.26	18.62	3785.75
128	5126.77	804.17	24.12	5808.81

0 replies

chatJohn · 2026-03-06T03:21:28Z

chatJohn
Mar 6, 2026

the Performance of Qwen2.5-7B-Instruct Model in `chitu`

Hardware && Environment

GPU RTX 5090(32GB) * 1

CPU 25 vCPU Intel(R) Xeon(R) Platinum 8470Q

Memory 90GB

torch 2.10.0+cu128

Python 3.12(ubuntu22.04)

other same as requirement_build.txt

chitu v0.5.1 latest

Benchmark Configuration

Model: Qwen2.5-7B-Instruct
Batch Size: 1
Input Length: 128
Output Length: 1024
Iterations: 10
Warmup Iterations: 3
Base URL: http://localhost:21002

Launch Command

torchrun --nnodes 1 \                                                                                                                                                 --nproc_per_node 1 \                                                                                                                                         
   --master_port=22525 \
   -m chitu \
   serve.port=21002 \
   infer.cache_type=paged \
   infer.pp_size=1 \
   infer.tp_size=1 \
   models=Qwen2.5-7B-Instruct \
   models.ckpt_dir=/model/qwen/qwen2.5 \
   infer.mla_absorb=absorb-without-precomp \
   infer.raise_lower_bit_float_to=bfloat16 \
   infer.max_reqs=4 \
   infer.max_seq_len=1024 \
   request.max_new_tokens=100 \
   infer.use_cuda_graph=True

Experiment Results

Serving Benchmark Result

Metrics	Successful requests	Benchmark duration (s)	Total input tokens	Total generated tokens	Request throughput (req/s)	Output token throughput (tok/s)	Total Token throughput (tok/s)
Value	10	87.42	1570	8670	0.11	99.17	117.13

Time to First Token

Metrics	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)
Value	37.82	35.12	52.09

Time per Output Token (excl. 1st token)

Metrics	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)
Value	10.03	10.03	10.06

Inter-token Latency

Metrics	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
Value	12.12	10.01	12.44

0 replies

pandaman176 · 2026-03-14T02:29:12Z

pandaman176
Mar 14, 2026

Environment

8 × NVIDIA GeForce RTX 4090 D (24 GB)
PCIe Gen4 x16, no NVLink
Dual-NUMA topology (4 GPUs per socket)
OS: Ubuntu 20.04.5 LTS (Kernel 5.4.0-208)
Model: Qwen2.5-7B-Instruct
chitu Version: v0.5.2

Server Config

tp-size=4, dp-size=2
cache_type=paged
use_cuda_graph=True
max_reqs=32
max_seq_len=4096
max_new_tokens=100

Bench Config

batch-size 1->64
iterations 10
input-len 128
output-len 1024
warmup 3

Result

bs	TPS (tok/s)	TTFT (ms)	TPOT (ms)	Total Token throughput (tok/s)
1	109.41	41.98	9.10	126.22
2	198.13	52.81	10.04	228.54
4	390.53	68.35	10.17	450.49
8	589.33	2071.20	11.53	679.77
16	1470.21	145.22	10.69	1695.86
32	2899.27	165.83	10.76	3344.23
64	2909.76	5722.86	10.74	3356.39

0 replies

Van2003319 · 2026-03-20T08:31:31Z

Van2003319
Mar 20, 2026

GPU: NVIDIA H100
OS: Ubuntu 22.04
chitu version: v0.4.2

Model: Qwen3-32B
Precision: bf16
input_len: 128
output_len: 1024

bs	TPS	TTFT	TPOT	Total Token Throughput
1	37.10	185.00	26.80	42.03
2	71.04	184.00	28.00	80.49
4	138.25	370.00	28.60	156.65
8	262.02	575.00	30.00	296.87
16	475.45	1110.00	32.60	538.59
32	806.40	2170.00	37.60	913.65
64	1109.14	4050.00	53.80	1256.65

0 replies

用户实测 Evaluation by Users #104

Uh oh!

DrTangxc Aug 29, 2025 Maintainer

Replies: 46 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Performance Test Report for gpt-oss-20b-BF16 on Hygon DCU

1. Test Environment

2. Test Configuration

3. Performance Data

4. Problem Description

Uh oh!

Uh oh!

Uh oh!

System and Hardware Configuration

Software and Frameworks

Startup Command

Runtime Error

Explanation

Uh oh!

Uh oh!

Qwen3-0.6B Performance Benchmark on RTX 5090

Environment

Benchmark Configuration

Results

Uh oh!

Qwen3-8B Performance Benchmark on RTX 3090

Environment

Benchmark Configuration

Results

Uh oh!

Qwen3-8B Serving Benchmark on RTX 4090

Environment

Benchmark Configuration

Launch Command

Results

Time to First Token (TTFT)

Time per Output Token (TPOT, excl. 1st)

Inter-token Latency (ITL)

Uh oh!

Uh oh!

Uh oh!

Device Information

Hardware

2.2.2 Software

Server Config

Benchmark Config

Benchmark Summary

Uh oh!

Hardware:

software:

image:

docker comand:

server config:

result:

Uh oh!

Uh oh!

Qwen3-32B Performance Benchmark on Ascend 910B3

Environment

Benchmark Configuration

Results

Uh oh!

Benchmark - Qwen3-8B on Nvidia A100 40GB

Environment

Benchmark Configuration

Uh oh!

[Evaluation] Qwen2.5-7B-Instruct Performance Benchmark on NVIDIA RTX 5090

Uh oh!

DrTangxc
Aug 29, 2025
Maintainer

Replies: 46 comments 1 reply