Frequently Asked Questions (FAQ)

Q1: How to make Chitu support our model?

If you're a large model developer seeking Chitu compatibility for your model:

Submit a Pull Request - our team will review and merge after confirmation (see CONTRIBUTING)
For technical difficulties, contact our support team at solution@chitu.ai

Q2: How to make Chitu support our chip?

If you're developing or using an unsupported chip architecture:

Submit a Pull Request for review (see CONTRIBUTING)
For adaptation challenges, email solution@chitu.ai

Q3: How to run FP4/FP8 models without native FP4/FP8 compute units?

Solution: Store weights in FP8 format but execute computations in BF16 (similar to w8a16 quantization where "8" refers to float8).
Note: Floating-point conversion involves greater technical complexity than integer conversion. Technical details are explained in this Zhihu article.

Q4: Why does FP4/FP8 sometimes accelerate performance beyond just compute savings?

While typically improving cost-performance ratios rather than raw performance, exceptional cases may show both compute savings and speedups. This Zhihu analysis explains when such exceptional cases occur.

Q5: How does Chitu differ from vLLM/SGLang/llama.cpp?

Chitu complements rather than replicates existing solutions by focusing on:

Native support for non-nvidia chips (e.g., Ascend/Muxi/Hygon)
Seamless scalability from minimal to large-scale deployments

Q6: Ideal use cases for Chitu

Consider Chitu if you:

Use non-nvidia chips (Ascend/Muxi/Hygon/etc.)
Employ heterogeneous computing (mixed chips)
Require high-performance inference
Seek cost-efficient deployment
Engaged in research on inference framework

Q7: Does Chitu support CPU-only or CPU+GPU inference?

Since v0.2.2: Supports CPU+GPU heterogeneous inference
CPU-only support: Planned feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently Asked Questions (FAQ)

Q1: How to make Chitu support our model?

Q2: How to make Chitu support our chip?

Q3: How to run FP4/FP8 models without native FP4/FP8 compute units?

Q4: Why does FP4/FP8 sometimes accelerate performance beyond just compute savings?

Q5: How does Chitu differ from vLLM/SGLang/llama.cpp?

Q6: Ideal use cases for Chitu

Q7: Does Chitu support CPU-only or CPU+GPU inference?

FilesExpand file tree

FAQ.md

Latest commit

History

FAQ.md

File metadata and controls

Frequently Asked Questions (FAQ)

Q1: How to make Chitu support our model?

Q2: How to make Chitu support our chip?

Q3: How to run FP4/FP8 models without native FP4/FP8 compute units?

Q4: Why does FP4/FP8 sometimes accelerate performance beyond just compute savings?

Q5: How does Chitu differ from vLLM/SGLang/llama.cpp?

Q6: Ideal use cases for Chitu

Q7: Does Chitu support CPU-only or CPU+GPU inference?