TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file triton_models/weights/config.ini that is closely related to inference performance.
If you are using LMDeploy version 0.0.x, please refer to the turbomind 1.0 config section to learn the relevant content in the configuration. Otherwise, please read turbomind 2.0 config to familiarize yourself with the configuration details.
Take the llama-2-7b-chat model as an example. In TurboMind 2.x, its config.ini content is as follows:
[llama]
model_name = "llama2"
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = "fp16"
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
enable_prefix_caching = false
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are not modifiable.
model_name = "llama2"
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128Comparing to TurboMind 1.0, the model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed In the following sections, we will focus on introducing the inference parameters.
weight_type and group_size are the relevant parameters, which cannot be modified.
weight_type represents the data type of weights. Currently, fp16 and int4 are supported. int4 represents 4bit weights. When weight_type is int4, group_size means the group size used when quantizing weights with awq. In LMDeploy prebuilt package, kernels with group size = 128 are included.
The maximum batch size is still set through max_batch_size. But its default value has been changed from 32 to 64, and max_batch_size is no longer related to cache_max_entry_count.
k/v cache memory is determined by cache_block_seq_len and cache_max_entry_count.
TurboMind 2.x has implemented Paged Attention, managing the k/v cache in blocks.
cache_block_seq_len represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
For the llama2-7b model, when storing k/v as the half type, the memory of a k/v block is: 128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB
The meaning of cache_max_entry_count varies depending on its value:
- When it's a decimal between (0, 1),
cache_max_entry_countrepresents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU withcache_max_entry_countbeing0.5, the total memory used by the k/v blocks is80 * 0.5 = 40G. - When lmdeploy is greater than v0.2.1,
cache_max_entry_countdetermines the percentage of free memory for k/v blocks, defaulting to0.8. For example, with Turbomind on an A100-80G GPU running a 13b model, the memory for k/v blocks would be(80 - 26) * 0.8 = 43.2G, utilizing 80% of the free 54G. - When it's an integer > 0, it represents the total number of k/v blocks
The cache_chunk_size indicates the size of the k/v cache chunk to be allocated each time new k/v cache blocks are needed. Different values represent different meanings:
- When it is an integer > 0,
cache_chunk_sizenumber of k/v cache blocks are allocated. - When the value is -1,
cache_max_entry_countnumber of k/v cache blocks are allocated. - When the value is 0,
sqrt(cache_max_entry_count)number of k/v cache blocks are allocated.
Prefix caching feature can be controlled by setting the enable_prefix_caching parameter. When set to True, it indicates that the feature is enabled, and when set to False, it indicates that the feature is disabled. The default value is False.
Prefix caching feature is mainly applicable to scenarios where multiple requests have the same prompt prefix (such as system prompt). The k/v blocks of this identical prefix part will be cached and reused by multiple requests, thereby saving the overhead of redundant computations and improving inference performance. The longer the identical prompt prefix, the greater the performance improvement.
Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length < cache_block_seq_len), there will be no improvement in inference performance.
quant_policy=4means 4bit k/v quantization and inferencequant_policy=8indicates 8bit k/v quantization and inference
Please refer to kv quant for detailed guide.
By setting rope_scaling_factor = 1.0, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
- https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
- https://kexue.fm/archives/9675
You can also turn on LogN attention scaling by setting use_logn_attn = 1.
Taking the llama-2-7b-chat model as an example, in TurboMind 1.0, its config.ini content is as follows:
[llama]
model_name = "llama2"
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = "fp16"
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 4
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
use_dynamic_ntk = 0
use_logn_attn = 0These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are not modifiable.
model_name = "llama2"
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128In the following sections, we will focus on introducing the inference parameters.
weight_type and group_size are the relevant parameters, which cannot be modified.
weight_type represents the data type of weights. Currently, fp16 and int4 are supported. int4 represents 4bit weights. When weight_type is int4, group_size means the group size used when quantizing weights with awq. In LMDeploy prebuilt package, kernels with group size = 128 are included.
max_batch_size determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that max_batch_size <= cache_max_entry_count
TurboMind allocates k/v cache memory based on session_len, cache_chunk_size, and cache_max_entry_count.
session_lendenotes the maximum length of a sequence, i.e., the size of the context window.cache_chunk_sizeindicates the size of k/v sequences to be allocated when new sequences are added.cache_max_entry_countsignifies the maximum number of k/v sequences that can be cached.
When initiating 8bit k/v inference, change quant_policy = 4 and use_context_fmha = 0. Please refer to kv int8 for a guide.
By setting use_dynamic_ntk = 1, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
- https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
- https://kexue.fm/archives/9675
You can also turn on LogN attention scaling by setting use_logn_attn = 1.