Add warning about truncation to ML documentation (#11981)

kolchfa-aws · web-flow · commit 2ed960b032de · 2026-03-06T17:10:06.000-05:00
* Add warning about truncation to ML documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestion from @kolchfa-aws Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
diff --git a/_ingest-pipelines/processors/text-embedding.md b/_ingest-pipelines/processors/text-embedding.md
@@ -15,6 +15,9 @@ The `text_embedding` processor is used to generate vector embeddings from text f
 Before using the `text_embedding` processor, you must set up a machine learning (ML) model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
 {: .note}
 
+**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings. 
+{: .warning}
+
 The following is the syntax for the `text_embedding` processor: 
 
 ```json
diff --git a/_ml-commons-plugin/pretrained-models.md b/_ml-commons-plugin/pretrained-models.md
@@ -23,7 +23,10 @@ Running local models on the CentOS 7 operating system is not supported. Moreover
 
 Sentence transformer models map sentences and paragraphs across a dimensional dense vector space. The number of vectors depends on the type of model. You can use these models for use cases such as clustering or semantic search.
 
-The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column. 
+The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column.
+
+**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings. 
+{: .warning}
 
 | Model name | Version | Vector dimensions | Auto-truncation | TorchScript artifact | ONNX artifact |
 |:---|:---|:---|:---|:---|:---|
diff --git a/_vector-search/getting-started/auto-generated-embeddings.md b/_vector-search/getting-started/auto-generated-embeddings.md
@@ -46,6 +46,9 @@ In this example, you'll use the [DistilBERT](https://huggingface.co/docs/transfo
 Take note of the dimensionality of the model because you'll need it when you set up a vector index.
 {: .important}
 
+**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings. 
+{: .warning}
+
 ## Manual setup
 
 For more control over the configuration, you can set up each component manually using the following steps.