Skip to content

Commit 2ed960b

Browse files
authored
Add warning about truncation to ML documentation (#11981)
* Add warning about truncation to ML documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestion from @kolchfa-aws Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
1 parent e04712e commit 2ed960b

File tree

3 files changed

+10
-1
lines changed

3 files changed

+10
-1
lines changed

_ingest-pipelines/processors/text-embedding.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ The `text_embedding` processor is used to generate vector embeddings from text f
1515
Before using the `text_embedding` processor, you must set up a machine learning (ML) model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
1616
{: .note}
1717

18+
**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
19+
{: .warning}
20+
1821
The following is the syntax for the `text_embedding` processor:
1922

2023
```json

_ml-commons-plugin/pretrained-models.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,10 @@ Running local models on the CentOS 7 operating system is not supported. Moreover
2323

2424
Sentence transformer models map sentences and paragraphs across a dimensional dense vector space. The number of vectors depends on the type of model. You can use these models for use cases such as clustering or semantic search.
2525

26-
The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column.
26+
The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column.
27+
28+
**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
29+
{: .warning}
2730

2831
| Model name | Version | Vector dimensions | Auto-truncation | TorchScript artifact | ONNX artifact |
2932
|:---|:---|:---|:---|:---|:---|

_vector-search/getting-started/auto-generated-embeddings.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ In this example, you'll use the [DistilBERT](https://huggingface.co/docs/transfo
4646
Take note of the dimensionality of the model because you'll need it when you set up a vector index.
4747
{: .important}
4848

49+
**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
50+
{: .warning}
51+
4952
## Manual setup
5053

5154
For more control over the configuration, you can set up each component manually using the following steps.

0 commit comments

Comments
 (0)