Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions _ingest-pipelines/processors/text-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ The `text_embedding` processor is used to generate vector embeddings from text f
Before using the `text_embedding` processor, you must set up a machine learning (ML) model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
{: .note}

**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
{: .warning}

The following is the syntax for the `text_embedding` processor:

```json
Expand Down
5 changes: 4 additions & 1 deletion _ml-commons-plugin/pretrained-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ Running local models on the CentOS 7 operating system is not supported. Moreover

Sentence transformer models map sentences and paragraphs across a dimensional dense vector space. The number of vectors depends on the type of model. You can use these models for use cases such as clustering or semantic search.

The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column.
The following table provides a list of sentence transformer models and artifact links you can use to download them. Note that you must prefix the model name with `huggingface/`, as shown in the **Model name** column.

**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
{: .warning}

| Model name | Version | Vector dimensions | Auto-truncation | TorchScript artifact | ONNX artifact |
|:---|:---|:---|:---|:---|:---|
Expand Down
3 changes: 3 additions & 0 deletions _vector-search/getting-started/auto-generated-embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ In this example, you'll use the [DistilBERT](https://huggingface.co/docs/transfo
Take note of the dimensionality of the model because you'll need it when you set up a vector index.
{: .important}

**Token limits and truncation**: Text embedding models have maximum token limits (typically 512 tokens for BERT-based models). When a document exceeds this limit, the model automatically truncates the text, and the truncated content is not represented in the embeddings. This can significantly impact search relevance because documents may not be returned in search results if the relevant content was truncated. To avoid this issue, split long documents into smaller chunks before generating embeddings.
{: .warning}

## Manual setup

For more control over the configuration, you can set up each component manually using the following steps.
Expand Down
Loading