diff --git a/en/rag/embedding.html b/en/rag/embedding.html index 06cbaca9b0..eb530e4025 100644 --- a/en/rag/embedding.html +++ b/en/rag/embedding.html @@ -576,93 +576,77 @@
For production deployments, we recommend configuring separate embedder components for feed and search operations. -This architectural pattern provides two key benefits - cost optimization and rate limit isolation. -In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.
+The voyage-4-nano + model is available as an ONNX model for use with the + Hugging Face embedder. + Since it shares the same embedding space as the larger + Voyage 4 models, + it can be used for query embeddings with local inference — trading some accuracy for lower cost + by eliminating API usage for queries entirely.
-{% highlight xml %}
-
-
- voyage-4-large
- 1024
- voyage_feed_api_key
-
-
-
+OpenAI Embedder
-
-
- voyage-4-lite
- 1024
- voyage_search_api_key
+Available since {% include version.html version="8.678" %}
+
+An embedder that uses the OpenAI embeddings API
+to generate embeddings for semantic search. The embedder can target any OpenAI-compatible API.
+
+{% highlight xml %}
+
+
+ text-embedding-3-small
+ openai_api_key
+ 1536
-
{% endhighlight %}
-Cost optimization with model variants
-The Voyage 4 model family features a shared embedding space
-across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
-for document embeddings, while using a smaller, cheaper model for query embeddings.
-Since document embedding happens once during indexing but query embedding occurs on every search request,
-this approach can significantly reduce operational costs while maintaining quality.
-
-The voyage-4-nano
- model is available as an ONNX model for use with the
- Hugging Face embedder.
- Since it shares the same embedding space as the larger Voyage 4 models,
- it can be used for query embeddings with local inference, trading some accuracy for lower cost
- by eliminating API usage for queries entirely.
-
-Rate limit isolation
-Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
-Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
-that affect search queries. By using separate API keys for feed and search embedders,
-you ensure that feeding bursts don't negatively impact search.
+
+ -
+ The
model specifies which OpenAI model to use.
+
+ -
+ The
api-key-secret-ref references a secret in Vespa's
+ secret store containing your OpenAI API key.
+ For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted.
+
+
-Thread pool tuning
-When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
- combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
- while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
- document processing thread pool size,
- assuming the content cluster is not the bottleneck.
+See the reference
+for all configuration parameters.
-For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
- thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
- the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
- See container tuning for more on container tuning.
+Mistral Embedder
-Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
- Use the embedder metrics
- to determine embedder latency and throughput.
- For additional throughput improvements, consider enabling dynamic batching.
+Available since {% include version.html version="8.678" %}
-Dynamic batching
-Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
- This is useful when throughput is constrained by VoyageAI's
- RPM (requests per minute) limit
- rather than the TPM (tokens per minute) limit.
- Batching reduces RPM usage by combining requests; TPM usage is unaffected.
+An embedder that uses the Mistral
+embeddings API to generate embeddings for semantic search.
{% highlight xml %}
-
-
- voyage-4-large
+
+
+ mistral-embed
+ mistral_api_key
1024
- voyage_feed_api_key
-
-
{% endhighlight %}
-The max-size attribute sets the maximum number of requests in a single batch,
- and max-delay sets the maximum time to wait for a full batch before sending a partial one.
- Batching is disabled by default.
+
+ -
+ The
model specifies which Mistral model to use.
+
+ -
+ The
api-key-secret-ref references a secret in Vespa's
+ secret store containing your Mistral API key.
+ This is required for authentication.
+
+
-The document processing thread pool size
- should be at least max-size, since each thread contributes one request to the batch.
+Mistral supports output quantization on models that offer it, such as codestral-embed.
+ See the reference
+ for all configuration parameters.
Embedder performance
@@ -675,14 +659,21 @@ Embedder performance
The number of inputs to the embed call. When encoding arrays, consider how many inputs a single document can have.
- For CPU inference, increasing feed timeout settings
+ For local CPU inference, increasing feed timeout settings
might be required when documents have many embedinputs.
-Using GPU, especially for longer sequence lengths (documents),
+
For local ONNX-based embedders (such as the Hugging Face,
+Bert, ColBERT, and SPLADE embedders),
+using GPU, especially for longer sequence lengths (documents),
can dramatically improve performance and reduce cost.
See the blog post on GPU-accelerated ML inference in Vespa Cloud.
With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.
+For cloud embedders that call an external API
+(VoyageAI, OpenAI, Mistral),
+throughput is bound by API latency and rate limits rather than local hardware.
+See Thread pool tuning for cloud embedders and
+dynamic batching for tuning guidance.
Refer to binarizing vectors for how to reduce vector size.
@@ -843,6 +834,99 @@ Combining with foreach
See Indexing language execution valuefor details.
+Separate feed and search embedders
+
+In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that
+bursty feed load cannot affect query latency. When using HTTP-based cloud embedders
+(VoyageAI, OpenAI, Mistral),
+configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload,
+and gives two additional benefits: cost optimization (via model variants) and
+rate limit isolation.
+
+{% highlight xml %}
+
+
+ voyage-4-large
+ 1024
+ voyage_feed_api_key
+
+
+
+
+
+
+ voyage-4-lite
+ 1024
+ voyage_search_api_key
+
+
+
+{% endhighlight %}
+
+Cost optimization with model variants
+When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful
+ (and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings.
+ Since document embedding happens once during indexing but query embedding occurs on every search request,
+ this can significantly reduce operational costs while maintaining retrieval quality.
+
+For example, the Voyage 4 model family shares a
+ vector space across sizes, making it a natural fit for this pattern:
+ use voyage-4-large in the feed cluster and voyage-4-lite in the search cluster as shown above.
+ See also Using voyage-4-nano for local query inference
+ for an even more cost-effective query-side option.
+
+Rate limit isolation
+Separating feed and search operations is particularly important for managing API rate limits.
+ Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
+ that affect search queries. By using separate API keys for feed and search embedders,
+ you ensure that feeding bursts don't negatively impact search.
+
+Thread pool tuning for cloud embedders
+When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily
+ limited by embedding API latency combined with the document processing thread pool size, not by CPU.
+ Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput,
+ you likely have to increase the
+ document processing thread pool size,
+ assuming the content cluster is not the bottleneck.
+
+For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
+ thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms,
+ the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
+ See container tuning for more on container tuning.
+
+Note that the effective throughput can never exceed the rate limit of your API key.
+ Use the embedder metrics
+ to determine embedder latency and throughput.
+ For additional throughput improvements, consider enabling dynamic batching.
+
+Dynamic batching
+Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation.
+ This is useful when throughput is constrained by the provider's
+ requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit.
+ Batching reduces RPM usage by combining requests; TPM usage is unaffected.
+
+Dynamic batching is supported by the VoyageAI,
+ OpenAI, and Mistral embedders.
+
+{% highlight xml %}
+
+
+ voyage-4-large
+ 1024
+ voyage_feed_api_key
+
+
+
+
+{% endhighlight %}
+
+The max-size attribute sets the maximum number of requests in a single batch,
+ and max-delay sets the maximum time to wait for a full batch before sending a partial one.
+ Batching is disabled by default.
+
+The document processing thread pool size
+ should be at least max-size, since each thread contributes one request to the batch.
+
Troubleshooting
diff --git a/en/reference/rag/embedding.html b/en/reference/rag/embedding.html
index ce25f0d36c..0285259427 100644
--- a/en/reference/rag/embedding.html
+++ b/en/reference/rag/embedding.html
@@ -580,6 +580,182 @@ VoyageAI embedder reference config
+OpenAI Embedder
+
Available since {% include version.html version="8.678" %}
+
+ An embedder that uses the OpenAI embeddings API
+ to generate embeddings.
+
+
+ The OpenAI embedder is configured in services.xml,
+ within the container tag:
+
+{% highlight xml %}
+
+
+ text-embedding-3-small
+ openai_api_key
+ 1536
+ https://api.openai.com/v1/embeddings
+
+
+{% endhighlight %}
+
+OpenAI embedder reference config
+
+
+
+ Name
+ Occurrence
+ Description
+ Type
+ Default
+
+
+
+
+ model
+ One
+ Required. The OpenAI model to use, for example text-embedding-3-small or
+ text-embedding-3-large. See the
+ OpenAI embeddings documentation
+ for the complete list of available models.
+ string
+ N/A
+
+
+ dimensions
+ One
+ Required. The number of dimensions for the output embedding vectors. Must match the
+ tensor field definition in your schema. The destination tensor field must use float or
+ bfloat16 cell type — the OpenAI API does not support quantization.
+ integer
+ N/A
+
+
+ api-key-secret-ref
+ Optional
+ Reference to the secret in Vespa's secret store
+ containing the OpenAI API key. When unset, requests are sent without an Authorization header.
+ string
+ "" (no auth)
+
+
+ endpoint
+ Optional
+ OpenAI API endpoint URL. Set this to target a specific OpenAI-compatible API.
+ string
+ https://api.openai.com/v1/embeddings
+
+
+ batching
+ Optional
+ Enables dynamic batching of concurrent embedding requests into single OpenAI API calls.
+ When enabled, the embedder collects concurrent requests and sends them as a single batch,
+ reducing the number of API calls and improving throughput.
+
+ max-size — Maximum number of requests to include in a single batch.
+ max-delay — Maximum time to wait for a full batch before sending a partial one (e.g., 200ms).
+
+
+ element
+ disabled
+
+
+
+
+Mistral Embedder
+Available since {% include version.html version="8.678" %}
+
+ An embedder that uses the Mistral
+ embeddings API to generate embeddings.
+
+
+ The Mistral embedder is configured in services.xml,
+ within the container tag:
+
+{% highlight xml %}
+
+
+ mistral-embed
+ mistral_api_key
+ 1024
+ auto
+
+
+{% endhighlight %}
+
+Mistral embedder reference config
+
+
+
+ Name
+ Occurrence
+ Description
+ Type
+ Default
+
+
+
+
+ model
+ One
+ Required. The Mistral model to use, for example mistral-embed or
+ codestral-embed. See the
+ Mistral embeddings documentation
+ for the complete list of available models.
+ string
+ N/A
+
+
+ api-key-secret-ref
+ One
+ Required. Reference to the secret in Vespa's
+ secret store containing the Mistral API key.
+ string
+ N/A
+
+
+ dimensions
+ One
+ Required. The number of dimensions for the output embedding vectors. Must match the
+ tensor field definition in your schema. See the
+ Mistral embeddings documentation
+ for model-specific dimension support.
+ integer
+ N/A
+
+
+ quantization
+ Optional
+ Output quantization format for embedding vectors. Valid values are auto, float,
+ int8, or binary. See the quantization row of the
+ VoyageAI embedder reference config
+ for details on auto resolution and the destination tensor layout required for int8
+ and binary. Note that not all Mistral models support int8 and binary
+ quantization — see the
+ Mistral embeddings documentation
+ for per-model support.
+ string
+ auto
+
+
+ batching
+ Optional
+ Enables dynamic batching of concurrent embedding requests into single Mistral API calls.
+ When enabled, the embedder collects concurrent requests and sends them as a single batch,
+ reducing the number of API calls and improving throughput.
+
+ max-size — Maximum number of requests to include in a single batch.
+ max-delay — Maximum time to wait for a full batch before sending a partial one (e.g., 200ms).
+
+
+ element
+ disabled
+
+
+
+
Huggingface tokenizer embedder
The Huggingface tokenizer embedder is configured in services.xml,