diff --git a/en/rag/embedding.html b/en/rag/embedding.html index 06cbaca9b0..eb530e4025 100644 --- a/en/rag/embedding.html +++ b/en/rag/embedding.html @@ -576,93 +576,77 @@

Input type detection

Embedder.Context when calling the embedder from Java code.

-

Best practices

-

For production deployments, we recommend configuring separate embedder components for feed and search operations. -This architectural pattern provides two key benefits - cost optimization and rate limit isolation. -In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.

+

Using voyage-4-nano for local query inference

+

The voyage-4-nano + model is available as an ONNX model for use with the + Hugging Face embedder. + Since it shares the same embedding space as the larger + Voyage 4 models, + it can be used for query embeddings with local inference — trading some accuracy for lower cost + by eliminating API usage for queries entirely.

-
{% highlight xml %}
-
-    
-        voyage-4-large
-        1024
-        voyage_feed_api_key
-    
-    
-
+

OpenAI Embedder

- - - voyage-4-lite - 1024 - voyage_search_api_key +

Available since {% include version.html version="8.678" %}

+ +

An embedder that uses the OpenAI embeddings API +to generate embeddings for semantic search. The embedder can target any OpenAI-compatible API.

+ +
{% highlight xml %}
+
+    
+        text-embedding-3-small
+        openai_api_key
+        1536
     
-    
 
 {% endhighlight %}
-
Cost optimization with model variants
-

The Voyage 4 model family features a shared embedding space -across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model -for document embeddings, while using a smaller, cheaper model for query embeddings. -Since document embedding happens once during indexing but query embedding occurs on every search request, -this approach can significantly reduce operational costs while maintaining quality.

- -

The voyage-4-nano - model is available as an ONNX model for use with the - Hugging Face embedder. - Since it shares the same embedding space as the larger Voyage 4 models, - it can be used for query embeddings with local inference, trading some accuracy for lower cost - by eliminating API usage for queries entirely.

- -
Rate limit isolation
-

Separating feed and search operations is particularly important for managing VoyageAI API rate limits. -Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors -that affect search queries. By using separate API keys for feed and search embedders, -you ensure that feeding bursts don't negatively impact search.

+
    +
  • + The model specifies which OpenAI model to use. +
  • +
  • + The api-key-secret-ref references a secret in Vespa's + secret store containing your OpenAI API key. + For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted. +
  • +
-
Thread pool tuning
-

When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency - combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread - while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the - document processing thread pool size, - assuming the content cluster is not the bottleneck.

+

See the reference +for all configuration parameters.

-

For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing - thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms, - the maximum throughput is approximately 16 / 0.2 = 80 documents/second. - See container tuning for more on container tuning.

+

Mistral Embedder

-

Note that the effective throughput can never exceed the rate limit of your VoyageAI API key. - Use the embedder metrics - to determine embedder latency and throughput. - For additional throughput improvements, consider enabling dynamic batching.

+

Available since {% include version.html version="8.678" %}

-
Dynamic batching
-

Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call. - This is useful when throughput is constrained by VoyageAI's - RPM (requests per minute) limit - rather than the TPM (tokens per minute) limit. - Batching reduces RPM usage by combining requests; TPM usage is unaffected.

+

An embedder that uses the Mistral +embeddings API to generate embeddings for semantic search.

{% highlight xml %}
-
-    
-        voyage-4-large
+
+    
+        mistral-embed
+        mistral_api_key
         1024
-        voyage_feed_api_key
-        
     
-    
 
 {% endhighlight %}
-

The max-size attribute sets the maximum number of requests in a single batch, - and max-delay sets the maximum time to wait for a full batch before sending a partial one. - Batching is disabled by default.

+
    +
  • + The model specifies which Mistral model to use. +
  • +
  • + The api-key-secret-ref references a secret in Vespa's + secret store containing your Mistral API key. + This is required for authentication. +
  • +
-

The document processing thread pool size - should be at least max-size, since each thread contributes one request to the batch.

+

Mistral supports output quantization on models that offer it, such as codestral-embed. + See the reference + for all configuration parameters.

Embedder performance

@@ -675,14 +659,21 @@

Embedder performance

  • The number of inputs to the embed call. When encoding arrays, consider how many inputs a single document can have. - For CPU inference, increasing feed timeout settings + For local CPU inference, increasing feed timeout settings might be required when documents have many embedinputs.
  • -

    Using GPU, especially for longer sequence lengths (documents), +

    For local ONNX-based embedders (such as the Hugging Face, +Bert, ColBERT, and SPLADE embedders), +using GPU, especially for longer sequence lengths (documents), can dramatically improve performance and reduce cost. See the blog post on GPU-accelerated ML inference in Vespa Cloud. With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.

    +

    For cloud embedders that call an external API +(VoyageAI, OpenAI, Mistral), +throughput is bound by API latency and rate limits rather than local hardware. +See Thread pool tuning for cloud embedders and +dynamic batching for tuning guidance.

    Refer to binarizing vectors for how to reduce vector size.

    @@ -843,6 +834,99 @@

    Combining with foreach

    See Indexing language execution valuefor details.

    +

    Separate feed and search embedders

    + +

    In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that +bursty feed load cannot affect query latency. When using HTTP-based cloud embedders +(VoyageAI, OpenAI, Mistral), +configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload, +and gives two additional benefits: cost optimization (via model variants) and +rate limit isolation.

    + +
    {% highlight xml %}
    +
    +    
    +        voyage-4-large
    +        1024
    +        voyage_feed_api_key
    +    
    +    
    +
    +
    +
    +    
    +        voyage-4-lite
    +        1024
    +        voyage_search_api_key
    +    
    +    
    +
    +{% endhighlight %}
    + +

    Cost optimization with model variants

    +

    When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful + (and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings. + Since document embedding happens once during indexing but query embedding occurs on every search request, + this can significantly reduce operational costs while maintaining retrieval quality.

    + +

    For example, the Voyage 4 model family shares a + vector space across sizes, making it a natural fit for this pattern: + use voyage-4-large in the feed cluster and voyage-4-lite in the search cluster as shown above. + See also Using voyage-4-nano for local query inference + for an even more cost-effective query-side option.

    + +

    Rate limit isolation

    +

    Separating feed and search operations is particularly important for managing API rate limits. + Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors + that affect search queries. By using separate API keys for feed and search embedders, + you ensure that feeding bursts don't negatively impact search.

    + +

    Thread pool tuning for cloud embedders

    +

    When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily + limited by embedding API latency combined with the document processing thread pool size, not by CPU. + Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput, + you likely have to increase the + document processing thread pool size, + assuming the content cluster is not the bottleneck.

    + +

    For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing + thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms, + the maximum throughput is approximately 16 / 0.2 = 80 documents/second. + See container tuning for more on container tuning.

    + +

    Note that the effective throughput can never exceed the rate limit of your API key. + Use the embedder metrics + to determine embedder latency and throughput. + For additional throughput improvements, consider enabling dynamic batching.

    + +

    Dynamic batching

    +

    Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation. + This is useful when throughput is constrained by the provider's + requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit. + Batching reduces RPM usage by combining requests; TPM usage is unaffected.

    + +

    Dynamic batching is supported by the VoyageAI, + OpenAI, and Mistral embedders.

    + +
    {% highlight xml %}
    +
    +    
    +        voyage-4-large
    +        1024
    +        voyage_feed_api_key
    +        
    +    
    +    
    +
    +{% endhighlight %}
    + +

    The max-size attribute sets the maximum number of requests in a single batch, + and max-delay sets the maximum time to wait for a full batch before sending a partial one. + Batching is disabled by default.

    + +

    The document processing thread pool size + should be at least max-size, since each thread contributes one request to the batch.

    +

    Troubleshooting

    diff --git a/en/reference/rag/embedding.html b/en/reference/rag/embedding.html index ce25f0d36c..0285259427 100644 --- a/en/reference/rag/embedding.html +++ b/en/reference/rag/embedding.html @@ -580,6 +580,182 @@

    VoyageAI embedder reference config +

    OpenAI Embedder

    +

    Available since {% include version.html version="8.678" %}

    +

    + An embedder that uses the OpenAI embeddings API + to generate embeddings. +

    +

    + The OpenAI embedder is configured in services.xml, + within the container tag: +

    +
    {% highlight xml %}
    +
    +    
    +        text-embedding-3-small
    +        openai_api_key
    +        1536
    +        https://api.openai.com/v1/embeddings
    +    
    +
    +{% endhighlight %}
    + +

    OpenAI embedder reference config

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameOccurrenceDescriptionTypeDefault
    modelOneRequired. The OpenAI model to use, for example text-embedding-3-small or + text-embedding-3-large. See the + OpenAI embeddings documentation + for the complete list of available models.stringN/A
    dimensionsOneRequired. The number of dimensions for the output embedding vectors. Must match the + tensor field definition in your schema. The destination tensor field must use float or + bfloat16 cell type — the OpenAI API does not support quantization.integerN/A
    api-key-secret-refOptionalReference to the secret in Vespa's secret store + containing the OpenAI API key. When unset, requests are sent without an Authorization header.string"" (no auth)
    endpointOptionalOpenAI API endpoint URL. Set this to target a specific OpenAI-compatible API.stringhttps://api.openai.com/v1/embeddings
    batchingOptionalEnables dynamic batching of concurrent embedding requests into single OpenAI API calls. + When enabled, the embedder collects concurrent requests and sends them as a single batch, + reducing the number of API calls and improving throughput. +
      +
    • max-size — Maximum number of requests to include in a single batch.
    • +
    • max-delay — Maximum time to wait for a full batch before sending a partial one (e.g., 200ms).
    • +
    +
    elementdisabled
    + +

    Mistral Embedder

    +

    Available since {% include version.html version="8.678" %}

    +

    + An embedder that uses the Mistral + embeddings API to generate embeddings. +

    +

    + The Mistral embedder is configured in services.xml, + within the container tag: +

    +
    {% highlight xml %}
    +
    +    
    +        mistral-embed
    +        mistral_api_key
    +        1024
    +        auto
    +    
    +
    +{% endhighlight %}
    + +

    Mistral embedder reference config

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameOccurrenceDescriptionTypeDefault
    modelOneRequired. The Mistral model to use, for example mistral-embed or + codestral-embed. See the + Mistral embeddings documentation + for the complete list of available models.stringN/A
    api-key-secret-refOneRequired. Reference to the secret in Vespa's + secret store containing the Mistral API key.stringN/A
    dimensionsOneRequired. The number of dimensions for the output embedding vectors. Must match the + tensor field definition in your schema. See the + Mistral embeddings documentation + for model-specific dimension support.integerN/A
    quantizationOptionalOutput quantization format for embedding vectors. Valid values are auto, float, + int8, or binary. See the quantization row of the + VoyageAI embedder reference config + for details on auto resolution and the destination tensor layout required for int8 + and binary. Note that not all Mistral models support int8 and binary + quantization — see the + Mistral embeddings documentation + for per-model support.stringauto
    batchingOptionalEnables dynamic batching of concurrent embedding requests into single Mistral API calls. + When enabled, the embedder collects concurrent requests and sends them as a single batch, + reducing the number of API calls and improving throughput. +
      +
    • max-size — Maximum number of requests to include in a single batch.
    • +
    • max-delay — Maximum time to wait for a full batch before sending a partial one (e.g., 200ms).
    • +
    +
    elementdisabled
    +

    Huggingface tokenizer embedder

    The Huggingface tokenizer embedder is configured in services.xml,