-
Notifications
You must be signed in to change notification settings - Fork 108
Document OpenAI and Mistral embedders #4655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
f4520e7
b6f1edf
90fee89
96c4787
d7cfdfd
5f1991a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -576,93 +576,73 @@ <h4 id="voyageai-input-types">Input type detection</h4> | |
| <a href="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html">Embedder.Context</a> | ||
| when calling the embedder from Java code.</p> | ||
|
|
||
| <h4 id="voyageai-best-practices">Best practices</h4> | ||
| <p>For production deployments, we recommend configuring <strong>separate embedder components for feed and search operations</strong>. | ||
| This architectural pattern provides two key benefits - cost optimization and rate limit isolation. | ||
| In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</p> | ||
|
|
||
| <pre>{% highlight xml %} | ||
| <container id="feed" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-large</model> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref> | ||
| </component> | ||
| <document-api/> | ||
| </container> | ||
|
|
||
| <container id="search" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-lite</model> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_search_api_key</api-key-secret-ref> | ||
| </component> | ||
| <search/> | ||
| </container> | ||
| {% endhighlight %}</pre> | ||
|
|
||
| <h5 id="voyageai-cost-optimization">Cost optimization with model variants</h5> | ||
| <p>The <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> features a shared embedding space | ||
| across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model | ||
| for document embeddings, while using a smaller, cheaper model for query embeddings. | ||
| Since document embedding happens once during indexing but query embedding occurs on every search request, | ||
| this approach can significantly reduce operational costs while maintaining quality.</p> | ||
|
|
||
| <h4 id="voyageai-local-query-inference">Using voyage-4-nano for local query inference</h4> | ||
| <p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a> | ||
| model is available as an ONNX model for use with the | ||
| <a href="#huggingface-embedder">Hugging Face embedder</a>. | ||
| Since it shares the same embedding space as the larger Voyage 4 models, | ||
| it can be used for query embeddings with local inference, trading some accuracy for lower cost | ||
| Since it shares the same embedding space as the larger | ||
| <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4</a> models, | ||
| it can be used for query embeddings with local inference — trading some accuracy for lower cost | ||
| by eliminating API usage for queries entirely.</p> | ||
|
|
||
| <h5 id="voyageai-rate-limit-isolation">Rate limit isolation</h5> | ||
| <p>Separating feed and search operations is particularly important for managing VoyageAI API rate limits. | ||
| Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors | ||
| that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders, | ||
| you ensure that feeding bursts don't negatively impact search.</p> | ||
| <h3 id="openai-embedder">OpenAI Embedder</h3> | ||
|
|
||
| <h5 id="voyageai-document-processing-concurrency">Thread pool tuning</h5> | ||
| <p>When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency | ||
| combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread | ||
| while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the | ||
| <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>, | ||
| assuming the content cluster is not the bottleneck.</p> | ||
| <p>An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API | ||
| to generate embeddings for semantic search. The embedder can also target self-hosted OpenAI-compatible APIs.</p> | ||
|
|
||
| <p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing | ||
| thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms, | ||
| the maximum throughput is approximately 16 / 0.2 = 80 documents/second. | ||
| See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p> | ||
| <pre>{% highlight xml %} | ||
| <container version="1.0"> | ||
| <component id="openai" type="openai-embedder"> | ||
| <model>text-embedding-3-small</model> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add URL to make it easy to see how to use custom URLs?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer having a simplified example here, and rather let the user discover how to override the endpoint by reading the reference documentation. |
||
| <api-key-secret-ref>openai_api_key</api-key-secret-ref> | ||
| <dimensions>1536</dimensions> | ||
| </component> | ||
| </container> | ||
| {% endhighlight %}</pre> | ||
|
|
||
| <p>Note that the effective throughput can never exceed the rate limit of your VoyageAI API key. | ||
| Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a> | ||
| to determine embedder latency and throughput. | ||
| For additional throughput improvements, consider enabling <a href="#voyageai-dynamic-batching">dynamic batching</a>.</p> | ||
| <ul> | ||
| <li> | ||
| The <code>model</code> specifies which OpenAI model to use. | ||
| </li> | ||
| <li> | ||
| The <code>api-key-secret-ref</code> references a secret in Vespa's | ||
| <a href="/en/cloud/security/secret-store.html">secret store</a> containing your OpenAI API key. | ||
| For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted. | ||
| </li> | ||
| </ul> | ||
|
|
||
| <h5 id="voyageai-dynamic-batching">Dynamic batching</h5> | ||
| <p>Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call. | ||
| This is useful when throughput is constrained by VoyageAI's | ||
| <a href="https://docs.voyageai.com/docs/rate-limits">RPM (requests per minute) limit</a> | ||
| rather than the TPM (tokens per minute) limit. | ||
| Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p> | ||
| <p>See the <a href="../reference/rag/embedding.html#openai-embedder-reference-config">reference</a> | ||
| for all configuration parameters.</p> | ||
|
|
||
| <h3 id="mistral-embedder">Mistral Embedder</h3> | ||
|
|
||
| <p>An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a> | ||
| embeddings API to generate embeddings for semantic search.</p> | ||
|
|
||
| <pre>{% highlight xml %} | ||
| <container id="feed" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-large</model> | ||
| <container version="1.0"> | ||
| <component id="mistral" type="mistral-embedder"> | ||
| <model>mistral-embed</model> | ||
| <api-key-secret-ref>mistral_api_key</api-key-secret-ref> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref> | ||
| <batching max-size="16" max-delay="200ms"/> | ||
| </component> | ||
| <document-api/> | ||
| </container> | ||
| {% endhighlight %}</pre> | ||
|
|
||
| <p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch, | ||
| and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one. | ||
| Batching is disabled by default.</p> | ||
| <ul> | ||
| <li> | ||
| The <code>model</code> specifies which Mistral model to use. | ||
| </li> | ||
| <li> | ||
| The <code>api-key-secret-ref</code> references a secret in Vespa's | ||
| <a href="/en/cloud/security/secret-store.html">secret store</a> containing your Mistral API key. | ||
| This is required for authentication. | ||
| </li> | ||
| </ul> | ||
|
|
||
| <p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a> | ||
| should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p> | ||
| <p>Mistral supports output quantization on models that offer it, such as <code>codestral-embed</code>. | ||
| See the <a href="../reference/rag/embedding.html#mistral-embedder-reference-config">reference</a> | ||
| for all configuration parameters.</p> | ||
|
|
||
| <h2 id="embedder-performance">Embedder performance</h2> | ||
|
bjorncs marked this conversation as resolved.
|
||
|
|
||
|
|
@@ -843,6 +823,99 @@ <h3 id="combining-with-foreach">Combining with foreach</h3> | |
| <p>See <a href="../writing/indexing.html#execution-value-example">Indexing language execution value</a>for details.</p> | ||
|
|
||
|
|
||
| <h3 id="separate-feed-and-search-embedders">Separate feed and search embedders</h3> | ||
|
|
||
| <p>In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that | ||
| bursty feed load cannot affect query latency. When using HTTP-based cloud embedders | ||
| (<a href="#voyageai-embedder">VoyageAI</a>, <a href="#openai-embedder">OpenAI</a>, <a href="#mistral-embedder">Mistral</a>), | ||
| configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload, | ||
| and gives two additional benefits: <strong>cost optimization</strong> (via model variants) and | ||
| <strong>rate limit isolation</strong>.</p> | ||
|
|
||
| <pre>{% highlight xml %} | ||
| <container id="feed" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-large</model> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref> | ||
| </component> | ||
| <document-api/> | ||
| </container> | ||
|
|
||
| <container id="search" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-lite</model> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_search_api_key</api-key-secret-ref> | ||
| </component> | ||
| <search/> | ||
| </container> | ||
| {% endhighlight %}</pre> | ||
|
|
||
| <h4 id="cost-optimization-with-model-variants">Cost optimization with model variants</h4> | ||
| <p>When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful | ||
| (and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings. | ||
| Since document embedding happens once during indexing but query embedding occurs on every search request, | ||
| this can significantly reduce operational costs while maintaining retrieval quality.</p> | ||
|
|
||
| <p>For example, the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> shares a | ||
| vector space across sizes, making it a natural fit for this pattern: | ||
| use <code>voyage-4-large</code> in the feed cluster and <code>voyage-4-lite</code> in the search cluster as shown above. | ||
| See also <a href="#voyageai-local-query-inference">Using voyage-4-nano for local query inference</a> | ||
| for an even more cost-effective query-side option.</p> | ||
|
|
||
| <h4 id="rate-limit-isolation">Rate limit isolation</h4> | ||
| <p>Separating feed and search operations is particularly important for managing API rate limits. | ||
| Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors | ||
| that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders, | ||
| you ensure that feeding bursts don't negatively impact search.</p> | ||
|
|
||
| <h3 id="thread-pool-tuning">Thread pool tuning for cloud embedders</h3> | ||
| <p>When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily | ||
| limited by embedding API latency combined with the document processing thread pool size, not by CPU. | ||
| Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput, | ||
| you likely have to increase the | ||
| <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>, | ||
| assuming the content cluster is not the bottleneck.</p> | ||
|
|
||
| <p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing | ||
| thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms, | ||
| the maximum throughput is approximately 16 / 0.2 = 80 documents/second. | ||
| See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p> | ||
|
|
||
| <p>Note that the effective throughput can never exceed the rate limit of your API key. | ||
| Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a> | ||
|
bjorncs marked this conversation as resolved.
Outdated
|
||
| to determine embedder latency and throughput. | ||
| For additional throughput improvements, consider enabling <a href="#dynamic-batching">dynamic batching</a>.</p> | ||
|
|
||
| <h3 id="dynamic-batching">Dynamic batching</h3> | ||
| <p>Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation. | ||
| This is useful when throughput is constrained by the provider's | ||
| requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit. | ||
| Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p> | ||
|
|
||
| <p>Dynamic batching is supported by the <a href="#voyageai-embedder">VoyageAI</a>, | ||
| <a href="#openai-embedder">OpenAI</a>, and <a href="#mistral-embedder">Mistral</a> embedders.</p> | ||
|
bjorncs marked this conversation as resolved.
|
||
|
|
||
| <pre>{% highlight xml %} | ||
| <container id="feed" version="1.0"> | ||
| <component id="voyage" type="voyage-ai-embedder"> | ||
| <model>voyage-4-large</model> | ||
| <dimensions>1024</dimensions> | ||
| <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref> | ||
| <batching max-size="16" max-delay="200ms"/> | ||
| </component> | ||
| <document-api/> | ||
| </container> | ||
| {% endhighlight %}</pre> | ||
|
|
||
| <p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch, | ||
| and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one. | ||
| Batching is disabled by default.</p> | ||
|
|
||
| <p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a> | ||
| should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p> | ||
|
|
||
|
|
||
| <h2 id="troubleshooting">Troubleshooting</h2> | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.