Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 143 additions & 70 deletions en/rag/embedding.html
Original file line number Diff line number Diff line change
Expand Up @@ -576,93 +576,73 @@ <h4 id="voyageai-input-types">Input type detection</h4>
<a href="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html">Embedder.Context</a>
when calling the embedder from Java code.</p>

<h4 id="voyageai-best-practices">Best practices</h4>
<p>For production deployments, we recommend configuring <strong>separate embedder components for feed and search operations</strong>.
This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</p>

<pre>{% highlight xml %}
<container id="feed" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-large</model>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
</component>
<document-api/>
</container>

<container id="search" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-lite</model>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
</component>
<search/>
</container>
{% endhighlight %}</pre>

<h5 id="voyageai-cost-optimization">Cost optimization with model variants</h5>
<p>The <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> features a shared embedding space
across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
for document embeddings, while using a smaller, cheaper model for query embeddings.
Since document embedding happens once during indexing but query embedding occurs on every search request,
this approach can significantly reduce operational costs while maintaining quality.</p>

<h4 id="voyageai-local-query-inference">Using voyage-4-nano for local query inference</h4>
<p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a>
model is available as an ONNX model for use with the
<a href="#huggingface-embedder">Hugging Face embedder</a>.
Since it shares the same embedding space as the larger Voyage 4 models,
it can be used for query embeddings with local inference, trading some accuracy for lower cost
Since it shares the same embedding space as the larger
<a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4</a> models,
it can be used for query embeddings with local inference — trading some accuracy for lower cost
by eliminating API usage for queries entirely.</p>

<h5 id="voyageai-rate-limit-isolation">Rate limit isolation</h5>
<p>Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
you ensure that feeding bursts don't negatively impact search.</p>
<h3 id="openai-embedder">OpenAI Embedder</h3>

<h5 id="voyageai-document-processing-concurrency">Thread pool tuning</h5>
<p>When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
<a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
assuming the content cluster is not the bottleneck.</p>
<p>An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API
to generate embeddings for semantic search. The embedder can also target self-hosted OpenAI-compatible APIs.</p>
Comment thread
bjorncs marked this conversation as resolved.
Outdated

<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>
<pre>{% highlight xml %}
<container version="1.0">
<component id="openai" type="openai-embedder">
<model>text-embedding-3-small</model>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add URL to make it easy to see how to use custom URLs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having a simplified example here, and rather let the user discover how to override the endpoint by reading the reference documentation.

<api-key-secret-ref>openai_api_key</api-key-secret-ref>
<dimensions>1536</dimensions>
</component>
</container>
{% endhighlight %}</pre>

<p>Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
to determine embedder latency and throughput.
For additional throughput improvements, consider enabling <a href="#voyageai-dynamic-batching">dynamic batching</a>.</p>
<ul>
<li>
The <code>model</code> specifies which OpenAI model to use.
</li>
<li>
The <code>api-key-secret-ref</code> references a secret in Vespa's
<a href="/en/cloud/security/secret-store.html">secret store</a> containing your OpenAI API key.
For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted.
</li>
</ul>

<h5 id="voyageai-dynamic-batching">Dynamic batching</h5>
<p>Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
This is useful when throughput is constrained by VoyageAI's
<a href="https://docs.voyageai.com/docs/rate-limits">RPM (requests per minute) limit</a>
rather than the TPM (tokens per minute) limit.
Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>
<p>See the <a href="../reference/rag/embedding.html#openai-embedder-reference-config">reference</a>
for all configuration parameters.</p>

<h3 id="mistral-embedder">Mistral Embedder</h3>

<p>An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a>
embeddings API to generate embeddings for semantic search.</p>

<pre>{% highlight xml %}
<container id="feed" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-large</model>
<container version="1.0">
<component id="mistral" type="mistral-embedder">
<model>mistral-embed</model>
<api-key-secret-ref>mistral_api_key</api-key-secret-ref>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
<batching max-size="16" max-delay="200ms"/>
</component>
<document-api/>
</container>
{% endhighlight %}</pre>

<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
Batching is disabled by default.</p>
<ul>
<li>
The <code>model</code> specifies which Mistral model to use.
</li>
<li>
The <code>api-key-secret-ref</code> references a secret in Vespa's
<a href="/en/cloud/security/secret-store.html">secret store</a> containing your Mistral API key.
This is required for authentication.
</li>
</ul>

<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>
<p>Mistral supports output quantization on models that offer it, such as <code>codestral-embed</code>.
See the <a href="../reference/rag/embedding.html#mistral-embedder-reference-config">reference</a>
for all configuration parameters.</p>

<h2 id="embedder-performance">Embedder performance</h2>
Comment thread
bjorncs marked this conversation as resolved.

Expand Down Expand Up @@ -843,6 +823,99 @@ <h3 id="combining-with-foreach">Combining with foreach</h3>
<p>See <a href="../writing/indexing.html#execution-value-example">Indexing language execution value</a>for details.</p>


<h3 id="separate-feed-and-search-embedders">Separate feed and search embedders</h3>

<p>In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that
bursty feed load cannot affect query latency. When using HTTP-based cloud embedders
(<a href="#voyageai-embedder">VoyageAI</a>, <a href="#openai-embedder">OpenAI</a>, <a href="#mistral-embedder">Mistral</a>),
configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload,
and gives two additional benefits: <strong>cost optimization</strong> (via model variants) and
<strong>rate limit isolation</strong>.</p>

<pre>{% highlight xml %}
<container id="feed" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-large</model>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
</component>
<document-api/>
</container>

<container id="search" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-lite</model>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
</component>
<search/>
</container>
{% endhighlight %}</pre>

<h4 id="cost-optimization-with-model-variants">Cost optimization with model variants</h4>
<p>When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful
(and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings.
Since document embedding happens once during indexing but query embedding occurs on every search request,
this can significantly reduce operational costs while maintaining retrieval quality.</p>

<p>For example, the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> shares a
vector space across sizes, making it a natural fit for this pattern:
use <code>voyage-4-large</code> in the feed cluster and <code>voyage-4-lite</code> in the search cluster as shown above.
See also <a href="#voyageai-local-query-inference">Using voyage-4-nano for local query inference</a>
for an even more cost-effective query-side option.</p>

<h4 id="rate-limit-isolation">Rate limit isolation</h4>
<p>Separating feed and search operations is particularly important for managing API rate limits.
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
you ensure that feeding bursts don't negatively impact search.</p>

<h3 id="thread-pool-tuning">Thread pool tuning for cloud embedders</h3>
<p>When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily
limited by embedding API latency combined with the document processing thread pool size, not by CPU.
Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput,
you likely have to increase the
<a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
assuming the content cluster is not the bottleneck.</p>

<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms,
the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>

<p>Note that the effective throughput can never exceed the rate limit of your API key.
Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
Comment thread
bjorncs marked this conversation as resolved.
Outdated
to determine embedder latency and throughput.
For additional throughput improvements, consider enabling <a href="#dynamic-batching">dynamic batching</a>.</p>

<h3 id="dynamic-batching">Dynamic batching</h3>
<p>Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation.
This is useful when throughput is constrained by the provider's
requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit.
Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>

<p>Dynamic batching is supported by the <a href="#voyageai-embedder">VoyageAI</a>,
<a href="#openai-embedder">OpenAI</a>, and <a href="#mistral-embedder">Mistral</a> embedders.</p>
Comment thread
bjorncs marked this conversation as resolved.

<pre>{% highlight xml %}
<container id="feed" version="1.0">
<component id="voyage" type="voyage-ai-embedder">
<model>voyage-4-large</model>
<dimensions>1024</dimensions>
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
<batching max-size="16" max-delay="200ms"/>
</component>
<document-api/>
</container>
{% endhighlight %}</pre>

<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
Batching is disabled by default.</p>

<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>


<h2 id="troubleshooting">Troubleshooting</h2>

Expand Down
146 changes: 146 additions & 0 deletions en/reference/rag/embedding.html
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,152 @@ <h3 id="voyageai-embedder-reference-config">VoyageAI embedder reference config</
</tbody>
</table>

<h2 id="openai-embedder">OpenAI Embedder</h2>
<p>
An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API
to generate embeddings.
</p>
<p>
The OpenAI embedder is configured in <a href="../applications/services/services.html">services.xml</a>,
within the <code>container</code> tag:
</p>
<pre>{% highlight xml %}
<container id="default" version="1.0">
<component id="openai" type="openai-embedder">
<model>text-embedding-3-small</model>
<api-key-secret-ref>openai_api_key</api-key-secret-ref>
<dimensions>1536</dimensions>
<endpoint>https://api.openai.com/v1/embeddings</endpoint>
</component>
</container>
{% endhighlight %}</pre>

<h3 id="openai-embedder-reference-config">OpenAI embedder reference config</h3>
<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Occurrence</th>
<th>Description</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td>model</td>
<td>One</td>
<td><strong>Required</strong>. The OpenAI model to use, for example <code>text-embedding-3-small</code> or
<code>text-embedding-3-large</code>. See the
<a href="https://platform.openai.com/docs/guides/embeddings">OpenAI embeddings documentation</a>
for the complete list of available models.</td>
<td>string</td>
<td>N/A</td>
</tr>
<tr>
<td>dimensions</td>
<td>One</td>
<td><strong>Required</strong>. The number of dimensions for the output embedding vectors. Must match the
tensor field definition in your schema. The destination tensor field must use <code>float</code> or
<code>bfloat16</code> cell type — the OpenAI API does not support quantization.</td>
<td>integer</td>
<td>N/A</td>
</tr>
<tr>
<td>api-key-secret-ref</td>
<td>Optional</td>
<td>Reference to the secret in Vespa's <a href="/en/cloud/security/secret-store.html">secret store</a>
containing the OpenAI API key. When unset, requests are sent without an <code>Authorization</code> header.</td>
<td>string</td>
<td>"" (no auth)</td>
</tr>
<tr>
<td>endpoint</td>
<td>Optional</td>
<td>OpenAI API endpoint URL. Set this to target a specific OpenAI-compatible API.</td>
<td>string</td>
<td>https://api.openai.com/v1/embeddings</td>
</tr>
Comment thread
bjorncs marked this conversation as resolved.
</tbody>
</table>

<h2 id="mistral-embedder">Mistral Embedder</h2>
<p>
An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a>
embeddings API to generate embeddings.
</p>
<p>
The Mistral embedder is configured in <a href="../applications/services/services.html">services.xml</a>,
within the <code>container</code> tag:
</p>
<pre>{% highlight xml %}
<container id="default" version="1.0">
<component id="mistral" type="mistral-embedder">
<model>mistral-embed</model>
<api-key-secret-ref>mistral_api_key</api-key-secret-ref>
<dimensions>1024</dimensions>
<quantization>auto</quantization>
</component>
</container>
{% endhighlight %}</pre>

<h3 id="mistral-embedder-reference-config">Mistral embedder reference config</h3>
<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Occurrence</th>
<th>Description</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td>model</td>
<td>One</td>
<td><strong>Required</strong>. The Mistral model to use, for example <code>mistral-embed</code> or
<code>codestral-embed</code>. See the
<a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
for the complete list of available models.</td>
<td>string</td>
<td>N/A</td>
</tr>
<tr>
<td>api-key-secret-ref</td>
<td>One</td>
<td><strong>Required</strong>. Reference to the secret in Vespa's
<a href="/en/cloud/security/secret-store.html">secret store</a> containing the Mistral API key.</td>
<td>string</td>
<td>N/A</td>
</tr>
<tr>
<td>dimensions</td>
<td>One</td>
<td><strong>Required</strong>. The number of dimensions for the output embedding vectors. Must match the
tensor field definition in your schema. See the
<a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
for model-specific dimension support.</td>
<td>integer</td>
<td>N/A</td>
</tr>
<tr>
<td>quantization</td>
<td>Optional</td>
<td>Output quantization format for embedding vectors. Valid values are <code>auto</code>, <code>float</code>,
<code>int8</code>, or <code>binary</code>. See the <code>quantization</code> row of the
<a href="#voyageai-embedder-reference-config">VoyageAI embedder reference config</a>
for details on <code>auto</code> resolution and the destination tensor layout required for <code>int8</code>
and <code>binary</code>. Note that not all Mistral models support <code>int8</code> and <code>binary</code>
quantization — see the
<a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
for per-model support.</td>
<td>string</td>
<td>auto</td>
</tr>
</tbody>
</table>
Comment thread
bjorncs marked this conversation as resolved.

<h2 id="huggingface-tokenizer-embedder">Huggingface tokenizer embedder</h2>
<p>
The Huggingface tokenizer embedder is configured in <a href="../applications/services/services.html">services.xml</a>,
Expand Down