vespa-engine · bjorncs · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/en/rag/embedding.html b/en/rag/embedding.html
@@ -576,93 +576,73 @@ <h4 id="voyageai-input-types">Input type detection</h4>
 <a href="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html">Embedder.Context</a>
 when calling the embedder from Java code.</p>
 
-<h4 id="voyageai-best-practices">Best practices</h4>
-<p>For production deployments, we recommend configuring <strong>separate embedder components for feed and search operations</strong>.
-This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
-In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</p>
-
-<pre>{% highlight xml %}
-<container id="feed" version="1.0">
-    <component id="voyage" type="voyage-ai-embedder">
-        <model>voyage-4-large</model>
-        <dimensions>1024</dimensions>
-        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
-    </component>
-    <document-api/>
-</container>
-
-<container id="search" version="1.0">
-    <component id="voyage" type="voyage-ai-embedder">
-        <model>voyage-4-lite</model>
-        <dimensions>1024</dimensions>
-        <api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
-    </component>
-    <search/>
-</container>
-{% endhighlight %}</pre>
-
-<h5 id="voyageai-cost-optimization">Cost optimization with model variants</h5>
-<p>The <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> features a shared embedding space
-across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
-for document embeddings, while using a smaller, cheaper model for query embeddings.
-Since document embedding happens once during indexing but query embedding occurs on every search request,
-this approach can significantly reduce operational costs while maintaining quality.</p>
-
+<h4 id="voyageai-local-query-inference">Using voyage-4-nano for local query inference</h4>
 <p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a>
     model is available as an ONNX model for use with the
     <a href="#huggingface-embedder">Hugging Face embedder</a>.
-    Since it shares the same embedding space as the larger Voyage 4 models,
-    it can be used for query embeddings with local inference, trading some accuracy for lower cost
+    Since it shares the same embedding space as the larger
+    <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4</a> models,
+    it can be used for query embeddings with local inference — trading some accuracy for lower cost
     by eliminating API usage for queries entirely.</p>
 
-<h5 id="voyageai-rate-limit-isolation">Rate limit isolation</h5>
-<p>Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
-Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
-that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
-you ensure that feeding bursts don't negatively impact search.</p>
+<h3 id="openai-embedder">OpenAI Embedder</h3>
 
-<h5 id="voyageai-document-processing-concurrency">Thread pool tuning</h5>
-<p>When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
-    combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
-    while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
-    <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
-    assuming the content cluster is not the bottleneck.</p>
+<p>An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API
+to generate embeddings for semantic search. The embedder can also target self-hosted OpenAI-compatible APIs.</p>
 
-<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
-    thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
-    the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
-    See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>
+<pre>{% highlight xml %}
+<container version="1.0">
+    <component id="openai" type="openai-embedder">
+        <model>text-embedding-3-small</model>
+        <api-key-secret-ref>openai_api_key</api-key-secret-ref>
+        <dimensions>1536</dimensions>
+    </component>
+</container>
+{% endhighlight %}</pre>
 
-<p>Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
-    Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
-    to determine embedder latency and throughput.
-    For additional throughput improvements, consider enabling <a href="#voyageai-dynamic-batching">dynamic batching</a>.</p>
+<ul>
+    <li>
+        The <code>model</code> specifies which OpenAI model to use.
+    </li>
+    <li>
+        The <code>api-key-secret-ref</code> references a secret in Vespa's
+        <a href="/en/cloud/security/secret-store.html">secret store</a> containing your OpenAI API key.
+        For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted.
+    </li>
+</ul>
 
-<h5 id="voyageai-dynamic-batching">Dynamic batching</h5>
-<p>Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
-    This is useful when throughput is constrained by VoyageAI's
-    <a href="https://docs.voyageai.com/docs/rate-limits">RPM (requests per minute) limit</a>
-    rather than the TPM (tokens per minute) limit.
-    Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>
+<p>See the <a href="../reference/rag/embedding.html#openai-embedder-reference-config">reference</a>
+for all configuration parameters.</p>
+
+<h3 id="mistral-embedder">Mistral Embedder</h3>
+
+<p>An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a>
+embeddings API to generate embeddings for semantic search.</p>
 
 <pre>{% highlight xml %}
-<container id="feed" version="1.0">
-    <component id="voyage" type="voyage-ai-embedder">
-        <model>voyage-4-large</model>
+<container version="1.0">
+    <component id="mistral" type="mistral-embedder">
+        <model>mistral-embed</model>
+        <api-key-secret-ref>mistral_api_key</api-key-secret-ref>
         <dimensions>1024</dimensions>
-        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
-        <batching max-size="16" max-delay="200ms"/>
     </component>
-    <document-api/>
 </container>
 {% endhighlight %}</pre>
 
-<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
-    and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
-    Batching is disabled by default.</p>
+<ul>
+    <li>
+        The <code>model</code> specifies which Mistral model to use.
+    </li>
+    <li>
+        The <code>api-key-secret-ref</code> references a secret in Vespa's
+        <a href="/en/cloud/security/secret-store.html">secret store</a> containing your Mistral API key.
+        This is required for authentication.
+    </li>
+</ul>
 
-<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
-    should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>
+<p>Mistral supports output quantization on models that offer it, such as <code>codestral-embed</code>.
+    See the <a href="../reference/rag/embedding.html#mistral-embedder-reference-config">reference</a>
+    for all configuration parameters.</p>
 
 <h2 id="embedder-performance">Embedder performance</h2>
 
@@ -843,6 +823,99 @@ <h3 id="combining-with-foreach">Combining with foreach</h3>
 <p>See <a href="../writing/indexing.html#execution-value-example">Indexing language execution value</a>for details.</p>
 
 
+<h3 id="separate-feed-and-search-embedders">Separate feed and search embedders</h3>
+
+<p>In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that
+bursty feed load cannot affect query latency. When using HTTP-based cloud embedders
+(<a href="#voyageai-embedder">VoyageAI</a>, <a href="#openai-embedder">OpenAI</a>, <a href="#mistral-embedder">Mistral</a>),
+configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload,
+and gives two additional benefits: <strong>cost optimization</strong> (via model variants) and
+<strong>rate limit isolation</strong>.</p>
+
+<pre>{% highlight xml %}
+<container id="feed" version="1.0">
+    <component id="voyage" type="voyage-ai-embedder">
+        <model>voyage-4-large</model>
+        <dimensions>1024</dimensions>
+        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
+    </component>
+    <document-api/>
+</container>
+
+<container id="search" version="1.0">
+    <component id="voyage" type="voyage-ai-embedder">
+        <model>voyage-4-lite</model>
+        <dimensions>1024</dimensions>
+        <api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
+    </component>
+    <search/>
+</container>
+{% endhighlight %}</pre>
+
+<h4 id="cost-optimization-with-model-variants">Cost optimization with model variants</h4>
+<p>When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful
+    (and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings.
+    Since document embedding happens once during indexing but query embedding occurs on every search request,
+    this can significantly reduce operational costs while maintaining retrieval quality.</p>
+
+<p>For example, the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> shares a
+    vector space across sizes, making it a natural fit for this pattern:
+    use <code>voyage-4-large</code> in the feed cluster and <code>voyage-4-lite</code> in the search cluster as shown above.
+    See also <a href="#voyageai-local-query-inference">Using voyage-4-nano for local query inference</a>
+    for an even more cost-effective query-side option.</p>
+
+<h4 id="rate-limit-isolation">Rate limit isolation</h4>
+<p>Separating feed and search operations is particularly important for managing API rate limits.
+    Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
+    that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
+    you ensure that feeding bursts don't negatively impact search.</p>
+
+<h3 id="thread-pool-tuning">Thread pool tuning for cloud embedders</h3>
+<p>When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily
+    limited by embedding API latency combined with the document processing thread pool size, not by CPU.
+    Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput,
+    you likely have to increase the
+    <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
+    assuming the content cluster is not the bottleneck.</p>
+
+<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
+    thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms,
+    the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
+    See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>
+
+<p>Note that the effective throughput can never exceed the rate limit of your API key.
+    Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
+    to determine embedder latency and throughput.
+    For additional throughput improvements, consider enabling <a href="#dynamic-batching">dynamic batching</a>.</p>
+
+<h3 id="dynamic-batching">Dynamic batching</h3>
+<p>Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation.
+    This is useful when throughput is constrained by the provider's
+    requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit.
+    Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>
+
+<p>Dynamic batching is supported by the <a href="#voyageai-embedder">VoyageAI</a>,
+    <a href="#openai-embedder">OpenAI</a>, and <a href="#mistral-embedder">Mistral</a> embedders.</p>
+
+<pre>{% highlight xml %}
+<container id="feed" version="1.0">
+    <component id="voyage" type="voyage-ai-embedder">
+        <model>voyage-4-large</model>
+        <dimensions>1024</dimensions>
+        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
+        <batching max-size="16" max-delay="200ms"/>
+    </component>
+    <document-api/>
+</container>
+{% endhighlight %}</pre>
+
+<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
+    and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
+    Batching is disabled by default.</p>
+
+<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
+    should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>
+
 
 <h2 id="troubleshooting">Troubleshooting</h2>
 

diff --git a/en/reference/rag/embedding.html b/en/reference/rag/embedding.html
@@ -580,6 +580,152 @@ <h3 id="voyageai-embedder-reference-config">VoyageAI embedder reference config</
   </tbody>
 </table>
 
+<h2 id="openai-embedder">OpenAI Embedder</h2>
+<p>
+  An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API
+  to generate embeddings.
+</p>
+<p>
+  The OpenAI embedder is configured in <a href="../applications/services/services.html">services.xml</a>,
+  within the <code>container</code> tag:
+</p>
+<pre>{% highlight xml %}
+<container id="default" version="1.0">
+    <component id="openai" type="openai-embedder">
+        <model>text-embedding-3-small</model>
+        <api-key-secret-ref>openai_api_key</api-key-secret-ref>
+        <dimensions>1536</dimensions>
+        <endpoint>https://api.openai.com/v1/embeddings</endpoint>
+    </component>
+</container>
+{% endhighlight %}</pre>
+
+<h3 id="openai-embedder-reference-config">OpenAI embedder reference config</h3>
+<table class="table">
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Occurrence</th>
+      <th>Description</th>
+      <th>Type</th>
+      <th>Default</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>model</td>
+      <td>One</td>
+      <td><strong>Required</strong>. The OpenAI model to use, for example <code>text-embedding-3-small</code> or
+        <code>text-embedding-3-large</code>. See the
+        <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI embeddings documentation</a>
+        for the complete list of available models.</td>
+      <td>string</td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>dimensions</td>
+      <td>One</td>
+      <td><strong>Required</strong>. The number of dimensions for the output embedding vectors. Must match the
+        tensor field definition in your schema. The destination tensor field must use <code>float</code> or
+        <code>bfloat16</code> cell type — the OpenAI API does not support quantization.</td>
+      <td>integer</td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>api-key-secret-ref</td>
+      <td>Optional</td>
+      <td>Reference to the secret in Vespa's <a href="/en/cloud/security/secret-store.html">secret store</a>
+        containing the OpenAI API key. When unset, requests are sent without an <code>Authorization</code> header.</td>
+      <td>string</td>
+      <td>"" (no auth)</td>
+    </tr>
+    <tr>
+      <td>endpoint</td>
+      <td>Optional</td>
+      <td>OpenAI API endpoint URL. Set this to target a specific OpenAI-compatible API.</td>
+      <td>string</td>
+      <td>https://api.openai.com/v1/embeddings</td>
+    </tr>
+  </tbody>
+</table>
+
+<h2 id="mistral-embedder">Mistral Embedder</h2>
+<p>
+  An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a>
+  embeddings API to generate embeddings.
+</p>
+<p>
+  The Mistral embedder is configured in <a href="../applications/services/services.html">services.xml</a>,
+  within the <code>container</code> tag:
+</p>
+<pre>{% highlight xml %}
+<container id="default" version="1.0">
+    <component id="mistral" type="mistral-embedder">
+        <model>mistral-embed</model>
+        <api-key-secret-ref>mistral_api_key</api-key-secret-ref>
+        <dimensions>1024</dimensions>
+        <quantization>auto</quantization>
+    </component>
+</container>
+{% endhighlight %}</pre>
+
+<h3 id="mistral-embedder-reference-config">Mistral embedder reference config</h3>
+<table class="table">
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Occurrence</th>
+      <th>Description</th>
+      <th>Type</th>
+      <th>Default</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>model</td>
+      <td>One</td>
+      <td><strong>Required</strong>. The Mistral model to use, for example <code>mistral-embed</code> or
+        <code>codestral-embed</code>. See the
+        <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
+        for the complete list of available models.</td>
+      <td>string</td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>api-key-secret-ref</td>
+      <td>One</td>
+      <td><strong>Required</strong>. Reference to the secret in Vespa's
+        <a href="/en/cloud/security/secret-store.html">secret store</a> containing the Mistral API key.</td>
+      <td>string</td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>dimensions</td>
+      <td>One</td>
+      <td><strong>Required</strong>. The number of dimensions for the output embedding vectors. Must match the
+        tensor field definition in your schema. See the
+        <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
+        for model-specific dimension support.</td>
+      <td>integer</td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>quantization</td>
+      <td>Optional</td>
+      <td>Output quantization format for embedding vectors. Valid values are <code>auto</code>, <code>float</code>,
+        <code>int8</code>, or <code>binary</code>. See the <code>quantization</code> row of the
+        <a href="#voyageai-embedder-reference-config">VoyageAI embedder reference config</a>
+        for details on <code>auto</code> resolution and the destination tensor layout required for <code>int8</code>
+        and <code>binary</code>. Note that not all Mistral models support <code>int8</code> and <code>binary</code>
+        quantization — see the
+        <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral embeddings documentation</a>
+        for per-model support.</td>
+      <td>string</td>
+      <td>auto</td>
+    </tr>
+  </tbody>
+</table>
+
 <h2 id="huggingface-tokenizer-embedder">Huggingface tokenizer embedder</h2>
   <p>
     The Huggingface tokenizer embedder is configured in <a href="../applications/services/services.html">services.xml</a>,