Skip to content

Commit 08b7199

Browse files
authored
Merge pull request #4655 from vespa-engine/bjorncs/openai-mistral-embedder-docs
Document OpenAI and Mistral embedders
2 parents 918de5f + 5f1991a commit 08b7199

2 files changed

Lines changed: 331 additions & 71 deletions

File tree

en/rag/embedding.html

Lines changed: 155 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -576,93 +576,77 @@ <h4 id="voyageai-input-types">Input type detection</h4>
576576
<a href="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html">Embedder.Context</a>
577577
when calling the embedder from Java code.</p>
578578

579-
<h4 id="voyageai-best-practices">Best practices</h4>
580-
<p>For production deployments, we recommend configuring <strong>separate embedder components for feed and search operations</strong>.
581-
This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
582-
In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</p>
579+
<h4 id="voyageai-local-query-inference">Using voyage-4-nano for local query inference</h4>
580+
<p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a>
581+
model is available as an ONNX model for use with the
582+
<a href="#huggingface-embedder">Hugging Face embedder</a>.
583+
Since it shares the same embedding space as the larger
584+
<a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4</a> models,
585+
it can be used for query embeddings with local inference — trading some accuracy for lower cost
586+
by eliminating API usage for queries entirely.</p>
583587

584-
<pre>{% highlight xml %}
585-
<container id="feed" version="1.0">
586-
<component id="voyage" type="voyage-ai-embedder">
587-
<model>voyage-4-large</model>
588-
<dimensions>1024</dimensions>
589-
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
590-
</component>
591-
<document-api/>
592-
</container>
588+
<h3 id="openai-embedder">OpenAI Embedder</h3>
593589

594-
<container id="search" version="1.0">
595-
<component id="voyage" type="voyage-ai-embedder">
596-
<model>voyage-4-lite</model>
597-
<dimensions>1024</dimensions>
598-
<api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
590+
<p>Available since {% include version.html version="8.678" %}</p>
591+
592+
<p>An embedder that uses the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> embeddings API
593+
to generate embeddings for semantic search. The embedder can target any OpenAI-compatible API.</p>
594+
595+
<pre>{% highlight xml %}
596+
<container version="1.0">
597+
<component id="openai" type="openai-embedder">
598+
<model>text-embedding-3-small</model>
599+
<api-key-secret-ref>openai_api_key</api-key-secret-ref>
600+
<dimensions>1536</dimensions>
599601
</component>
600-
<search/>
601602
</container>
602603
{% endhighlight %}</pre>
603604

604-
<h5 id="voyageai-cost-optimization">Cost optimization with model variants</h5>
605-
<p>The <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> features a shared embedding space
606-
across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
607-
for document embeddings, while using a smaller, cheaper model for query embeddings.
608-
Since document embedding happens once during indexing but query embedding occurs on every search request,
609-
this approach can significantly reduce operational costs while maintaining quality.</p>
610-
611-
<p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a>
612-
model is available as an ONNX model for use with the
613-
<a href="#huggingface-embedder">Hugging Face embedder</a>.
614-
Since it shares the same embedding space as the larger Voyage 4 models,
615-
it can be used for query embeddings with local inference, trading some accuracy for lower cost
616-
by eliminating API usage for queries entirely.</p>
617-
618-
<h5 id="voyageai-rate-limit-isolation">Rate limit isolation</h5>
619-
<p>Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
620-
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
621-
that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
622-
you ensure that feeding bursts don't negatively impact search.</p>
605+
<ul>
606+
<li>
607+
The <code>model</code> specifies which OpenAI model to use.
608+
</li>
609+
<li>
610+
The <code>api-key-secret-ref</code> references a secret in Vespa's
611+
<a href="/en/cloud/security/secret-store.html">secret store</a> containing your OpenAI API key.
612+
For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted.
613+
</li>
614+
</ul>
623615

624-
<h5 id="voyageai-document-processing-concurrency">Thread pool tuning</h5>
625-
<p>When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
626-
combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
627-
while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
628-
<a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
629-
assuming the content cluster is not the bottleneck.</p>
616+
<p>See the <a href="../reference/rag/embedding.html#openai-embedder-reference-config">reference</a>
617+
for all configuration parameters.</p>
630618

631-
<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
632-
thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
633-
the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
634-
See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>
619+
<h3 id="mistral-embedder">Mistral Embedder</h3>
635620

636-
<p>Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
637-
Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
638-
to determine embedder latency and throughput.
639-
For additional throughput improvements, consider enabling <a href="#voyageai-dynamic-batching">dynamic batching</a>.</p>
621+
<p>Available since {% include version.html version="8.678" %}</p>
640622

641-
<h5 id="voyageai-dynamic-batching">Dynamic batching</h5>
642-
<p>Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
643-
This is useful when throughput is constrained by VoyageAI's
644-
<a href="https://docs.voyageai.com/docs/rate-limits">RPM (requests per minute) limit</a>
645-
rather than the TPM (tokens per minute) limit.
646-
Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>
623+
<p>An embedder that uses the <a href="https://docs.mistral.ai/capabilities/embeddings/overview/">Mistral</a>
624+
embeddings API to generate embeddings for semantic search.</p>
647625

648626
<pre>{% highlight xml %}
649-
<container id="feed" version="1.0">
650-
<component id="voyage" type="voyage-ai-embedder">
651-
<model>voyage-4-large</model>
627+
<container version="1.0">
628+
<component id="mistral" type="mistral-embedder">
629+
<model>mistral-embed</model>
630+
<api-key-secret-ref>mistral_api_key</api-key-secret-ref>
652631
<dimensions>1024</dimensions>
653-
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
654-
<batching max-size="16" max-delay="200ms"/>
655632
</component>
656-
<document-api/>
657633
</container>
658634
{% endhighlight %}</pre>
659635

660-
<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
661-
and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
662-
Batching is disabled by default.</p>
636+
<ul>
637+
<li>
638+
The <code>model</code> specifies which Mistral model to use.
639+
</li>
640+
<li>
641+
The <code>api-key-secret-ref</code> references a secret in Vespa's
642+
<a href="/en/cloud/security/secret-store.html">secret store</a> containing your Mistral API key.
643+
This is required for authentication.
644+
</li>
645+
</ul>
663646

664-
<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
665-
should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>
647+
<p>Mistral supports output quantization on models that offer it, such as <code>codestral-embed</code>.
648+
See the <a href="../reference/rag/embedding.html#mistral-embedder-reference-config">reference</a>
649+
for all configuration parameters.</p>
666650

667651
<h2 id="embedder-performance">Embedder performance</h2>
668652

@@ -675,14 +659,21 @@ <h2 id="embedder-performance">Embedder performance</h2>
675659
</li>
676660
<li>
677661
The number of inputs to the <code>embed</code> call. When encoding arrays, consider how many inputs a single document can have.
678-
For CPU inference, increasing <a href="../reference/api/document-v1.html#timeout">feed timeout</a> settings
662+
For local CPU inference, increasing <a href="../reference/api/document-v1.html#timeout">feed timeout</a> settings
679663
might be required when documents have many <code>embed</code>inputs.
680664
</li>
681665
</ul>
682-
<p>Using <a href="../reference/rag/embedding.html#embedder-onnx-reference-config">GPU</a>, especially for longer sequence lengths (documents),
666+
<p>For local ONNX-based embedders (such as the <a href="#huggingface-embedder">Hugging Face</a>,
667+
<a href="#bert-embedder">Bert</a>, <a href="#colbert-embedder">ColBERT</a>, and <a href="#splade-embedder">SPLADE</a> embedders),
668+
using <a href="../reference/rag/embedding.html#embedder-onnx-reference-config">GPU</a>, especially for longer sequence lengths (documents),
683669
can dramatically improve performance and reduce cost.
684670
See the blog post on <a href="https://blog.vespa.ai/gpu-accelerated-ml-inference-in-vespa-cloud/">GPU-accelerated ML inference in Vespa Cloud</a>.
685671
With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.</p>
672+
<p>For cloud embedders that call an external API
673+
(<a href="#voyageai-embedder">VoyageAI</a>, <a href="#openai-embedder">OpenAI</a>, <a href="#mistral-embedder">Mistral</a>),
674+
throughput is bound by API latency and rate limits rather than local hardware.
675+
See <a href="#thread-pool-tuning">Thread pool tuning for cloud embedders</a> and
676+
<a href="#dynamic-batching">dynamic batching</a> for tuning guidance.</p>
686677
<p>
687678
Refer to <a href="../rag/binarizing-vectors">binarizing vectors</a> for how to reduce vector size.
688679
</p>
@@ -843,6 +834,99 @@ <h3 id="combining-with-foreach">Combining with foreach</h3>
843834
<p>See <a href="../writing/indexing.html#execution-value-example">Indexing language execution value</a>for details.</p>
844835

845836

837+
<h3 id="separate-feed-and-search-embedders">Separate feed and search embedders</h3>
838+
839+
<p>In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that
840+
bursty feed load cannot affect query latency. When using HTTP-based cloud embedders
841+
(<a href="#voyageai-embedder">VoyageAI</a>, <a href="#openai-embedder">OpenAI</a>, <a href="#mistral-embedder">Mistral</a>),
842+
configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload,
843+
and gives two additional benefits: <strong>cost optimization</strong> (via model variants) and
844+
<strong>rate limit isolation</strong>.</p>
845+
846+
<pre>{% highlight xml %}
847+
<container id="feed" version="1.0">
848+
<component id="voyage" type="voyage-ai-embedder">
849+
<model>voyage-4-large</model>
850+
<dimensions>1024</dimensions>
851+
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
852+
</component>
853+
<document-api/>
854+
</container>
855+
856+
<container id="search" version="1.0">
857+
<component id="voyage" type="voyage-ai-embedder">
858+
<model>voyage-4-lite</model>
859+
<dimensions>1024</dimensions>
860+
<api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
861+
</component>
862+
<search/>
863+
</container>
864+
{% endhighlight %}</pre>
865+
866+
<h4 id="cost-optimization-with-model-variants">Cost optimization with model variants</h4>
867+
<p>When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful
868+
(and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings.
869+
Since document embedding happens once during indexing but query embedding occurs on every search request,
870+
this can significantly reduce operational costs while maintaining retrieval quality.</p>
871+
872+
<p>For example, the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> shares a
873+
vector space across sizes, making it a natural fit for this pattern:
874+
use <code>voyage-4-large</code> in the feed cluster and <code>voyage-4-lite</code> in the search cluster as shown above.
875+
See also <a href="#voyageai-local-query-inference">Using voyage-4-nano for local query inference</a>
876+
for an even more cost-effective query-side option.</p>
877+
878+
<h4 id="rate-limit-isolation">Rate limit isolation</h4>
879+
<p>Separating feed and search operations is particularly important for managing API rate limits.
880+
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
881+
that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
882+
you ensure that feeding bursts don't negatively impact search.</p>
883+
884+
<h3 id="thread-pool-tuning">Thread pool tuning for cloud embedders</h3>
885+
<p>When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily
886+
limited by embedding API latency combined with the document processing thread pool size, not by CPU.
887+
Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput,
888+
you likely have to increase the
889+
<a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
890+
assuming the content cluster is not the bottleneck.</p>
891+
892+
<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
893+
thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms,
894+
the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
895+
See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>
896+
897+
<p>Note that the effective throughput can never exceed the rate limit of your API key.
898+
Use the <a href="../reference/operations/metrics/container.html">embedder metrics</a>
899+
to determine embedder latency and throughput.
900+
For additional throughput improvements, consider enabling <a href="#dynamic-batching">dynamic batching</a>.</p>
901+
902+
<h3 id="dynamic-batching">Dynamic batching</h3>
903+
<p>Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation.
904+
This is useful when throughput is constrained by the provider's
905+
requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit.
906+
Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>
907+
908+
<p>Dynamic batching is supported by the <a href="#voyageai-embedder">VoyageAI</a>,
909+
<a href="#openai-embedder">OpenAI</a>, and <a href="#mistral-embedder">Mistral</a> embedders.</p>
910+
911+
<pre>{% highlight xml %}
912+
<container id="feed" version="1.0">
913+
<component id="voyage" type="voyage-ai-embedder">
914+
<model>voyage-4-large</model>
915+
<dimensions>1024</dimensions>
916+
<api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
917+
<batching max-size="16" max-delay="200ms"/>
918+
</component>
919+
<document-api/>
920+
</container>
921+
{% endhighlight %}</pre>
922+
923+
<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
924+
and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
925+
Batching is disabled by default.</p>
926+
927+
<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
928+
should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>
929+
846930

847931
<h2 id="troubleshooting">Troubleshooting</h2>
848932

0 commit comments

Comments
 (0)