@@ -576,93 +576,77 @@ <h4 id="voyageai-input-types">Input type detection</h4>
576576< a href ="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html "> Embedder.Context</ a >
577577when calling the embedder from Java code.</ p >
578578
579- < h4 id ="voyageai-best-practices "> Best practices</ h4 >
580- < p > For production deployments, we recommend configuring < strong > separate embedder components for feed and search operations</ strong > .
581- This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
582- In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</ p >
579+ < h4 id ="voyageai-local-query-inference "> Using voyage-4-nano for local query inference</ h4 >
580+ < p > The < a href ="model-hub.html#voyage-4-nano "> voyage-4-nano</ a >
581+ model is available as an ONNX model for use with the
582+ < a href ="#huggingface-embedder "> Hugging Face embedder</ a > .
583+ Since it shares the same embedding space as the larger
584+ < a href ="https://blog.voyageai.com/2026/01/15/voyage-4/ "> Voyage 4</ a > models,
585+ it can be used for query embeddings with local inference — trading some accuracy for lower cost
586+ by eliminating API usage for queries entirely.</ p >
583587
584- < pre > {% highlight xml %}
585- < container id ="feed " version ="1.0 ">
586- < component id ="voyage " type ="voyage-ai-embedder ">
587- < model > voyage-4-large</ model >
588- < dimensions > 1024</ dimensions >
589- < api-key-secret-ref > voyage_feed_api_key</ api-key-secret-ref >
590- </ component >
591- < document-api />
592- </ container >
588+ < h3 id ="openai-embedder "> OpenAI Embedder</ h3 >
593589
594- < container id ="search " version ="1.0 ">
595- < component id ="voyage " type ="voyage-ai-embedder ">
596- < model > voyage-4-lite</ model >
597- < dimensions > 1024</ dimensions >
598- < api-key-secret-ref > voyage_search_api_key</ api-key-secret-ref >
590+ < p > Available since {% include version.html version="8.678" %}</ p >
591+
592+ < p > An embedder that uses the < a href ="https://platform.openai.com/docs/guides/embeddings "> OpenAI</ a > embeddings API
593+ to generate embeddings for semantic search. The embedder can target any OpenAI-compatible API.</ p >
594+
595+ < pre > {% highlight xml %}
596+ < container version ="1.0 ">
597+ < component id ="openai " type ="openai-embedder ">
598+ < model > text-embedding-3-small</ model >
599+ < api-key-secret-ref > openai_api_key</ api-key-secret-ref >
600+ < dimensions > 1536</ dimensions >
599601 </ component >
600- < search />
601602</ container >
602603{% endhighlight %}</ pre >
603604
604- < h5 id ="voyageai-cost-optimization "> Cost optimization with model variants</ h5 >
605- < p > The < a href ="https://blog.voyageai.com/2026/01/15/voyage-4/ "> Voyage 4 model family</ a > features a shared embedding space
606- across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
607- for document embeddings, while using a smaller, cheaper model for query embeddings.
608- Since document embedding happens once during indexing but query embedding occurs on every search request,
609- this approach can significantly reduce operational costs while maintaining quality.</ p >
610-
611- < p > The < a href ="model-hub.html#voyage-4-nano "> voyage-4-nano</ a >
612- model is available as an ONNX model for use with the
613- < a href ="#huggingface-embedder "> Hugging Face embedder</ a > .
614- Since it shares the same embedding space as the larger Voyage 4 models,
615- it can be used for query embeddings with local inference, trading some accuracy for lower cost
616- by eliminating API usage for queries entirely.</ p >
617-
618- < h5 id ="voyageai-rate-limit-isolation "> Rate limit isolation</ h5 >
619- < p > Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
620- Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
621- that affect search queries. By using < strong > separate API keys</ strong > for feed and search embedders,
622- you ensure that feeding bursts don't negatively impact search.</ p >
605+ < ul >
606+ < li >
607+ The < code > model</ code > specifies which OpenAI model to use.
608+ </ li >
609+ < li >
610+ The < code > api-key-secret-ref</ code > references a secret in Vespa's
611+ < a href ="/en/cloud/security/secret-store.html "> secret store</ a > containing your OpenAI API key.
612+ For self-hosted OpenAI-compatible endpoints that do not require authentication, this element can be omitted.
613+ </ li >
614+ </ ul >
623615
624- < h5 id ="voyageai-document-processing-concurrency "> Thread pool tuning</ h5 >
625- < p > When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
626- combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
627- while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
628- < a href ="../reference/applications/services/docproc.html#threadpool "> document processing thread pool size</ a > ,
629- assuming the content cluster is not the bottleneck.</ p >
616+ < p > See the < a href ="../reference/rag/embedding.html#openai-embedder-reference-config "> reference</ a >
617+ for all configuration parameters.</ p >
630618
631- < p > For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
632- thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
633- the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
634- See < a href ="../performance/container-tuning.html "> container tuning</ a > for more on container tuning.</ p >
619+ < h3 id ="mistral-embedder "> Mistral Embedder</ h3 >
635620
636- < p > Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
637- Use the < a href ="https://docs.vespa.ai/en/reference/operations/metrics/container.html "> embedder metrics</ a >
638- to determine embedder latency and throughput.
639- For additional throughput improvements, consider enabling < a href ="#voyageai-dynamic-batching "> dynamic batching</ a > .</ p >
621+ < p > Available since {% include version.html version="8.678" %}</ p >
640622
641- < h5 id ="voyageai-dynamic-batching "> Dynamic batching</ h5 >
642- < p > Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
643- This is useful when throughput is constrained by VoyageAI's
644- < a href ="https://docs.voyageai.com/docs/rate-limits "> RPM (requests per minute) limit</ a >
645- rather than the TPM (tokens per minute) limit.
646- Batching reduces RPM usage by combining requests; TPM usage is unaffected.</ p >
623+ < p > An embedder that uses the < a href ="https://docs.mistral.ai/capabilities/embeddings/overview/ "> Mistral</ a >
624+ embeddings API to generate embeddings for semantic search.</ p >
647625
648626< pre > {% highlight xml %}
649- < container id ="feed " version ="1.0 ">
650- < component id ="voyage " type ="voyage-ai-embedder ">
651- < model > voyage-4-large</ model >
627+ < container version ="1.0 ">
628+ < component id ="mistral " type ="mistral-embedder ">
629+ < model > mistral-embed</ model >
630+ < api-key-secret-ref > mistral_api_key</ api-key-secret-ref >
652631 < dimensions > 1024</ dimensions >
653- < api-key-secret-ref > voyage_feed_api_key</ api-key-secret-ref >
654- < batching max-size ="16 " max-delay ="200ms "/>
655632 </ component >
656- < document-api />
657633</ container >
658634{% endhighlight %}</ pre >
659635
660- < p > The < code > max-size</ code > attribute sets the maximum number of requests in a single batch,
661- and < code > max-delay</ code > sets the maximum time to wait for a full batch before sending a partial one.
662- Batching is disabled by default.</ p >
636+ < ul >
637+ < li >
638+ The < code > model</ code > specifies which Mistral model to use.
639+ </ li >
640+ < li >
641+ The < code > api-key-secret-ref</ code > references a secret in Vespa's
642+ < a href ="/en/cloud/security/secret-store.html "> secret store</ a > containing your Mistral API key.
643+ This is required for authentication.
644+ </ li >
645+ </ ul >
663646
664- < p > The < a href ="../reference/applications/services/docproc.html#threadpool "> document processing thread pool size</ a >
665- should be at least < code > max-size</ code > , since each thread contributes one request to the batch.</ p >
647+ < p > Mistral supports output quantization on models that offer it, such as < code > codestral-embed</ code > .
648+ See the < a href ="../reference/rag/embedding.html#mistral-embedder-reference-config "> reference</ a >
649+ for all configuration parameters.</ p >
666650
667651< h2 id ="embedder-performance "> Embedder performance</ h2 >
668652
@@ -675,14 +659,21 @@ <h2 id="embedder-performance">Embedder performance</h2>
675659 </ li >
676660 < li >
677661 The number of inputs to the < code > embed</ code > call. When encoding arrays, consider how many inputs a single document can have.
678- For CPU inference, increasing < a href ="../reference/api/document-v1.html#timeout "> feed timeout</ a > settings
662+ For local CPU inference, increasing < a href ="../reference/api/document-v1.html#timeout "> feed timeout</ a > settings
679663 might be required when documents have many < code > embed</ code > inputs.
680664 </ li >
681665</ ul >
682- < p > Using < a href ="../reference/rag/embedding.html#embedder-onnx-reference-config "> GPU</ a > , especially for longer sequence lengths (documents),
666+ < p > For local ONNX-based embedders (such as the < a href ="#huggingface-embedder "> Hugging Face</ a > ,
667+ < a href ="#bert-embedder "> Bert</ a > , < a href ="#colbert-embedder "> ColBERT</ a > , and < a href ="#splade-embedder "> SPLADE</ a > embedders),
668+ using < a href ="../reference/rag/embedding.html#embedder-onnx-reference-config "> GPU</ a > , especially for longer sequence lengths (documents),
683669can dramatically improve performance and reduce cost.
684670See the blog post on < a href ="https://blog.vespa.ai/gpu-accelerated-ml-inference-in-vespa-cloud/ "> GPU-accelerated ML inference in Vespa Cloud</ a > .
685671With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.</ p >
672+ < p > For cloud embedders that call an external API
673+ (< a href ="#voyageai-embedder "> VoyageAI</ a > , < a href ="#openai-embedder "> OpenAI</ a > , < a href ="#mistral-embedder "> Mistral</ a > ),
674+ throughput is bound by API latency and rate limits rather than local hardware.
675+ See < a href ="#thread-pool-tuning "> Thread pool tuning for cloud embedders</ a > and
676+ < a href ="#dynamic-batching "> dynamic batching</ a > for tuning guidance.</ p >
686677< p >
687678 Refer to < a href ="../rag/binarizing-vectors "> binarizing vectors</ a > for how to reduce vector size.
688679</ p >
@@ -843,6 +834,99 @@ <h3 id="combining-with-foreach">Combining with foreach</h3>
843834< p > See < a href ="../writing/indexing.html#execution-value-example "> Indexing language execution value</ a > for details.</ p >
844835
845836
837+ < h3 id ="separate-feed-and-search-embedders "> Separate feed and search embedders</ h3 >
838+
839+ < p > In Vespa Cloud, it is general practice to configure separate container clusters for feed and search, so that
840+ bursty feed load cannot affect query latency. When using HTTP-based cloud embedders
841+ (< a href ="#voyageai-embedder "> VoyageAI</ a > , < a href ="#openai-embedder "> OpenAI</ a > , < a href ="#mistral-embedder "> Mistral</ a > ),
842+ configure a separate embedder component in each cluster. This lets you pick different models and API keys per workload,
843+ and gives two additional benefits: < strong > cost optimization</ strong > (via model variants) and
844+ < strong > rate limit isolation</ strong > .</ p >
845+
846+ < pre > {% highlight xml %}
847+ < container id ="feed " version ="1.0 ">
848+ < component id ="voyage " type ="voyage-ai-embedder ">
849+ < model > voyage-4-large</ model >
850+ < dimensions > 1024</ dimensions >
851+ < api-key-secret-ref > voyage_feed_api_key</ api-key-secret-ref >
852+ </ component >
853+ < document-api />
854+ </ container >
855+
856+ < container id ="search " version ="1.0 ">
857+ < component id ="voyage " type ="voyage-ai-embedder ">
858+ < model > voyage-4-lite</ model >
859+ < dimensions > 1024</ dimensions >
860+ < api-key-secret-ref > voyage_search_api_key</ api-key-secret-ref >
861+ </ component >
862+ < search />
863+ </ container >
864+ {% endhighlight %}</ pre >
865+
866+ < h4 id ="cost-optimization-with-model-variants "> Cost optimization with model variants</ h4 >
867+ < p > When a provider offers multiple model sizes that share the same embedding space, you can use a more powerful
868+ (and more expensive) model for document embeddings while using a smaller, cheaper model for query embeddings.
869+ Since document embedding happens once during indexing but query embedding occurs on every search request,
870+ this can significantly reduce operational costs while maintaining retrieval quality.</ p >
871+
872+ < p > For example, the < a href ="https://blog.voyageai.com/2026/01/15/voyage-4/ "> Voyage 4 model family</ a > shares a
873+ vector space across sizes, making it a natural fit for this pattern:
874+ use < code > voyage-4-large</ code > in the feed cluster and < code > voyage-4-lite</ code > in the search cluster as shown above.
875+ See also < a href ="#voyageai-local-query-inference "> Using voyage-4-nano for local query inference</ a >
876+ for an even more cost-effective query-side option.</ p >
877+
878+ < h4 id ="rate-limit-isolation "> Rate limit isolation</ h4 >
879+ < p > Separating feed and search operations is particularly important for managing API rate limits.
880+ Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
881+ that affect search queries. By using < strong > separate API keys</ strong > for feed and search embedders,
882+ you ensure that feeding bursts don't negatively impact search.</ p >
883+
884+ < h3 id ="thread-pool-tuning "> Thread pool tuning for cloud embedders</ h3 >
885+ < p > When using an HTTP-based cloud embedder (VoyageAI, OpenAI, Mistral), container feed throughput is primarily
886+ limited by embedding API latency combined with the document processing thread pool size, not by CPU.
887+ Each document being fed blocks a thread while waiting for the embedding API response. To improve throughput,
888+ you likely have to increase the
889+ < a href ="../reference/applications/services/docproc.html#threadpool "> document processing thread pool size</ a > ,
890+ assuming the content cluster is not the bottleneck.</ p >
891+
892+ < p > For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
893+ thread pool size of 1 thread per vCPU, you have 16 total threads. If the average embedding API latency is 200ms,
894+ the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
895+ See < a href ="../performance/container-tuning.html "> container tuning</ a > for more on container tuning.</ p >
896+
897+ < p > Note that the effective throughput can never exceed the rate limit of your API key.
898+ Use the < a href ="../reference/operations/metrics/container.html "> embedder metrics</ a >
899+ to determine embedder latency and throughput.
900+ For additional throughput improvements, consider enabling < a href ="#dynamic-batching "> dynamic batching</ a > .</ p >
901+
902+ < h3 id ="dynamic-batching "> Dynamic batching</ h3 >
903+ < p > Dynamic batching combines multiple concurrent embedding requests into a single embedding invocation.
904+ This is useful when throughput is constrained by the provider's
905+ requests-per-minute (RPM) limit rather than the tokens-per-minute (TPM) limit.
906+ Batching reduces RPM usage by combining requests; TPM usage is unaffected.</ p >
907+
908+ < p > Dynamic batching is supported by the < a href ="#voyageai-embedder "> VoyageAI</ a > ,
909+ < a href ="#openai-embedder "> OpenAI</ a > , and < a href ="#mistral-embedder "> Mistral</ a > embedders.</ p >
910+
911+ < pre > {% highlight xml %}
912+ < container id ="feed " version ="1.0 ">
913+ < component id ="voyage " type ="voyage-ai-embedder ">
914+ < model > voyage-4-large</ model >
915+ < dimensions > 1024</ dimensions >
916+ < api-key-secret-ref > voyage_feed_api_key</ api-key-secret-ref >
917+ < batching max-size ="16 " max-delay ="200ms "/>
918+ </ component >
919+ < document-api />
920+ </ container >
921+ {% endhighlight %}</ pre >
922+
923+ < p > The < code > max-size</ code > attribute sets the maximum number of requests in a single batch,
924+ and < code > max-delay</ code > sets the maximum time to wait for a full batch before sending a partial one.
925+ Batching is disabled by default.</ p >
926+
927+ < p > The < a href ="../reference/applications/services/docproc.html#threadpool "> document processing thread pool size</ a >
928+ should be at least < code > max-size</ code > , since each thread contributes one request to the batch.</ p >
929+
846930
847931< h2 id ="troubleshooting "> Troubleshooting</ h2 >
848932
0 commit comments