documentation/en/rag/embedding.html at 3b3f076a7e736c181fa7413e8d2de29ac8511403 · vespa-engine/documentation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
---
# Copyright Vespa.ai. All rights reserved.
title: "Embedding"
redirect_from:
- /en/embedding.html
---

<p>
  A common technique is to map unstructured data - say, text or images -
  to points in an abstract vector space and then do the computation in that space.
  For example, retrieve
  similar data by <a href="../querying/approximate-nn-hnsw">finding nearby points in the vector space</a>,
  or <a href="../ranking/onnx">using the vectors as input to a neural net</a>.
  This mapping is referred to as <em>embedding</em>.
  Read more about embedding and embedding management in this
  <a href="https://blog.vespa.ai/tailoring-frozen-embeddings-with-vespa/">blog post</a>.
</p>

<p>Embedding vectors can be sent to Vespa in queries and writes:</p>
<img src="/assets/img/vespa-overview-embeddings-1.svg" alt="document- and query-embeddings"
     width="890px" height="auto"/>

<p>Alternatively, you can use the <code>embed</code> function to generate the embeddings inside Vespa
to reduce vector transfer costs and make clients simpler:
</p>
<img src="/assets/img/vespa-overview-embeddings-2.svg" alt="Vespa's embedding feature, creating embeddings from text"
     width="890px" height="auto"/>

<p>
  Adding embeddings to schemas will change the characteristics of an application;
  Memory usage will grow, and feeding latency might increase.
  Read more on how to address this in <a href="../rag/binarizing-vectors">binarizing vectors</a>.
</p>

<h2 id="configuring-embedders">Configuring embedders</h2>

<p>Embedders are <a href="../applications/components.html">components</a> which must be configured in your
<a href="../reference/applications/services/services.html">services.xml</a>. Components are shared and can be used across schemas.</p>

<pre>{% highlight xml %}
<container id="default" version="1.0">
    <component id="hf-embedder" type="hugging-face-embedder">
        <transformer-model path="my-models/model.onnx"/>
        <tokenizer-model path="my-models/tokenizer.json"/>
        <prepend>
            <query>query:</query>
            <document>passage:</document>
        </prepend>
    </component>
    ...
</container>
{% endhighlight %}</pre>

<p>You can <a href="https://javadoc.io/doc/com.yahoo.vespa/linguistics/latest/com/yahoo/language/process/Embedder.html">write your own</a>,
  or use <a href="#provided-embedders">embedders provided in Vespa</a>.</p>

<p>
  If you have multiple container clusters that are using the same embedder,
  consider using an <a href="../reference/applications/services/container.html#include">include</a> statement
  to avoid duplicating component config.
</p>


<h2 id="embedding-a-query-text">Embedding a query text</h2>

<p>Where you would otherwise supply a tensor in a query request,
you can (with an embedder configured) instead supply any text enclosed in <code>embed()</code>:</p>

<pre>
input.query(q)=<span class="pre-hilite">embed(myEmbedderId, "Hello%20world")</span>
</pre>

<p>Both single and double quotes are permitted, and if you have only configured a single embedder,
you can skip the embedder id argument and the quotes.</p>

<p>The text argument can be supplied by a referenced parameter instead, using the <code>@parameter</code> syntax:</p>
<pre>{% highlight json %}
{
    "yql": "select * from doc where {totalTargetHits:10}nearestNeighbor(embedding_field, query_embedding)",
    "text": "my text to embed",
    "input.query(query_embedding)": "embed(@text)",
}
{% endhighlight %}</pre>

<p>Remember that regardless of whether you are using embedders, input tensors
must always be <a href="../reference/schemas/schemas.html#inputs">defined in the schema's rank-profile</a>.</p>


<h2 id="embedding-a-document-field">Embedding a document field</h2>

<p>Use the <code><a href="../reference/writing/indexing-language.html#embed">embed</a></code> function
of the <a href="../reference/writing/indexing-language.html#indexing-statement">indexing language</a>
to convert strings into embeddings:

<pre>
schema doc {

    document doc {

        field title type string {
            indexing: summary | index
        }

    }

    field embeddings type tensor&lt;bfloat16&gt;(x[384]) {
        indexing {
            input title | <span class="pre-hilite">embed embedderId</span> | attribute | index
        }
    }

}
</pre>

<p>Notice that the embedding field is defined outside the <code>document</code> clause in the schema.
If you have only configured a single embedder, you can skip the embedder id argument.</p>

<p>The input field can also be an array, where the output becomes a rank two tensor, see
<a href="https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/">this blog post</a>:</p>

<pre>
schema doc {

    document doc {

        field chunks type array&lt;string&gt; {
            indexing: index | summary
        }

    }

    field embeddings type tensor&lt;bfloat16&gt;(p{},x[5]) {
        indexing: input chunks | <span class="pre-hilite">embed embedderId</span> | attribute | index
    }

}
</pre>


<h2 id="provided-embedders">Provided embedders</h2>
<p>Vespa provides several embedders as part of the platform.</p>


<h3 id="huggingface-embedder">Huggingface Embedder</h3>

<p>An embedder using any <a href="https://huggingface.co/docs/tokenizers/index">Huggingface tokenizer</a>,
including multilingual tokenizers,
to produce tokens which are then input to a supplied transformer model in <a href="https://onnx.ai/">ONNX</a> model format:</p>

<pre>{% highlight xml %}
<container id="default" version="1.0">
    <component id="hf-embedder" type="hugging-face-embedder">
        <transformer-model path="my-models/model.onnx"/>
        <tokenizer-model path="my-models/tokenizer.json"/>
    </component>
    ...
</container>
{% endhighlight %}</pre>

<p>The huggingface-embedder supports all
<a href="https://huggingface.co/docs/tokenizers/index">Huggingface tokenizer implementations</a>.</p>

<ul>
    <li>
        The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a> format.
        See <a href="../ranking/onnx#using-optimum-to-export-models-to-onnx-format">exporting models to ONNX</a>
        for how to export embedding models from Huggingface to be compatible with Vespa's <code>hugging-face-embedder</code>.
        See <a href="../ranking/onnx#limitations-on-model-size-and-complexity">Limitations on Model Size and Complexity</a>
        for details on the ONNX model format supported by Vespa.
    </li>
    <li>
        The <code>tokenizer-model</code> specifies the Huggingface <code>tokenizer.json</code> formatted file.
        See <a href="https://huggingface.co/transformers/v4.8.0/fast_tokenizers.html#loading-from-a-json-file">HF loading tokenizer from a JSON file.</a>
    </li>
</ul>

<p>Use <code>path</code> to supply the model files from the application package,
<code>url</code> to supply them from a remote server, or
<code>model-id</code> to use a
<a href="https://cloud.vespa.ai/en/model-hub#hugging-face-embedder">model supplied by Vespa Cloud</a>.
You can also use a model hosted in private Huggingface Model Hub by adding your Huggingface API token
to the <a href="../security/secret-store">secret store</a> and referring to the secret
using <code>secret-ref</code> in the model tag.
See <a href="../reference/rag/embedding.html#model-config-reference">model config reference</a> for more details.
</p>
<pre>{% highlight xml %}
<container id="default" version="1.0">
    <component id="e5" type="hugging-face-embedder">
        <transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
        <tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/>
    </component>
    ...
</container>{% endhighlight %}</pre>

<p>See the <a href="../reference/rag/embedding.html#huggingface-embedder-reference-config">reference</a>
for all configuration parameters.</p>

<h4 id="huggingface-embedder-models">Huggingface embedder models</h4>
<p>
    The following are examples of text embedding models that can be used with the hugging-face-embedder
    and their output <a href="../ranking/tensor-user-guide.html">tensor</a> dimensionality.
    The resulting <a href="../reference/ranking/tensor.html#tensor-type-spec">tensor type</a> can be <code>float</code>,
    <code>bfloat16</code> or using binarized quantization into <code>int8</code>.
    See blog post <a href="https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/">Combining matryoshka with binary-quantization</a>
    for more examples of using the Huggingface embedder with binary quantization.
</p>

<p>The following models use <code>pooling-strategy</code> <code>mean</code>,
   which is the default <a href="../reference/rag/embedding.html#huggingface-embedder-reference-config">pooling-strategy</a>:</p>
<ul>
    <li><a href="https://huggingface.co/intfloat/e5-small-v2">intfloat/e5-small-v2</a> produces <code>tensor&lt;float&gt;(x[384])</code></li>
    <li><a href="https://huggingface.co/intfloat/e5-base-v2">intfloat/e5-base-v2</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
    <li><a href="https://huggingface.co/intfloat/e5-large-v2">intfloat/e5-large-v2</a> produces <code>tensor&lt;float&gt;(x[1024])</code></li>
    <li><a href="https://huggingface.co/intfloat/multilingual-e5-base">intfloat/multilingual-e5-base</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
</ul>
<p>
  The following models are useful for binarization and Matryoshka dimensionality flexibility where only the first k
  dimensions are retained.
  <a href="https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/">
    Matryoshka 🤝 Binary vectors: Slash vector search costs with Vespa</a> is a great read on this subject.
  When enabling binarization with <code>int8</code> use <a href="../reference/schemas/schemas.html#hamming">distance-metric hamming</a>:
</p>
<ul>
    <li><a href="https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1">mxbai-embed-large-v1</a> produces <code>tensor&lt;float&gt;(x[1024])</code>. This model
    is also useful for binarization, which can be triggered by using destination <code>tensor&lt;int8&gt;(x[128])</code>.
    Use <code>pooling-strategy</code> <code>cls</code> and <code>normalize</code> <code>true</code>.</li>
    <li><a href="https://huggingface.co/nomic-ai/nomic-embed-text-v1.5">nomic-embed-text-v1.5</a> produces <code>tensor&lt;float&gt;(x[768])</code>. This model
      is also useful for binarization, which can be triggered by using destination <code>tensor&lt;int8&gt;(x[96])</code>.  Use <code>normalize</code> <code>true</code>.</li>
</ul>

<p>Snowflake arctic model series:</p>
<ul>
  <li><a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-xs">snowflake-arctic-embed-xs</a> produces <code>tensor&lt;float&gt;(x[384])</code>.
    Use <code>pooling-strategy</code> <code>cls</code> and <code>normalize</code> <code>true</code>.</li>
    <li><a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">snowflake-arctic-embed-m</a> produces <code>tensor&lt;float&gt;(x[768])</code>.
      Use <code>pooling-strategy</code> <code>cls</code> and <code>normalize</code> <code>true</code>.</li>
</ul>

<p>All of these example text embedding models can be used in combination with Vespa's
<a href="../querying/nearest-neighbor-search">nearest neighbor search</a>
using the appropriate <a href="../reference/schemas/schemas.html#distance-metric">distance-metric</a>.
Notice that to use the <a href="../reference/schemas/schemas.html#prenormalized-angular">distance-metric: prenormalized-angular</a>,
the <code>normalize</code> configuration must be set to <code>true</code>.</p>

<p>Check the <a href="https://huggingface.co/blog/mteb">Massive Text Embedding Benchmark</a> (MTEB) benchmark and
<a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB leaderboard</a>
for help with choosing an embedding model.</p>


<h3 id="bert-embedder">Bert embedder</h3>

<p>
  DEPRECATED; prefer using the <a href="#huggingface-embedder">Huggingface
Embedder</a> instead of the Bert embedder.
</p>

<p>An embedder using the <a href="../reference/rag/embedding.html#wordpiece-embedder">WordPiece</a> embedder to produce tokens
which are then input to a supplied <a href="https://onnx.ai/">ONNX</a> model on the form expected by a BERT base model:</p>

<pre>{% highlight xml %}
<container version="1.0">
    <component id="myBert" type="bert-embedder">
        <transformer-model url="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx"/>
        <tokenizer-vocab url="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/raw/main/vocab.txt"/>
        <max-tokens>128</max-tokens>
        <transformer-output>last_hidden_state</transformer-output>
    </component>
</container>
{% endhighlight %}</pre>
<ul>
    <li>
        The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a> format.
        See <a href="../ranking/onnx#using-optimum-to-export-models-to-onnx-format">exporting models to ONNX</a>,
        for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
    </li>
    <li>
        The <code>tokenizer-vocab</code> specifies the Huggingface <code>vocab.txt</code> file, with one valid token per line.
        Note that the Bert embedder does not support the <code>tokenizer.json</code> formatted tokenizer configuration files.
        This means that tokenization settings like max tokens should be set explicitly.
    </li>
    <li>
        The <code>transformer-output</code> specifies the name given
        to to embedding output in the <code>model.onnx</code> file;
        this will differ depending on how the model is exported to
        ONNX format.  One common name is <code>last_hidden_state</code>,
        especially in transformer-based models.  Other common names are
        <code>output</code> or
        <code>output_0</code>,
        <code>embedding</code> or
        <code>embeddings</code>,
        <code>sentence_embedding</code>,
        <code>pooled_output</code>,
        or
        <code>encoder_last_hidden_state</code>.
        The default is <code>output_0</code>.
    </li>
</ul>

<p>The Bert embedder is limited to English (<a href="../reference/rag/embedding.html#wordpiece-embedder">WordPiece</a>) and
BERT-styled transformer models with three model inputs
(<em>input_ids, attention_mask, token_type_ids</em>).
Prefer using the <a href="#huggingface-embedder">Huggingface Embedder</a> instead of the Bert embedder.</p>

<p>See <a href="../reference/rag/embedding.html#bert-embedder-reference-config">configuration reference</a> for all configuration options.</p>


<h3 id="colbert-embedder">ColBERT embedder</h3>

<p>An embedder supporting <a href="https://github.com/stanford-futuredata/ColBERT">ColBERT</a> models. The
ColBERT embedder maps text to <em>token</em> embeddings, representing a text as multiple
contextualized embeddings. This produces better quality than reducing all tokens into a single vector.</p>

<p>Read more about ColBERT and the ColBERT embedder in blog post form
  <a href="https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/">Announcing the Vespa ColBERT embedder</a>
  and <a href="https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/">Announcing Vespa Long-Context ColBERT</a>.</p>

<pre>{% highlight xml %}
<container version="1.0">
    <component id="colbert" type="colbert-embedder">
        <transformer-model url="https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"/>
        <tokenizer-model url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/>
        <max-query-tokens>32</max-query-tokens>
        <max-document-tokens>128</max-document-tokens>
    </component>
</container>
{% endhighlight %}</pre>

<ul>
    <li>
        The <code>transformer-model</code> specifies the ColBERT embedding model in <a href="https://onnx.ai/">ONNX</a> format.
        See <a href="../ranking/onnx#using-optimum-to-export-models-to-onnx-format">exporting models to ONNX</a>
        for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
        The <a href="https://huggingface.co/vespa-engine/col-minilm">vespa-engine/col-minilm</a> page on the HF
        model hub has a detailed example of how to export a colbert checkpoint to ONNX format for accelerated inference.
    </li>
    <li>
        The <code>tokenizer-model</code> specifies the Huggingface <code>tokenizer.json</code> formatted file.
        See <a href="https://huggingface.co/transformers/v4.8.0/fast_tokenizers.html#loading-from-a-json-file"> HF loading tokenizer from a JSON file.</a>
    </li>
    <li>
        The <code>max-query-tokens</code> controls the maximum number of query text tokens that are represented as vectors,
        and similarly, <code>max-document-tokens</code> controls the document side. These parameters
        can be used to control resource usage.
    </li>
</ul>
<p>See <a href="../reference/rag/embedding.html#colbert-embedder-reference-config">configuration reference</a> for all
configuration options and defaults.</p>

<p>The ColBERT token embeddings are represented as a
<a href="../ranking/tensor-user-guide.html#tensor-concepts">mixed tensor</a>: <code>tensor&lt;float&gt;(token{}, x[dim])</code> where
<code>dim</code> is the vector dimensionality of the contextualized token embeddings.
The <a href="https://huggingface.co/colbert-ir/colbertv2.0">colbert model checkpoint</a> on Hugging Face hub
uses 128 dimensions.</p>

<p>The embedder destination tensor is defined in the <a href="../basics/schemas.html">schema</a>, and
depending on the target <a href="../reference/ranking/tensor.html#tensor-type-spec">tensor cell precision</a> definition
the embedder can compress the representation:

    If the target tensor cell type is <code>int8</code>, the ColBERT embedder compresses the token embeddings with binarization for
    the document to reduce storage to 1-bit per value, reducing the token embedding storage footprint
    by 32x compared to using float. The <i>query</i> representation is not compressed with binarization.
    The following demonstrates two ways to use the ColBERT embedder in
    the document schema to <a href="#embedding-a-document-field">embed a document field</a>.
</p>

<pre>
schema doc {
    document doc {
        field text type string {..}
    }
    field colbert_tokens type tensor&lt;float&gt;(token{}, x[128]) {
        indexing: input text | embed colbert | attribute
    }
    field colbert_tokens_compressed type tensor&lt;int8&gt;(token{}, x[16]) {
        indexing: input text | embed colbert | attribute
    }
}
</pre>

<p>The first field <code>colbert_tokens</code> stores the original representation as the tensor destination
cell type is float. The second field, the <code>colbert_tokens_compressed</code> tensor is compressed.
When using <code>int8</code> tensor cell precision,
one should divide the original vector size by 8 (128/8 = 16).</p>

<p>You can also use <code>bfloat16</code> instead of <code>float</code> to reduce storage by 2x compared to <code>float</code>.</p>
<pre>
field colbert_tokens type tensor&lt;bfloat16&gt;(token{}, x[128]) {
    indexing: input text | embed colbert | attribute
}
</pre>

<p>You can also use the ColBERT embedder with an array of strings (representing chunks):</p>
<pre>
schema doc {
    document doc {
        field chunks type array&lt;string&gt; {..}
    }
    field colbert_tokens_compressed type tensor&lt;int8&gt;(chunk{}, token{}, x[16]) {
        indexing: input text | embed colbert chunk | attribute
    }
}
</pre>

<p>Here, we need a second mapped dimension in the target tensor and a second argument to embed,
telling the ColBERT embedder the name of the tensor dimension to use for the chunks.</p>

<p>Notice that the examples above did not specify the <code>index</code> function for creating a
<a href="../querying/approximate-nn-hnsw">HNSW</a> index.
The colbert representation is intended to be used as a ranking model
and not for retrieval with Vespa's nearestNeighbor query operator,
where you can e.g., use a document-level vector and/or lexical matching.</p>

<p>To reduce memory footprint, use <a href="../content/attributes.html#paged-attributes">paged attributes</a>.</p>

<h4 id="colbert-ranking">ColBERT ranking</h4>
<p>
See the sample applications for using ColBERT in ranking with variants of the MaxSim similarity operator
expressed using Vespa tensor computation expressions. See:
<a href="https://github.com/vespa-engine/sample-apps/tree/master/colbert">colbert</a> and
 <a href="https://github.com/vespa-engine/sample-apps/tree/master/colbert-long">colbert-long</a>.
</p>


<h3 id="splade-embedder">SPLADE embedder</h3>

<p>An embedder supporting <a href="https://github.com/naver/splade">SPLADE</a> models. The
SPLADE embedder maps text to mapped tensor, representing a text as a sparse vector of unique tokens and their weights.</p>

<pre>{% highlight xml %}
<container version="1.0">
    <component id="splade" type="splade-embedder">
        <transformer-model path="models/splade_model.onnx"/>
        <tokenizer-model path="models/tokenizer.json"/>
    </component>
</container>
{% endhighlight %}</pre>

<ul>
    <li>
        The <code>transformer-model</code> specifies the SPLADE embedding model in <a href="https://onnx.ai/">ONNX</a> format.
        See <a href="../ranking/onnx#using-optimum-to-export-models-to-onnx-format">exporting models to ONNX</a>
        for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
    </li>
    <li>
        The <code>tokenizer-model</code> specifies the Huggingface <code>tokenizer.json</code> formatted file.
        See <a href="https://huggingface.co/transformers/v4.8.0/fast_tokenizers.html#loading-from-a-json-file"> HF loading tokenizer from a JSON file.</a>
    </li>
</ul>
<p>See <a href="../reference/rag/embedding.html#splade-embedder-reference-config">configuration reference</a> for all
configuration options and defaults.</p>

<p>
  The splade token weights are represented as a
  <a href="../ranking/tensor-user-guide.html#tensor-concepts">mapped tensor</a>: <code>tensor&lt;float&gt;(token{})</code>.
</p>
<p>
  The embedder destination tensor is defined in the <a href="../basics/schemas.html">schema</a>.
  The following demonstrates how to use the SPLADE embedder in the document schema to
  <a href="#embedding-a-document-field">embed a document field</a>.
</p>

<pre>
schema doc {
    document doc {
        field text type string {..}
    }
    field splade_tokens type tensor&lt;float&gt;(token{}) {
        indexing: input text | embed splade | attribute
    }
}
</pre>

<p>You can also use the SPLADE embedder with an array of strings (representing chunks). Here, also
  using lower tensor cell precision <code>bfloat16</code>:</p>
<pre>
schema doc {
    document doc {
        field chunks type array&lt;string&gt; {..}
    }
    field splade_tokens type tensor&lt;bfloat16&gt;(chunk{}, token{}) {
        indexing: input text | embed splade chunk | attribute
    }
}
</pre>

<p>Here, we need a second mapped dimension in the target tensor and a second argument to embed,
telling the splade embedder the name of the tensor dimension to use for the chunks.</p>

<p>To reduce memory footprint, use <a href="../content/attributes.html#paged-attributes">paged attributes</a>.</p>

<h4 id="splade-ranking">SPLADE ranking</h4>
<p>
  See the <a href="https://github.com/vespa-engine/sample-apps/tree/master/splade">splade</a> sample application for how to use SPLADE in ranking,
  including also how to use the SPLADE embedder with an array of strings (representing chunks).
</p>


{% include chip-cloud.html %}
<h3 id="voyageai-embedder">VoyageAI Embedder</h3>

<p>An embedder that uses the <a href="https://www.voyageai.com/">VoyageAI</a> embedding API
to generate high-quality embeddings for semantic search. This embedder calls the VoyageAI API service
and does not require local model files or ONNX inference. All embeddings returned by VoyageAI are normalized
to unit length, making them suitable for cosine similarity and
<a href="../reference/schemas/schemas.html#prenormalized-angular">prenormalized-angular</a> distance metrics
(see <a href="https://docs.voyageai.com/docs/faq#which-similarity-function-should-i-use">VoyageAI FAQ</a>).</p>

<pre>{% highlight xml %}
<container version="1.0">
    <component id="voyage" type="voyage-ai-embedder">
        <model>voyage-4</model>
        <api-key-secret-ref>voyage_api_key</api-key-secret-ref>
        <dimensions>1024</dimensions>
    </component>
</container>
{% endhighlight %}</pre>

<ul>
    <li>
        The <code>model</code> specifies which VoyageAI model to use.
    </li>
    <li>
        The <code>api-key-secret-ref</code> references a secret in Vespa's
        <a href="/en/cloud/security/secret-store.html">secret store</a> containing your VoyageAI API key.
        This is required for authentication.
    </li>
</ul>

<p>See the <a href="../reference/rag/embedding.html#voyageai-embedder-reference-config">reference</a>
for all configuration parameters.</p>

<h4 id="voyageai-embedder-models">VoyageAI embedder models</h4>
<p>For the complete list of available models and their specifications, see:</p>
<ul>
    <li><a href="https://docs.voyageai.com/docs/embeddings">VoyageAI Embeddings Documentation</a> - General-purpose and specialized models</li>
    <li><a href="https://docs.voyageai.com/docs/contextualized-chunk-embeddings">Contextualized Chunk Embeddings</a> - Models for embedding document chunks with surrounding context. See <a href="../rag/working-with-chunks.html">Working with chunks</a>.</li>
    <li><a href="https://docs.voyageai.com/docs/multimodal-embeddings">Multimodal Embeddings</a> - Multimodal models for text, images, and video</li>
</ul>

<h4 id="voyageai-contextualized-embeddings">Contextualized chunk embeddings</h4>
<p>To use <a href="https://docs.voyageai.com/docs/contextualized-chunk-embeddings">contextualized chunk embeddings</a>,
configure the VoyageAI embedder with a <code>voyage-context-*</code> model and use it to embed an
<code>array&lt;string&gt;</code> field containing your document chunks:</p>

<pre>
schema doc {
    document doc {
        field chunks type array&lt;string&gt; {
            indexing: index | summary
        }
    }
    field embeddings type tensor&lt;float&gt;(chunk{}, x[1024]) {
        indexing: input chunks | embed voyage | attribute | index
        attribute {
            distance-metric: prenormalized-angular
        }
    }
}
</pre>

<p>
    When embedding array fields with a contextualized chunk embedding model, Vespa sends all chunks from a document in a single API request,
    allowing Voyage to encode each chunk with context from the other chunks.
    Be aware that the combined size of all chunks in a document must fit within the VoyageAI API's input token limit.
    See <a href="working-with-chunks.html">Working with chunks</a> for chunking strategies.
</p>

<h4 id="voyageai-input-types">Input type detection</h4>
<p>VoyageAI models distinguish between query and document embeddings for improved retrieval quality.
The embedder automatically detects the context and sets the appropriate input type based on whether
the embedding is performed during feed (indexing) or query processing in Vespa.</p>

<p>For advanced use cases where you need to control the input type programmatically,
you can use the <code>destination</code> property of the
<a href="https://javadoc.io/static/com.yahoo.vespa/linguistics/8.620.35/com/yahoo/language/process/Embedder.Context.html">Embedder.Context</a>
when calling the embedder from Java code.</p>

<h4 id="voyageai-best-practices">Best practices</h4>
<p>For production deployments, we recommend configuring <strong>separate embedder components for feed and search operations</strong>.
This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.</p>

<pre>{% highlight xml %}
<container id="feed" version="1.0">
    <component id="voyage" type="voyage-ai-embedder">
        <model>voyage-4-large</model>
        <dimensions>1024</dimensions>
        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
    </component>
    <document-api/>
</container>

<container id="search" version="1.0">
    <component id="voyage" type="voyage-ai-embedder">
        <model>voyage-4-lite</model>
        <dimensions>1024</dimensions>
        <api-key-secret-ref>voyage_search_api_key</api-key-secret-ref>
    </component>
    <search/>
</container>
{% endhighlight %}</pre>

<h5 id="voyageai-cost-optimization">Cost optimization with model variants</h5>
<p>The <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 model family</a> features a shared embedding space
across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
for document embeddings, while using a smaller, cheaper model for query embeddings.
Since document embedding happens once during indexing but query embedding occurs on every search request,
this approach can significantly reduce operational costs while maintaining quality.</p>

<p>The <a href="model-hub.html#voyage-4-nano">voyage-4-nano</a>
    model is available as an ONNX model for use with the
    <a href="#huggingface-embedder">Hugging Face embedder</a>.
    Since it shares the same embedding space as the larger Voyage 4 models,
    it can be used for query embeddings with local inference, trading some accuracy for lower cost
    by eliminating API usage for queries entirely.</p>

<h5 id="voyageai-rate-limit-isolation">Rate limit isolation</h5>
<p>Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
that affect search queries. By using <strong>separate API keys</strong> for feed and search embedders,
you ensure that feeding bursts don't negatively impact search.</p>

<h5 id="voyageai-document-processing-concurrency">Thread pool tuning</h5>
<p>When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
    combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
    while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
    <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>,
    assuming the content cluster is not the bottleneck.</p>

<p>For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
    thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
    the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
    See <a href="../performance/container-tuning.html">container tuning</a> for more on container tuning.</p>

<p>Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
    Use the <a href="https://docs.vespa.ai/en/reference/operations/metrics/container.html">embedder metrics</a>
    to determine embedder latency and throughput.
    For additional throughput improvements, consider enabling <a href="#voyageai-dynamic-batching">dynamic batching</a>.</p>

<h5 id="voyageai-dynamic-batching">Dynamic batching</h5>
<p>Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
    This is useful when throughput is constrained by VoyageAI's
    <a href="https://docs.voyageai.com/docs/rate-limits">RPM (requests per minute) limit</a>
    rather than the TPM (tokens per minute) limit.
    Batching reduces RPM usage by combining requests; TPM usage is unaffected.</p>

<pre>{% highlight xml %}
<container id="feed" version="1.0">
    <component id="voyage" type="voyage-ai-embedder">
        <model>voyage-4-large</model>
        <dimensions>1024</dimensions>
        <api-key-secret-ref>voyage_feed_api_key</api-key-secret-ref>
        <batching max-size="16" max-delay="200ms"/>
    </component>
    <document-api/>
</container>
{% endhighlight %}</pre>

<p>The <code>max-size</code> attribute sets the maximum number of requests in a single batch,
    and <code>max-delay</code> sets the maximum time to wait for a full batch before sending a partial one.
    Batching is disabled by default.</p>

<p>The <a href="../reference/applications/services/docproc.html#threadpool">document processing thread pool size</a>
    should be at least <code>max-size</code>, since each thread contributes one request to the batch.</p>

<h2 id="embedder-performance">Embedder performance</h2>

<p>Embedding inference can be resource-intensive for larger embedding models. Factors that impact performance:</p>

<ul>
  <li>The embedding model parameters. Larger models are more expensive to evaluate than smaller models.</li>
  <li>The sequence input length. Transformer models scale quadratically with input length. Since queries
    are typically shorter than documents, embedding queries is less computationally intensive than embedding documents.
  </li>
  <li>
    The number of inputs to the <code>embed</code> call. When encoding arrays, consider how many inputs a single document can have.
    For CPU inference, increasing <a href="../reference/api/document-v1.html#timeout">feed timeout</a> settings
    might be required when documents have many <code>embed</code>inputs.
  </li>
</ul>
<p>Using <a href="../reference/rag/embedding.html#embedder-onnx-reference-config">GPU</a>, especially for longer sequence lengths (documents),
can dramatically improve performance and reduce cost.
See the blog post on <a href="https://blog.vespa.ai/gpu-accelerated-ml-inference-in-vespa-cloud/">GPU-accelerated ML inference in Vespa Cloud</a>.
With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.</p>
<p>
  Refer to <a href="../rag/binarizing-vectors">binarizing vectors</a> for how to reduce vector size.
</p>


<h2 id="metrics">Metrics</h2>

<p>Vespa's built-in embedders emit metrics for computation time and token sequence length.
These metrics are prefixed with <code>embedder.</code>
and listed in the <a href="../reference/operations/metrics/container.html">Container Metrics</a> reference documentation.
Third-party embedder implementations may inject the <code>ai.vespa.embedding.Embedder.Runtime</code> component to easily
emit the same predefined metrics, although emitting custom metrics is perfectly fine.</p>


<h2 id="sample-applications">Sample applications</h2>

<p>These sample applications use embedders:</p>
<ul>
    <li><a href="https://github.com/vespa-engine/sample-apps/tree/master/commerce-product-ranking">commerce-product-ranking</a> -
        demonstrates using multiple embedders</li>
    <li><a href="https://github.com/vespa-engine/sample-apps/tree/master/multi-vector-indexing">multi-vector-indexing</a>
    demonstrates how to use embedders with multiple document field inputs</li>
    <li><a href="https://github.com/vespa-engine/sample-apps/tree/master/colbert">colbert</a>
      demonstrates how to use the colbert-embedder </li>
      <li><a href="https://github.com/vespa-engine/sample-apps/tree/master/colbert-long">colbert-long</a>
        demonstrates how to use the colbert-embedder with long contexts (array input) </li>
      <li><a href="https://github.com/vespa-engine/sample-apps/tree/master/splade">splade</a> demonstrates
      how to use the splade-embedder.</li>
</ul>


<h2 id="tricks-and-tips">Tricks and tips</h2>

<p>Various tricks that are useful with embedders.</p>


<h3 id="adding-a-fixed-string-to-a-query-text">Adding a fixed string to a query text</h3>
<p>
  Embedding models might require text to be prepended with a fixed string, e.g.:
</p>
<pre>{% highlight xml %}
<component id="e5" type="hugging-face-embedder">
    <transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
    <tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/>
    <prepend>
        <query>query:</query>
        <document>passage:</document>
    </prepend>
</component>
{% endhighlight %}</pre>
<p>
  The above configuration prepends text in queries and field data.
  Find a complete example in the <a href="https://github.com/vespa-engine/sample-apps/tree/master/colbert">ColBERT</a>
  sample application.
</p>
<p>
  An alternative approach is using query profiles to prepend query data.
  If you need to add a standard wrapper or a prefix instruction around the input text you want to embed
  use parameter substitution to supply the text, as in <code>embed(myEmbedderId, @text)</code>,
  and let the parameter (<code>text</code> here) be defined in a <a href="../querying/query-profiles.html">query profile</a>,
  which in turn uses <a href="../querying/query-profiles.html#value-substitution">value substitution</a>
  to place another query request with a supplied text value within it. The following is a concrete example
  where queries should have a prefix instruction before being embedded in a vector representation. The following
  defines a <code>text</code> input field to <code>search/query-profiles/default.xml</code>:
</p>
<pre>{% highlight xml %}
<query-profile id="default">
    <field name="text">"Represent this sentence for searching relevant passages: %{user_query}</field>
</query-profile>
{% endhighlight %}</pre>

<p>Then, at query request time, we can pass <code>user_query</code> as a request parameter, this parameter is then used to produce
the <code>text</code> value which then is embedded.</p>
<pre>{% highlight json %}
{
    "yql": "select * from doc where userQuery() or (totalTtargetHits: 100}nearestNeighbor(embedding, e))",
    "input.query(e)": "embed(mxbai, @text)",
    "user_query": "space contains many suns"
}
{% endhighlight %}</pre>
<p>
The text that is embedded by the embedder is then:
<em>Represent this sentence for searching relevant passages: space contains many suns</em>.
</p>


<h3 id="concatenating-input-fields">Concatenating input fields</h3>

<p>You can concatenate values in indexing using "<code>.</code>", and handle missing field values using
    <a href="../writing/indexing.html#choice-example">choice</a>
    to produce a single input for an embedder:
</p>

<pre>
schema doc {

    document doc {

        field title type string {
            indexing: summary | index
        }

        field body type string {
            indexing: summary | index
        }

    }

    field embeddings type tensor&lt;bfloat16&gt;(x[384]) {
        indexing {
            (input title || "") . " " . (input body || "") | <span class="pre-hilite">embed embedderId</span> | attribute | index
        }
        index: hnsw
    }

}
</pre>

<p>You can also use concatenation to add a fixed preamble to the string to embed.</p>


<h3 id="combining-with-foreach">Combining with foreach</h3>

<p>The indexing expression can also use <code>for_each</code> and include other document fields.
    For example, the <em>E5</em> family of embedding models uses instructions along with the input. The following
    expression prefixes the input with <em>passage: </em> followed by a concatenation of the title and a text chunk.</p>

<pre>
schema doc {

    document doc {

        field title type string {
            indexing: summary | index
        }

        field chunks type array&lt;string&gt; {
            indexing: index | summary
        }

    }
    field embedding type tensor&lt;bfloat16&gt;(p{}, x[384]) {
        indexing {
            input chunks |
                for_each {
                    "passage: " . (input title || "") . " " . ( _ || "")
                } | embed e5 | attribute | index
        }
        attribute {
            distance-metric: prenormalized-angular
        }
    }
}
</pre>
<p>See <a href="../writing/indexing.html#execution-value-example">Indexing language execution value</a>for details.</p>


<h2 id="troubleshooting">Troubleshooting</h2>

<p>This section covers common issues and how to resolve them.</p>
<h3 id="model-download-failure">Model download failure</h3>
<p>
  If models fail to download, it will cause the Vespa stateless container service to not start with
  <code>RuntimeException: Not able to create config builder for payload</code> -
  see <a href="../applications/components.html#component-load">example</a>.
</p>
<p>
  This usually means that the model download failed. Check the Vespa log for more details.
  The most common reasons for download failure are network issues or incorrect URLs.
</p>

<p>This will also be visible in the Vespa status output as the container will not listen to its port:</p>
<pre>
vespa status -t http://127.0.0.1:8080
Container at http://127.0.0.1:8080 is not ready: unhealthy container at http://127.0.0.1:8080/status.html: Get "http://127.0.0.1:8080/status.html": EOF
Error: services not ready: http://127.0.0.1:8080
</pre>


<h3 id="tensor-shape-mismatch">Tensor shape mismatch</h3>
<p>
    The native embedder implementations expect that the output tensor has a specific shape.
    If the shape is incorrect, you will see an error message during feeding like:
</p>
<pre>
feed: got status 500 ({"pathId":"..","..","message":"[UNKNOWN(252001) @ tcp/vespa-container:19101/chain.indexing]:
Processing failed. Error message: java.lang.IllegalArgumentException: Expected 3 output dimensions for output name 'sentence_embedding': [batch, sequence, embedding], got 2 -- See Vespa log for details. "}) for put xx:not retryable
</pre>
<p>
This means that the exported ONNX model output tensor does not have the expected shape. For example, the above is
logged by the <a href="#huggingface-embedder">hf-embedder</a> that expects the output shape to be [batch, sequence, embedding] (A 3D tensor). This is because the embedder
implementation performs the <a href="../reference/rag/embedding.html#huggingface-embedder">pooling-strategy</a> over the sequence dimension to produce a single embedding vector.
The batch size is always 1 for Vespa embeddings.
</p>
<p>
See <a href="../ranking/onnx#using-optimum-to-export-models-to-onnx-format">onnx export</a> for how to export models to ONNX format with the correct output shapes and
<a href="../ranking/onnx#debugging-onnx-models">onnx debug</a> for debugging input and output names.
</p>

<h3 id="input-names">Input names</h3>
<p>The native embedder implementations expect that the ONNX model accepts certain input names.
    If the names are incorrect, it will cause the Vespa container service to not start,
    and you will see an error message in the vespa log like:</p>
<pre>
WARNING container        Container.com.yahoo.container.di.Container
Caused by: java.lang.IllegalArgumentException: Model does not contain required input: 'input_ids'. Model contains: my_input
</pre>

<p>This means that the ONNX model accepts "my_input", while our configuration attempted to use "input_ids". The default
    input names for the <a href="#huggingface-embedder">hf-embedder</a> are "input_ids", "attention_mask" and "token_type_ids". These are overridable
    in the configuration (<a href="../reference/rag/embedding.html#huggingface-embedder">reference</a>). Some embedding models do not
    use the "token_type_ids" input. We can specify this in the configuration by setting <code>transformer-token-type-ids</code> to empty,
    illustrated by the following example.</p>
<pre>
{% highlight xml %}
<component id="hf-embedder" type="hugging-face-embedder">
    <transformer-model path="my-models/a-model-without-token-types-model.onnx"/>
    <tokenizer-model path="my-models/tokenizer.json"/>
    <transformer-token-type-ids/>
</component>
{% endhighlight %}</pre>

<h3 id="output-names">Output names</h3>
<p>The native embedder implementations expect that the ONNX model produces certain output names.
    It will cause the Vespa stateless container service to not start,
    and you will see an error message in the vespa log like:</p>
<pre>
    Model does not contain required output: 'test'. Model contains: last_hidden_state
</pre>
<p>This means that the ONNX model produces "last_hidden_state", while our configuration attempted to use "test". The default
    output name for the <a href="#huggingface-embedder">hf-embedder</a> is "last_hidden_state". This is overridable
    in the configuration. See <a href="../reference/rag/embedding.html#huggingface-embedder">reference</a>.</p>

<h3 id="EOF">EOF</h3>
<p>If vespa status shows that the container is healthy, but you observe an EOF error during feeding, this means that the stateless container service has
    crashed and stopped listening to its port. This could be related to the embedder ONNX model size, docker container memory resource constraints,
    or the configured JVM heap size of the Vespa stateless container service.
</p>
<pre>
vespa feed ext/1.json
feed: got error "Post "http://127.0.0.1:8080/document/v1/doc/doc/docid/1": unexpected EOF" (no body) for put id:doc:doc::1: giving up after 10 attempts
</pre>
<p>This could be related to insufficient stateless container (JVM) memory.
    Check the container logs for OOM errors. See <a href="../performance/container-tuning.html#jvm-tuning">jvm-tuning</a> for JVM tuning options (The default heap size is 1.5GB).
    Container crashes could also be caused by too little memory allocated to the docker or podman container, which can cause the Linux kernel to kill processes to free memory.
    See the <a href="../operations/self-managed/docker-containers.html#memory">docker containers memory</a> documentation.
</p>