documentation/en/performance/sizing-search.html at 3b3f076a7e736c181fa7413e8d2de29ac8511403 · vespa-engine/documentation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
---
# Copyright Vespa.ai. All rights reserved.
title: "Vespa Serving Scaling Guide"
---

<p><em>Vespa can scale in multiple scaling dimensions:</em></p>
<ul>
  <li>Scale document volume and write volume</li>
  <li>Scale query throughput</li>
  <li>Scale serving latency to meet service level agreements (SLA)</li>
</ul>
<p>The question one tries to answer during a sizing exercise is:
<em>What the total cost would be to serve a use case using Vespa?</em>.</p>
<p>
This document helps sizing an application correctly with as low cost as possible.
Vespa is used to implement many use cases, and this document is relevant for all of them:
</p>
<ul>
  <li>Serving a <a href="/en/ranking/nativerank.html">text ranking</a> use case
    or a <a href="../learn/tutorials/news-1-deploy-an-application">recommendation</a> use case</li>
  <li>Serving a machine learned model, e.g., an
    <a href="../ranking/onnx">ONNX</a>, <a href="../ranking/xgboost">XGBoost</a>, or
    <a href="../ranking/lightgbm">LightGBM</a> model</li>
</ul>
<p>
With Vespa, it is possible to do benchmarking on a few nodes
to infer the overall performance and cost of the chosen deployment architecture,
and as Vespa supports <a href="../content/elasticity.html">live resizing</a>,
it is easy to scale from a prototype to a full production size deployment.
</p><p>
This document covers sizing and capacity planning for serving,
see <a href="sizing-feeding.html">feed performance sizing</a> for feed performance sizing
and <a href="feature-tuning.html">Vespa serving feature tuning</a>.
It also covers the following topics:
</p>
<ul>
  <li><a href="#data-distribution">Data distribution</a> in Vespa and how it impacts serving</li>
  <li><a href="#content-cluster-scalability-model">Scaling Serving Latency and Throughput</a> in Vespa</li>
  <li><a href="#scaling-document-volume-per-node">Scaling Data Volume</a> in Vespa</li>
</ul>

<h2 id="data-distribution">Data distribution in Vespa - flat versus grouped</h2>
<p>
The basic element in the Vespa search architecture is a content node, which is part of a content cluster.
A Vespa deployment can have several content clusters, which can be scaled independently.
</p><p>
A content node holds a fraction of the entire data corpus.
Data is distributed to nodes using a <a href="../content/idealstate.html">distribution algorithm</a>,
which goal is to uniformly distribute data over the set of nodes.
The goal is also to avoid distribution skew,
while at the same time supporting re-distribution of data,
with minimal data movement,
if the size of the content cluster changes.
Read <a href="../content/elasticity.html">content cluster elasticity</a> to learn
how data is distributed across nodes, and how adding or removing nodes works.
See also <a href="../content/consistency.html">Vespa's consistency model</a> documentation.
</p>


<h3 id="flat-content-distribution">Flat content distribution</h3>
<img src="/assets/img/flat-content-distribution.svg" width="942px" height="auto" alt="Flat content distribution"/>
<p>
With a flat distribution, the content is distributed to content nodes using the
<a href="../content/idealstate.html">ideal state distribution algorithm</a>.
A query is dispatched in parallel from a container instance to
<strong>all</strong> content nodes in the content cluster.
Each content node searches the <em>active</em> part of the <em>ready</em> sub-database.
The above figure illustrates a deployment using 4 nodes with <em>redundancy</em> 2 and <em>searchable-copies</em> 2 -
see the <a href="#high-data-availability">availability</a> section.
</p><p>
When using flat data distribution, the only way to scale query throughput is to reduce the search latency.
Given a fixed occupancy (users, load clients),
this relationship between query throughput and latency is described by
<a href="https://en.wikipedia.org/wiki/Little%27s_law">Little's law</a> -
more on this in <a href="#content-cluster-scalability-model">content cluster scalability model</a> section.
</p>


<h3 id="grouped-content-distribution">Grouped content distribution</h3>
<img src="/assets/img/grouped-content-distribution.svg" width="942px" height="auto" alt="Grouped content distribution"/>
<p>
With a grouped distribution, content is distributed to a configured set of <em>groups</em>,
such that the entire document collection is contained in each group.
A <em>group</em> contains a set of content nodes where the content is
distributed using the <a href="../content/idealstate.html">distribution algorithm</a>.
In the above illustration, there are 4 nodes in total, 2 groups with 2 nodes in each group.
<em>redundancy</em> is 2 and <em>searchable-copies</em> is also 2.
As can be seen from the figure with this grouped configuration,
the content nodes only have a populated ready sub-database.
A query is dispatched in parallel to all nodes in <strong>one group</strong> at a time
using a <a href="../reference/applications/services/content.html#dispatch-policy">dispatch-policy</a>.
The default policy is <em>adaptive</em>, which load balances over the set of groups, aiming at even latency.
</p>


<h3 id="high-data-availability">High Data Availability</h3>
<p>
Ideally, the data is available and searchable at all times, even during node failures.
High availability costs resources due to data replication.
How many replicas of the data to configure
depends on what kind of availability guarantees the deployment should provide.
Configure availability vs cost:
</p>
<table class="table">
  <thead></thead><tbody>
    <tr>
      <th><a href="../reference/applications/services/content.html#redundancy">redundancy</a></th>
      <td>
        <p id="redundancy">
        Defines the total number of copies of each piece of data the cluster will store and maintain to avoid data loss.
        Example: with a redundancy of 2, the system tolerates 1 node failure
        before any further node failures may cause data to become unavailable.
        </p>
      </td>
    </tr><tr>
      <th style="white-space: nowrap">
          <a href="../reference/applications/services/content.html#searchable-copies">searchable-copies</a></th>
      <td>
        <p id="searchable-copies">
        Configures how many of the copies (as configured with <em>redundancy</em>)
        to be indexed (<em>ready</em>) at any time.
        Configuring <em>searchable-copies</em> to be less than <em>redundancy</em> saves resources (memory, disk, cpu),
        as not all copies are indexed (<em>ready</em>).
        In case of node failure, the remaining nodes needs to index the
        <em>not ready</em> documents which belonged to the failed node.
        In this transition period, the search has reduced search coverage.
        </p>
      </td>
    </tr>
  </tbody>
</table>


<h3 id="content-node-database">Content node database</h3>
<img src="/assets/img/proton-databases.svg" width="720px" height="auto" alt="Content node databases"/>
<p>
The above figure illustrates the three
<a href="../content/proton.html#sub-databases">sub-databases</a> inside a Vespa content node.
</p>
<ul>
  <li>The documents in the <strong>Ready</strong> DB are indexed,
    but only the documents in <b>Active</b> state are searchable.
    In a flat distributed system there is only one active instance of the same document,
    while with grouped distribution there is one active instance per group.</li>
  <li>The documents in the <b>Not Ready</b> DB are stored but not indexed.</li>
  <li>The documents in the <b>Removed</b> DB are stored but blocklisted, hidden from search.
    The documents are permanently deleted from storage by
    <a href="../content/proton.html#proton-maintenance-jobs">Proton maintenance jobs</a>.</li>
</ul>
<p>
If the availability guarantees tolerate temporary search coverage loss during node
failures (e.g. <em>searchable-copies</em>=1), this is by far the most optimal for serving performance -
the query work per node is less, as index structures like posting lists are smaller.
The index structures only contains documents in <em>Active</em> state,
not including <em>Not Active</em> documents.
</p><p>
With <em>searchable-copies</em>=2 and <em>redundancy</em>=2,
each replica is fully indexed on separate content nodes.
Only the documents in <em>Active</em> state are searchable,
the posting lists for a given term are (up to) doubled as compared to <em>searchable-copies</em>=1.
</p><p>
See <a href="sizing-examples.html">Content cluster Sizing example deployments</a>
for examples using grouped and flat data distribution.
</p>


<h2 id="life-of-a-query-in-vespa">Life of a query in Vespa</h2>
<p>
  Find an overview in <a href="../querying/query-api.html#query-execution">query execution</a>:
</p>
<img src="/assets/img/query-to-response.svg" width="645px" height="auto"
     alt="Query execution - from query to response"/>
<p>
Vespa executes a query in two protocol phases
(or more if using <a href="../querying/grouping.html">result grouping features</a>)
to optimize the network footprint of the parallel query execution.
The first protocol phase executes the query in parallel over content nodes in a group to find the global top hits,
the second protocol phase fetches the data of the global top hits.
</p><p>
During the first phase,
content nodes match and <a href="../basics/ranking.html">rank</a> documents using the selected rank-profile/model.
The hits are returned to the stateless container for merging
and potentially blending when multiple content clusters are involved.
</p><p>
When the global top ranking documents are found,
the second protocol phase fetch the summary data for the global best hits
(e.g. summary snippets, the original field contents, and ranking features).
By doing the query in two protocol
phases one avoids transferring summary data for hits which will not make it into the global best hits.
</p><p>
Components Involved in query execution:
</p>
<ul>
<li><strong>Container</strong>
  <ul>
  <li>Parses the <a href="../querying/query-api.html">API</a> request
      and the <a href="../querying/query-language.html">query</a> and run time context features.</li>
  <li>Modifies the query according to the schema specification (stemming, etc.) for a text search application
      or creating run time query or user context tensors for an ML serving application.</li>
  <li>Invokes chains of custom <a href="../applications/components.html">container components/plugins</a>
      which can work on the request and query input and also the results.</li>
  <li>Dispatching of query to content nodes in the content cluster(s) for parallel execution.
      With flat distribution queries are dispatched to all content nodes,
      while with a grouped distribution the query is dispatched to all content nodes within a group
      and the queries are load-balanced between the groups using a
      <a href="../reference/applications/services/content.html#dispatch-policy">dispatch-policy</a>.</li>
  <li>Blending of global top ranking results from cluster(s).</li>
  <li>Fetching the top ranking results with document summaries from cluster(s).</li>
  <li>Result processing and possible top-k re-ranking and finally rendering of results back to client.</li>
  </ul>
</li>
<li><strong>Content node (Proton)</strong>
  <ul>
  <li>Finding all documents matching the <a href="../querying/query-api.html">query specification</a>.
      For an ML serving use case, the selection might be a subset of the content pool
      (e.g. limit the model to only be evaluated for content-type video documents),
      while for a text ranking application it might be a
      <a href="../ranking/wand.html">WAND</a> text matching query.</li>
  <li>Calculating the score (which might be a text ranking relevancy score or
      the inferred score of a Machine Learned model) of each hit,
      using the chosen rank-profile. See <a href="../basics/ranking.html">ranking with Vespa</a>.</li>
  <li>Aggregating information over all the generated hits using <a href="../querying/grouping.html">result grouping</a>.</li>
  <li>Sorting hits on relevancy score (text ranking) or inference score (e.g. ML model serving),
      or on attribute(s).
      See <em>max-hits-per-partition</em> and <em>top-k-probability</em> in
      <a href="../reference/applications/services/content.html#dispatch-tuning">dispatch tuning</a>
      for how to tune how many hits to return.</li>
  <li>Processing and returning the document summaries of the selected top hits
      (during summary fetch phase after merging and blending has happened on levels above).</li>
  </ul>
</li>
</ul>
<p>
The detailed execution inside Proton during the matching and ranking first protocol phase is:
</p>
<ol>
  <li>Build up the query tree from the serialized network representation.</li>
  <li>Lookup the query terms in the index and B-tree dictionaries
      and estimate the number of hits each term and parts of the query tree will produce.
      Terms which search attribute fields without <a href="../content/attributes.html#fast-search">fast-search</a>
      will be given a hit count estimate to the total number of documents.</li>
  <li>Optimize and re-arrange the query tree for most efficient performance, trying to move terms or
      operators with the lowest hit ratio estimate first in the query tree.</li>
  <li>Prepare for query execution, by fetching posting lists from the index and B-tree structures.</li>
  <li>Multithreaded execution per search starts using the above information.
      Each thread will do its own thread local setup.</li>
  <li>Each search thread will evaluate the query over its document space.</li>
  <li>The search threads complete first phase and agree which hits will continue to second phase ranking
      (if enabled per the used rank-profile).
      The threads operate over a shared heap with the global top ranking hits.</li>
  <li>Each thread will the complete second phase and grouping/aggregation/sorting.</li>
  <li>Merge all threads results and return up to the container.</li>
</ol>
<p>
<a href="../jdisc/">Container</a> clusters are stateless and easy to scale horizontally,
and don't require any data distribution during re-sizing.
The set of stateful content nodes can be scaled independently
and <a href="../content/elasticity.html">re-sized</a> which requires re-distribution of data.
Re-distribution of data in Vespa is supported and designed to be done without significant serving impact.
Altering the number of nodes or groups in a Vespa content cluster does not require re-feeding of the corpus,
so it's easy to start out with a sample prototype and scale it to production scale workloads.
</p>


<h2 id="content-cluster-scalability-model">Content cluster scalability model</h2>
<p>
Vespa is a parallel computing platform
where the work of matching and ranking is parallelized across a set of nodes and processors.
The speedup one can get by altering the number of nodes in a Vespa content group follows
<a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a>,
which is a formula used to find the maximum improvement possible by improving a particular part of a system.
In parallel computing, <em>Amdahl's law</em> is mainly used to predict the theoretical
maximum speedup for program processing using multiple processors.
In Vespa, as in any parallel computing system,
there is work which can be parallelized and work which cannot.
The relationship between these two work types determine how to best scale the system,
using a flat or grouped distribution.
</p>
<table class="table">
  <tr>
    <td><strong>static query work</strong></td>
    <td>
      <p id="static-query-work">
      Portion of the query work on a content node that does not depend on the
      number of documents indexed on the node.
      This is an administrative overhead caused by system design and abstractions,
      e.g. number of memory allocations per query term.
      Typically, a large query tree means higher static work,
      and this work cannot be parallelized over multiple processors, threads or nodes.
      The static query work portion is described in step 1 to 4 and step 9 in the detailed life of a query explanation above.
      </p>
    </td>
  </tr><tr>
    <td><strong>dynamic query work</strong></td>
    <td>
      <p id="dynamic-query-work">
      Portion of the query work on a content node that depends on the number
      of documents indexed and active on the node.
      This portion of the work scales mostly linearly with the number of matched documents.
      The dynamic query work can be parallelized over multiple processors and nodes.
      Referenced later as <em>DQW</em>.
      The <em>DQW</em> also depends on the phase two protocol summary fill
      where the actual contents of the global best documents is fetched from the content nodes
      which produced the hit in the first protocol phase.
      </p>
    </td>
  </tr><tr>
    <td><strong>Total query work</strong></td>
    <td>
      <p id="total-query-work">
        The total query work is given as the dynamic query work (<em>DQW</em>) + static query work (<em>SQW</em>).
      </p>
    </td>
  </tr>
</table>
<p>
Adding content nodes to a content cluster (keeping the total document volume fixed) with flat distribution
reduces the dynamic query work per node (<em>DQW</em>),
but does not reduce the static query work (<em>SQW</em>).
The overall system cost also increases as you need to rent another node.
</p><p>
Since <em>DQW</em> depends and scales almost linearly with the number of documents on the content nodes,
one can try to distribute the work over more nodes.
<em>Amdahl's law</em> specifies that the maximum speedup one achieve by parallelizing the
dynamic work (<em>DQW</em>) is given by the formula:
</p><p class="equation-container"><!-- depends on mathjax -->
  $$\text{max_speedup}_{\text{group}} = \frac{1}{1 - \frac{DQW}{SQW+DQW}}$$
</p><p>
For example, if one by inspecting <a href="#metrics-for-vespa-sizing">metrics</a> see that the <em>DQW</em> = 0.50,
the maximum speedup one can get by increasing parallelism by using more nodes and decreasing <em>DQW</em> is 2.
With fixed occupancy (number of users, clients or load),
<a href="https://en.wikipedia.org/wiki/Little%27s_law">Little's Law'</a>
tells us that one could achieve two times the throughput if one is able to speed up the latency by a factor of two:
</p><p class="equation-container"><!-- depends on mathjax -->
  $$\frac{1}{1 - \frac{0.5}{0.5+0.5}} = 2$$
</p><p>
When <em>SQW</em> is no longer significantly less than <em>DQW</em>,
adding more nodes in a flat distributed cluster just increases the overall system cost.
This without any serving performance gain,
except increasing overall supported feed throughput,
which increases almost linearly with number of nodes.
</p><p>
Two different <em>DQW/(DQW+SQW)</em> factors are illustrated in the figures below.
The overall query work <em>TQW</em> is the same for both cases (10 ms),
but the <em>DQW</em> portion of the <em>TQW</em> is different.
The throughput (QPS) is a function of the latency (<a href="https://en.wikipedia.org/wiki/Little%27s_law">Little's Law</a>)
and the number of cpu cores * nodes.
Using 1 node with 24 v-cpu cores and 10 ms service time (<em>TQW</em>),
one can expect reaching close to 2400 QPS at 100% utilization
(unless there are other bottlenecks like network or stateless container processing).
</p>
<figure>
  <img src="/assets/img/ScalingLatencyFactor0.5.svg" alt="Scaling throughput/latency where DQW/(SQW+DQW)=0.5"/>
  <img src="/assets/img/ScalingLatencyFactor0.005.svg" alt="Scaling throughput/latency where DQW/(SQW+DQW)=0.9"/>
</figure>
<p>
In the first figure the overall latency is 10 ms,
but the dynamic query work (latency) is only 50%
and given <em>Amdahl's law</em> it follows that the maximum speedup one can get is two.
This is true regardless of how many processors or nodes the dynamic query work is parallelized over.
No matter how many nodes one adds, one don't get above 4800 queries/s.
The only thing one achieve by adding more nodes
is increasing the cost without <em>any</em> performance benefits.
</p><p>
In the second figure there is a system where the dynamic work portion is much higher (0.9),
and the theoretical maximum speedup becomes bound by 10x as given by <em>Amdahl's law</em>.
Note that both figures are with a single flat distributed content cluster with a fixed document volume.
</p><p>
Given the theory, one can derive two rules of thumb for scaling throughput and latency:</p>
<table class="table">
  <thead></thead><tbody>
    <tr>
      <th>Add nodes in a flat distribution</th>
      <td>When DQW/TQW is large (close to 1.0),
        throughput QPS can be scaled by just adding more content nodes in a system using flat distribution.
        This will reduce the number of documents per node,
        and thus reduce the <em>DQW</em> per node.</td>
    </tr><tr>
      <th style="white-space:nowrap;">Add groups using grouped distribution</th>
      <td>When DQW/TQW is low, one can no longer just add more content nodes to scale throughput
        and must instead use a grouped distribution to scale throughput.</td>
    </tr>
  </tbody>
</table>


<h2 id="scaling-latency-in-a-content-group">Scaling latency in a content group</h2>
<p>
Irrespective of using single group (flat distribution) or multiple groups,
the serving latency depends on the factors already described; <em>DQW</em> and <em>SQW</em>.
For use cases where <em>DQW</em> dominates the total query work <em>TQW</em>,
one can effectively scale latency down by parallelizing the <em>DQW</em> over more nodes per group.
</p><p>
It is important to decide on a latency service level agreement (SLA)
before sizing the Vespa deployment for the application and query features.
A latency SLA is often specified as a latency percentile at a certain throughput level - example:</p>
<ul>
  <li>SLA Example 1: 95.0 percentile &lt; 100 ms @ 2000 QPS</li>
  <li>SLA Example 2: 95.0 percentile &lt; 40 ms @ 8000 QPS</li>
</ul>
<p>
Different use cases might have different performance characteristics,
depending on how the dynamic work query portion is compared to the static query work portion.
This graph illustrates the relationship between overall latency versus number of documents indexed per node for two
different use cases.
</p>
<figure>
  <img src="/assets/img/latency-documents.svg" alt="Latency vs document count per node"/>
</figure>
<ul>
  <li>
    <p>
    For the yellow use case the measured latency is almost independent of the total document volume.
    This is called sublinear latency scaling, which calls for scaling up using better flavor
    specification instead of scaling out.
    </p>
    <p>
    The observed latency at 10M documents per node is almost the same as with 1M documents per node.
    Such a use case would be most cost-effective by storing as many documents as possible
    (within the memory/disk/feeding constrains set by the concurrency settings and node flavor)
    and scale throughput by using a grouped distribution.
    Efficient query operators which are sublinear has scaling properties
    like the yellow case. Example of such query operators include
    the <a href="../ranking/wand.html">wand operators</a>, and <a href="../querying/approximate-nn-hnsw">
    approximate nearest neighbor search operator</a>
    </p>
  </li>
  <li>
    <p>
      For the blue use case the measured latency shows a clear correlation with the document volume.
      This is a case where the dynamic query work portion is high,
      and adding nodes to the flat group will reduce the serving latency.
      The sweet spot is found where targeted latency SLA is achieved.
      This sweet spot depends on which model or ranking features are in use,
      e.g. how expensive the model is per retrieved or ranked document.
    </p>
    <p>
      For example, a <a href="../xgboost">GBDT xgboost model</a> with 3000 trees
      might breach the targeted latency SLA already at 200K documents,
      while a 300 tree model might be below the SLA at 2M documents.
      Using exact <a href="../querying/nearest-neighbor-search">nearest neighbor search</a>
      has scaling properties like the blue case.
      See also <a href="feature-tuning.html">feature tuning</a>.
    </p>
  </li>
</ul>


<h3 id="reduce-latency-with-multi-threaded-per-search-execution">
  Reduce latency with multithreaded per-search execution</h3>
<p>
It is possible to reduce latency of queries
where the <a href="#dynamic-query-work">dynamic query work</a> portion is high.
Using multiple threads per search for a use case where the static query work is high
will be as wasteful as adding nodes to a flat distribution.
</p>
<figure>
  <img src="/assets/img/Threads-per-search.svg" alt="Content node search latency vs threads per search"/>
</figure>
<p>
Using more threads per search will reduce latency as long as the dynamic portion of the query cost is high
compared to the static query cost. The reduction in latency comes with the cost of higher cpu utilization.
</p>
<p>
A search request with four threads will occupy all four threads until the last thread has completed,
and the intra-node per thread document space partitioning must be balanced to give optimal results.
</p>
<p>
For rank profiles with second phase ranking, see <a href="../ranking/phased-ranking.html">phased ranking</a>,
the hits from first-phase ranking are rebalanced so that each matching thread scores about
the same amount of hits using the second phase ranking expression.
</p>
<p>
  From the above examples with the blue and yellow use case it follows that
</p>
<ul>
  <li>Linear exact nearest neighbor search latency can be reduced by using more threads per search</li>
  <li>Sublinear approximate nearest neighbor search latency does not benefit from using more threads per search</li>
</ul>
<p>
By default the number of threads per search is one,
as that gives the best resource usage measured as CPU resources used per query.
The optimal threads per search depends on the query use case,
and should be evaluated by benchmarking.
</p><p>
The threads per search settings globally is tuned by
<a href="../reference/applications/services/content.html#requestthreads-persearch">persearch</a>.
This can be overridden to a lower value in
<a href="../reference/schemas/schemas.html#num-threads-per-search">rank profiles</a>
so that different query use cases can use different number of threads per search.
Using multiple threads per search allows better utilization of
multicore cpu architectures for low query volume applications.
The <code>persearch</code> number in services should always be equal to the highest
<code>num-threads-per-search</code> in your rank profiles.
Setting it higher reduces the maximum number of concurrent queries without any latency benefit.
</p>

<h4 id="thread-configuration">Thread configuration</h4>
<p>
  The <code>search</code> and <code>persearch</code> settings together determine
  the maximum number of queries that can execute concurrently on a content node:
</p>
<pre>
max concurrent queries = search / persearch
</pre>
<p>
  For example, with <code>search=64</code> (the default) and <code>persearch=8</code>,
  only 8 queries can execute simultaneously.
  Queries arriving when all slots are busy are queued in the match engine,
  adding latency without any backpressure to the caller.
  This can significantly reduce throughput even when CPU cores are available.
</p>
<p>
  To get started, set <code>search</code> to be 2x number of CPU cores.
  <code>persearch</code> should never be more than number of cores.
  <code>summary</code> should be equal to number of cores.
  Start with the default value of 1 for <code>persearch</code> and only increase it for lower query latency,
  as it will reduce throughput and efficiency.
</p>
<p>
  If you increase <code>persearch</code>, consider increasing <code>search</code> proportionally
  to maintain enough concurrent query slots.
  For example, if you need <code>persearch=8</code> on a 16-core machine,
  setting <code>search=128</code> gives 16 concurrent query slots &mdash;
  enough that queries using fewer threads (via per-rank-profile
  <a href="../reference/schemas/schemas.html#num-threads-per-search">num-threads-per-search</a>
  overrides) are not starved of executor slots.
</p>
<p>
  Monitor <code>content.proton.executor.match.utilization</code> to detect when
  the match engine is saturated &mdash; a sustained value at or near 1.0 indicates
  all executor slots are busy and queries are queuing.
  A high <code>content.proton.executor.match.queuesize.max</code> relative to
  <code>search / persearch</code> confirms the bottleneck.
</p>

<h2 id="scaling-retrieval-size">Scaling the size of the retrievable unit</h2>
<p>
Retrieving units that are too small or too large can have a drastic impact on both the quality and
performance of your search. Consider an example where we want to search in PDF files: Creating one
document per PDF file seems like a logical solution, and with Vespa this is possible - up to a point.
But as system architects we must consider the potential edge cases: some files may be entire books,
or long reports with many hundreds or even thousands of pages.
<p>The current max document size is limited by the max protobuf message size of 2 GB, but we advise
staying well below this limit, at least < 200MB and ideally < 1MB for even, predictable performance.
For reference to scale, the complete text of the bible is about 4MB.

<p><em>Split too-large documents into smaller units for better search quality and performance!</em>
Natural subdivisions like chapters, parts or sections are good candidates for splitting into separate
Vespa documents.

<h3 id="when-documents-are-too-large">When documents are too large</h4>
If each document is a complete PDF file and some are very large, what problems could we run into?
<p>
<b>Usefulness of the result</b> - knowing that there are relevant parts <em>somewhere</em> in
several hundred pages of text is not very helpful to the user. As of 2025, this is still
true also if the "user" is an LLM in a RAG or agentic workflow.
<p>
We can improve the usefulness by providing <a href="/en/querying/document-summaries.html#dynamic-snippets">dynamic snippets</a>
or returning per-chunk similarity scores as <a href="/en/ranking/ranking-expressions-features.html#accessing-feature-function-values-in-results">feature values</a>
to be able to identify the most relevant portions of the returned summary.
<p>
<b>Performance</b> - Returning large document values in the query response over HTTP has a significant
cost, both in CPU time spent in rendering the response, compression, and network transfer time. This
can easily become the largest contribution to the total end-to-end latency.
<p>
Large documents can also contribute to poor performance in indexing and query execution, or greatly
increase the amount of temporary memory required for complex ranking expressions like multi-dimensional ColBert maxsim.
As document are processed, indexed, stored and ranked as individual units, working on a few very large documents
at a time may not offer the system enough opportunity to parallelize and result in poor, uneven utilization
of resources, and even a small fraction of very large documents may impact your mean (and especially higher percentile)
latencies both for processing and query execution.

<h3 id="too-small-documents">When documents are too small</h4>
What problems can occur if documents are very small? Consider indexing small fragments of text, like a single sentence or even a word.
<p>
<b>Granularity</b> - As the size of text fragments decrease, we are less likely to find good matches for queries as the relevant terms
or context is spread across multiple documents. The response may not contain enough information to resolve the user information need,
or to even judge if it is likely to resolve the need if the source is examined in full. This problem is described both in
<a href="https://nlp.stanford.edu/IR-book/html/htmledition/choosing-a-document-unit-1.html">traditional information retrieval literature</a>
and has also been a popular topic in recent years as "chunking" for semantic search.
<p>
<b>Overhead</b> - splitting a document into very small pieces means that more resources will be spent on per-document overhead.
Shared metadata like e.g. the abstract or access permissions of a document will be replicated many times, and updates/deletes to the
source document or its metadata must fan-out to all the sub-documents, increasing write load. Unlike too-large documents, having a
fraction of very small documents is fine, what matters for efficiency is that the average size is not too small.

<h2 id="scaling-document-volume-per-node">Scaling document volume per node</h2>
<p>
One want to fit as many documents as possible into a node given the node constrains
(e.g. available cpu, memory, disk) while maintaining:
<ul>
  <li>The targeted search latency SLA</li>
  <li>The targeted feed and update rate, and feed latency</li>
</ul>
<p>
With the latency SLA in mind, benchmark with increasing number of documents per node
and watch system level metrics and Vespa metrics.
If latency is within the stated latency SLA and the system meets the targeted sustained feed rate,
overall cost is reduced by fitting more documents into each node
(e.g. by increasing memory, cpu and disk constraints set by the node flavor).
</p><p>
With larger fan-out using more nodes to partition the data also overcomes higher tail latency
as search waits for all results from all nodes. Therefore, the overall execution time depends on
the slowest node at the time of the query. In such cases with large fan-out, using
<a href="../reference/applications/services/content.html#coverage">adaptive timeout</a> is recommended
to keep tail latency in check.
</p><p>
Vespa will block feed operations
if <a href="../reference/applications/services/content.html#resource-limits">resource limits</a> have been reached.
</p>


<h3 id="disk-usage-sizing">Disk usage sizing</h3>
<p>
Disk usage of a content node increases as the document volume increases:
The disk usage per document depends on various factors like the number of schemas,
the number of indexed fields and their settings,
and most important the size of the fields that are indexed and stored.
The simplest way to determine the disk usage is to simply index documents
and watch the disk usage along with the relevant metrics.
The <em>redundancy</em> (number of copies) impact the disk usage footprint, obviously.
</p><p>
Note that <a href="../content/proton.html#proton-maintenance-jobs">content node maintenance jobs</a>
temporarily increases disk usage.
E.g. <em>index fusion</em> is an example, where new index files are written,
causing an increase in used disk space while running.
Space used depends on configuration and data - headroom must include the temporary usage.
See <a href="#metrics-for-vespa-sizing">metrics for capacity planning</a>.
</p>


<h3 id="memory-usage-sizing">Memory usage sizing</h3>
<p>
The memory usage on a content node increases as the document volume increases.
The memory usage increases almost linearly with the number of documents.
The vespa-proton-bin process (content node) uses the full 64-bit virtual address space,
so the virtual memory usage reported might be high,
as both index and summary files are mapped into memory using mmap
and pages are paged into memory as needed.
</p><p>
The memory usage per document depends on the number of fields,
the raw size of the documents and how many of the fields are defined as
<a href="../content/attributes.html">attributes</a>.
Also see <a href="#metrics-for-vespa-sizing">metrics for capacity planning</a>.
</p>


<h2 id="scaling-throughput">Scaling Throughput</h2>
<p>
As seen in the previous sections,
when the static query work (<em>SQW</em>) becomes large,
scale throughput using grouped distribution.
Regardless, if throughput is scaled by grouped distribution for use cases
with high static query work portion
or a flat distribution for high dynamic query work portion,
one should identify how much throughput the total system supports.
</p><p>
Finding where the latency starts climbing exponentially versus throughput
is important in order to make sure that the deployed system is scaled well below this throughput threshold.
Also, that it has capacity to absorb load increases over time,
as well as having sufficient capacity to sustain node outages during peak traffic.
</p><p>
At some throughput level some resource(s) in the system will be fully saturated,
and requests will be queued up causing latency to spike up,
as requests are spending more time waiting for the saturated resource.
</p><p>
This behaviour is illustrated in the figure below:
</p>
<figure>
  <img src="/assets/img/QPS-scaling.svg" alt="Latency vs throughput"/>
</figure>
<p>
A small increase in serving latency is observed as throughput increases,
until saturated at approximately 2200 QPS.
Pushing more queries than this only increases queueing time,
and latency increases sharply.
</p>


<h2 id="scaling-for-failures-and-headroom">Scaling for failures and headroom</h2>
<p>
It is important to also measure behaviour under non-ideal circumstances,
to avoid getting too good results.
E.g., by simulating node failures or node replacements,
verifying feeding concurrency versus search and serving.
See <a href="sizing-examples.html#serving-availability-tuning">Serving availability tuning</a>.
</p><p>
Generally, the higher utilization a system has in production,
the more fragile it becomes when changing query patterns or ranking models.
</p><p>
The target system utilization should be kept sufficiently low for the response times to be reasonable and within latency SLA,
even with some extra traffic occurring at peak hours.
See also <a href="graceful-degradation.html">graceful degradation</a>.
Also see <a href="sizing-feeding.html#feed-vs-search">sizing write versus read</a>.
</p>


<h2 id="metrics-for-vespa-sizing">Metrics for Vespa Sizing</h2>
<p>
The relevant <a href="../operations/metrics.html">Vespa Metrics</a> for measuring the cost factors,
in addition to system level metrics like cpu util, are:
</p>
<em>Metric capturing static query work (SQW) at content nodes </em>
<pre>
content.proton.documentdb.matching.rank_profile.query_setup_time
</pre>
<em>Metric capturing dynamic query work (DQW) at content nodes</em>
<pre>
content.proton.documentdb.matching.rank_profile.query_latency
</pre>
<p>
By sampling these metrics,
one can calculate the theoretical speedup we can achieve by increasing number of nodes using flat distribution,
by using <em>Amdahl's law</em>:
</p><p class="equation-container"><!-- depends on mathjax -->
  $$\text{max_speedup}_{\text{}} = \frac{1}{1 - \frac{match\_time}{query\_setup\_time+match\_time}}$$
</p><p>
In addition, the following metrics are used to find number of matches per query, per node:
<pre>
content.proton.documentdb.matching.rank_profile.docs_matched
content.proton.documentdb.matching.rank_profile.queries
</pre>
<strong>Disk usage</strong>:
<ul>
  <li>documentdb: <em>vespa.content.proton.documentdb.disk_usage.last</em></li>
  <li>transaction log: <em>vespa.content.proton.transactionlog.disk_usage.last</em></li>
</ul>
<strong>Memory usage</strong>:
<ul>
  <li>documentdb: <em>vespa.content.proton.documentdb.memory_usage.allocated_bytes.last</em></li>
</ul>