documentation/en/content/elasticity.html at 3b3f076a7e736c181fa7413e8d2de29ac8511403 · vespa-engine/documentation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
# Copyright Vespa.ai. All rights reserved.
title: "Content cluster elasticity"
redirect_from:
- /en/elastic-vespa.html
- /en/elasticity.html
---

<p>
  Vespa clusters can be grown and shrunk while serving queries and writes.
  Documents in content clusters are automatically redistributed on changes to maintain an even distribution
  with minimal data movement.
  To resize, just change the <a href="../reference/applications/services/services.html#nodes">nodes</a>
  and redeploy the application - no restarts needed.
</p>
<img src="/assets/img/elastic-grow.svg" style="max-width: 300px;"
     alt="A cluster growing in two dimensions" />

<p>
  Documents are managed by Vespa in chunks called <a href="#buckets">buckets</a>.
  The size and number of buckets are completely managed by Vespa and there is never any
  need to manually control sharding.
</p>

<p>
  The elasticity mechanism is also used to recover from a node loss:
  New replicas of documents are created automatically on other nodes to maintain
  the configured redundancy.
  Failed nodes is therefore not a problem that requires immediate attention -
  clusters will self-heal from node failures as long as there are sufficient resources.

</p>
<img src="/assets/img/elastic-fail.svg" style="max-width: 300px;"
     alt="A cluster with a node failure" />

<p>
  When you want to remove nodes from a content cluster, you can have the system migrate data
  off them in an orderly fashion prior to removal. This is done by marking nodes as <em>retired</em>.
  This is useful to remove nodes that should be retired, but also to migrate a cluster to entirely new nodes
  while online: Add the new nodes, mark the old nodes retired, wait for the data to be redistributed
  and remove the old nodes.
</p>
<p>
  The auto-elasticity is configured for a normal fail-safe operation,
  but there are tradeoffs like recovery speed and resource usage.
  Learn more in <a href="../operations/self-managed/admin-procedures.html#content-cluster-configuration">procedures</a>.
</p>


<h2 id="adding-nodes">Adding nodes</h2>

<p>
  To add or remove nodes from a content cluster, just <code>nodes</code>
  tag of the <a href="../reference/applications/services/content.html">content</a> cluster in
  <a href="../reference/applications/services/services.html">services.xml</a>
  and <a href="../basics/applications.html#deploying-applications">redeploy</a>.
  Read more in <a href="../operations/self-managed/admin-procedures.html">procedures</a>.
</p>

<p>
  When adding a new node, a new <em>ideal state</em> is calculated for all buckets.
  The buckets mapped to the new node are moved, the superfluous are removed.
  See redistribution example - add a new node to the system, with redundancy n=2:
</p>
<img src="/assets/img/add-node-move-buckets.svg" width="690px" height="auto"
     alt="Bucket migration as a node is added to the cluster"/>
<p>
  The distribution algorithm generates a random node sequence for each bucket.
  In this example with n=2, replicas map to the two nodes sorted first.
  The illustration shows how placement onto two nodes changes as a third node is added.
  The new node takes over as primary for the buckets where it got sorted first,
  and as secondary for the buckets where it got sorted second.
  This ensures minimal data movement when nodes come and go, and allows capacity to be changed easily.
</p><p>
  No buckets are moved between the existing nodes when a new node is added.
  Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed.
  Multiple nodes can be added in the same deployment.
</p>


<h2 id="removing-nodes">Removing nodes</h2>

<p>
  Whether a node fails or is <em>retired</em>, the same redistribution happens.
  If the node is retired, replicas are generated on the other nodes
  and the node stays up, but with no active replicas.
  Example of redistribution after node failure, n=2:
</p>
<img src="/assets/img/lose-node-move-buckets.svg" width="690px" height="auto"
     alt="Bucket migration as a node is removed from the cluster"/>
<p>
  Here, node 2 fails. This node held the active replicas of bucket 2 and 6.
  Once the node fails the secondary replicas are set active.
  If they were already in a <em>ready</em> state, they start serving queries immediately,
  otherwise they will index replicas,
  see <a href="../reference/applications/services/content.html#searchable-copies">searchable-copies</a>.
  All buckets that no longer have secondary replicas
  are merged to the remaining nodes according to the ideal state.
</p>


<h2 id="grouped-distribution">Grouped distribution</h2>

<p>
  Nodes in content clusters can be placed in <a href="../reference/applications/services/content.html#group">groups</a>.
  A group of nodes in a content cluster will have one or more complete replicas of the entire document corpus.
</p>
<img src="/assets/img/query-groups.svg" style="max-width: 400px;"
     alt="A cluster changes from using one to many groups" />

<p>This is useful in the cases listed below:</p>

<table class="table">
  <thead></thead><tbody>
<tr>
  <th>Cluster upgrade</th>
  <td>
    With multiple groups it becomes safe to take out a full group for upgrade instead of just one node at a time.
    <a href="../operations/self-managed/live-upgrade.html">Read more</a>.
  </td>
</tr><tr>
  <th style="white-space: nowrap">Query throughput</th>
  <td>
    Applications with high query rates and/or high static query cost can use groups to scale to higher query rates
    since Vespa will automatically send a query to just a single group.
    <a href="../performance/sizing-search.html">Read more</a>.
  </td>
</tr><tr>
  <th>Topology</th>
  <td>
    By using groups you can control replica placement over network switches or racks to
    ensure there is redundancy at the switch and rack level.
  </td>
</tr>
</tbody>
</table>
<p>
  Tuning group sizes and node resources enables applications to easily find the latency/cost sweet spot,
  the elasticity operations are automatic and queries and writes work as usual with no downtime.
</p>


<h2 id="changing-topology">Changing topology</h2>
<p>
  A Vespa elasticity feature is the ability to change topology (i.e. grouped distribution) without service disruption.
  This is a live change, and will auto-redistribute documents to the new topology.
</p>
<p>
  Also read <a href="../operations/self-managed/admin-procedures.html#topology-change">topology change</a>
  if running Vespa self-hosted - the below steps are general for all hosting options.
</p>


<h3 id="replicas">Replicas</h3>
<p>
  When changing topology, pay attention to the <a href="../reference/applications/services/content.html#min-redundancy">min-redundancy</a> setting -
  this setting configures a <em>minimum</em> number of replicas in a cluster,
  the <em>actual</em> number is topology dependent - example:
</p>
<p>
  A flat cluster with min-redundancy n=2 and 15 nodes is changed into a grouped cluster with 3 groups with 5 nodes each
  (total node count and n is kept unchanged).
  In this case, the actual redundancy will be 3 after the change,
  as each of the 3 groups will have at least 1 replica for full query coverage.
  The practical consequence is that disk and memory requirements per node <em>increases</em> due to the change to topology.
  It is therefore important to calculate the actual replica count before reconfiguring topology.
</p>


<h3 id="query-coverage">Query coverage</h3>
<p>
  Changing topology might cause query coverage loss in the transition, unless steps taken in the right order.
  If full coverage is not important, just make the change and wait for document redistribution to complete.
</p>
<p>
  To keep full query coverage, make sure not to change both group size and number of groups at the same time:
</p>
<ol>
  <li>
    To add nodes for more data, or to have less data per node, increase group size.
    E.g., in a 2-group cluster with 8 nodes per group, add 4 nodes for a 25% capacity increase with 10 nodes per group.
  </li>
  <li>
    If the goal is to add query capacity, add one or more groups, with the same node count as existing group(s).
    A flat cluster is the same as one group - if the flat cluster has 8 nodes,
    change to a grouped cluster with 2 groups of 8 nodes per group.
    This will add an empty group, which is put in query serving once populated.
  </li>
</ol>
<p>
  In short, if the end-state means both changing number of groups and node count per group,
  do this as separate steps, as a combination of the above.
  Between each step, wait for document redistribution to complete using the <code>merge_bucket.pending</code> metric -
  see <a href="../writing/initial-batch-feed.html">example</a>.
</p>


<h2 id="buckets">Buckets</h2>

<p>
  To manage documents, Vespa groups them in <em>buckets</em>,
  using hashing or hints in the <a href="../schemas/documents.html">document id</a>.
</p>

<p>
  A document Put or Update is sent to all replicas of the bucket with the document.
  If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket.
  A bucket contains <a href="../operations/self-managed/admin-procedures.html#data-retention-vs-size">tombstones</a>
  of recently removed documents.
</p>

<p>
  Buckets are split when they grow too large, and joined when they shrink.
  This is a key feature for high performance in small to large instances,
  and eliminates need for downtime or manual operations when scaling.
  Buckets are purely a content management concept, and data is not stored or
  indexed in separate buckets, nor does queries relate to buckets in any way.
  Read more in <a href="buckets.html">buckets</a>.
</p>


<h2 id="ideal-state-distribution-algorithm">Ideal state distribution algorithm</h2>
<p>
  The <a href="idealstate.html">ideal state distribution algorithm</a>
  uses a variant of the <a href="https://ceph.io/assets/pdfs/weil-crush-sc06.pdf">CRUSH algorithm</a>
  to decide bucket placement.
  It makes a minimal number of documents move when nodes are added or removed.
  Central to the algorithm is the assignment of a node sequence to each bucket:
</p>
<img src="/assets/img/bucket-node-sequence.svg" width="675px" height="auto"
     alt="Assignment of a node sequence to each bucket" />
<p>
  Steps to assign a bucket to a set of nodes:
</p>
<ol>
  <li>Seed a random generator with the bucket ID to generate a pseudo-random sequence of numbers.
    Using the bucket ID as seed will then always generate the same sequence for the bucket.</li>
  <li>Nodes are ordered by <a href="../reference/applications/services/content.html#node">distribution-key</a>,
    assign the random number in that order.
    E.g. a node with distribution-key 0 will get the first random number, node 1 the second.</li>
  <li>Sort the node list by the random number.</li>
  <li>Select nodes in descending random number order -
    above, node 1, 3 and 0 will store bucket 0x3c000000000000a0 with n=3 (redundancy).
    For n=2, node 1 and 3 will store the bucket.
    This specification of where to place a bucket is called the bucket's <em>ideal state</em>.</li>
</ol>
<p>
  Repeat this for all buckets in the system.
</p>


<h2 id="consistency">Consistency</h2>

<p>
  Consistency is maintained at bucket level.
  Content nodes calculate local checksums based on the bucket contents,
  and the distributors compare checksums across the bucket replicas.
  A <em>bucket merge</em> is issued to resolve inconsistency, when detected.
  While there are inconsistent bucket replicas, operations are routed to the "best" replica.
</p>

<p>
  As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels.
  A node may have been down while its buckets have been split or joined.
  This is called <em>inconsistent bucket splitting</em>.
  Bucket checksums can not be compared across buckets with different split levels.
  Consequently, content nodes do not know whether all documents exist in enough replicas in this state.
  Due to this, inconsistent splitting is one of the highest maintenance priorities.
  After all buckets are split or joined back to the same level,
  the content nodes can verify that all the replicas are consistent and fix any detected issues with a merge.
  <a href="consistency">Read more</a>.
</p>


<h2 id="further-reading">Further reading</h2>

<ul>
  <li><a href="content-nodes.html">content nodes</a></li>
  <li><a href="proton.html">proton</a> - see <em>ready</em> state</li>
</ul>