DBMS-Course/topics/09-storage-engines.html at main · sridhar-3009/DBMS-Course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">
  <script>try{var t=localStorage.getItem("dbms-theme")||"dark";document.documentElement.setAttribute("data-theme",t)}catch(e){document.documentElement.setAttribute("data-theme","dark")}</script>
  <title>Storage Engines — DBMS Illustrated</title>
  <link rel="stylesheet" href="../css/style.css">
  <style>:root{--topic-color:#F97316}</style>
</head>
<body>
<div class="reading-progress" id="reading-progress"></div>
<nav class="navbar">
  <div class="navbar-inner">
    <a class="navbar-brand" href="../index.html"><span class="brand-icon" style="color:var(--primary-soft)">&#9670;</span>DBMS Illustrated</a>
    <div class="navbar-links"><a href="../index.html#topics">All Topics</a></div>
    <div class="navbar-actions"><button class="btn-icon" id="theme-toggle" title="Toggle theme">&#9788;</button></div>
  </div>
</nav>

<div class="container">
  <div class="breadcrumb">
    <a href="../index.html">Home</a><span class="breadcrumb-sep">›</span>
    <a href="../index.html#topics">Course</a><span class="breadcrumb-sep">›</span><span>Storage Engines</span>
  </div>

  <div class="topic-header">
    <div class="topic-badge">Topic 09</div>
    <h1>Storage <span class="accent">Engines</span></h1>
    <p class="subtitle">When you save something in a database, where does it actually go? How does the database survive a power outage without losing your data? This topic covers what happens under the hood: the journal that makes writes safe, the memory cache that makes reads fast, and the two main ways databases store data on disk.</p>
    <div class="company-badges">
      <span class="badge">InnoDB (MySQL)</span><span class="badge">PostgreSQL heap</span>
      <span class="badge">RocksDB (LSM)</span><span class="badge">WiredTiger (MongoDB)</span><span class="badge">MyISAM</span>
    </div>
  </div>

  <div class="at-a-glance reveal">
    <h3>At a Glance</h3>
    <ul>
      <li><strong>InnoDB</strong> — MySQL's default storage engine. Stores rows in a B+tree organized by primary key. Supports real transactions, row-level locking, and crash recovery via a journal (called the redo log or WAL).</li>
      <li><strong>WAL (Write-Ahead Log)</strong> — before touching any data on disk, the database first writes a record of the change to a journal file. If the machine crashes, the journal can replay all committed changes and restore the database to a safe state.</li>
      <li><strong>LSM tree</strong> — an alternative storage design that accepts all writes into memory first, then flushes them to sorted files on disk in batches. Much faster for write-heavy workloads; slightly slower for reads. Used by RocksDB, Cassandra, and others.</li>
      <li><strong>Buffer pool</strong> — a large chunk of RAM where the database caches recently used pages of data. Reading from RAM is thousands of times faster than reading from disk, so the goal is to keep frequently accessed data in the buffer pool at all times.</li>
      <li><strong>Vacuum / compaction</strong> — deleted rows are not immediately removed from disk; they leave behind dead space. VACUUM (PostgreSQL) and compaction (LSM engines) periodically clean up that space so the database does not bloat over time.</li>
    </ul>
  </div>

  <div class="section-label">Core Concepts</div>
  <div class="mini-cards">
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Pages &amp; Buffer Pool</h4>
      <p>Think of a database like a book. The database does not read one word at a time — it reads a whole page at a time. Each page is typically 8KB or 16KB and contains many rows. The buffer pool is the database's working desk: recently used pages sit in RAM so they can be read instantly. When a page is modified but not yet saved to disk, it is called a dirty page. The database writes dirty pages to disk in the background.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Write-Ahead Log (WAL)</h4>
      <p>Imagine a chef who writes every step in a notebook before doing anything in the kitchen. If the kitchen catches fire mid-recipe, you can replay the notebook and pick up exactly where things went wrong. The WAL is that notebook. Every change is written to the journal before it touches the actual data. A transaction is considered committed the moment the journal entry is safely on disk — the data page itself can be updated later at a convenient time.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>LSM Trees</h4>
      <p>Traditional databases update data in place on disk, which involves a lot of jumping around (random I/O). LSM trees take a different approach: every write lands in memory first (in a structure called the memtable), and once memory fills up, the batch is written to disk sequentially — like appending to a log file. Sequential writes are much faster than random ones. The tradeoff is that reads may need to check multiple sorted files, and background work called compaction periodically merges those files to keep reads manageable.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Checkpoints &amp; Recovery</h4>
      <p>Think of a checkpoint like a save point in a video game. Periodically, the database takes all the dirty pages from memory and writes them to disk, then marks the current position in the journal. If the machine crashes later, recovery only needs to replay the journal from the last checkpoint — not from the very beginning. More frequent checkpoints mean faster recovery after a crash, but slightly more disk activity while the database is running.</p>
    </div>
  </div>

  <!-- ── Page Layout ────────────────────────────────────────── -->
  <div class="section-label">Page Layout</div>
  <h2 class="section-title">How Data Pages Work</h2>
  <p class="section-desc">The page (also called a block) is the fundamental unit of I/O in a database engine — typically 4KB, 8KB, or 16KB. Everything the database reads from or writes to disk is in whole pages. Understanding page layout explains why certain operations are expensive and why fragmentation occurs.</p>

  <div class="steps">
    <div class="step-item">
      <div class="step-num">1</div>
      <div>
        <h4>Page structure</h4>
        <p>A database page is like a physical sheet of paper in a filing folder. At the top is a header with bookkeeping information — the page number, how much free space is left, and a sequence number (called the LSN) that the journal uses to track which changes have been applied. The middle holds the actual row data. A small directory at the bottom keeps track of where each row starts within the page. When the page fills up, a new page is allocated — like adding a new sheet to the folder.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">2</div>
      <div>
        <h4>Fill factor</h4>
        <p>If you pack a page completely full, the next INSERT has nowhere to go and the database must split the page in two — an expensive operation. Fill factor is the setting that controls how full a page is allowed to get before the database considers it "full." InnoDB leaves about 7% empty by default. PostgreSQL lets you configure this per table. For tables that receive lots of updates, leaving extra free space (setting fillfactor to 70-80%) means updates can fit on the same page rather than spilling to a new one.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">3</div>
      <div>
        <h4>HOT updates (PostgreSQL)</h4>
        <p>In PostgreSQL, every UPDATE creates a new version of the row rather than changing it in place. Normally this also requires updating every index that points to that row — which is expensive if you have many indexes. A HOT update (Heap Only Tuple) is a special shortcut: if the new version fits on the same page and none of the indexed columns changed, PostgreSQL can skip updating the indexes entirely and just leave a pointer inside the page from the old row to the new one. This is much cheaper for tables with frequent small updates on non-indexed columns.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">4</div>
      <div>
        <h4>Page fragmentation</h4>
        <p>Imagine a notebook where you keep erasing entries. Over time, the notebook has lots of half-empty pages with gaps scattered throughout. Every time you read a page, you bring back a full 8KB from disk — but if half of it is empty space, you wasted that read. This is fragmentation. PostgreSQL's VACUUM FULL and InnoDB's OPTIMIZE TABLE physically rewrite the entire table from scratch, packing rows together tightly. The downside is that this operation locks the table and takes a while — so it is not something you do casually on a large production table.</p>
      </div>
    </div>
  </div>

  <!-- ── B+Tree vs LSM ───────────────────────────────────────── -->
  <div class="section-label">Storage Model Comparison</div>
  <h2 class="section-title">B+Tree vs LSM Tree</h2>

  <div class="table-wrap">
    <table class="compare-table">
      <thead>
        <tr><th>Property</th><th>B+Tree (InnoDB/PostgreSQL)</th><th>LSM Tree (RocksDB/Cassandra)</th></tr>
      </thead>
      <tbody>
        <tr><td>Write path</td><td>Random I/O to existing pages (in-place)</td><td>Sequential append to memtable, then SSTable</td></tr>
        <tr><td>Read path</td><td>O(log n) — single B+tree traversal</td><td>Check memtable + L0..Ln SSTables (bloom filter skips)</td></tr>
        <tr><td>Write amplification</td><td>Low — write page once</td><td>High — data written multiple times during compaction</td></tr>
        <tr><td>Read amplification</td><td>Low — one path from root to leaf</td><td>Higher — may need to check multiple levels</td></tr>
        <tr><td>Space amplification</td><td>Low (with VACUUM/OPTIMIZE)</td><td>Higher — multiple versions exist until compaction</td></tr>
        <tr><td>Delete handling</td><td>Mark dead, VACUUM later</td><td>Tombstone record, compaction removes it</td></tr>
        <tr><td>Best workload</td><td>Mixed read/write, random lookups, range scans</td><td>Write-heavy: logs, events, IoT telemetry, append-only</td></tr>
        <tr><td>Used by</td><td>PostgreSQL, MySQL InnoDB, SQLite, Oracle</td><td>RocksDB, Cassandra, HBase, LevelDB, TiKV</td></tr>
      </tbody>
    </table>
  </div>

  <!-- ── LSM Compaction ──────────────────────────────────────── -->
  <div class="section-label">LSM Internals</div>
  <h2 class="section-title">LSM Tree and Compaction Strategies</h2>
  <p class="section-desc">The LSM tree trades read performance for dramatically higher write throughput by converting random writes into sequential I/O. Understanding compaction is essential for tuning LSM-backed databases like Cassandra and RocksDB.</p>

  <div class="steps">
    <div class="step-item">
      <div class="step-num">1</div>
      <div>
        <h4>Memtable and immutable memtable</h4>
        <p>All incoming writes land in memory first, in a sorted structure called the memtable. Think of it as an in-progress draft document. Once it reaches a size limit (typically 64MB to 256MB), it is "sealed" — no more writes go to it. A fresh memtable takes over for new writes, while the sealed one (now immutable) is flushed to disk in the background. A WAL journal backs the in-memory data so nothing is lost if the machine crashes before the flush completes.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">2</div>
      <div>
        <h4>SSTables and bloom filters</h4>
        <p>When the memtable is flushed to disk, it becomes a sorted, immutable file called an SSTable (Sorted String Table). Over time many SSTables accumulate, and looking up a key might require checking several of them. To speed this up, each SSTable carries a bloom filter — a small lookup structure that can quickly answer "is this key definitely NOT in this file?" If the bloom filter says no, the whole file is skipped. This reduces a lookup that might scan dozens of files down to checking just one or two.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">3</div>
      <div>
        <h4>Leveled compaction (RocksDB default)</h4>
        <p>SSTables are organized into levels. Level 0 holds newly flushed files and may have overlapping key ranges. Compaction periodically merges L0 files into Level 1, then Level 1 into Level 2, and so on — each level being about 10x larger than the previous. Within Levels 1 and up, key ranges never overlap, which makes lookups fast (at most one file per level to check). The tradeoff is that data is physically rewritten multiple times as it moves through levels. This is called write amplification, and it is the main cost of the LSM approach.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">4</div>
      <div>
        <h4>Size-tiered compaction (Cassandra default)</h4>
        <p>Instead of organizing files by level, size-tiered compaction groups SSTables that are similar in size and merges them together. This avoids the high write amplification of leveled compaction — data is not rewritten as many times. The tradeoff is that multiple files may cover the same key range, so reads may need to check more files. This strategy works best for write-heavy workloads where you mostly append and rarely look up individual records. Cassandra also offers a Time-Window Compaction Strategy (TWCS) which is ideal for time-series data that naturally expires after a fixed period.</p>
      </div>
    </div>
  </div>

  <div class="section-label">How It Works</div>
  <h2 class="section-title">InnoDB Write Path</h2>
  <div class="steps">
    <div class="step-item"><div class="step-num">1</div><div><h4>Write to Redo Log (WAL)</h4><p>Before any data page on disk is touched, the change is first recorded in the redo log (the journal). Once that journal entry is safely written to disk, the transaction is considered committed. This is the key insight: writing to a sequential journal file is fast; writing to a random location in a data file is slow. The journal entry is the durable record of what happened.</p></div></div>
    <div class="step-item"><div class="step-num">2</div><div><h4>Modify Page in Buffer Pool</h4><p>The relevant data page is loaded from disk into the buffer pool (if it is not already there). The change is applied to the in-memory copy of the page. The page is now "dirty" — its content in memory does not match what is on disk. It will be written to disk later, in the background.</p></div></div>
    <div class="step-item"><div class="step-num">3</div><div><h4>Undo Log Entry</h4><p>Alongside the redo log, InnoDB writes an undo record — a "before photo" of what the row looked like before the change. If the transaction is rolled back, InnoDB uses the undo record to put the row back to its original state. This also supports MVCC: other transactions can read the old version of the row from the undo log even while your transaction is mid-flight.</p></div></div>
    <div class="step-item"><div class="step-num">4</div><div><h4>Background Flush (Checkpoint)</h4><p>Dirty pages sit in memory until a background process writes them to disk. Periodically, the database performs a checkpoint: it flushes all dirty pages and records exactly where in the journal it has caught up to. This is the save-game moment. If the database crashes after a checkpoint, it only needs to replay the journal from that point forward — not from the beginning of time.</p></div></div>
    <div class="step-item"><div class="step-num">5</div><div><h4>Crash Recovery: ARIES</h4><p>After a crash, the database runs through three phases. First, the Analysis phase scans the journal from the last checkpoint to figure out which transactions were in-flight and which pages were dirty. Second, the Redo phase replays every change in the journal — both committed and uncommitted — to restore the database to the exact state it was in at the moment of the crash. Third, the Undo phase rolls back any transactions that had not yet committed, using the undo logs. This three-phase process is called ARIES recovery.</p></div></div>
  </div>

  <div class="code-block">
    <div class="cb-header"><span class="cb-lang">SQL — InnoDB Configuration</span></div>
    <pre><span class="cmt">Buffer pool: most important InnoDB tuning parameter
-- Set to 70-80% of available RAM on dedicated DB servers</span>
<span class="kw">SET</span> GLOBAL innodb_buffer_pool_size = <span class="num">8</span>G;

<span class="cmt">WAL sync mode: 1 = safest (fsync on commit)</span>
<span class="cmt">0 = fastest (no fsync, risky), 2 = flush but no fsync</span>
<span class="kw">SET</span> GLOBAL innodb_flush_log_at_trx_commit = <span class="num">1</span><span class="cmt">PostgreSQL: WAL configuration</span>
<span class="cmt">wal_level = logical enables logical replication</span>
<span class="cmt">checkpoint_completion_target = 0.9 spreads checkpoint I/O</span>

<span class="cmt">View dirty page stats (InnoDB)</span>
<span class="kw">SELECT</span> * <span class="kw">FROM</span> information_schema.INNODB_BUFFER_POOL_STATS\G</pre>
  </div>

  <div class="section-label">Interactive Demo</div>
  <h2 class="section-title">Page/Block Storage Visualizer</h2>
  <p class="section-desc">INSERT fills free blocks, DELETE marks blocks dashed (fragmented). VACUUM compacts the table. Watch the fill factor gauge change.</p>

  <div class="diagram-box">
    <svg viewBox="0 0 700 220" role="img" aria-label="Heap page layout: page header, slot array, and tuple data">
      <!-- Page border -->
      <rect x="20"  y="20" width="440" height="190" rx="8" class="svg-box" style="stroke:#7C3AED;stroke-width:2"/>
      <text x="240" y="13" text-anchor="middle" class="svg-soft" style="fill:#7C3AED">8 KB Heap Page</text>
      <!-- Header -->
      <rect x="30"  y="30" width="420" height="36" rx="5" class="svg-box-p"/>
      <text x="240" y="46"  text-anchor="middle" class="svg-label">Page Header</text>
      <text x="240" y="60"  text-anchor="middle" class="svg-mono" style="font-size:10px">lsn | page_id | checksum | free_space_start | free_space_end</text>
      <!-- Slot array -->
      <rect x="30"  y="74" width="420" height="32" rx="5" class="svg-box-c"/>
      <text x="240" y="89"  text-anchor="middle" class="svg-label">Slot Array (grows down)</text>
      <text x="240" y="103" text-anchor="middle" class="svg-mono" style="font-size:10px">[slot1: offset=180, len=42] [slot2: offset=140, len=38] [slot3: deleted] ...</text>
      <!-- Free space -->
      <rect x="30"  y="114" width="420" height="36" rx="5" class="svg-box"/>
      <text x="240" y="130" text-anchor="middle" class="svg-soft">Free Space</text>
      <text x="240" y="146" text-anchor="middle" class="svg-soft" style="font-size:10px">(slot array grows down, tuples grow up)</text>
      <!-- Tuples -->
      <rect x="30"  y="158" width="420" height="44" rx="5" class="svg-box-g"/>
      <text x="240" y="174" text-anchor="middle" class="svg-label">Tuple Data (grows up)</text>
      <text x="240" y="190" text-anchor="middle" class="svg-mono" style="font-size:10px">t_xmin | t_xmax | t_ctid | null_bitmap | attrs...</text>
      <!-- WAL flow on right -->
      <rect x="490" y="20"  width="190" height="56" rx="8" class="svg-box-r"/>
      <text x="585" y="40"  text-anchor="middle" class="svg-label">WAL Buffer</text>
      <text x="585" y="56"  text-anchor="middle" class="svg-soft">write-ahead log record</text>
      <text x="585" y="70"  text-anchor="middle" class="svg-mono" style="font-size:10px">XID | page | old+new images</text>
      <rect x="490" y="95"  width="190" height="52" rx="8" class="svg-box-a"/>
      <text x="585" y="115" text-anchor="middle" class="svg-label">Buffer Pool</text>
      <text x="585" y="131" text-anchor="middle" class="svg-soft">dirty page cache</text>
      <text x="585" y="145" text-anchor="middle" class="svg-mono" style="font-size:10px">evict via clock/LRU</text>
      <rect x="490" y="165" width="190" height="42" rx="8" class="svg-box-g"/>
      <text x="585" y="183" text-anchor="middle" class="svg-label">Disk (fsync)</text>
      <text x="585" y="199" text-anchor="middle" class="svg-soft">durable on COMMIT</text>
      <!-- WAL flow arrows -->
      <line x1="585" y1="76" x2="585" y2="95"  class="svg-line svg-dash" style="stroke:#EF4444"/>
      <line x1="585" y1="147" x2="585" y2="165" class="svg-line svg-dash" style="stroke:#F59E0B"/>
      <!-- Packets -->
      <circle class="pkt" cx="460" cy="50"  r="5" fill="#EF4444"/>
      <circle class="pkt" cx="585" cy="147" r="5" fill="#F59E0B"/>
      <circle class="pkt" cx="240" cy="170" r="5" fill="#10B981"/>
    </svg>
    <p class="diagram-caption">InnoDB and PostgreSQL use fixed-size heap pages (8 or 16 KB). Each tuple includes transaction metadata (xmin/xmax) enabling MVCC visibility rules without a separate undo log in Postgres. Writes go to WAL first — data pages are flushed lazily, often in batches.</p>
  </div>

  <div class="demo-section" id="demo-storage">
    <div class="demo-header">
      <h3>Storage Page Layout</h3>
      <div class="demo-controls">
        <button class="demo-btn" data-action="insert-row" style="background:var(--topic-color);color:#fff;border-color:var(--topic-color)">INSERT Row</button>
        <button class="demo-btn danger" data-action="delete-row">DELETE Row</button>
        <button class="demo-btn warn" data-action="vacuum">VACUUM</button>
      </div>
    </div>
    <div class="demo-canvas-wrap"></div>
    <div class="demo-hint">Blue = used page, red dashed = deleted (dead tuple), dark = free. VACUUM reclaims dead tuples.</div>
  </div>

  <div class="comparison-grid">
    <div class="comparison-card">
      <h4>B+Tree (InnoDB/PostgreSQL)</h4>
      <ul>
        <li>Reads: O(log n) — excellent</li>
        <li>Writes: random I/O to existing pages</li>
        <li>Good for mixed read/write workloads</li>
        <li>VACUUM needed for dead tuple cleanup</li>
        <li>Row-level compression possible</li>
      </ul>
    </div>
    <div class="comparison-card">
      <h4>LSM Tree (RocksDB/Cassandra)</h4>
      <ul>
        <li>Writes: sequential append — very fast</li>
        <li>Reads: may check multiple levels</li>
        <li>Bloom filters reduce false reads</li>
        <li>Compaction happens asynchronously</li>
        <li>Best for write-heavy time-series/log data</li>
      </ul>
    </div>
  </div>

  <div class="section-label">Anti-patterns</div>
  <div class="antipatterns">
    <div class="antipattern">
      <h4>Setting innodb_flush_log_at_trx_commit = 0</h4>
      <p>This disables WAL flushing on commit — MySQL uses OS buffers only. Up to 1 second of committed data can be lost on crash. Never use this for financial or critical data. Mode 2 (flush but no fsync) is a safer compromise.</p>
    </div>
    <div class="antipattern">
      <h4>Not running VACUUM (PostgreSQL)</h4>
      <p>PostgreSQL MVCC never overwrites rows — dead tuples accumulate. Without VACUUM, tables bloat (4× is common), queries slow down, and eventually transaction ID wraparound (every ~2B transactions) causes catastrophic downtime. Autovacuum must be tuned and monitored.</p>
    </div>
    <div class="antipattern">
      <h4>Undersizing the buffer pool</h4>
      <p>If the buffer pool is too small, hot pages are constantly evicted and re-read from disk. 70–80% of RAM on a dedicated database server is the standard starting point. Monitor <code>innodb_buffer_pool_reads</code> (disk reads) vs <code>innodb_buffer_pool_read_requests</code> (cache hits).</p>
    </div>
  </div>

  <div class="section-label">Quiz</div>
  <div class="quiz-section">
    <div class="quiz-question">
      <div class="quiz-q-num">Question 1 of 3</div>
      <div class="quiz-q-text">Write-Ahead Logging (WAL) ensures durability by:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="WAL is specifically about logging changes, not just making writes faster."><span class="opt-letter">A</span>Making all writes faster by buffering them</div>
        <div class="quiz-option" data-correct="true" data-explanation="WAL requires that log records for a change are flushed to durable storage BEFORE the corresponding data pages. On crash, the log can replay all committed changes even if data pages weren't flushed."><span class="opt-letter">B</span>Writing log records to durable storage BEFORE applying changes to data pages</div>
        <div class="quiz-option" data-correct="false" data-explanation="WAL writes are sequential to a log file, not random to data pages. The benefit is sequential vs random I/O."><span class="opt-letter">C</span>Writing all changes directly to data pages synchronously</div>
        <div class="quiz-option" data-correct="false" data-explanation="WAL doesn't eliminate the need for checkpoints; checkpoints reduce recovery time."><span class="opt-letter">D</span>Eliminating the need for periodic checkpoints</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 2 of 3</div>
      <div class="quiz-q-text">In InnoDB, the clustered index means:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="Clustered index is not a separate file — the row data IS the leaf page of the clustered B+tree."><span class="opt-letter">A</span>Row data is stored in a separate file from the index</div>
        <div class="quiz-option" data-correct="true" data-explanation="In InnoDB, the primary key IS the clustered index. The B+tree leaf pages contain the actual row data. PK lookups require only one B+tree traversal."><span class="opt-letter">B</span>Row data is stored directly in the primary key B+tree leaf pages</div>
        <div class="quiz-option" data-correct="false" data-explanation="InnoDB uses row-level locking, not page-level."><span class="opt-letter">C</span>Rows are locked at the page level, not row level</div>
        <div class="quiz-option" data-correct="false" data-explanation="A hash-based clustered index is not what InnoDB uses — it uses a B+tree."><span class="opt-letter">D</span>Rows are stored in a hash table sorted by primary key</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 3 of 5</div>
      <div class="quiz-q-text">LSM trees are optimized for:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="LSM trees can have slower reads than B-trees due to multi-level lookups — they are not optimized for read-heavy random access."><span class="opt-letter">A</span>Read-heavy random lookup workloads</div>
        <div class="quiz-option" data-correct="true" data-explanation="LSM trees convert random writes into sequential appends to an in-memory memtable then sequential disk writes. This is dramatically faster than random B-tree page writes, making them ideal for write-heavy workloads like time-series, logs, and IoT data."><span class="opt-letter">B</span>Write-heavy workloads where sequential write throughput matters</div>
        <div class="quiz-option" data-correct="false" data-explanation="LSM trees write to memory first, then sequential disk — the sequential disk write is the key advantage, not skipping memory."><span class="opt-letter">C</span>Direct random access without memory buffering</div>
        <div class="quiz-option" data-correct="false" data-explanation="LSM trees are not specifically optimized for JOINs — relational databases with B+trees are better for complex joins."><span class="opt-letter">D</span>Complex multi-table JOIN operations</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 4 of 5</div>
      <div class="quiz-q-text">What is the purpose of a bloom filter in an LSM-tree storage engine?</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="Bloom filters do not compress data — they are a probabilistic lookup structure."><span class="opt-letter">A</span>Compressing SSTable data to reduce disk usage</div>
        <div class="quiz-option" data-correct="false" data-explanation="Bloom filters do not sort data — SSTables are already sorted by construction."><span class="opt-letter">B</span>Sorting keys in an SSTable for binary search</div>
        <div class="quiz-option" data-correct="true" data-explanation="A bloom filter for each SSTable lets the database quickly check: 'Is this key definitely NOT in this file?' With zero false negatives, if the bloom filter says no, the SSTable is skipped entirely. This reduces point lookup I/O from O(files) to O(1) in the common case."><span class="opt-letter">C</span>Quickly determining if a key is definitely not in an SSTable, avoiding unnecessary I/O</div>
        <div class="quiz-option" data-correct="false" data-explanation="Bloom filters are read-time optimizations, not write-time buffers."><span class="opt-letter">D</span>Buffering writes before flushing to SSTable</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 5 of 5</div>
      <div class="quiz-q-text">In PostgreSQL, a HOT (Heap Only Tuple) update avoids which expensive operation?</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="WAL writes are always required for durability — HOT updates do not skip WAL."><span class="opt-letter">A</span>Writing a WAL record for the update</div>
        <div class="quiz-option" data-correct="true" data-explanation="A HOT update occurs when the new version fits on the same page AND no indexed columns changed. The index entries are NOT updated — a pointer chain in the heap page links old to new. This eliminates the O(indexes) index maintenance cost per update."><span class="opt-letter">B</span>Updating index entries — the index is not touched for HOT updates</div>
        <div class="quiz-option" data-correct="false" data-explanation="HOT updates still create a new row version in the heap — they just skip the index update."><span class="opt-letter">C</span>Writing the new row version to the heap</div>
        <div class="quiz-option" data-correct="false" data-explanation="MVCC versioning is still used for HOT updates — the old version's xmax is set and a new xmin is recorded."><span class="opt-letter">D</span>MVCC versioning of the row</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
  </div>

  <div class="section-label">Interview Q&A</div>
  <div class="qa-section">
    <div class="qa-item">
      <div class="qa-q">How does crash recovery work in InnoDB (ARIES algorithm)?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">InnoDB uses the ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) algorithm: (1) <strong>Analysis pass</strong>scan the WAL from the last checkpoint to determine which pages were dirty (incomplete flushes) and which transactions were in-progress at the time of crash. (2) <strong>Redo pass</strong>replay ALL WAL records from the last checkpoint, both committed and uncommitted. This restores the database to the exact state at crash time. (3) <strong>Undo pass</strong>using undo logs, rollback any transactions that were in-progress (uncommitted) at crash time. ARIES's key innovation: "steal + no-force" — dirty pages CAN be flushed before commit (steal), and committed transactions DON'T need pages flushed before commit (no-force) — WAL provides durability.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What is VACUUM in PostgreSQL and why is it necessary?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">PostgreSQL's MVCC never overwrites or deletes rows in place — instead, UPDATE inserts a new row version and marks the old xmax as the current transaction ID. DELETE just marks xmax. These "dead tuples" remain on disk until VACUUM reclaims them. VACUUM does: (1) Reclaim dead tuple storage for reuse. (2) Update the Free Space Map (FSM) so INSERTs can find free space. (3) Update the Visibility Map (VM) — marks pages with all-visible tuples, allowing index-only scans. (4) Prevent transaction ID wraparound (TXID wraparound vacuum freezes old rows). AUTOVACUUM triggers based on <code>autovacuum_vacuum_scale_factor</code> (default 20% dead tuples). On high-write tables, tune to trigger more aggressively.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What is the difference between InnoDB and MyISAM storage engines?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">InnoDB: full ACID transactions, row-level locking, MVCC, foreign keys, crash recovery via WAL, clustered primary key index. MyISAM: no transactions, table-level locking only, no FK support, no crash recovery, full-text search (pre-InnoDB 5.6). MyISAM was MySQL's default before 5.5; InnoDB has been the default since. Reasons to still use MyISAM: slightly faster full table reads on read-only tables, simpler storage (no transaction overhead). For any production workload with writes or needing transactions, always use InnoDB. MyISAM is effectively deprecated for most use cases.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">How does an LSM tree work, and what is compaction?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">LSM (Log-Structured Merge) tree: writes go to an in-memory <strong>memtable</strong> (sorted skip list or red-black tree). When memtable fills, it's flushed to disk as an immutable <strong>SSTable</strong> (Sorted String Table). Over time, many SSTables accumulate. <strong>Compaction</strong> merges SSTables: sort-merge overlapping key ranges, discard outdated (overwritten/deleted) entries, produce a new sorted SSTable. Levels: L0 = recently flushed (unsorted across files), L1+ = sorted, compacted. Bloom filters on each SSTable let you skip files that definitely don't contain a key. Read amplification: may need to check multiple levels. Write amplification: data is written multiple times during compaction. Tuning: level-based vs size-tiered compaction strategies (different tradeoffs).</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What is the buffer pool and how should you tune it?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">The buffer pool (InnoDB) or shared_buffers (PostgreSQL) is an in-memory cache of database pages. All reads check the cache first. All writes modify pages in cache first — WAL ensures durability for the unflushed pages. Sizing: for InnoDB, set <code>innodb_buffer_pool_size</code> to 70–80% of RAM on a dedicated DB server. For PostgreSQL, set <code>shared_buffers</code> to ~25% of RAM — PostgreSQL also benefits from the OS page cache via effective_cache_size. Monitor hit ratio: <code>SELECT blks_hit / (blks_hit + blks_read)</code> from pg_stat_database — below 99% is a red flag for read-heavy workloads. InnoDB's LRU has "young" (recently accessed) and "old" (newly inserted) sub-lists to prevent bulk scans from evicting hot data.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What is write amplification and why does it matter for SSD lifetime?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Write amplification is the ratio of actual bytes written to disk to the logical bytes written by the application. In a B+tree, a 100-byte update may write an entire 8KB page (80x amplification on small updates). In an LSM tree, compaction rewrites the same data multiple times across levels — a byte written to L1 is rewritten when merged to L2, and again to L3. Write amplification matters for SSDs because NAND flash cells have finite write endurance (typically 1,000–10,000 Program/Erase cycles). High write amplification — from WAL writes, B+tree page writes, and compaction — accelerates flash wear. In production: WAL alone on a busy PostgreSQL server can write 100GB/day; on a 1TB SSD with 3,000 P/E cycles, that's ~30GB total — the SSD would last ~8 years from WAL alone. High-write databases need enterprise SSDs with high endurance ratings or NVMe with over-provisioning.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">How does RocksDB differ from LevelDB?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">RocksDB was forked from Google's LevelDB (2013, Facebook) with production workloads in mind. Key differences: (1) <strong>Concurrency:</strong> RocksDB supports multiple concurrent memtable writes and parallel compaction; LevelDB has a single global lock. (2) <strong>Compaction options:</strong> RocksDB adds Universal and FIFO compaction strategies in addition to Leveled; LevelDB only has Leveled. (3) <strong>Column families:</strong> RocksDB organizes data into separate column families (each with its own LSM tree), enabling different compaction and compression settings per logical table. LevelDB has a single key space. (4) <strong>Production hardening:</strong> backups, SST file ingestion, rate limiters, statistics. RocksDB is used as the storage layer for: CockroachDB, TiKV, MyRocks (MySQL), PebbleDB (CockroachDB's RocksDB replacement), Cassandra plugins. LevelDB is primarily used in Bitcoin Core and simpler embedded use cases.</div></div>
    </div>
  </div>

  <div class="section-label">Further Reading</div>
  <div class="reading-links">
    <a class="reading-link" href="https://dev.mysql.com/doc/refman/8.0/en/innodb-storage-engine.html" target="_blank">InnoDB Architecture</a>
    <a class="reading-link" href="https://github.com/facebook/rocksdb/wiki/RocksDB-Overview" target="_blank">RocksDB (LSM) Overview</a>
  </div>

  <div class="topic-nav">
    <a href="08-concurrency.html" class="topic-nav-link"><div class="topic-nav-arrow">←</div><div><div class="topic-nav-label">Previous</div><div class="topic-nav-title">Concurrency Control</div></div></a>
    <a href="10-nosql.html" class="topic-nav-link next"><div class="topic-nav-arrow">→</div><div><div class="topic-nav-label">Next</div><div class="topic-nav-title">NoSQL Databases</div></div></a>
  </div>
</div>

<script src="../js/main.js"></script>
<script src="../js/demos.js"></script>
</body>
</html>