DBMS-Course/topics/10-nosql.html at main · sridhar-3009/DBMS-Course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">
  <script>try{var t=localStorage.getItem("dbms-theme")||"dark";document.documentElement.setAttribute("data-theme",t)}catch(e){document.documentElement.setAttribute("data-theme","dark")}</script>
  <title>NoSQL Databases — DBMS Illustrated</title>
  <link rel="stylesheet" href="../css/style.css">
  <style>:root{--topic-color:#84CC16}</style>
</head>
<body>
<div class="reading-progress" id="reading-progress"></div>
<nav class="navbar">
  <div class="navbar-inner">
    <a class="navbar-brand" href="../index.html"><span class="brand-icon" style="color:var(--primary-soft)">&#9670;</span>DBMS Illustrated</a>
    <div class="navbar-links"><a href="../index.html#topics">All Topics</a></div>
    <div class="navbar-actions"><button class="btn-icon" id="theme-toggle" title="Toggle theme">&#9788;</button></div>
  </div>
</nav>

<div class="container">
  <div class="breadcrumb">
    <a href="../index.html">Home</a><span class="breadcrumb-sep">›</span>
    <a href="../index.html#topics">Course</a><span class="breadcrumb-sep">›</span><span>NoSQL Databases</span>
  </div>

  <div class="topic-header">
    <div class="topic-badge">Topic 10</div>
    <h1>NoSQL <span class="accent">Databases</span></h1>
    <p class="subtitle">Not every problem fits neatly into tables and rows. Sometimes your data looks more like a JSON blob, a social graph, or a time-series of sensor readings. NoSQL databases are purpose-built for these shapes. This topic explains what they trade away from relational databases, and what they gain in return.</p>
    <div class="company-badges">
      <span class="badge">MongoDB</span><span class="badge">Cassandra</span><span class="badge">Redis</span>
      <span class="badge">DynamoDB</span><span class="badge">Couchbase</span><span class="badge">Neo4j</span>
    </div>
  </div>

  <div class="at-a-glance reveal">
    <h3>At a Glance</h3>
    <ul>
      <li><strong>CAP theorem</strong> — when a distributed system's network breaks, you have to choose: do you keep returning answers (even if slightly stale), or do you go silent until you can guarantee correctness? You cannot do both. This fundamental constraint is called the CAP theorem.</li>
      <li><strong>Document DB</strong> (MongoDB) — stores data as JSON-like objects. Instead of splitting data across multiple tables, you pack related information into one document. Great for things like a user profile with settings, addresses, and preferences all bundled together.</li>
      <li><strong>Wide-column DB</strong> (Cassandra) — designed to absorb enormous numbers of writes without slowing down. Think millions of sensor readings per second or event logs from millions of devices. No JOINs; data modeling is driven entirely by query patterns.</li>
      <li><strong>Key-value DB</strong> (Redis, DynamoDB) — the simplest model: you look things up by a key, like a dictionary. Extremely fast, but you can only find data if you know the exact key. Used heavily for caching, sessions, and leaderboards.</li>
      <li><strong>Graph DB</strong> (Neo4j) — stores data as nodes (things) and edges (relationships between things). When you need to ask "who are the friends of my friends?" or "what is the shortest path between two people?", a graph database handles this far better than a table with JOINs.</li>
    </ul>
  </div>

  <div class="section-label">Core Concepts</div>
  <div class="mini-cards">
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>CAP Theorem</h4>
      <p>Imagine a database spread across two data centers. The network cable between them gets cut. Now what? If you want both sides to agree on the latest data (Consistency), you have to stop accepting new requests until the connection is restored. If you want both sides to keep serving requests (Availability), you accept that they might disagree on some values. You cannot have both at once during a network split. This trade-off is the CAP theorem: when the network fails, choose between Consistency and Availability. Partition Tolerance (staying up despite the split) is non-negotiable for any distributed system.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Document Store</h4>
      <p>Think of a document store like a folder of JSON files. Each document can have a different structure — one user might have three email addresses, another might have none. There is no rigid schema forcing every row to look identical. Related data is usually embedded directly inside the document rather than spread across separate tables. This makes reads fast (one fetch gets everything) but makes queries across documents more limited. Used by MongoDB, Couchbase, and Firestore.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Wide-Column Store</h4>
      <p>A wide-column store is like a spreadsheet where different rows can have completely different columns. Data is organized by a row key, and within each row, you can store as many or as few named columns as you want. Columns that belong together are grouped into column families and stored physically close on disk, making it efficient to read just a specific set of columns across millions of rows. This design is a natural fit for time-series data: each device ID is the row key, and each timestamp becomes a column. Used by Cassandra, HBase, and Google Bigtable.</p>
    </div>
    <div class="mini-card">
      <div class="mini-card-icon"></div>
      <h4>Eventual Consistency</h4>
      <p>When you update a row in a database that has copies in multiple data centers, those copies do not all update at the exact same instant. For a brief window, different data centers may return different values for the same piece of data. Eventually, all copies will converge to the same value. This is called eventual consistency. It lets the database remain fast and available even across geographic distance. The tradeoff: if you read immediately after a write, you might get the old value. For things like social media likes or view counts, this is fine. For a bank balance, it is not.</p>
    </div>
  </div>

  <div class="section-label">How It Works</div>
  <h2 class="section-title">CAP Theorem Trade-offs</h2>
  <div class="steps">
    <div class="step-item"><div class="step-num">C</div><div><h4>Consistency</h4><p>Every read returns the most recent write. To achieve this, the database must confirm that all replicas have the update before responding. This means waiting for network round trips, which adds latency. If the network is broken, the database may have to refuse requests entirely rather than risk returning stale data. Examples: PostgreSQL running on a single node, ZooKeeper, HBase.</p></div></div>
    <div class="step-item"><div class="step-num">A</div><div><h4>Availability</h4><p>Every request gets a response, even during failures. The database never says "I am unavailable." The tradeoff is that the response might not reflect the very latest write — a replica that has not yet received the update might serve the old value. The system stays up and responsive, but correctness is slightly relaxed. Examples: CouchDB, Cassandra (at low consistency settings), DynamoDB.</p></div></div>
    <div class="step-item"><div class="step-num">P</div><div><h4>Partition Tolerance</h4><p>Network partitions — where some servers can no longer communicate with others — are inevitable in any real distributed system. Partition tolerance means the system keeps working despite this. Because partitions cannot be prevented, every distributed database must be partition tolerant. The real choice, then, is not C vs A vs P — it is: when a partition happens, which do you sacrifice: Consistency or Availability?</p></div></div>
    <div class="step-item"><div class="step-num">+</div><div><h4>PACELC Extension</h4><p>CAP only describes behavior during a network failure. The PACELC model adds what happens during normal operation: even when there is no partition, you still choose between Latency (respond fast, possibly stale) and Consistency (respond correctly, but slower). DynamoDB optimizes for low latency in normal operation. Google Spanner optimizes for consistency in all cases. This extended view is more useful for everyday database selection than CAP alone.</p></div></div>
  </div>

  <div class="table-wrap">
    <table class="compare-table">
      <thead><tr><th>Type</th><th>Examples</th><th>Data Model</th><th>Best For</th><th>Weak At</th></tr></thead>
      <tbody>
        <tr><td>Document</td><td>MongoDB, Couchbase</td><td>JSON/BSON docs</td><td>Flexible schemas, nested data</td><td>Multi-doc transactions, complex JOINs</td></tr>
        <tr><td>Key-Value</td><td>Redis, DynamoDB</td><td>Key → blob</td><td>Caching, sessions, leaderboards</td><td>Querying by value fields</td></tr>
        <tr><td>Wide-Column</td><td>Cassandra, HBase</td><td>Row key + dynamic cols</td><td>Time-series, write-heavy IoT</td><td>Complex queries, JOINs</td></tr>
        <tr><td>Graph</td><td>Neo4j, TigerGraph</td><td>Nodes + edges</td><td>Social graphs, recommendations</td><td>Tabular analytics</td></tr>
        <tr><td>Search</td><td>Elasticsearch</td><td>Inverted index</td><td>Full-text search, logs, metrics</td><td>Transactional writes</td></tr>
      </tbody>
    </table>
  </div>

  <!-- ── Redis Data Structures ──────────────────────────────── -->
  <div class="section-label">Redis Data Structures</div>
  <h2 class="section-title">Redis Beyond Simple Caching</h2>
  <p class="section-desc">Redis is not just a key-value cache. It is a data structure server. The choice of data structure determines what operations are O(1) vs O(n) and what problems each one solves efficiently.</p>

  <div class="table-wrap">
    <table class="compare-table">
      <thead>
        <tr><th>Type</th><th>Commands</th><th>O(n) operations</th><th>Use Case</th></tr>
      </thead>
      <tbody>
        <tr><td>String</td><td>GET, SET, INCR, APPEND</td><td>String ops on large values</td><td>Cache, counters, distributed locks (SET NX EX)</td></tr>
        <tr><td>List</td><td>LPUSH, RPOP, LRANGE</td><td>LINDEX, LINSERT</td><td>Message queues, activity feeds, job queues</td></tr>
        <tr><td>Hash</td><td>HGET, HSET, HMGET</td><td>HGETALL on large hashes</td><td>Object storage (user profile), session data</td></tr>
        <tr><td>Set</td><td>SADD, SISMEMBER, SUNION</td><td>SMEMBERS on large sets</td><td>Unique visitors, tag systems, friend lists</td></tr>
        <tr><td>Sorted Set</td><td>ZADD, ZRANGE, ZRANK</td><td>ZRANGEBYLEX on large sets</td><td>Leaderboards, rate limiters, time-series indexes</td></tr>
        <tr><td>Stream</td><td>XADD, XREAD, XGROUP</td><td>Full stream scan</td><td>Event sourcing, CDC, Kafka-like message log</td></tr>
        <tr><td>Bitmap</td><td>SETBIT, GETBIT, BITCOUNT</td><td>BITPOS on large bitmaps</td><td>Daily active users (1 bit/user/day = 125MB for 1B users)</td></tr>
        <tr><td>HyperLogLog</td><td>PFADD, PFCOUNT</td><td>N/A (O(1) always)</td><td>Approximate unique count with fixed 12KB memory</td></tr>
      </tbody>
    </table>
  </div>

  <!-- ── Cassandra Data Model ────────────────────────────────── -->
  <div class="section-label">Cassandra Data Modeling</div>
  <h2 class="section-title">Partition Key and Clustering Key</h2>
  <p class="section-desc">In Cassandra, data modeling is driven by access patterns — not by normalization rules. The partition key determines which node holds the data. The clustering key determines sort order within a partition. Getting this wrong means scatter-gather queries or hot partitions.</p>

  <div class="steps">
    <div class="step-item">
      <div class="step-num">1</div>
      <div>
        <h4>Partition key</h4>
        <p>In Cassandra, the partition key decides which physical server stores your data. Think of it as the address on a letter: all rows with the same partition key land on the same machine and its replicas. A query that includes the partition key goes straight to that one machine — fast and efficient. A query without it must ask every machine in the cluster — slow and expensive. Good partition keys have many distinct values that are spread evenly: user_id, device_id, or tenant_id. Bad choices are things like status (only a handful of values) or date alone (all writes for today pile onto one server).</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">2</div>
      <div>
        <h4>Clustering key</h4>
        <p>Once the partition key routes you to the right server, the clustering key determines the sort order of rows within that partition. This is what makes range queries fast inside a partition. For example, if your partition key is user_id and your clustering key is event_time, then <code>WHERE user_id = 42 AND event_time &gt; '2024-01-01'</code> goes straight to user 42's partition and scans forward in time order — no random searching needed. You can also use a compound partition key like <code>(user_id, month)</code> to prevent any single partition from growing too large over time.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">3</div>
      <div>
        <h4>Denormalization is required</h4>
        <p>Cassandra has no JOINs. If you need to look up the same data two different ways — for example, "get orders by user" AND "get orders by status" — you must store it in two separate tables, each optimized for one access pattern. Your application writes to both tables whenever data changes. This feels redundant compared to relational design, but it is the deliberate trade-off: Cassandra gives up flexibility in querying in exchange for extremely fast, predictable writes and reads at any scale.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">4</div>
      <div>
        <h4>Tunable consistency</h4>
        <p>Cassandra lets you dial in how strict you want consistency to be on each operation. <code>CONSISTENCY ONE</code> returns as soon as the first replica responds — fastest, but possibly stale. <code>CONSISTENCY ALL</code> waits for every replica to confirm — slowest, but always up to date. <code>CONSISTENCY QUORUM</code> is the sweet spot: it waits for a majority (more than half) of replicas. If you use QUORUM for both writes and reads, you are guaranteed to see the latest write, because the majority that served the read will always overlap with the majority that acknowledged the write. This is called strong consistency with RF=3 and quorum.</p>
      </div>
    </div>
  </div>

  <!-- ── Graph Database ──────────────────────────────────────── -->
  <div class="section-label">Graph Databases</div>
  <h2 class="section-title">When Native Graphs Win</h2>
  <p class="section-desc">Relational databases represent relationships with foreign keys and JOINs. For shallow relationships (1–2 hops), JOINs are fast. For deep or variable-depth relationships (friendship networks, dependency graphs, recommendation paths), JOIN count grows exponentially — graph databases win decisively.</p>

  <div class="steps">
    <div class="step-item">
      <div class="step-num">1</div>
      <div>
        <h4>Native graph storage (Neo4j)</h4>
        <p>In Neo4j, relationships between nodes are stored as physical pointers — following a relationship from one node to another is a direct memory jump, not a table scan. Compare this to a relational database: finding "friends of friends" requires two JOINs, each scanning rows to find matches. At five hops deep, the relational JOIN grows enormous. In a graph database, each hop is constant-time regardless of how large the overall graph is. This is why graph databases dominate for social networks, recommendation engines, and fraud detection — these problems are all about following chains of relationships.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">2</div>
      <div>
        <h4>Cypher query language (Neo4j)</h4>
        <p>Graph databases have their own query language. Neo4j uses Cypher, which is designed to look like the graph you are drawing. The pattern <code>(Alice)-[:KNOWS]->(Bob)</code> reads almost literally: "Alice knows Bob." To find everyone within 3 hops of Alice: <code>MATCH (user:Person {name: 'Alice'})-[:KNOWS*1..3]-(friend:Person) RETURN DISTINCT friend.name</code>. The <code>*1..3</code> means "follow 1 to 3 KNOWS relationships." Writing this same query in SQL would require three separate self-joins and becomes difficult to read and slow at scale.</p>
      </div>
    </div>
    <div class="step-item">
      <div class="step-num">3</div>
      <div>
        <h4>When NOT to use a graph DB</h4>
        <p>Graph databases have clear weaknesses. Aggregating data across all nodes — like "count all users by country" — is not what they are built for; a relational or columnar database does this better. Write throughput is lower than what Cassandra or Kafka can handle. If your main queries are simple lookups like "get user by ID," the overhead of a graph database adds no value. Use a graph database when the core of your problem is navigating multi-step relationships, and the relationships themselves carry data (like the strength of a connection or the distance of a route).</p>
      </div>
    </div>
  </div>

  <div class="section-label">Interactive Demo</div>
  <h2 class="section-title">Document vs Column Store Comparison</h2>
  <p class="section-desc">Compare how a query for a specific column is handled differently in a document store (reads entire document) vs a column store (reads only target column).</p>

  <div class="diagram-box">
    <svg viewBox="0 0 700 240" role="img" aria-label="CAP theorem triangle and NoSQL store placement">
      <!-- Triangle -->
      <polygon points="350,20 100,210 600,210" style="fill:rgba(124,58,237,0.06);stroke:#7C3AED;stroke-width:2;stroke-linejoin:round"/>
      <!-- Corner labels -->
      <text x="350" y="14" text-anchor="middle" class="svg-label" style="fill:#10B981">Consistency</text>
      <text x="350" y="28" text-anchor="middle" class="svg-soft">all nodes see same data</text>
      <text x="72"  y="226" text-anchor="middle" class="svg-label" style="fill:#0EA5E9">Availability</text>
      <text x="72"  y="240" text-anchor="middle" class="svg-soft">always responds</text>
      <text x="628" y="226" text-anchor="middle" class="svg-label" style="fill:#EF4444">Partition Tolerance</text>
      <text x="628" y="240" text-anchor="middle" class="svg-soft">survives net split</text>
      <!-- Edge midpoint labels (pick 2) -->
      <text x="210" y="116" text-anchor="middle" class="svg-soft" style="fill:#10B981">CA systems</text>
      <text x="490" y="116" text-anchor="middle" class="svg-soft" style="fill:#EF4444">CP systems</text>
      <text x="350" y="220" text-anchor="middle" class="svg-soft" style="fill:#0EA5E9">AP systems</text>
      <!-- Product placement dots -->
      <circle cx="225" cy="132" r="7" fill="#0EA5E9" opacity="0.9"/>
      <text x="215" y="150" text-anchor="middle" class="svg-mono" style="font-size:10px;fill:#0EA5E9">Cassandra</text>
      <circle cx="475" cy="132" r="7" fill="#EF4444" opacity="0.9"/>
      <text x="485" y="150" text-anchor="middle" class="svg-mono" style="font-size:10px;fill:#EF4444">HBase</text>
      <circle cx="350" cy="80"  r="7" fill="#10B981" opacity="0.9"/>
      <text x="390" y="86"  text-anchor="middle" class="svg-mono" style="font-size:10px;fill:#10B981">PostgreSQL</text>
      <circle cx="350" cy="210" r="7" fill="#A78BFA" opacity="0.9"/>
      <text x="350" y="207" text-anchor="middle" class="svg-mono" style="font-size:10px;fill:#A78BFA">DynamoDB</text>
      <!-- Note: can only pick 2 -->
      <text x="350" y="53" text-anchor="middle" class="svg-soft" style="fill:#F59E0B;font-size:11px">Pick any two during a network partition</text>
      <!-- Packets along edges -->
      <circle class="pkt" cx="210" cy="116" r="5" fill="#0EA5E9"/>
      <circle class="pkt" cx="490" cy="116" r="5" fill="#EF4444"/>
      <circle class="pkt" cx="350" cy="210" r="5" fill="#A78BFA"/>
    </svg>
    <p class="diagram-caption">CAP theorem says a distributed system can guarantee only two of three properties during a network partition. In practice, partition tolerance is non-negotiable at scale, so the real choice is CP (strong consistency, possible unavailability) vs. AP (always available, eventual consistency).</p>
  </div>

  <div class="demo-section" id="demo-nosql">
    <div class="demo-header">
      <h3>Document vs Column Store Access Patterns</h3>
      <div class="demo-controls">
        <button class="demo-btn" data-action="query-total">Query: totals only</button>
        <button class="demo-btn" data-action="query-all">Query: all columns</button>
        <button class="demo-btn" data-action="reset-nosql">Reset</button>
      </div>
    </div>
    <div class="demo-canvas-wrap"></div>
    <div class="demo-hint">Bytes-read bar shows the dramatic I/O savings of column stores for selective column queries.</div>
  </div>

  <div class="section-label">Anti-patterns</div>
  <div class="antipatterns">
    <div class="antipattern">
      <h4>Using NoSQL just because it's "web scale"</h4>
      <p>NoSQL doesn't automatically scale better than SQL. PostgreSQL handles millions of rows trivially. Choose NoSQL when you have a specific use case (flexible schema, write-heavy, extreme scale) not based on hype.</p>
    </div>
    <div class="antipattern">
      <h4>Ignoring consistency requirements for "eventual consistency"</h4>
      <p>Eventual consistency means reads CAN return stale data. For financial transactions, inventory counts, or anything requiring correctness, eventual consistency is not acceptable without additional application-level conflict resolution.</p>
    </div>
    <div class="antipattern">
      <h4>Modeling NoSQL data like a relational schema</h4>
      <p>Normalizing documents into separate collections and joining them in application code undoes NoSQL's benefits. Design around access patterns: embed data that's always read together. Reference (FK-style) only when data is large and rarely accessed together.</p>
    </div>
  </div>

  <div class="section-label">Quiz</div>
  <div class="quiz-section">
    <div class="quiz-question">
      <div class="quiz-q-num">Question 1 of 3</div>
      <div class="quiz-q-text">The CAP theorem states a distributed system can guarantee at most:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="CAP says you can only have 2 of the 3 properties in the presence of network partitions."><span class="opt-letter">A</span>All 3: Consistency, Availability, Partition tolerance</div>
        <div class="quiz-option" data-correct="true" data-explanation="CAP theorem: choose 2 of Consistency, Availability, Partition Tolerance. Since all distributed systems must tolerate partitions, the real choice is between C and A during a partition."><span class="opt-letter">B</span>2 of Consistency, Availability, Partition Tolerance</div>
        <div class="quiz-option" data-correct="false" data-explanation="CAP specifically covers exactly these three properties."><span class="opt-letter">C</span>2 of the 4 ACID properties</div>
        <div class="quiz-option" data-correct="false" data-explanation="All distributed systems must tolerate partitions to function."><span class="opt-letter">D</span>Consistency only — availability and partition tolerance are impossible</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 2 of 3</div>
      <div class="quiz-q-text">MongoDB stores data as:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="Rows and columns are the relational model — MongoDB is a document store."><span class="opt-letter">A</span>Rows and columns like a relational table</div>
        <div class="quiz-option" data-correct="false" data-explanation="Key-value pairs describe Redis and DynamoDB, not MongoDB's document model."><span class="opt-letter">B</span>Simple key-value pairs only</div>
        <div class="quiz-option" data-correct="true" data-explanation="MongoDB stores data as BSON (Binary JSON) documents. Documents can have nested subdocuments and arrays. Collections are groups of documents."><span class="opt-letter">C</span>BSON (Binary JSON) documents in collections</div>
        <div class="quiz-option" data-correct="false" data-explanation="Column families describe Cassandra/HBase (wide-column stores)."><span class="opt-letter">D</span>Column families like Cassandra</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 3 of 5</div>
      <div class="quiz-q-text">For time-series data with extremely high write throughput, the best NoSQL type is:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="Document stores can work but aren't optimized for time-series write throughput."><span class="opt-letter">A</span>Document store (MongoDB)</div>
        <div class="quiz-option" data-correct="false" data-explanation="Graph databases are optimized for relationship traversal, not time-series writes."><span class="opt-letter">B</span>Graph database (Neo4j)</div>
        <div class="quiz-option" data-correct="false" data-explanation="Key-value stores work for simple time-series but lack the column-level query efficiency."><span class="opt-letter">C</span>Key-value store (Redis)</div>
        <div class="quiz-option" data-correct="true" data-explanation="Wide-column stores like Cassandra and HBase are designed for write-heavy time-series workloads — LSM-tree backed, partitioned by time, sorted within partitions. Used by Netflix, Twitter, IoT platforms."><span class="opt-letter">D</span>Wide-column store (Cassandra, HBase)</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 4 of 5</div>
      <div class="quiz-q-text">In Cassandra, the partition key determines:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="Sort order within a partition is determined by the clustering key, not the partition key."><span class="opt-letter">A</span>The sort order of rows within a table</div>
        <div class="quiz-option" data-correct="true" data-explanation="The partition key is hashed to determine which node(s) store the data. All rows with the same partition key are co-located on the same node and its replicas. Queries with the full partition key go to exactly one node — efficient. Without it, queries must scatter to all nodes."><span class="opt-letter">B</span>Which node(s) store the data — all rows with the same partition key are co-located</div>
        <div class="quiz-option" data-correct="false" data-explanation="Replication factor is a cluster-wide or keyspace-level setting, not determined by partition key."><span class="opt-letter">C</span>How many replicas are created for each row</div>
        <div class="quiz-option" data-correct="false" data-explanation="Consistency level is a per-query setting, not determined by partition key."><span class="opt-letter">D</span>The consistency level for reads and writes</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
    <div class="quiz-question">
      <div class="quiz-q-num">Question 5 of 5</div>
      <div class="quiz-q-text">Redis HyperLogLog is useful for:</div>
      <div class="quiz-options">
        <div class="quiz-option" data-correct="false" data-explanation="HyperLogLog provides approximate, not exact, unique counts — it uses statistical estimation."><span class="opt-letter">A</span>Exact unique user counting with zero error</div>
        <div class="quiz-option" data-correct="true" data-explanation="HyperLogLog approximates the cardinality (number of unique elements) of a set using a fixed 12KB of memory regardless of input size, with ~0.81% error rate. Perfect for counting unique visitors, unique queries, or any large cardinality estimate where a small error is acceptable."><span class="opt-letter">B</span>Approximate unique count with ~0.81% error using only 12KB of memory</div>
        <div class="quiz-option" data-correct="false" data-explanation="HyperLogLog does not store the actual elements — only a probabilistic sketch. You cannot retrieve members."><span class="opt-letter">C</span>Storing all unique user IDs for later retrieval</div>
        <div class="quiz-option" data-correct="false" data-explanation="Sorted sets, not HyperLogLog, are used for leaderboards with ranking."><span class="opt-letter">D</span>Implementing a leaderboard with rankings</div>
      </div>
      <div class="quiz-feedback"></div>
    </div>
  </div>

  <div class="section-label">Interview Q&A</div>
  <div class="qa-section">
    <div class="qa-item">
      <div class="qa-q">When would you choose NoSQL over a relational database?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Choose NoSQL when: (1) <strong>Schema flexibility</strong>data structure varies per record (e.g., product catalog with different attributes per product type). (2) <strong>Extreme write throughput</strong>Cassandra/HBase handle millions of writes/sec. (3) <strong>Horizontal scale-out</strong>data too large for a single machine and JOINs aren't needed. (4) <strong>Specific access patterns</strong>graph traversal (Neo4j), full-text search (Elasticsearch), sub-millisecond cache (Redis). (5) <strong>Operational simplicity</strong>schema-less development in early-stage products. Stick with SQL when: ACID transactions required, complex queries with JOINs, strong data integrity, team SQL expertise.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What is eventual consistency and what are its implications?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Eventual consistency means that given no new writes, all replicas will converge to the same value — eventually. In the window before convergence, different nodes may return different values for the same key. Implications: (1) Don't use for money transfers, inventory counts, or any scenario where stale reads cause business errors. (2) Design around it: use append-only event logs, CRDTs (Conflict-free Replicated Data Types), or vector clocks for conflict resolution. (3) In Cassandra: <code>CONSISTENCY QUORUM</code> (read+write quorum) gives you strong consistency at the cost of availability. <code>CONSISTENCY ONE</code> is eventual. (4) The "eventual" window can be milliseconds (same DC) or seconds (cross-region). Measure it for your workload.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">How does MongoDB's document model differ from a relational schema design-wise?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Relational: normalize data — one fact in one place, JOINs to reconstruct. MongoDB: design around access patterns — embed data that's read together, reference data that's large or shared. Example: a blog post with comments. Relational: posts table + comments table, JOIN on query. MongoDB: embed comments array inside the post document (if always read together). Trade-offs of embedding: single document read (fast, atomic), but document grows unboundedly and you can't easily query comments across posts. Trade-offs of referencing: JOINs via <code>$lookup</code> (2 queries), but modular and supports independent comment queries. The right choice depends on your dominant read pattern.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">Explain the BASE model and how it contrasts with ACID.<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">BASE (coined for NoSQL systems): <strong>Basically Available</strong> — system is available even during partial failures. <strong>Soft state</strong> — state may change over time even without input (replicas catching up). <strong>Eventually consistent</strong> — all replicas converge given no new writes. ACID vs BASE: ACID is pessimistic (prevent inconsistency) and offers strong guarantees at the cost of availability and throughput. BASE is optimistic (tolerate inconsistency temporarily) and offers high availability and throughput at the cost of immediate consistency. Modern NewSQL databases (Spanner, CockroachDB) provide ACID + horizontal scale, blurring the boundary.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">How does DynamoDB handle high availability and consistency?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">DynamoDB replicates data across 3 AZs in a region. By default it offers <strong>eventually consistent reads</strong> — might return stale data from a replica not yet synchronized; fastest and cheapest. <strong>Strongly consistent reads</strong> route to the primary replica and always return the latest committed data — costs 2x read capacity units. DynamoDB is classified as AP in CAP. For global tables (multi-region), it uses last-write-wins conflict resolution. Transactions: DynamoDB supports ACID transactions across multiple items via <code>TransactWriteItems</code> / <code>TransactGetItems</code> — serializable but limited to 25 items and single-region. Pricing is capacity-unit based — design access patterns to be efficient. Single-table design patterns use composite keys and GSIs (Global Secondary Indexes) to support multiple access patterns without JOINs.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">Explain Redis data structures and which to use for a leaderboard.<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Redis Sorted Sets are the native leaderboard data structure. <code>ZADD leaderboard 15230 "alice"</code> adds alice with score 15230. <code>ZREVRANK leaderboard "alice"</code> returns alice's rank (0-indexed from highest). <code>ZREVRANGE leaderboard 0 9 WITHSCORES</code> returns the top 10 with scores. Score updates: <code>ZINCRBY leaderboard 500 "alice"</code> atomically adds 500 to alice's score. Why not a SQL table with an index? SQL requires a query + sort + limit — works fine for small leaderboards. For leaderboards with millions of entries updated thousands of times per second, Redis Sorted Set's O(log n) write and O(log n + k) range read are ideal. Combine with Redis HyperLogLog for "unique players today" metrics and a Hash for player metadata (name, avatar URL) — three data structures, three problems solved.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">What are CRDTs and when are they used in distributed databases?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">CRDTs (Conflict-free Replicated Data Types) are data structures mathematically designed so that any concurrent updates can always be merged without conflict. The merge operation must be commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). Examples: G-Counter (grow-only counter — only increments, each replica maintains its own counter, total is sum); 2P-Set (two-phase set — add and remove sets, a removed element can never be re-added); LWW-Register (Last Write Wins — each value tagged with timestamp); PN-Counter (positive-negative counter — supports decrement via two G-Counters). Used by: Riak (peer-to-peer KV), Redis (CRDT-based counters), collaborative editing tools (Google Docs uses operational transforms, a related concept), Dynamo-style databases. CRDTs are suitable when you need eventual consistency without conflict resolution logic and the data structure semantics match the CRDT constraints.</div></div>
    </div>
    <div class="qa-item">
      <div class="qa-q">When should you use MongoDB versus PostgreSQL?<span class="qa-chevron">▾</span></div>
      <div class="qa-a"><div class="qa-a-inner">Use MongoDB when: (1) Data is genuinely document-shaped and varies between records (e-commerce product catalog where electronics have different attributes from clothing). (2) You need flexible schema evolution without ALTER TABLE migrations. (3) Data is always accessed as a whole document (embedded comments, blog posts with metadata). (4) Team prefers JSON-native APIs and doesn't need complex SQL analytics. Use PostgreSQL when: (1) Data has regular structure and strong relationships (users, orders, products with FK constraints). (2) ACID transactions across multiple tables required. (3) Complex analytical queries with JOINs, window functions, CTEs. (4) Schema validation and constraint enforcement needed. (5) Full MVCC and point-in-time recovery needed. Note: PostgreSQL with JSONB columns can serve many document store use cases while retaining full SQL power — a common "best of both" choice.</div></div>
    </div>
  </div>

  <div class="section-label">Further Reading</div>
  <div class="reading-links">
    <a class="reading-link" href="https://www.mongodb.com/docs/manual/data-modeling/" target="_blank">MongoDB Data Modeling</a>
    <a class="reading-link" href="https://cassandra.apache.org/doc/latest/cassandra/data_modeling/" target="_blank">Cassandra Data Modeling</a>
    <a class="reading-link" href="https://jepsen.io/consistency" target="_blank">Consistency Models (Jepsen)</a>
  </div>

  <div class="topic-nav">
    <a href="09-storage-engines.html" class="topic-nav-link"><div class="topic-nav-arrow">←</div><div><div class="topic-nav-label">Previous</div><div class="topic-nav-title">Storage Engines</div></div></a>
    <a href="11-distributed-databases.html" class="topic-nav-link next"><div class="topic-nav-arrow">→</div><div><div class="topic-nav-label">Next</div><div class="topic-nav-title">Distributed Databases</div></div></a>
  </div>
</div>

<script src="../js/main.js"></script>
<script src="../js/demos.js"></script>
</body>
</html>