Commit 3859cc8
fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854)
* fix: deduplicate doc counts in term aggregation for multi-valued fields
Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.
Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.
Fixes #2721
* refactor: only deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
* fix: handle non-consecutive duplicate values in dedup
Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.
Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
* perf: skip dedup when block has no multivalue entries
Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>1 parent 545169c commit 3859cc8
File tree
2 files changed
+134
-8
lines changed- columnar/src
- src/aggregation/bucket
2 files changed
+134
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
61 | 133 | | |
62 | 134 | | |
63 | 135 | | |
| |||
163 | 235 | | |
164 | 236 | | |
165 | 237 | | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
166 | 290 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
807 | 807 | | |
808 | 808 | | |
809 | 809 | | |
810 | | - | |
811 | | - | |
812 | | - | |
813 | | - | |
814 | | - | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
815 | 817 | | |
816 | 818 | | |
817 | 819 | | |
| |||
2347 | 2349 | | |
2348 | 2350 | | |
2349 | 2351 | | |
2350 | | - | |
| 2352 | + | |
2351 | 2353 | | |
2352 | 2354 | | |
2353 | 2355 | | |
| |||
2356 | 2358 | | |
2357 | 2359 | | |
2358 | 2360 | | |
2359 | | - | |
| 2361 | + | |
2360 | 2362 | | |
2361 | 2363 | | |
2362 | 2364 | | |
| |||
2370 | 2372 | | |
2371 | 2373 | | |
2372 | 2374 | | |
2373 | | - | |
| 2375 | + | |
2374 | 2376 | | |
2375 | 2377 | | |
2376 | 2378 | | |
| |||
0 commit comments