fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors#85
Merged
sre-ci-robot merged 1 commit intozilliztech:mainfrom Apr 16, 2026
Merged
Conversation
…ctors
Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density)
expects `density` as bytes per value, not total bytes. The old code in
MilvusV2BinlogWriter and MilvusLoonWriter passed `batchSize * 32L` as the
second argument, so Arrow computed the data buffer size as
valueCount × density = batchSize × (batchSize × 32) = batchSize² × 32
At the default batchSize=1024 this pre-allocates 32 MiB per varchar
column. With two varchar targets per writer, plus the export-side retain
and C++ cached_batches_ holding prior buffers, a single RootAllocator
peak hit ~1 GiB on local repro (64 MiB `actual` leaked on close, 1 GiB
peak) for a 20480-row segment carrying ~820 KB of real data — roughly
a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB
per-writer peak × 4 concurrent writers = direct-memory OOM.
Pass density = 32.0 directly so the initial allocation reflects the
intended "≈32 bytes per varchar value" estimate. At default batch size
this drops per-column pre-allocation from 32 MiB to 32 KiB.
Verified locally: the same repro (`--batch-size 1024`, 2 varchar columns
length 20, 20480 rows) that previously died with
`OutOfMemoryException: Failure allocating buffer` + `Memory leaked:
(67125248)` now completes successfully with no allocator warnings.
The quadratic-scaling evidence (`actual = batchSize² × 32 × numCols`):
batchSize | observed actual leaked | batchSize² × 32 × 2
----------|------------------------|--------------------
1024 | 67,125,248 | 67,108,864
2048 | 268,468,224 | 268,435,456
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Collaborator
|
/lgtm |
Collaborator
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: congqixia, liliu-z The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density) expects
densityas bytes per value, not total bytes. The old code in MilvusV2BinlogWriter and MilvusLoonWriter passedbatchSize * 32Las the second argument, so Arrow computed the data buffer size asAt the default batchSize=1024 this pre-allocates 32 MiB per varchar column. With two varchar targets per writer, plus the export-side retain and C++ cached_batches_ holding prior buffers, a single RootAllocator peak hit ~1 GiB on local repro (64 MiB
actualleaked on close, 1 GiB peak) for a 20480-row segment carrying ~820 KB of real data — roughly a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB per-writer peak × 4 concurrent writers = direct-memory OOM.Pass density = 32.0 directly so the initial allocation reflects the intended "≈32 bytes per varchar value" estimate. At default batch size this drops per-column pre-allocation from 32 MiB to 32 KiB.
Verified locally: the same repro (
--batch-size 1024, 2 varchar columns length 20, 20480 rows) that previously died withOutOfMemoryException: Failure allocating buffer+Memory leaked: (67125248)now completes successfully with no allocator warnings. The quadratic-scaling evidence (actual = batchSize² × 32 × numCols):