Skip to content

fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors#85

Merged
sre-ci-robot merged 1 commit intozilliztech:mainfrom
congqixia:fix/use_correct_density
Apr 16, 2026
Merged

fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors#85
sre-ci-robot merged 1 commit intozilliztech:mainfrom
congqixia:fix/use_correct_density

Conversation

@congqixia
Copy link
Copy Markdown
Contributor

@congqixia congqixia commented Apr 16, 2026

Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density) expects density as bytes per value, not total bytes. The old code in MilvusV2BinlogWriter and MilvusLoonWriter passed batchSize * 32L as the second argument, so Arrow computed the data buffer size as

valueCount × density = batchSize × (batchSize × 32) = batchSize² × 32

At the default batchSize=1024 this pre-allocates 32 MiB per varchar column. With two varchar targets per writer, plus the export-side retain and C++ cached_batches_ holding prior buffers, a single RootAllocator peak hit ~1 GiB on local repro (64 MiB actual leaked on close, 1 GiB peak) for a 20480-row segment carrying ~820 KB of real data — roughly a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB per-writer peak × 4 concurrent writers = direct-memory OOM.

Pass density = 32.0 directly so the initial allocation reflects the intended "≈32 bytes per varchar value" estimate. At default batch size this drops per-column pre-allocation from 32 MiB to 32 KiB.

Verified locally: the same repro (--batch-size 1024, 2 varchar columns length 20, 20480 rows) that previously died with
OutOfMemoryException: Failure allocating buffer + Memory leaked: (67125248) now completes successfully with no allocator warnings. The quadratic-scaling evidence (actual = batchSize² × 32 × numCols):

batchSize observed actual leaked batchSize² × 32 × 2
1024 |  67,125,248 |   67,108,864
2048 |  268,468,224  |  268,435,456

…ctors

Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density)
expects `density` as bytes per value, not total bytes. The old code in
MilvusV2BinlogWriter and MilvusLoonWriter passed `batchSize * 32L` as the
second argument, so Arrow computed the data buffer size as

    valueCount × density = batchSize × (batchSize × 32) = batchSize² × 32

At the default batchSize=1024 this pre-allocates 32 MiB per varchar
column. With two varchar targets per writer, plus the export-side retain
and C++ cached_batches_ holding prior buffers, a single RootAllocator
peak hit ~1 GiB on local repro (64 MiB `actual` leaked on close, 1 GiB
peak) for a 20480-row segment carrying ~820 KB of real data — roughly
a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB
per-writer peak × 4 concurrent writers = direct-memory OOM.

Pass density = 32.0 directly so the initial allocation reflects the
intended "≈32 bytes per varchar value" estimate. At default batch size
this drops per-column pre-allocation from 32 MiB to 32 KiB.

Verified locally: the same repro (`--batch-size 1024`, 2 varchar columns
length 20, 20480 rows) that previously died with
`OutOfMemoryException: Failure allocating buffer` + `Memory leaked:
(67125248)` now completes successfully with no allocator warnings.
The quadratic-scaling evidence (`actual = batchSize² × 32 × numCols`):

batchSize | observed actual leaked | batchSize² × 32 × 2
----------|------------------------|--------------------
    1024  |    67,125,248          |   67,108,864
    2048  |   268,468,224          |  268,435,456

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@liliu-z
Copy link
Copy Markdown
Collaborator

liliu-z commented Apr 16, 2026

/lgtm
/approve

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, liliu-z

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit aa222df into zilliztech:main Apr 16, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants