fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors by congqixia · Pull Request #85 · zilliztech/spark-milvus

congqixia · 2026-04-16T15:25:26Z

Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density) expects density as bytes per value, not total bytes. The old code in MilvusV2BinlogWriter and MilvusLoonWriter passed batchSize * 32L as the second argument, so Arrow computed the data buffer size as

valueCount × density = batchSize × (batchSize × 32) = batchSize² × 32

At the default batchSize=1024 this pre-allocates 32 MiB per varchar column. With two varchar targets per writer, plus the export-side retain and C++ cached_batches_ holding prior buffers, a single RootAllocator peak hit ~1 GiB on local repro (64 MiB actual leaked on close, 1 GiB peak) for a 20480-row segment carrying ~820 KB of real data — roughly a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB per-writer peak × 4 concurrent writers = direct-memory OOM.

Pass density = 32.0 directly so the initial allocation reflects the intended "≈32 bytes per varchar value" estimate. At default batch size this drops per-column pre-allocation from 32 MiB to 32 KiB.

Verified locally: the same repro (--batch-size 1024, 2 varchar columns length 20, 20480 rows) that previously died with
OutOfMemoryException: Failure allocating buffer + Memory leaked: (67125248) now completes successfully with no allocator warnings. The quadratic-scaling evidence (actual = batchSize² × 32 × numCols):

batchSize	observed actual leaked	batchSize² × 32 × 2

1024 |  67,125,248 |   67,108,864
2048 |  268,468,224  |  268,435,456

…ctors Arrow Java's BaseVariableWidthVector.setInitialCapacity(valueCount, density) expects `density` as bytes per value, not total bytes. The old code in MilvusV2BinlogWriter and MilvusLoonWriter passed `batchSize * 32L` as the second argument, so Arrow computed the data buffer size as valueCount × density = batchSize × (batchSize × 32) = batchSize² × 32 At the default batchSize=1024 this pre-allocates 32 MiB per varchar column. With two varchar targets per writer, plus the export-side retain and C++ cached_batches_ holding prior buffers, a single RootAllocator peak hit ~1 GiB on local repro (64 MiB `actual` leaked on close, 1 GiB peak) for a 20480-row segment carrying ~820 KB of real data — roughly a 1000× over-allocation. Production surfaced the same bug as a ~7 GiB per-writer peak × 4 concurrent writers = direct-memory OOM. Pass density = 32.0 directly so the initial allocation reflects the intended "≈32 bytes per varchar value" estimate. At default batch size this drops per-column pre-allocation from 32 MiB to 32 KiB. Verified locally: the same repro (`--batch-size 1024`, 2 varchar columns length 20, 20480 rows) that previously died with `OutOfMemoryException: Failure allocating buffer` + `Memory leaked: (67125248)` now completes successfully with no allocator warnings. The quadratic-scaling evidence (`actual = batchSize² × 32 × numCols`): batchSize | observed actual leaked | batchSize² × 32 × 2 ----------|------------------------|-------------------- 1024 | 67,125,248 | 67,108,864 2048 | 268,468,224 | 268,435,456 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

liliu-z · 2026-04-16T15:57:00Z

/lgtm
/approve

sre-ci-robot · 2026-04-16T15:57:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, liliu-z

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [liliu-z]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot requested review from czs007 and liliu-z April 16, 2026 15:25

sre-ci-robot added the size/S label Apr 16, 2026

sre-ci-robot assigned liliu-z Apr 16, 2026

sre-ci-robot added the lgtm label Apr 16, 2026

sre-ci-robot added the approved label Apr 16, 2026

sre-ci-robot merged commit aa222df into zilliztech:main Apr 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors#85

fix: [backfill] avoid quadratic varchar over-allocation in allocateVectors#85
sre-ci-robot merged 1 commit intozilliztech:mainfrom
congqixia:fix/use_correct_density

congqixia commented Apr 16, 2026 •

edited

Loading

Uh oh!

liliu-z commented Apr 16, 2026

Uh oh!

sre-ci-robot commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

congqixia commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liliu-z commented Apr 16, 2026

Uh oh!

sre-ci-robot commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

congqixia commented Apr 16, 2026 •

edited

Loading