Skip to content

feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead#689

Merged
zeroshade merged 6 commits intoapache:mainfrom
junyan-ling:parquet-binary-prealloc
Mar 5, 2026
Merged

feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead#689
zeroshade merged 6 commits intoapache:mainfrom
junyan-ling:parquet-binary-prealloc

Conversation

@junyan-ling
Copy link
Copy Markdown
Contributor

@junyan-ling junyan-ling commented Mar 5, 2026

Rationale for this change

This PR is to address issue #688

byteArrayRecordReader builds binary/string Arrow arrays using array.BinaryBuilder, but the builder's data buffer starts empty and grows via repeated doublings as values are appended. For large binary columns this causes O(log n) realloc+copy cycles per row group, wasting both time and memory.

This PR threads column chunk size metadata (TotalUncompressedSize, NumRows) from columnIterator.NextChunk() down to leafReader, and uses it to pre-allocate the builder's data buffer at the start of each LoadBatch call via BinaryBuilder.ReserveData.

What changes are included in this PR?

  • parquet/file/record_reader.go: adds ReserveData(int64) to BinaryRecordReader interface and implements it on
    byteArrayRecordReader; adds a no-op implementation on flbaRecordReader.
  • parquet/pqarrow/file_reader.go: columnIterator.NextChunk() now returns (PageReader, uncompressedBytes, numRows, error).
  • parquet/pqarrow/column_readers.go: leafReader stores current row group metadata; LoadBatch calls
    reserveBinaryData(nrecords) after each reset; nextRowGroup takes a remainingRows parameter to extend the reservation when crossing
    row group boundaries mid-batch.
  • parquet/pqarrow/properties.go: adds PreAllocBinaryData bool to ArrowReadProperties (default: false).

Opt in via:

props := pqarrow.ArrowReadProperties{
    PreAllocBinaryData: true,
}
reader, err := pqarrow.NewFileReader(pf, props, mem)

Are these changes tested?

Yes. parquet/pqarrow/binary_prealloc_test.go covers:

  • Default flag value is false (no behaviour change for existing callers)
  • Correctness of output for binary, string, nullable, int32, FLBA, and dict-encoded columns
  • All batch size configurations: unbounded, one batch per row group, multiple batches per row group, and batches that span row group
    boundaries

Benchmark in parquet/pqarrow/reader_writer_test.go (BenchmarkPreAllocBinaryData) compares prealloc=false vs prealloc=true on a two-column
schema (slim string id + fat binary blob, 5 KB–50 KB values, Zstd, 2 row groups × 484 rows):

Environment: Apple M1 Max · count=3 · medians reported

┌────────────────┬─────────────┬─────────────┬────────┬─────────────┬─────────────┬────────┬────────────────┬───────────────┬─────────┐
│ Sub-benchmark  │   ns/op     │   ns/op     │   Δ    │    B/op     │ B/op (true) │ Δ B/op │   allocs/op    │  allocs/op    │   Δ     │
│                │   (false)   │   (true)    │ ns/op  │   (false)   │             │        │    (false)     │    (true)     │ allocs  │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchAll       │   9,117,272 │   7,993,732 │ -12.3% │ 144,021,824 │ 115,098,562 │ -20.1% │            511 │           494 │   -3.3% │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchPerRG     │   9,190,661 │   8,083,567 │ -12.0% │ 144,024,680 │ 115,096,686 │ -20.1% │            513 │           493 │   -3.9% │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchQuarterRG │   9,116,379 │   7,896,174 │ -13.4% │ 144,023,299 │ 115,097,206 │ -20.1% │            512 │           493 │   -3.7% │
└────────────────┴─────────────┴─────────────┴────────┴─────────────┴─────────────┴────────┴────────────────┴───────────────┴─────────┘

Note: production workloads with larger values (~250 KB/row) will see larger improvements - more reallocation doublings are eliminated at greater value sizes. This benchmark uses 5–50 KB values to keep runtime practical.

Are there any user-facing changes?

Yes, opt-in. A new field PreAllocBinaryData bool is added to ArrowReadProperties. It defaults to false, so all existing code is
unaffected without any changes. Users with large binary or string columns can enable it to reduce memory allocations and improve read throughput.

@junyan-ling junyan-ling requested a review from zeroshade as a code owner March 5, 2026 21:34
Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Just one question really below. Though you do have some linting issues to fix 😄


func (fr *flbaRecordReader) ReadDictionary() bool { return false }

func (fr *flbaRecordReader) ReserveData(int64) {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we also figure out how much to reserve here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FixedSizeBinaryBuilder doesn't have a variable-length data buffer, since all values are the same width, the buffer size is fully determined by the slot count, which ReserveValues already handles via bldr.Reserve(n). ReserveData is only meaningful for byteArrayRecordReader where value lengths vary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Sounds good

@junyan-ling junyan-ling changed the title Parquet binary prealloc pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead Mar 5, 2026
@zeroshade zeroshade changed the title pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead Mar 5, 2026
@zeroshade zeroshade merged commit 6b44167 into apache:main Mar 5, 2026
40 of 41 checks passed
@junyan-ling
Copy link
Copy Markdown
Contributor Author

Hi @zeroshade , thanks a lot for merging the PR! I am curious about the release schedule - what does the cadence look like, or what's the best practice for us to upgrade our arrow-go to include this commit before a new release?

@zeroshade
Copy link
Copy Markdown
Member

We just did a release last week, so ideally we'd wait a couple weeks at least before doing a new release. Depending on your own requirements, you could update your own go.mod to point at the commit hash directly (or use a replace directive) if you're okay with that. Is it urgent for you that there be a release with this change soon?

@junyan-ling
Copy link
Copy Markdown
Contributor Author

junyan-ling commented Mar 10, 2026

We just did a release last week, so ideally we'd wait a couple weeks at least before doing a new release. Depending on your own requirements, you could update your own go.mod to point at the commit hash directly (or use a replace directive) if you're okay with that. Is it urgent for you that there be a release with this change soon?

Thanks @zeroshade ! Sounds good, I can build an internal version cherrypicking this commit. Thanks.

Yet I need to point out that, after cherry-picking this change, in the profiling files of e2e benchmark test k8s jobs, I noticed the CPU time is not dramatically improved. After further investigation, here are the reasons:

  1. Pre-allocation only ensures the destination buffer is large enough, and it does not eliminate the per-value copy. When we read a parquet page, the decompressed bytes live in the page buffer. Each value then gets copied individually from the page buffer into the BinaryBuilder's data buffer via bufferBuilder.Append → copy() → memmove. This copy happens for every single value regardless of how large the destination buffer is. Pre-allocation prevents the destination from needing to resize mid-batch, but the fundamental copy from source to destination is unavoidable. This accounts for most of the memmove cost in the profiles.
  2. Pre-allocation itself has a CPU cost: zeroing. Both Go's runtime and Arrow's memory allocator zero-fill all newly allocated memory. When ReserveData pre-allocates the data buffer, every byte gets zeroed via memset before any data is written. This zeroing is wasted work - we're about to overwrite every byte with actual values.

@zeroshade
Copy link
Copy Markdown
Member

You're absolutely correct there, this really gives the best benefit to very specific use cases that were causing multiple reallocate-and-copy scenarios but doesn't do anything to help the case you're mentioning.

I have a series of other various performance focused optimizations I'm working on, but if there's a particular bottleneck you're hitting somewhere please let me know and I can try to focus some time on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants