feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead#689
Conversation
zeroshade
left a comment
There was a problem hiding this comment.
Thanks for this! Just one question really below. Though you do have some linting issues to fix 😄
|
|
||
| func (fr *flbaRecordReader) ReadDictionary() bool { return false } | ||
|
|
||
| func (fr *flbaRecordReader) ReserveData(int64) {} |
There was a problem hiding this comment.
Can't we also figure out how much to reserve here too?
There was a problem hiding this comment.
FixedSizeBinaryBuilder doesn't have a variable-length data buffer, since all values are the same width, the buffer size is fully determined by the slot count, which ReserveValues already handles via bldr.Reserve(n). ReserveData is only meaningful for byteArrayRecordReader where value lengths vary.
|
Hi @zeroshade , thanks a lot for merging the PR! I am curious about the release schedule - what does the cadence look like, or what's the best practice for us to upgrade our arrow-go to include this commit before a new release? |
|
We just did a release last week, so ideally we'd wait a couple weeks at least before doing a new release. Depending on your own requirements, you could update your own go.mod to point at the commit hash directly (or use a replace directive) if you're okay with that. Is it urgent for you that there be a release with this change soon? |
Thanks @zeroshade ! Sounds good, I can build an internal version cherrypicking this commit. Thanks. Yet I need to point out that, after cherry-picking this change, in the profiling files of e2e benchmark test k8s jobs, I noticed the CPU time is not dramatically improved. After further investigation, here are the reasons:
|
|
You're absolutely correct there, this really gives the best benefit to very specific use cases that were causing multiple reallocate-and-copy scenarios but doesn't do anything to help the case you're mentioning. I have a series of other various performance focused optimizations I'm working on, but if there's a particular bottleneck you're hitting somewhere please let me know and I can try to focus some time on that. |
Rationale for this change
This PR is to address issue #688
byteArrayRecordReaderbuilds binary/string Arrow arrays usingarray.BinaryBuilder, but the builder's data buffer starts empty and grows via repeated doublings as values are appended. For large binary columns this causes O(log n) realloc+copy cycles per row group, wasting both time and memory.This PR threads column chunk size metadata (
TotalUncompressedSize,NumRows) fromcolumnIterator.NextChunk()down toleafReader, and uses it to pre-allocate the builder's data buffer at the start of eachLoadBatchcall viaBinaryBuilder.ReserveData.What changes are included in this PR?
parquet/file/record_reader.go: addsReserveData(int64)toBinaryRecordReaderinterface and implements it onbyteArrayRecordReader; adds a no-op implementation onflbaRecordReader.parquet/pqarrow/file_reader.go:columnIterator.NextChunk()now returns(PageReader, uncompressedBytes, numRows, error).parquet/pqarrow/column_readers.go:leafReaderstores current row group metadata;LoadBatchcallsreserveBinaryData(nrecords)after each reset;nextRowGrouptakes aremainingRowsparameter to extend the reservation when crossingrow group boundaries mid-batch.
parquet/pqarrow/properties.go: addsPreAllocBinaryData booltoArrowReadProperties(default:false).Opt in via:
Are these changes tested?
Yes. parquet/pqarrow/binary_prealloc_test.go covers:
boundaries
Benchmark in parquet/pqarrow/reader_writer_test.go (BenchmarkPreAllocBinaryData) compares prealloc=false vs prealloc=true on a two-column
schema (slim string id + fat binary blob, 5 KB–50 KB values, Zstd, 2 row groups × 484 rows):
Environment: Apple M1 Max · count=3 · medians reported
Note: production workloads with larger values (~250 KB/row) will see larger improvements - more reallocation doublings are eliminated at greater value sizes. This benchmark uses 5–50 KB values to keep runtime practical.
Are there any user-facing changes?
Yes, opt-in. A new field PreAllocBinaryData bool is added to ArrowReadProperties. It defaults to false, so all existing code is
unaffected without any changes. Users with large binary or string columns can enable it to reduce memory allocations and improve read throughput.