Commit 6b44167
feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead (#689)
### Rationale for this change
This PR is to address issue
#688
`byteArrayRecordReader` builds binary/string Arrow arrays using
`array.BinaryBuilder`, but the builder's data buffer starts empty and
grows via repeated doublings as values are appended. For large binary
columns this causes O(log n) realloc+copy cycles per row group, wasting
both time and memory.
This PR threads column chunk size metadata (`TotalUncompressedSize`,
`NumRows`) from `columnIterator.NextChunk()` down to `leafReader`, and
uses it to pre-allocate the builder's data buffer at the start of each
`LoadBatch` call via `BinaryBuilder.ReserveData`.
### What changes are included in this PR?
- **`parquet/file/record_reader.go`**: adds `ReserveData(int64)` to
`BinaryRecordReader` interface and implements it on
`byteArrayRecordReader`; adds a no-op implementation on
`flbaRecordReader`.
- **`parquet/pqarrow/file_reader.go`**: `columnIterator.NextChunk()` now
returns `(PageReader, uncompressedBytes, numRows, error)`.
- **`parquet/pqarrow/column_readers.go`**: `leafReader` stores current
row group metadata; `LoadBatch` calls
`reserveBinaryData(nrecords)` after each reset; `nextRowGroup` takes a
`remainingRows` parameter to extend the reservation when crossing
row group boundaries mid-batch.
- **`parquet/pqarrow/properties.go`**: adds `PreAllocBinaryData bool` to
`ArrowReadProperties` (default: `false`).
Opt in via:
```go
props := pqarrow.ArrowReadProperties{
PreAllocBinaryData: true,
}
reader, err := pqarrow.NewFileReader(pf, props, mem)
```
### Are these changes tested?
Yes. parquet/pqarrow/binary_prealloc_test.go covers:
- Default flag value is false (no behaviour change for existing callers)
- Correctness of output for binary, string, nullable, int32, FLBA, and dict-encoded columns
- All batch size configurations: unbounded, one batch per row group, multiple batches per row group, and batches that span row group
boundaries
Benchmark in parquet/pqarrow/reader_writer_test.go (BenchmarkPreAllocBinaryData) compares prealloc=false vs prealloc=true on a two-column
schema (slim string id + fat binary blob, 5 KB–50 KB values, Zstd, 2 row groups × 484 rows):
Environment: Apple M1 Max · count=3 · medians reported
```
┌────────────────┬─────────────┬─────────────┬────────┬─────────────┬─────────────┬────────┬────────────────┬───────────────┬─────────┐
│ Sub-benchmark │ ns/op │ ns/op │ Δ │ B/op │ B/op (true) │ Δ B/op │ allocs/op │ allocs/op │ Δ │
│ │ (false) │ (true) │ ns/op │ (false) │ │ │ (false) │ (true) │ allocs │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchAll │ 9,117,272 │ 7,993,732 │ -12.3% │ 144,021,824 │ 115,098,562 │ -20.1% │ 511 │ 494 │ -3.3% │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchPerRG │ 9,190,661 │ 8,083,567 │ -12.0% │ 144,024,680 │ 115,096,686 │ -20.1% │ 513 │ 493 │ -3.9% │
├────────────────┼─────────────┼─────────────┼────────┼─────────────┼─────────────┼────────┼────────────────┼───────────────┼─────────┤
│ batchQuarterRG │ 9,116,379 │ 7,896,174 │ -13.4% │ 144,023,299 │ 115,097,206 │ -20.1% │ 512 │ 493 │ -3.7% │
└────────────────┴─────────────┴─────────────┴────────┴─────────────┴─────────────┴────────┴────────────────┴───────────────┴─────────┘
```
Note: production workloads with larger values (~250 KB/row) will see
larger improvements - more reallocation doublings are eliminated at
greater value sizes. This benchmark uses 5–50 KB values to keep runtime
practical.
### Are there any user-facing changes?
Yes, opt-in. A new field PreAllocBinaryData bool is added to
ArrowReadProperties. It defaults to false, so all existing code is
unaffected without any changes. Users with large binary or string
columns can enable it to reduce memory allocations and improve read
throughput.
---------
Co-authored-by: Junyan Ling <jling22@apple.com>1 parent 2895752 commit 6b44167
6 files changed
Lines changed: 610 additions & 7 deletions
File tree
- parquet
- file
- pqarrow
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
94 | 97 | | |
95 | 98 | | |
96 | 99 | | |
| |||
117 | 120 | | |
118 | 121 | | |
119 | 122 | | |
| 123 | + | |
120 | 124 | | |
121 | 125 | | |
122 | 126 | | |
| |||
343 | 347 | | |
344 | 348 | | |
345 | 349 | | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
346 | 354 | | |
347 | 355 | | |
348 | 356 | | |
| |||
758 | 766 | | |
759 | 767 | | |
760 | 768 | | |
| 769 | + | |
| 770 | + | |
761 | 771 | | |
762 | 772 | | |
763 | 773 | | |
| |||
817 | 827 | | |
818 | 828 | | |
819 | 829 | | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
820 | 842 | | |
821 | 843 | | |
822 | 844 | | |
| |||
0 commit comments