Improvements for parquet writing performance (25%-44%)#7824
Improvements for parquet writing performance (25%-44%)#7824Dandandan merged 8 commits intoapache:mainfrom
Conversation
3a807a5 to
11e76a3
Compare
|
Is this ready for review @jhorstmann ? |
|
@Dandandan I think this should be ready for review. I haven't gotten around to re-run all parquet writing benchmarks yet, maybe @alamb can also do his benchmark magic :) |
|
Benchmark results from a local run with Very nice improvements for primitives, minor improvements once bloom filters or string types are involved. I think the last two benchmarks are named incorrectly, they actually write 3 columns of types int32, bool and utf8. |
|
@Dandandan wait a second with merging, I think I found another hotspot regarding the bloom filters |
Sure thing! |
This comment was marked as outdated.
This comment was marked as outdated.
|
Those are some pretty impressive numbers @jhorstmann |
|
Updated results for bloom filter benchmarks. There is a very weird effect on my machine where running only these benchmarks is much faster than running them interleaved with non-bloom benchmarks. |
|
🤖 |
|
🤖: Benchmark completed Details
|
😮 |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much @jhorstmann -- that is amazing
There was a problem hiding this comment.
it is a good call to remove the runtime overhead in release mode
There was a problem hiding this comment.
This is also a good idea to avoid having to clone (even though it is just a few ars) each time
|
BTW @jhorstmann -- twitter noticed that the initial results in the description don't reflect what I just measured https://x.com/qrilka/status/1940751432560660823 Perhaps the description needs to be updated with your latest results? |
|
The PR description needs an update now that bloom filters are also improved. I also ran my initial benchmarks with |
|
@alamb I can't reproduce these benchmark numbers. The branch results looks plausible, but the numbers for main look much slower than what I'm getting locally. Can you verify that the main results include the changes from PR #7823 / commit a9f316b? it might also be that the benchmark names ending in |
My scripts compare the branch against |
9d6a5f5 to
e7f32e9
Compare
|
🤖 |
Ah, that makes sense, I think I created both branches in parallel. This PR is rebased now, that should give a better comparison. |
I restarted the benchmark |
|
🤖: Benchmark completed Details
|
|
Ah, 25% - 44% -- will adjust |
|
🚀 indeed |
|
🚀 🚀 🚀 |
This method was removed in apache#7824, which introduced an optimized code path for writing bloom filters on little-endian architectures. The method was however still used in the big-endian code-path. Due to the use of `#[cfg(target_endian)]` this went unnoticed in CI. Fixes apache#8207
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7822. The benchmark update in apache#7823 should be merged first to get a fair baseline. # Rationale for this change The changes in this PR improve parquet writing performance for primitives by up to 45%. # What changes are included in this PR? There was not a single bottleneck to fix, instead several small improvements contributed to the performance increase: - Optimize counting of values and nulls by replacing a loop with code that can be vectorized by the compiler. The number of nulls can also be calculated from the lengths of the array and the number of values to write, instead of being counted separately. - Change asserts in `BitWriter::put_value` to `debug_assert` since these should never be triggered by users of the code and are not required for soundness. - Use slice iteration instead of indexing in flush_bit_packed_run to avoid a bounds check. - Separate iteration for def_levels and non_null_indices using specialized iterators. Range iteration is `TrustedLen` and so avoids multiple capacity checks and `BitIndexIterator` is more optimized for collecting non-null indices. - Cache logical nulls of the array to avoid clones or repeated recomputation. This should avoid a pathological case when writing lists of arrays that need logical nulls. - Optimize bloom filter initialization to a single `memset` and write all blocks as a single slice on little endian targets. # Are these changes tested? Logic should be covered by existing tests. # Are there any user-facing changes? No, all changes are to implementation details and do not affect public apis.
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
The changes in this PR improve parquet writing performance for primitives by up to 45%.
What changes are included in this PR?
There was not a single bottleneck to fix, instead several small improvements contributed to the performance increase:
BitWriter::put_valuetodebug_assertsince these should never be triggered by users of the code and are not required for soundness.TrustedLenand so avoids multiple capacity checks andBitIndexIteratoris more optimized for collecting non-null indices.memsetand write all blocks as a single slice on little endian targets.Are these changes tested?
Logic should be covered by existing tests.
Are there any user-facing changes?
No, all changes are to implementation details and do not affect public apis.