Skip to content

perf(parquet/compress): set zstd pool encoder concurrency to 1#717

Merged
zeroshade merged 1 commit intoapache:mainfrom
dimakuz:perf/zstd-pool-concurrency
Mar 17, 2026
Merged

perf(parquet/compress): set zstd pool encoder concurrency to 1#717
zeroshade merged 1 commit intoapache:mainfrom
dimakuz:perf/zstd-pool-concurrency

Conversation

@dimakuz
Copy link
Copy Markdown
Contributor

@dimakuz dimakuz commented Mar 16, 2026

The zstdEncoderPool is used exclusively by EncodeAll(), which is a single-shot synchronous call that uses exactly one inner block encoder. However, zstd.NewWriter defaults concurrent to runtime.GOMAXPROCS, pre-allocating that many inner block encoders — each with its own ~1 MiB history buffer (ensureHist). On a 10-core machine, each pooled Encoder allocates 10 inner encoders when only 1 is ever used by EncodeAll.

With WithEncoderConcurrency(1), each pooled encoder creates a single inner encoder, matching actual usage. The streaming Write/Close path is unaffected — it does not use the pool.

Benchmark results (Apple M4 Pro, arm64, 256 KiB semi-random data):

BenchmarkZstdPooledEncodeAll/Default-14        11000 B/op   5250 MB/s
BenchmarkZstdPooledEncodeAll/Concurrency1-14     810 B/op   5500 MB/s

14x less memory per operation, ~5% higher throughput from reduced GC pressure.

In a parquet write workload (1 GiB Arrow data, ZSTD level 3), this reduced ensureHist allocations from 22 GiB to 7 GiB and madvise kernel CPU from 4.6s to 2.3s (10% wall-time improvement).

Rationale for this change

High memory churn during parquet encoding

What changes are included in this PR?

Change to zstd encoder concurrency, a benchmark to reproduce results.

Are these changes tested?

Yes

Are there any user-facing changes?

No

The zstdEncoderPool is used exclusively by EncodeAll(), which is a
single-shot synchronous call that uses exactly one inner block encoder.
However, zstd.NewWriter defaults concurrent to runtime.GOMAXPROCS,
pre-allocating that many inner block encoders — each with its own ~1 MiB
history buffer (ensureHist). On a 10-core machine, each pooled Encoder
allocates 10 inner encoders when only 1 is ever used by EncodeAll.

With WithEncoderConcurrency(1), each pooled encoder creates a single
inner encoder, matching actual usage. The streaming Write/Close path
is unaffected — it does not use the pool.

Benchmark results (Apple M4 Pro, arm64, 256 KiB semi-random data):

    BenchmarkZstdPooledEncodeAll/Default-14        11000 B/op   5250 MB/s
    BenchmarkZstdPooledEncodeAll/Concurrency1-14     810 B/op   5500 MB/s

14x less memory per operation, ~5% higher throughput from reduced GC
pressure.

In a parquet write workload (1 GiB Arrow data, ZSTD level 3), this
reduced ensureHist allocations from 22 GiB to 7 GiB and madvise
kernel CPU from 4.6s to 2.3s (10% wall-time improvement).
@dimakuz dimakuz requested a review from zeroshade as a code owner March 16, 2026 19:52
@zeroshade
Copy link
Copy Markdown
Member

Should this be a configurable setting that we just default to 1?

Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the one question

@dimakuz
Copy link
Copy Markdown
Contributor Author

dimakuz commented Mar 17, 2026

Thanks for the prompt review!
I think that in the current implementation of zstd codec and zstd libs there's no real reason to make it configurable. We encode as one-shot (EncodeAll) and internally it uses a single encoder, never benefitting from parallel encoders being available.
If you prefer to plumb it out in some way so its more controllable from outside I can try.

@zeroshade
Copy link
Copy Markdown
Member

I think this is fine for now, we can look into making it more controllable in a follow-up. Thanks!

@zeroshade zeroshade merged commit 5a94422 into apache:main Mar 17, 2026
39 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants