Regression in memory stability for skewed datasets following PR #4193 (v1.5) #4281

aldder · 2026-03-13T16:07:15Z

aldder
Mar 13, 2026

Dear delta-rs team,

I’ve been testing the new delta-rs v1.5 across several AWS production workloads, and the results are a bit of a double-edged sword. I wanted to start a conversation about the trade-offs introduced by the recent performance optimizations, specifically regarding PR #4193.

The Good
On "well-behaved" datasets, the performance gains are massive. We are seeing up to a 50% reduction in execution time, which is an incredible achievement by the maintainers. The throughput improvements are definitely noticeable.

The Problem: Memory Overload (OOM)
However, we’ve encountered significant regressions in stability for processes dealing with unbalanced partitions (data skew). While the old version handled these gracefully (albeit more slowly), v1.5 leads to immediate Out Of Memory (OOM) errors in these real-world scenarios.

It seems the new implementation prioritizes "pushing" data as fast as possible, but lacks a robust back-pressure mechanism or a memory-aware buffer when dealing with skewed data. As noted in this comment, the peak memory usage has spiked significantly.

Questions for the developers:

Is this expected behavior? While performance is key, delta-rs is often used in memory-constrained environments (like AWS Lambda or small Fargate tasks). Should a performance-oriented update sacrifice the "stability safety net" for skewed data?
Configuration: Are there new knobs or environment variables we can use to throttle this new behavior for specific workloads without rolling back to v1.4?
Future Mitigation: Are there plans to implement a more conservative memory allocation strategy or better handling for partition skew in the upcoming patches?

I’d love to hear if others are facing similar issues and if we should consider the current implementation "safe" for general-purpose production use where data distribution isn't always perfect.

ethan-tyler · 2026-03-19T20:00:40Z

ethan-tyler
Mar 19, 2026
Collaborator

Thanks for the detailed report @aldder and I hear you that this is a real concern.

What you're describing lines up with a tradeoff that was made from #4193, where partitioned writes moved to hash repartition w/ per-stream writers. Much faster on balanced data, but as you pointed out the benchmarks in that PR showed higher memory on skewed inputs.

For anyone hitting this now:

DELTARS_MAX_CONCURRENT_WRITERS is the main lever and I would try lowering it
Lower target_file_size can also help by reducing per writer buffering
DELTARS_WRITER_BATCH_CHANNEL_SIZE is less relevant here as that's the single writer path
Spill configured SessionState helps for those using rust, but isn't exposed in the python write api today

Most useful next step would be a small repro plus a comparison across:

v1.4.x
v1.5.0
v1.5.0 with DELTARS_MAX_CONCURRENT_WRITERS=1/2/4

That would help us tell apart expected throughput/memory tradeoff vs a regression in defaults vs needing a more conservative fallback for skewed workloads. I will be running these locally but would appreciate additional evidence.

follow ups I think we can be looking at from here are:

Better docs on the memory/perf tradeoff
Expose spill config for writes
Evaluate adaptive write strategy for highly skewed inputs

If you can share more about the workload shape (python vs rust, partition cardinality, degree of skew, rough batch sizes) that would help narrow things down. Thanks!

0 replies

aldder · 2026-03-26T17:01:54Z

aldder
Mar 26, 2026
Author

Hello @ethan-tyler, sorry for the late reply.

Thanks for the suggestions. I tried looking through the documentation but couldn't find how to modify the DELTARS_MAX_CONCURRENT_WRITERS and DELTARS_WRITER_BATCH_CHANNEL_SIZE parameters. Could you please point me in the right direction? I am using the Python build of delta-rs.

Regarding the workload shape, it's difficult to extract this information because it usually doesn't appear in the logs, and I'm having a hard time building a reproducible example.

However, looking at the latest issues, I suspect that the memory issue could be related to this one (#4301), since that's exactly what we have in place within our processes (streamed execution + literal partitions with the IN (...) clause).

Let's see what happens with 1.5.1!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in memory stability for skewed datasets following PR #4193 (v1.5) #4281

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Regression in memory stability for skewed datasets following PR #4193 (v1.5) #4281

Uh oh!

aldder Mar 13, 2026

Replies: 2 comments

Uh oh!

ethan-tyler Mar 19, 2026 Collaborator

Uh oh!

aldder Mar 26, 2026 Author

aldder
Mar 13, 2026

ethan-tyler
Mar 19, 2026
Collaborator

aldder
Mar 26, 2026
Author