Replies: 2 comments
-
|
Thanks for the detailed report @aldder and I hear you that this is a real concern. What you're describing lines up with a tradeoff that was made from #4193, where partitioned writes moved to hash repartition w/ per-stream writers. Much faster on balanced data, but as you pointed out the benchmarks in that PR showed higher memory on skewed inputs. For anyone hitting this now:
Most useful next step would be a small repro plus a comparison across:
That would help us tell apart expected throughput/memory tradeoff vs a regression in defaults vs needing a more conservative fallback for skewed workloads. I will be running these locally but would appreciate additional evidence. follow ups I think we can be looking at from here are:
If you can share more about the workload shape (python vs rust, partition cardinality, degree of skew, rough batch sizes) that would help narrow things down. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Hello @ethan-tyler, sorry for the late reply. Thanks for the suggestions. I tried looking through the documentation but couldn't find how to modify the Regarding the workload shape, it's difficult to extract this information because it usually doesn't appear in the logs, and I'm having a hard time building a reproducible example. However, looking at the latest issues, I suspect that the memory issue could be related to this one (#4301), since that's exactly what we have in place within our processes (streamed execution + literal partitions with the Let's see what happens with 1.5.1! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear delta-rs team,
I’ve been testing the new delta-rs v1.5 across several AWS production workloads, and the results are a bit of a double-edged sword. I wanted to start a conversation about the trade-offs introduced by the recent performance optimizations, specifically regarding PR #4193.
The Good
On "well-behaved" datasets, the performance gains are massive. We are seeing up to a 50% reduction in execution time, which is an incredible achievement by the maintainers. The throughput improvements are definitely noticeable.
The Problem: Memory Overload (OOM)
However, we’ve encountered significant regressions in stability for processes dealing with unbalanced partitions (data skew). While the old version handled these gracefully (albeit more slowly), v1.5 leads to immediate Out Of Memory (OOM) errors in these real-world scenarios.
It seems the new implementation prioritizes "pushing" data as fast as possible, but lacks a robust back-pressure mechanism or a memory-aware buffer when dealing with skewed data. As noted in this comment, the peak memory usage has spiked significantly.
Questions for the developers:
Is this expected behavior? While performance is key, delta-rs is often used in memory-constrained environments (like AWS Lambda or small Fargate tasks). Should a performance-oriented update sacrifice the "stability safety net" for skewed data?
Configuration: Are there new knobs or environment variables we can use to throttle this new behavior for specific workloads without rolling back to v1.4?
Future Mitigation: Are there plans to implement a more conservative memory allocation strategy or better handling for partition skew in the upcoming patches?
I’d love to hear if others are facing similar issues and if we should consider the current implementation "safe" for general-purpose production use where data distribution isn't always perfect.
Beta Was this translation helpful? Give feedback.
All reactions