Add fast-and-unsound feature#485
Conversation
|
Regarding the performance impact:
Regarding the new So I think that adding a feature that implicitly alters the internal implementation of one of the internal functions somewhere deep in the library to make it "unsound, but fast" is not the right way to approach a problem. Documenting such a feature and properly explaining its meaning would also result in a very vague and confusing explanation. Let's try to find a more elegant solution! |
886cd45 to
de4b536
Compare
|
This is a new version that attempts to expose the unsound version through additional unsafe functions in the public API. I am still working through building a small use case that demonstrates this. As far as I known, this is unrelated to #480. |
de4b536 to
d0c2bf6
Compare
|
We're seeing the same upgrading from 0.24 to 0.26. I'm a little rusty when it comes to profiling and debugging performance issues, but was also able to narrow it down to at least Very rough estimate of observed perf difference is a factor of 60. Sorry if this is otherwise light on additional details. I also wasn't yet able to test this PR in our setup. |
|
I think we can optimise the read impl to avoid having to zero fill on each read, while maintaining safety and without any additional copies. I'll raise a PR when I get some time 🙂 |
|
I was hoping the new e2e benchmarks could highlight this as a potential gain. However, when I run #496 vs unsound set_len there doesn't seem to be a difference outside of noise. Perhaps this issue was really the regression in large message performance fixed in #496 rather than the zero filling code? While I think eliminating inefficient zero fills is doable, it may not actually be worth any extra complexity to do so if it isn't yielding objective performance improvements. If someone is still seeing a regression with #496 please detail your usage scenario so we can benchmark it. |
|
The workload we have is more lots of small messages at a high frequency (like 10,000 / seconds, up to 100,000 / seconds in burst) rather than huge messages. At a high enough frequency, zeroing out memory becomes (very) noticeable in profiling. |
|
The new "send+recv" benches do bench high freq single message perf. post #496 I can't produce much different in performance. That is unless I set a very high read buffer size, say 4MiB (instead of default 128KiB). Then I see a diff between the two particularly for small messages. This makes sense as each small message read needs to fill 4M of zeros. And this perhaps also explains the existing performance regressions, as even though the 4M is not the default, the old behaviour tried to read (and therefore zero-fill) the entire available capacity. So it is feasible some large message usage caused the 128KiB buf to grow and then start showing bad perf. So I think this is mostly resolved by #496, since that changes the read to only ever read the configured My conclusion:
|
|
Yeah, this sounds like a sensible approach! |
|
FYI safe & sound fix for this issue: #524 |
Hi!
I have been experiencing some performance trouble that we drilled down to being caused by the use of
BytesMut::resize. I realize that this topic was recently discussed and the decision was to useresizeprecisely to ensure soundness. I think this is a prudent decision for general use cases.Still I would like to propose this change so that, as an opt-in, using a feature flag, it is possible to get top performance at the expense of soundness:
In some cases, speed could be preferable over soundness if we control the implementation of
Readthat is used for the underlying stream. By default the feature is NOT activated, keeping the current behavior as-is. However, when the feature is activated, it gives better performance when the user ensures correctness by verifying the behavior of the underlyingReadimplementation.In my use case, which I admit is very limited, this change fixes a catastrophic latency issue (the time spent initializing memory to 0 dwarfs the rest).