Skip to content

Add fast-and-unsound feature#485

Closed
woutersl wants to merge 1 commit intosnapview:masterfrom
cenotelie:fast-and-unsound
Closed

Add fast-and-unsound feature#485
woutersl wants to merge 1 commit intosnapview:masterfrom
cenotelie:fast-and-unsound

Conversation

@woutersl
Copy link
Copy Markdown

@woutersl woutersl commented Mar 4, 2025

Hi!

I have been experiencing some performance trouble that we drilled down to being caused by the use of BytesMut::resize. I realize that this topic was recently discussed and the decision was to use resize precisely to ensure soundness. I think this is a prudent decision for general use cases.

Still I would like to propose this change so that, as an opt-in, using a feature flag, it is possible to get top performance at the expense of soundness:

In some cases, speed could be preferable over soundness if we control the implementation of Read that is used for the underlying stream. By default the feature is NOT activated, keeping the current behavior as-is. However, when the feature is activated, it gives better performance when the user ensures correctness by verifying the behavior of the underlying Read implementation.

In my use case, which I admit is very limited, this change fixes a catastrophic latency issue (the time spent initializing memory to 0 dwarfs the rest).

@daniel-abramov
Copy link
Copy Markdown
Member

daniel-abramov commented Mar 4, 2025

Regarding the performance impact:

  1. Could you please share how much of a difference you observe in your tests and share an SSCCE? - This may help us to understand your particular use case and find a better solution for your problem.
  2. Could you also check if your problem might be related to Heavy memory usage difference between 0.24 and 0.25 #480? (if the in_buffer's size is very large and it ends up zeroing the whole buffer every single read, that might result in some noticeable differences)

Regarding the new fast-and-unsound feature: I agree that if there is a performance issue, we must solve it and I'm confident there is a way to solve it without making the safe function unsound. While it's true that sometimes performance is achieved at the expense of a 'safe behavior', the idiomatic Rust way of handling such issues is to provide an unsafe function to the end user, so that the user deliberately opts in for a certain behavior, understanding all the risks and all the responsibilities for choosing the particular behavior.

So I think that adding a feature that implicitly alters the internal implementation of one of the internal functions somewhere deep in the library to make it "unsound, but fast" is not the right way to approach a problem. Documenting such a feature and properly explaining its meaning would also result in a very vague and confusing explanation. Let's try to find a more elegant solution!

@woutersl
Copy link
Copy Markdown
Author

woutersl commented Mar 5, 2025

This is a new version that attempts to expose the unsound version through additional unsafe functions in the public API.
Do you think this is the direction to go?

I am still working through building a small use case that demonstrates this.

As far as I known, this is unrelated to #480.

@stephank
Copy link
Copy Markdown

We're seeing the same upgrading from 0.24 to 0.26.

I'm a little rusty when it comes to profiling and debugging performance issues, but was also able to narrow it down to at least FrameCodec::read_in before finding this issue. I tried profiling with perf, which shows >80% in memset, which sounds like something that'd line up with BytesMut::resize. I eventually found FrameCode::read_in by simply interrupting a process under load with gdb. (I guess BytesMut::resize is inlined in my particular build.)

Very rough estimate of observed perf difference is a factor of 60.

Sorry if this is otherwise light on additional details. I also wasn't yet able to test this PR in our setup.

@alexheretic
Copy link
Copy Markdown
Contributor

I think we can optimise the read impl to avoid having to zero fill on each read, while maintaining safety and without any additional copies. I'll raise a PR when I get some time 🙂

@alexheretic
Copy link
Copy Markdown
Contributor

I was hoping the new e2e benchmarks could highlight this as a potential gain. However, when I run #496 vs unsound set_len there doesn't seem to be a difference outside of noise.

group                #496                                   #496+unsound
-----                ----                                   ------------
send+recv/1 GiB      1.00   1212.3±2.19ms  1689.4 MB/sec    1.00  1209.7±21.52ms  1692.9 MB/sec
send+recv/128 MiB    1.02    179.0±0.85ms  1429.9 MB/sec    1.00    176.3±0.71ms  1452.1 MB/sec
send+recv/16 MiB     1.00     12.7±0.32ms     2.5 GB/sec    1.07     13.6±0.38ms     2.3 GB/sec
send+recv/2 MiB      1.00   935.2±21.91µs     4.2 GB/sec    1.01   940.0±10.86µs     4.2 GB/sec
send+recv/256 KiB    1.01    115.2±2.30µs     4.2 GB/sec    1.00    114.0±3.10µs     4.3 GB/sec
send+recv/32 KiB     1.04     29.3±0.50µs     2.1 GB/sec    1.00     28.2±0.54µs     2.2 GB/sec
send+recv/4 KiB      1.02     15.4±0.02µs   506.5 MB/sec    1.00     15.1±0.06µs   516.8 MB/sec
send+recv/512 B      1.00     13.3±0.24µs    73.4 MB/sec    1.01     13.4±0.04µs    72.9 MB/sec

Perhaps this issue was really the regression in large message performance fixed in #496 rather than the zero filling code?

While I think eliminating inefficient zero fills is doable, it may not actually be worth any extra complexity to do so if it isn't yielding objective performance improvements.

If someone is still seeing a regression with #496 please detail your usage scenario so we can benchmark it.

@woutersl
Copy link
Copy Markdown
Author

The workload we have is more lots of small messages at a high frequency (like 10,000 / seconds, up to 100,000 / seconds in burst) rather than huge messages. At a high enough frequency, zeroing out memory becomes (very) noticeable in profiling.

@alexheretic
Copy link
Copy Markdown
Contributor

The new "send+recv" benches do bench high freq single message perf. post #496 I can't produce much different in performance.

That is unless I set a very high read buffer size, say 4MiB (instead of default 128KiB). Then I see a diff between the two particularly for small messages.

group                 resize(4M read buf)                    setlen(4M read buf)  
-----                 -------------------                    -------------------
send+recv/1 GiB       1.00  1187.6±20.59ms  1724.5 MB/sec    1.02   1214.9±3.45ms  1685.7 MB/sec
send+recv/128 MiB     1.01    179.6±2.53ms  1425.2 MB/sec    1.00    177.3±4.75ms  1443.8 MB/sec
send+recv/16 MiB      1.00     13.5±0.51ms     2.3 GB/sec    1.04     14.0±0.30ms     2.2 GB/sec
send+recv/2 MiB       1.00  1073.7±38.67µs     3.6 GB/sec    1.04  1111.8±16.95µs     3.5 GB/sec
send+recv/256 KiB     1.27    154.1±4.72µs     3.2 GB/sec    1.00    121.3±2.87µs     4.0 GB/sec
send+recv/32 KiB      1.31     35.9±0.31µs  1742.5 MB/sec    1.00     27.4±0.85µs     2.2 GB/sec
send+recv/4 KiB       1.81     28.0±0.04µs   279.2 MB/sec    1.00     15.5±0.04µs   504.6 MB/sec
send+recv/512 B       3.01     41.1±1.51µs    23.7 MB/sec    1.00     13.7±0.09µs    71.5 MB/sec

This makes sense as each small message read needs to fill 4M of zeros.

And this perhaps also explains the existing performance regressions, as even though the 4M is not the default, the old behaviour tried to read (and therefore zero-fill) the entire available capacity. So it is feasible some large message usage caused the 128KiB buf to grow and then start showing bad perf.

So I think this is mostly resolved by #496, since that changes the read to only ever read the configured read_buffer_size (default 128KiB) as a max. With that, even if I manually make the actual read buffer capacity 4MiB the perf diff becomes difficult to see (presumably compared to other stuff).

group                resize(128K read buf)                  setlen(128K read buf)
-----                ---------------------                  ---------------------
send+recv/1 GiB      1.01  1310.2±20.95ms  1563.1 MB/sec    1.00   1297.3±7.94ms  1578.7 MB/sec
send+recv/128 MiB    1.00    196.9±2.47ms  1299.9 MB/sec    1.00    197.9±6.88ms  1293.5 MB/sec
send+recv/16 MiB     1.00     15.0±0.10ms     2.1 GB/sec    1.06     16.0±0.35ms  2002.4 MB/sec
send+recv/2 MiB      1.00  1145.5±33.09µs     3.4 GB/sec    1.09  1247.4±19.38µs     3.1 GB/sec
send+recv/256 KiB    1.05    139.0±0.96µs     3.5 GB/sec    1.00    131.8±3.76µs     3.7 GB/sec
send+recv/32 KiB     1.00     29.1±0.52µs     2.1 GB/sec    1.12     32.6±1.05µs  1919.9 MB/sec
send+recv/4 KiB      1.03     16.5±0.08µs   472.4 MB/sec    1.00     16.0±0.27µs   486.8 MB/sec
send+recv/512 B      1.00     13.9±0.32µs    70.5 MB/sec    1.01     14.0±0.05µs    69.9 MB/sec

My conclusion:

@daniel-abramov
Copy link
Copy Markdown
Member

Yeah, this sounds like a sensible approach!

@alexheretic
Copy link
Copy Markdown
Contributor

FYI safe & sound fix for this issue: #524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants