Add `fast-and-unsound` feature by woutersl · Pull Request #485 · snapview/tungstenite-rs

woutersl · 2025-03-04T08:44:48Z

Hi!

I have been experiencing some performance trouble that we drilled down to being caused by the use of BytesMut::resize. I realize that this topic was recently discussed and the decision was to use resize precisely to ensure soundness. I think this is a prudent decision for general use cases.

Still I would like to propose this change so that, as an opt-in, using a feature flag, it is possible to get top performance at the expense of soundness:

In some cases, speed could be preferable over soundness if we control the implementation of Read that is used for the underlying stream. By default the feature is NOT activated, keeping the current behavior as-is. However, when the feature is activated, it gives better performance when the user ensures correctness by verifying the behavior of the underlying Read implementation.

In my use case, which I admit is very limited, this change fixes a catastrophic latency issue (the time spent initializing memory to 0 dwarfs the rest).

daniel-abramov · 2025-03-04T10:35:18Z

Regarding the performance impact:

Could you please share how much of a difference you observe in your tests and share an SSCCE? - This may help us to understand your particular use case and find a better solution for your problem.
Could you also check if your problem might be related to Heavy memory usage difference between 0.24 and 0.25 #480? (if the in_buffer's size is very large and it ends up zeroing the whole buffer every single read, that might result in some noticeable differences)

Regarding the new fast-and-unsound feature: I agree that if there is a performance issue, we must solve it and I'm confident there is a way to solve it without making the safe function unsound. While it's true that sometimes performance is achieved at the expense of a 'safe behavior', the idiomatic Rust way of handling such issues is to provide an unsafe function to the end user, so that the user deliberately opts in for a certain behavior, understanding all the risks and all the responsibilities for choosing the particular behavior.

So I think that adding a feature that implicitly alters the internal implementation of one of the internal functions somewhere deep in the library to make it "unsound, but fast" is not the right way to approach a problem. Documenting such a feature and properly explaining its meaning would also result in a very vague and confusing explanation. Let's try to find a more elegant solution!

woutersl · 2025-03-05T09:05:49Z

This is a new version that attempts to expose the unsound version through additional unsafe functions in the public API.
Do you think this is the direction to go?

I am still working through building a small use case that demonstrates this.

As far as I known, this is unrelated to #480.

stephank · 2025-03-17T14:15:30Z

We're seeing the same upgrading from 0.24 to 0.26.

I'm a little rusty when it comes to profiling and debugging performance issues, but was also able to narrow it down to at least FrameCodec::read_in before finding this issue. I tried profiling with perf, which shows >80% in memset, which sounds like something that'd line up with BytesMut::resize. I eventually found FrameCode::read_in by simply interrupting a process under load with gdb. (I guess BytesMut::resize is inlined in my particular build.)

Very rough estimate of observed perf difference is a factor of 60.

Sorry if this is otherwise light on additional details. I also wasn't yet able to test this PR in our setup.

alexheretic · 2025-05-16T17:09:11Z

I think we can optimise the read impl to avoid having to zero fill on each read, while maintaining safety and without any additional copies. I'll raise a PR when I get some time 🙂

alexheretic · 2025-05-18T16:18:43Z

I was hoping the new e2e benchmarks could highlight this as a potential gain. However, when I run #496 vs unsound set_len there doesn't seem to be a difference outside of noise.

group                #496                                   #496+unsound
-----                ----                                   ------------
send+recv/1 GiB      1.00   1212.3±2.19ms  1689.4 MB/sec    1.00  1209.7±21.52ms  1692.9 MB/sec
send+recv/128 MiB    1.02    179.0±0.85ms  1429.9 MB/sec    1.00    176.3±0.71ms  1452.1 MB/sec
send+recv/16 MiB     1.00     12.7±0.32ms     2.5 GB/sec    1.07     13.6±0.38ms     2.3 GB/sec
send+recv/2 MiB      1.00   935.2±21.91µs     4.2 GB/sec    1.01   940.0±10.86µs     4.2 GB/sec
send+recv/256 KiB    1.01    115.2±2.30µs     4.2 GB/sec    1.00    114.0±3.10µs     4.3 GB/sec
send+recv/32 KiB     1.04     29.3±0.50µs     2.1 GB/sec    1.00     28.2±0.54µs     2.2 GB/sec
send+recv/4 KiB      1.02     15.4±0.02µs   506.5 MB/sec    1.00     15.1±0.06µs   516.8 MB/sec
send+recv/512 B      1.00     13.3±0.24µs    73.4 MB/sec    1.01     13.4±0.04µs    72.9 MB/sec

Perhaps this issue was really the regression in large message performance fixed in #496 rather than the zero filling code?

While I think eliminating inefficient zero fills is doable, it may not actually be worth any extra complexity to do so if it isn't yielding objective performance improvements.

If someone is still seeing a regression with #496 please detail your usage scenario so we can benchmark it.

woutersl · 2025-05-19T21:07:45Z

The workload we have is more lots of small messages at a high frequency (like 10,000 / seconds, up to 100,000 / seconds in burst) rather than huge messages. At a high enough frequency, zeroing out memory becomes (very) noticeable in profiling.

alexheretic · 2025-05-20T00:10:31Z

The new "send+recv" benches do bench high freq single message perf. post #496 I can't produce much different in performance.

That is unless I set a very high read buffer size, say 4MiB (instead of default 128KiB). Then I see a diff between the two particularly for small messages.

group                 resize(4M read buf)                    setlen(4M read buf)  
-----                 -------------------                    -------------------
send+recv/1 GiB       1.00  1187.6±20.59ms  1724.5 MB/sec    1.02   1214.9±3.45ms  1685.7 MB/sec
send+recv/128 MiB     1.01    179.6±2.53ms  1425.2 MB/sec    1.00    177.3±4.75ms  1443.8 MB/sec
send+recv/16 MiB      1.00     13.5±0.51ms     2.3 GB/sec    1.04     14.0±0.30ms     2.2 GB/sec
send+recv/2 MiB       1.00  1073.7±38.67µs     3.6 GB/sec    1.04  1111.8±16.95µs     3.5 GB/sec
send+recv/256 KiB     1.27    154.1±4.72µs     3.2 GB/sec    1.00    121.3±2.87µs     4.0 GB/sec
send+recv/32 KiB      1.31     35.9±0.31µs  1742.5 MB/sec    1.00     27.4±0.85µs     2.2 GB/sec
send+recv/4 KiB       1.81     28.0±0.04µs   279.2 MB/sec    1.00     15.5±0.04µs   504.6 MB/sec
send+recv/512 B       3.01     41.1±1.51µs    23.7 MB/sec    1.00     13.7±0.09µs    71.5 MB/sec

This makes sense as each small message read needs to fill 4M of zeros.

And this perhaps also explains the existing performance regressions, as even though the 4M is not the default, the old behaviour tried to read (and therefore zero-fill) the entire available capacity. So it is feasible some large message usage caused the 128KiB buf to grow and then start showing bad perf.

So I think this is mostly resolved by #496, since that changes the read to only ever read the configured read_buffer_size (default 128KiB) as a max. With that, even if I manually make the actual read buffer capacity 4MiB the perf diff becomes difficult to see (presumably compared to other stuff).

group                resize(128K read buf)                  setlen(128K read buf)
-----                ---------------------                  ---------------------
send+recv/1 GiB      1.01  1310.2±20.95ms  1563.1 MB/sec    1.00   1297.3±7.94ms  1578.7 MB/sec
send+recv/128 MiB    1.00    196.9±2.47ms  1299.9 MB/sec    1.00    197.9±6.88ms  1293.5 MB/sec
send+recv/16 MiB     1.00     15.0±0.10ms     2.1 GB/sec    1.06     16.0±0.35ms  2002.4 MB/sec
send+recv/2 MiB      1.00  1145.5±33.09µs     3.4 GB/sec    1.09  1247.4±19.38µs     3.1 GB/sec
send+recv/256 KiB    1.05    139.0±0.96µs     3.5 GB/sec    1.00    131.8±3.76µs     3.7 GB/sec
send+recv/32 KiB     1.00     29.1±0.52µs     2.1 GB/sec    1.12     32.6±1.05µs  1919.9 MB/sec
send+recv/4 KiB      1.03     16.5±0.08µs   472.4 MB/sec    1.00     16.0±0.27µs   486.8 MB/sec
send+recv/512 B      1.00     13.9±0.32µs    70.5 MB/sec    1.01     14.0±0.05µs    69.9 MB/sec

My conclusion:

We should merge Fix large message read performance by enforcing max read_buffer_size read chunks #496 this seems to provide most of the benefit.
We can still try to optimise the zero fill logic and we can use a bench with a over-large read buf configured (e.g. 4MiB) to measure the optimisation.

daniel-abramov · 2025-05-23T21:24:13Z

Yeah, this sounds like a sensible approach!

alexheretic · 2025-11-23T21:17:16Z

FYI safe & sound fix for this issue: #524

woutersl force-pushed the fast-and-unsound branch from 886cd45 to de4b536 Compare March 5, 2025 09:02

Allow the use of unsound but faster BytesMut::set_len

d0c2bf6

woutersl force-pushed the fast-and-unsound branch from de4b536 to d0c2bf6 Compare March 5, 2025 09:08

daniel-abramov closed this May 23, 2025

alexheretic mentioned this pull request Nov 23, 2025

Optimise read buf initialization performance #524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `fast-and-unsound` feature#485

Add `fast-and-unsound` feature#485
woutersl wants to merge 1 commit intosnapview:masterfrom
cenotelie:fast-and-unsound

woutersl commented Mar 4, 2025

Uh oh!

daniel-abramov commented Mar 4, 2025 •

edited

Loading

Uh oh!

woutersl commented Mar 5, 2025

Uh oh!

stephank commented Mar 17, 2025

Uh oh!

alexheretic commented May 16, 2025

Uh oh!

alexheretic commented May 18, 2025

Uh oh!

woutersl commented May 19, 2025

Uh oh!

alexheretic commented May 20, 2025

Uh oh!

daniel-abramov commented May 23, 2025

Uh oh!

alexheretic commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

woutersl commented Mar 4, 2025

Uh oh!

daniel-abramov commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woutersl commented Mar 5, 2025

Uh oh!

stephank commented Mar 17, 2025

Uh oh!

alexheretic commented May 16, 2025

Uh oh!

alexheretic commented May 18, 2025

Uh oh!

woutersl commented May 19, 2025

Uh oh!

alexheretic commented May 20, 2025

Uh oh!

daniel-abramov commented May 23, 2025

Uh oh!

alexheretic commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daniel-abramov commented Mar 4, 2025 •

edited

Loading