Skip to content

Faster implementation of InputBuffer#214

Merged
daniel-abramov merged 3 commits intomasterfrom
faster-input-buffer
Jul 8, 2021
Merged

Faster implementation of InputBuffer#214
daniel-abramov merged 3 commits intomasterfrom
faster-input-buffer

Conversation

@daniel-abramov
Copy link
Copy Markdown
Member

@daniel-abramov daniel-abramov commented Jul 5, 2021

Fixes #210.

Please check the discussion and decision if you don't know the context as well as the latest PR in input_buffer.

It seems like the new simplified implementation has only significant difference on x86_64 architecture while the changes on Armv8.4-A A64 (Apple M1) are not that significant as both versions perform really well. On ARM the new version even behaves a bit slower (but not significantly slower), however the difference on Intel (x86_64) architecture is quite significant.

===== M1 Mac (Armv8.4-A A64) =====

buffers/InputBuffer     time:   [543.66 us 549.32 us 555.54 us]
                        thrpt:  [7.0315 GiB/s 7.1110 GiB/s 7.1851 GiB/s]

buffers/ReadBuffer      time:   [552.16 us 565.83 us 581.26 us]
                        thrpt:  [6.7203 GiB/s 6.9036 GiB/s 7.0745 GiB/s]


===== Intel (x86_64) =====

buffers/InputBuffer     time:   [4.7392 ms 4.7601 ms 4.7817 ms]
                        thrpt:  [836.52 MiB/s 840.31 MiB/s 844.03 MiB/s]

buffers/ReadBuffer      time:   [726.86 us 730.72 us 734.81 us]
                        thrpt:  [5.3160 GiB/s 5.3457 GiB/s 5.3741 GiB/s]

So it seems like migrating to the newer ReadBuffer makes sense as the 'average performance gain' is quite significant (and I believe most users are still using x86_64).

⚠️ Notes:

  1. The PR also removes a couple of errors that could have been triggered by tungstenite-rs. These were removed as they could only occur if the incoming message does not fit into the hardcoded u32::max_value(). Currently we have max_frame_size and max_message_size, but we don't use them to cap the size of the input/read buffer (instead we compare the values once we read something). This in theory is not a problem when using tungstenite-rs with non-blocking sockets (like in case of tokio-tungstenite), though it could pose the issue for those ones who use blocking sockets. In other words, the previous reading with u32::max_value() did not really add any sort of "additional protection", so I removed them all together to not complicate the ReadBuffer implementation.
  2. There is no automation to run criterion benchmarks with our CI. That's fine as criterion is not meant to be used like this and may give many false positives and false negatives when used like this. In order to automate benchmarking with CI we have to use a tool like lai that was written to be used for such purposes (it works differently than criterion), but I think it's better to make it as part of a separate PR.

P.S.: Thanks @qiujiangkun for bringing this topic and performance testing of tungstenite-rs 👍

@agalakhov
Copy link
Copy Markdown
Member

Is there any size limit now? Is it possible to DoS by memory overfull?

@daniel-abramov
Copy link
Copy Markdown
Member Author

Is there any size limit now? Is it possible to DoS by memory overfull?

Do you mean those points that I mentioned in (1)? If so, then I think there is no problem with it since the previous limit that we had on the InputBuffer was a hardcoded value of usize::max_value() which I believe could not even be triggered in any normal circumstances because of max_frame_size and max_message_size that were checked (and still checked) during the message assembly. Since we read chunks, the only way to trigger the error from the InputBuffer was user setting either of max_frame_size or max_message_size in the WebSocketConfig to the Some(usize::MAX). Another case when the limit could hit was the initial handshake. I don't think that the new implementation is worse than previous implementation in those particular cases since even if we get to the usize::MAX, the size of the underlying Vec will be at its maximum value (len is usize), pushing to such Vec would result in panic. But the old implementation did not handle it better since it tried to return an error by summing up and checking the usize values without taking the overflow into an account. Adding usize with some reserve would lead to the panic in the old implementation as well.

In other words, the new drop-in replacement is probably not perfect (since there is a way to get the panic), but it is at least not worse than the previous implementation. Please correct me if I'm wrong or missed something.

However while replying to your comment and checking the unit tests (trying to extend them) I noticed another thing: the new implementation had an issue that it did not clean up the data that is read by the cursor (which normally is not required anymore), so I pushed a new commit that implements a behavior similar to the previous remove_garbage stage in the InputBuffer and extended the unit test to account for that.

With that step implemented the new ReadBuffer seems to still be faster.

buffers/InputBuffer     time:   [3.1572 ms 3.1728 ms 3.1904 ms]                                 
                        thrpt:  [1.2244 GiB/s 1.2312 GiB/s 1.2373 GiB/s]

buffers/ReadBuffer      time:   [1.2585 ms 1.2659 ms 1.2737 ms]                                
                        thrpt:  [3.0668 GiB/s 3.0857 GiB/s 3.1038 GiB/s]

@daniel-abramov daniel-abramov merged commit c742916 into master Jul 8, 2021
@daniel-abramov daniel-abramov deleted the faster-input-buffer branch March 22, 2023 23:13
takumi-earth pushed a commit to earthlings-dev/tungstenite-rs that referenced this pull request Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bad performance with input_buffer > 3.0

3 participants