Skip to content

Websocket lifecycle fixes#2411

Open
stephenberry wants to merge 7 commits intomainfrom
websocket
Open

Websocket lifecycle fixes#2411
stephenberry wants to merge 7 commits intomainfrom
websocket

Conversation

@stephenberry
Copy link
Copy Markdown
Owner

@stephenberry stephenberry commented Mar 30, 2026

Fix websocket_client crash on Windows IOCP (issue #2409)

Problem

websocket_client crashes with STATUS_ACCESS_VIOLATION (0xC0000005) on Windows IOCP during run(). The crash occurs at op->complete() in ASIO's IOCP event loop, intermittently, during the SSL handshake / early connection phase.

Root cause

This is ASIO issue #312 — an unfixed upstream bug. When PostQueuedCompletionStatus fails due to IOCP resource exhaustion, ASIO falls back to an internal completed_ops_ queue. But the completion key (overlapped_contains_result) is only passed as an argument to PostQueuedCompletionStatus and never stored on the operation object. When do_one() later picks up the operation from the fallback queue, it dispatches with the wrong completion key, causing op->complete() to be called with incorrect error codes and byte counts — leading to undefined behavior and crashes.

Fix

ASIO IOCP patch (cmake/asio-iocp-fix.cmake): Applied automatically to bundled ASIO (FetchContent path). Based on MongoDB's proven patch:

  • Adds a completionKey_ member to win_iocp_operation to store the completion key on the operation object
  • Stores the key before calling PostQueuedCompletionStatus, so it survives the fallback path
  • Passes op->completionKey() instead of a local variable when re-posting

The patch is 16 lines across 2 ASIO internal headers. It is idempotent (detects if already applied) and only runs for the bundled ASIO path. Users with system-installed ASIO or Boost.Asio should apply the patch to their ASIO installation if they experience the crash.

cancel_all() safety (websocket_client.hpp): Cancel and close raw sockets before resetting their shared_ptrs, then drain pending IOCP completion handlers with ctx->poll(). During the handshake phase, the websocket_connection doesn't exist yet, so force_close() is a no-op — the raw sockets must be cancelled directly.

Handler ordering (websocket_client.hpp): Set on_message/on_close/on_error handlers on the websocket_connection before calling set_initial_data(), so WebSocket frames that arrive in the same TCP segment as the HTTP upgrade response are not silently dropped.

New tests (websocket_client_lifetime_test.cpp) — 8 tests

  • Destroy-during-resolve, destroy-during-connect, destroy-during-handshake, destroy-during-active-connection
  • Rapid connect/destroy stress test (50 cycles at varying timings)
  • Destroy after connection failure, shared context destruction order
  • immediate_server_message_received: server sends a message immediately on open; verifies the client receives it (fails ~25% without the handler ordering fix)

@GTruf
Copy link
Copy Markdown

GTruf commented Mar 31, 2026

@stephenberry, hello, I used your websocket branch and verified that the code in the library is indeed the updated version, but unfortunately, the error persists. Based on the logs from asio, it’s still crashing at op->complete.

изображение

Try use my code example:

#include "glaze/net/websocket_client.hpp"

int main() {
    glz::websocket_client client;

    client.set_ssl_verify_mode(asio::ssl::verify_none);

    client.on_open([]() {
        std::cout << "Connected to WebSocket server!" << std::endl;
    });

    client.on_message([](std::string_view message, glz::ws_opcode opcode) {
        std::cout << message << std::endl;
    });

    client.on_close([](glz::ws_close_code code, std::string_view reason) {
        std::cout << "Connection closed with code: " << static_cast<int>(code);
        if (!reason.empty()) {
            std::cout << ", reason: " << reason;
        }
        std::cout << std::endl;
    });

    client.on_error([](std::error_code ec) {
        std::cerr << "Error: " << ec.message() << std::endl;
    });

    client.connect("wss://wseea.okx.com:8443/ws/v5/public");
    client.run();

    return 0;
}

The same problem:
изображение

изображение

@stephenberry
Copy link
Copy Markdown
Owner Author

stephenberry commented Mar 31, 2026

@GTruf, the issue is with the handshake/upgrade being asynchronous. I've now made it synchronous, but the current fix spawns a thread for each client that connects. I want to make a long term asynchronous solution, but this might be a valid short term fix. Are you expecting to deal with tens of thousands of websocket clients?

@GTruf
Copy link
Copy Markdown

GTruf commented Mar 31, 2026

@stephenberry, Not tens of thousands, but in the long run it could be several dozen or a few hundred. A per-connection flow doesn't seem like the best solution.

@stephenberry
Copy link
Copy Markdown
Owner Author

Yeah, I don't like the fixes I've tried so far. The problem is that the original code looks correct, but this is probably an IOCP windows bug. The synchronous handshake sidesteps the problem by never registering the socket with IOCP during the handshake phase. The only major cost is serialization when multiple clients share one io_context, which is an uncommon pattern for a WebSocket client.

@GTruf
Copy link
Copy Markdown

GTruf commented Mar 31, 2026

@stephenberry, so, for now, you’ll keep the synchronous version specifically for Windows and the asynchronous version for Unix? If so, could you please test my example on Unix (Ubuntu 24/25, for example) using your asynchronous version?

And what are your long-term plans for Windows? Will you open an issue for asio, or will you try to find workarounds?

@GTruf
Copy link
Copy Markdown

GTruf commented Mar 31, 2026

@stephenberry, In this issue, someone added a commit whose title suggests it fixes the problem. Have you tried this patch for asio code? It might just work.

изображение


# Apply IOCP fix for issue #312 (Windows crash in op->complete)
include(${CMAKE_CURRENT_SOURCE_DIR}/../cmake/asio-iocp-fix.cmake)
apply_asio_iocp_fix("${asio_SOURCE_DIR}/asio/include")
Copy link
Copy Markdown

@GTruf GTruf Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of this works, it will be necessary to add checks to ensure that the Windows OS

# Apply IOCP fix for issue #312 (Windows crash in op->complete)
include(${CMAKE_CURRENT_SOURCE_DIR}/../cmake/asio-iocp-fix.cmake)
apply_asio_iocp_fix("${asio_SOURCE_DIR}/asio/include")
if(WIN32)
Copy link
Copy Markdown

@GTruf GTruf Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenberry, Was the original IOCP problem finally solved?

@stephenberry
Copy link
Copy Markdown
Owner Author

@GTruf, I'm going to run tests on Windows and get back to you soon. Still trying to understand the whole of the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRITICAL BLOCKER BUG: crash on websocket_client::run()

2 participants