Skip to content

WebSocket auth handshake: tokens off URLs, in-band refresh#1502

Open
JSv4 wants to merge 15 commits intomainfrom
feature/websocket-auth-handshake
Open

WebSocket auth handshake: tokens off URLs, in-band refresh#1502
JSv4 wants to merge 15 commits intomainfrom
feature/websocket-auth-handshake

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 3, 2026

Summary

Move WebSocket JWTs off URL query strings (where they leak into nginx logs, browser history, Referer headers, Sentry breadcrumbs, CDN/WAF logs) onto the standard Sec-WebSocket-Protocol handshake header, with an in-band {type:"AUTH",token:...} refresh protocol so token rotation no longer reconnects the socket.

Hard cutover: ?token= URL parameters are stripped by the new middleware and ignored. Stale browser tabs from before the deploy must reload to recover.

What changed

Backend (config/websocket/)

  • middleware.py — rewritten. Reads tokens only from Sec-WebSocket-Protocol (["opencontracts.jwt.v1", "<jwt>"]). Query string + Authorization header no longer consulted. Sets scope["accepted_subprotocol"] so consumers can echo it on accept.
  • auth_handshake.py — new AuthHandshakeMixin. Provides accept_with_auth() (echoes subprotocol + emits initial AUTH_OK), handle_auth_message() (in-band refresh: refuses user-pk swap → 4002, refuses revoked permissions → 4003, swaps scope["user"] on success), request_token_refresh() (server-nudge with grace-timer that closes 4001 on timeout), cleanup_auth_handshake().
  • consumers/{unified_agent_conversation,thread_updates,notification_updates}.py — all three integrate the mixin. UnifiedAgent + ThreadUpdates override _validate_resource_permissions so refresh re-runs corpus/document/conversation access checks.

Frontend (frontend/src/)

  • utils/websocketAuth.ts — new. Constants (WS_AUTH_SUBPROTOCOL, close codes), tiny pure helpers (buildAuthProtocols, buildAuthMessage, parseAuthMessage).
  • hooks/useWebSocketAuth.ts — new shared hook. Owns one WebSocket, intercepts AUTH frames, fires in-band refresh on authToken reactive-var change, close-code-aware reconnect policy.
  • components/chat/get_websockets.ts — token parameter stripped from all builders. Deprecated getDocumentQueryWebSocket and getCorpusQueryWebSocket deleted.
  • hooks/useAgentChat.ts, hooks/useNotificationWebSocket.ts, components/corpuses/CorpusChat.tsx, components/knowledge_base/document/utils.ts, components/knowledge_base/document/right_tray/ChatTray.tsx, plus useBadgeNotifications/useExtractCompletionNotification/useJobNotifications — all updated to compose the shared hook and stop reading authToken directly.

Wire protocol

Client → Server  (refresh):    {"type":"AUTH","token":"<jwt>"}
Server → Client  (initial):    {"type":"AUTH_OK","user_id":...,"anonymous":false,"refreshed":false}
Server → Client  (refresh):    {"type":"AUTH_OK","refreshed":true}
Server → Client  (refusal):    {"type":"AUTH_FAILED","reason":"EXPIRED|INVALID|USER_MISMATCH|PERMISSION_REVOKED"}
Server → Client  (server nudge): {"type":"AUTH_REFRESH_REQUIRED","grace_seconds":30}

Close codes: 4000 unauthenticated, 4001 token expired, 4002 token invalid (or pk-mismatch on refresh), 4003 permission denied (or revoked on refresh), 4029 rate-limited.

Test plan

  • docker compose -f test.yml run --rm django pytest opencontractserver/tests/test_websocket_auth.py -v — middleware unit tests + mixin unit tests + per-consumer integration tests + ?token= regression tests + user-pk-swap refusal
  • cd frontend && yarn test:unit run src/utils/__tests__/websocketAuth.test.ts src/hooks/__tests__/useWebSocketAuth.test.ts src/hooks/__tests__/useNotificationWebSocket.auth.test.ts src/hooks/__tests__/useAgentChat.reconnection.test.ts
  • cd frontend && yarn lint && yarn tsc --noEmit
  • Manual sanity: docs/test_scripts/websocket-auth-handshake.md — confirm browser DevTools shows no ?token= in WS request URL, Sec-WebSocket-Protocol: opencontracts.jwt.v1, <jwt> header on the handshake, AUTH_OK as first frame, single AUTH frame on Auth0 silent renewal (no socket churn)

Risk

  • Hard cutover means stale browser tabs from before the deploy must reload. No backward-compat for ?token=.
  • If a reverse proxy strips Sec-WebSocket-Protocol, all auth fails. nginx forwards by default; deployment-time config check covered in docs/test_scripts/websocket-auth-handshake.md.

JSv4 added 12 commits May 3, 2026 16:05
…l only

Removes the historical query-string (?token=) and Authorization-header token
extraction paths. Tokens are now read exclusively from the Sec-WebSocket-Protocol
handshake header (format: "opencontracts.jwt.v1, <jwt>"), which is the only
channel a browser WebSocket constructor can populate without a custom header.

The subprotocol marker is always echoed on scope["accepted_subprotocol"] when
present — even on auth failure — so consumers can close cleanly with a 4xxx code
rather than failing at the transport layer. Auth errors are surfaced on
scope["auth_error"] with a typed close code (4001 expired, 4002 invalid).

Exports: WS_AUTH_SUBPROTOCOL, WS_CLOSE_UNAUTHENTICATED, WS_CLOSE_TOKEN_EXPIRED,
WS_CLOSE_TOKEN_INVALID, WS_CLOSE_RATE_LIMITED. Backwards-compat alias
GraphQLJWTTokenAuthMiddleware kept until consumers are updated.

Adds JWTAuthMiddlewareSubprotocolTests (6 tests) covering: no header →
anonymous, marker-only → anonymous + echoed subprotocol, valid token →
authenticated, invalid token → 4002 error + echoed subprotocol, ?token=
ignored, Authorization header ignored.
…perm checks

- UnifiedAgentConsumer now inherits AuthHandshakeMixin (before AsyncWebsocketConsumer
  in MRO) and calls accept_with_auth() / cleanup_auth_handshake() in connect /
  disconnect.
- Extracted corpus/document permission checks into _validate_resource_permissions()
  override so the mixin can re-validate on in-band token refresh.
- receive() fast-paths AUTH frames to handle_auth_message() before rate-limit
  gates or agent dispatch.
- JWTAuthMiddleware._parse_subprotocol_token() now also checks
  scope["subprotocols"] (ASGI spec field set by WebsocketCommunicator in tests)
  in addition to the Sec-WebSocket-Protocol header; production browser behaviour
  is unchanged.
- Updated GraphQLJWTTokenAuthMiddlewareTestCase to use subprotocol transport
  and drain the initial AUTH_OK frame before sending queries.
- Added UnifiedAgentHandshakeTests covering in-band refresh success and
  user-mismatch rejection.
- Added type: ignore comments to suppress pre-existing mypy mixin attr errors
  in auth_handshake.py and the method-assign error in the test fixture.
- Add AuthHandshakeMixin to MRO before AsyncWebsocketConsumer
- Replace accept() with accept_with_auth() to emit initial AUTH_OK frame
- Add AUTH fast-path at top of receive() for in-band token refresh
- Add _validate_resource_permissions() override re-checking conversation
  access (owner, superuser, corpus/document visibility) on refresh
- Call cleanup_auth_handshake() in disconnect()
- Add ThreadUpdatesHandshakeTests (3 tests: valid token, no token 4000,
  in-band refresh)
- Add AuthHandshakeMixin to MRO before AsyncWebsocketConsumer
- Replace accept() with accept_with_auth() to emit initial AUTH_OK frame
  before the existing CONNECTED frame
- Add AUTH fast-path at top of receive() for in-band token refresh
- Call cleanup_auth_handshake() in disconnect()
- No _validate_resource_permissions() override needed; consumer is bound
  to the authenticated user themselves (user-pk swap already forbidden by
  the mixin)
- Add NotificationUpdatesHandshakeTests (3 tests: AUTH_OK+CONNECTED pair,
  no token 4001, in-band refresh)
…pusQuery wrappers

Tokens are now carried exclusively via the Sec-WebSocket-Protocol subprotocol
header (useWebSocketAuth). The three remaining URL builders — getUnifiedAgentWebSocket,
getThreadUpdatesWebSocket, getNotificationUpdatesWebSocket — no longer accept or embed
token parameters. getDocumentQueryWebSocket and getCorpusQueryWebSocket are deleted.
Replaces bespoke WebSocket lifecycle with the shared useWebSocketAuth hook.
Token rotation is now in-band (no socket churn). The autoReconnect option is
removed from the public API since reconnect behavior is owned by useWebSocketAuth.
Callers that passed autoReconnect: true have the option removed (it remains the default).
Deletes local getEnvVar / resolveWsBaseUrl / getUnifiedAgentWebSocketUrl helpers;
delegates to getUnifiedAgentWebSocket from get_websockets.ts. The WebSocket lifecycle
(open, close, reconnect, in-band token refresh) is now owned by useWebSocketAuth.
reconnectTrigger state and auth_token reactive-var read are removed.
Test file updated to import getUnifiedAgentWebSocket from get_websockets and
assert no token in URL.
CorpusChat: import getUnifiedAgentWebSocket instead of getCorpusQueryWebSocket;
drop auth_token from URL builder call and effect dep array.
document/utils.ts: drop token parameter from getWebSocketUrl; remove @deprecated
JSDoc block.
ChatTray: update getWebSocketUrl call to drop token arg; remove now-unused
auth_token reactive-var read and authToken import.
Comment thread config/websocket/auth_handshake.py Fixed
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: WebSocket auth handshake — tokens off URLs, in-band refresh

This is a well-motivated security improvement with solid architecture. The core approach is correct and the test coverage is genuinely impressive. There are a few issues worth addressing before merge.


Overview

Moves JWT transport from URL query strings (logged everywhere) to Sec-WebSocket-Protocol headers, and eliminates socket reconnect on token rotation via an in-band AUTH frame protocol. Hard cutover — ?token= stops working.


Critical Bug: CorpusChat.tsx and ChatTray.tsx are not migrated

Both components still create sockets directly with new WebSocket(wsUrl) — no subprotocols. The token was removed from the URL but never replaced with the subprotocol mechanism. These connections will hit the middleware as anonymous users and be rejected.

CorpusChat.tsx (~line 335 in the diff):

const wsUrl = getUnifiedAgentWebSocket({ corpusId, conversationId: ... });
const newSocket = new WebSocket(wsUrl);  // ← no protocols[]

ChatTray.tsx (~line 777 in the diff):

const wsUrl = getWebSocketUrl(documentId, selectedConversationId, corpusId);
const newSocket = new WebSocket(wsUrl);  // ← no protocols[]

Both need to be refactored to use useWebSocketAuth (or at minimum pass buildAuthProtocols(token) as the second argument to new WebSocket()), the same way useAgentChat and useNotificationWebSocket were migrated. Without this fix, corpus chat and document chat tray are completely broken for authenticated users after this deploy.


Bug: Stale ws value returned from useWebSocketAuth

return {
  ws: wsRef.current,  // evaluated at render time — stale until next render
  ...
};

wsRef.current is set inside a useEffect (after render), but the return value is evaluated during render. The first render after mount returns null; after the effect runs and the socket opens, ws stays stale until something triggers a re-render. The consumers work around this with their own useEffect(() => { socketRef.current = ws; }, [ws]), but that too will only fire when ws changes between renders — and since refs don't cause re-renders, it only fires coincidentally.

Since send and reconnect are already exposed and consumers use those, consider simply not returning ws at all to remove the footgun, or expose it as a ref directly.


Bug: Rate limit bypassed for AUTH frame spam

In all three consumers, AUTH frames are dispatched before the rate-limit check:

if isinstance(_payload, dict) and _payload.get("type") == "AUTH":
    await self.handle_auth_message(_payload)
    return  # ← skips rate-limit gate

if await check_ws_rate_limit(self, "WS_HEARTBEAT"):
    return

A malicious client with a valid connection can send unlimited AUTH frames. Each one hits _get_user_from_token() and then _validate_resource_permissions() (which runs DB queries). Consider adding a per-connection AUTH rate limit (e.g., a simple token bucket or a cooldown ref) before calling handle_auth_message.


Issue: Reconnect loop on close 4000/4001

useWebSocketAuth.ts's close handler:

if (code === WS_CLOSE_NORMAL || code === WS_CLOSE_PERMISSION_DENIED) return;
if (code === WS_CLOSE_TOKEN_INVALID) { onAuthInvalid?.(); return; }
// everything else → exponential backoff reconnect

Codes 4000 (unauthenticated) and 4001 (token expired) fall through to reconnect. If the user is genuinely unauthenticated or has a persistently expired token, this creates a retry storm. 4001 particularly — the backend closes with it when the initial handshake token is expired, so retrying will just keep getting rejected. Consider treating WS_CLOSE_UNAUTHENTICATED (4000) the same as WS_CLOSE_TOKEN_INVALID (4002), and treating WS_CLOSE_TOKEN_EXPIRED (4001) as a trigger for onAuthInvalid rather than a reconnect.


Minor: Redundant JSON parsing in receive()

All three consumers now parse JSON twice when a non-AUTH frame arrives:

try:
    _payload = json.loads(text_data)  # parse #1 — check for AUTH
except json.JSONDecodeError:
    _payload = None
if isinstance(_payload, dict) and _payload.get("type") == "AUTH":
    ...
    return

# original code below:
try:
    payload = json.loads(text_data)  # parse #2 — actual dispatch

Pass _payload through (or early-exit on decode failure) to avoid the double parse.


Minor: Nested import of already-imported symbol in thread_updates.py

ThreadUpdatesConsumer._validate_resource_permissions re-imports Conversation inside the nested _check function, but Conversation is already imported at the module level (line ~27 of the original file). Same for Corpus and Document — they can be module-level imports.


Minor: disconnect() is now a silent no-op

const disconnect = useCallback(() => {
    /* lifecycle owned by useWebSocketAuth */
}, []);

Any existing callers of disconnect() will silently do nothing. Worth either removing from the returned interface (breaking change) or noting this more prominently, since silent no-ops are hard to debug.


What's good

  • Security objective is correct and important. JWT off URL is a real win.
  • _parse_subprotocol_token dual-path (header vs scope["subprotocols"]) correctly handles both production and WebsocketCommunicator test paths. Good foresight.
  • User-pk swap refusal is defense-in-depth done right — mixin-level so every consumer gets it.
  • AuthHandshakeMixin class-level defaults (_refresh_grace_task = None, _initial_auth_sent = False) work correctly in Python (assignment creates instance attributes), though documenting this intent would help.
  • Test coverage is comprehensive — middleware unit tests, mixin isolation tests, per-consumer integration tests, and regression tests for the old ?token= path. The test_nudge_refresh_grace_timer_closes_4001_on_timeout test using asyncio.sleep(0.2) to test a 0.05s timer is pragmatic.
  • CHANGELOG is thorough and correctly categorized under ### Security.
  • Removal of dead code (getDocumentQueryWebSocket, getCorpusQueryWebSocket, duplicate URL-builder logic in useAgentChat.ts) follows the project's no-dead-code rule cleanly.

Summary

The CorpusChat.tsx and ChatTray.tsx migration is the blocking issue — those chat surfaces will be broken for authenticated users after deploy. The AUTH rate-limit bypass and reconnect-loop edge cases are also worth fixing before shipping to production. Everything else is addressable as a follow-up.

Critical fixes from claude-review feedback:

- CorpusChat.tsx and ChatTray.tsx now compose useWebSocketAuth instead of
  bare `new WebSocket(url)`. Without this both chat surfaces would have
  authenticated as anonymous against the new middleware (no subprotocol →
  no token). Sub-tool unwrapping for ask_document approvals stays in
  CorpusChat; ChatTray keeps its message-state-derived processing
  indicator. pendingApproval is mirrored into a ref so the once-installed
  onmessage closure reads the latest value without retriggering connect.

- AuthHandshakeMixin.handle_auth_message now enforces a 1-second
  per-connection cooldown before any DB work (token validation +
  resource permission check). Stops a malicious client from spamming
  AUTH frames to burn queries; legitimate Auth0 refreshes are ~50 min
  apart so unaffected.

- useWebSocketAuth now treats close codes 4000 (UNAUTHENTICATED) and
  4001 (TOKEN_EXPIRED) as auth-invalid signals (was: silently fell
  through to exponential-backoff reconnect, hammering a server that
  would just reject them the same way). The hook also no longer returns
  the underlying ws reference — it was stale on first render and a
  footgun for callers; useAgentChat's socketRef mirror is dropped.
  useNotificationWebSocket drops the no-op disconnect()/connect aliases
  for the same reason.

- Each consumer's receive() parses the incoming text frame once and
  dispatches AUTH or non-AUTH from the single parsed payload (was
  double-parsing). thread_updates.py consolidates Corpus/Document/
  Conversation imports to the module level.

Test wrappers (CorpusChat, ChatTray) needed `static OPEN = 1` on their
StubSocket so the hook's `ws.readyState !== WebSocket.OPEN` send guard
works after `window.WebSocket = StubSocket`. useAgentChat unit tests
also adjust three stale assertions left over from the URL-token
removal: token=test-token in URL → expect not to contain token=, and
the two _fail()-driven error tests now drive errors via close 4002 +
onAuthInvalid (the post-migration path that actually surfaces errors
to consumers).

CHANGELOG updated. New CT screenshot:
knowledge-base--chat-tray--conversation-list.png.
}
)
)
except Exception:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant