Skip to content

Fix self-sustaining WebRTC reconnect storm in remote transport#1888

Merged
MarvinSchenkel merged 3 commits into
mainfrom
fix/webrtc-reconnect-storm
Jun 8, 2026
Merged

Fix self-sustaining WebRTC reconnect storm in remote transport#1888
MarvinSchenkel merged 3 commits into
mainfrom
fix/webrtc-reconnect-storm

Conversation

@MarvinSchenkel

Copy link
Copy Markdown
Contributor

Problem

In remote (WebRTC) mode the transport could enter a perpetual reconnect loop: a healthy connection — data channel open, ICE connected — was torn down and rebuilt roughly every 2 seconds and never stabilised. On iOS PWAs this surfaced as an endless connecting → connected → reconnecting cycle after backgrounding the app, and server-side as a new WebRTC session created and destroyed every ~2s.

Root cause

Three compounding issues in WebRTCTransport (src/plugins/remote/webrtc-transport.ts):

  1. cleanup() re-triggered recovery (the engine of the loop). Each reconnect runs cleanup() then connect(). cleanup() closed the data channel while its onclose handler was still attached, and that handler calls scheduleReconnect(). So the act of tearing down to reconnect scheduled another reconnect — and since connect() re-creates a healthy channel, the next cleanup closed a healthy channel and re-armed the loop. Once entered it never stopped.
  2. connect() / scheduleReconnect() were not serialised. Multiple recovery drivers (dataChannel.onclose, signaling peer-disconnected, ICE/connection failed) — all of which fire together on a real network drop or iOS resume — ran concurrently and corrupted the shared peer connection / SDP state (InvalidModificationError: setLocalDescription ... SDP does not match). The previous guard only blocked while RECONNECTING with a pending timer.
  3. Backoff never escalated. reconnectAttempts reset to 0 on every successful connect; because each storm cycle briefly succeeds, the delay stayed flat (~1.5s) instead of backing off.

Changes

  • Detach handlers before close in cleanup() (data channel + peer connection) so teardown no longer re-enters the recovery path.
  • Serialise connect() via an in-flight guard, and bail from scheduleReconnect() when a reconnect is already pending or a connect is in flight.
  • Defer the backoff reset until the connection has been stable for 5s, and add jitter to the reconnect delay.
  • Only treat the terminal ICE failed state as a failure; let the transient disconnected state self-heal (it usually recovers on its own; the browser escalates disconnected → failed if it does not). This aligns with WebRTC guidance.

Verification

Reproduced deterministically against a local debug server (frontend dev build connected over real WebRTC/signaling), correlating client console + server logs:

Before After
Concurrent connect() InvalidModificationError, teardown extra calls bail cleanly, single negotiation
Sustained flapping stuck reconnecting 25s+ after triggers stop recovers to connected, no pending timer, attempts reset
Server sessions new session every ~2s, indefinitely churn bounded to the trigger window, then quiet

vue-tsc --noEmit, ESLint, and the vitest suite (176 tests) all pass.

Follow-up (not in this PR)

In-place ICE restart recovery (restartIce()). The WebRTC-recommended way to recover from a connectivity drop is an in-place ICE restart, which preserves DTLS keys and data channels instead of tearing down and rebuilding the whole RTCPeerConnection. This PR makes the existing teardown-and-rebuild path correct and bounded; it does not adopt restartIce().

Adopting it requires the server to support in-place renegotiation on an existing peer connection (today the gateway creates a fresh RTCPeerConnection per signaling session), so it is best done alongside the planned switch to a server-side WebRTC library with trickle-ICE support. At that point the natural companion change is to add a short grace timer on ICE disconnected that triggers restartIce() (cheap recovery) rather than waiting for failed. The serialised-negotiation guard added here is the foundation that work builds on.

The remote WebRTC transport could enter a perpetual reconnect loop in
which a healthy connection (data channel open, ICE connected) was torn
down and rebuilt roughly every 2s and never stabilised. On iOS PWAs this
showed up as an endless "connecting → connected → reconnecting" cycle
after backgrounding the app.

Root cause was three compounding issues in WebRTCTransport:
- cleanup() closed the data channel with its onclose handler still
  attached, so tearing down to reconnect scheduled another reconnect -
  the engine of the loop.
- connect()/scheduleReconnect() were not serialised, so multiple
  recovery drivers (data channel close, peer-disconnected, connection
  failure) ran concurrently and corrupted the shared peer connection and
  SDP state (InvalidModificationError on setLocalDescription).
- reconnectAttempts reset on every successful connect, so the backoff
  never escalated during a flapping loop and stayed at a flat ~1.5s.

Changes:
- Detach data channel and peer connection handlers before closing them
  in cleanup() so teardown no longer re-triggers recovery.
- Serialise connect() with an in-flight guard and bail from
  scheduleReconnect() when a reconnect is pending or a connect is in
  flight.
- Defer the backoff reset until the connection has been stable for 5s
  and add jitter to the reconnect delay.
- Treat only the terminal ICE `failed` state as a failure; let the
  transient `disconnected` state self-heal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a self-sustaining reconnect loop (“reconnect storm”) in the remote WebRTC transport by preventing teardown from re-triggering recovery, serializing connection attempts to avoid SDP/PC state corruption, and making reconnect backoff behave correctly under flapping connections.

Changes:

  • Detach DataChannel/PeerConnection handlers before closing in cleanup() to avoid teardown re-entering reconnection logic.
  • Serialize connect() and gate scheduleReconnect() to prevent concurrent negotiations and duplicate reconnect scheduling.
  • Improve reconnect backoff behavior by adding jitter and only resetting the backoff after the connection remains stable for 5 seconds; also treat ICE failed (not disconnected) as the failure signal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/plugins/remote/webrtc-transport.ts
Addresses review feedback: cleanup() didn't cancel stableConnectionTimer,
so on a non-reconnect teardown (reconnect disabled, or peer-disconnect via
the non-reconnect path) the pending backoff reset could fire after the
transport was closed and kept the instance referenced until it did. Clear
it in cleanup() - the single teardown funnel - and drop the now-redundant
clear in disconnect().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread src/plugins/remote/webrtc-transport.ts
Addresses review feedback: the stable-connection timer was cleared only
after the max-attempts guard, so with a finite maxReconnectAttempts a drop
within the 5s stability window could give up (FAILED) yet leave the timer
running, which then reset reconnectAttempts to 0 and bypassed the max.
Clear it at the top of scheduleReconnect() so it's cancelled on every path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@stvncode stvncode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@MarvinSchenkel MarvinSchenkel merged commit 0283581 into main Jun 8, 2026
9 checks passed
@MarvinSchenkel MarvinSchenkel deleted the fix/webrtc-reconnect-storm branch June 8, 2026 15:57
github-actions Bot pushed a commit to music-assistant/server that referenced this pull request Jun 8, 2026
Update music-assistant-frontend to version
[2.17.181](https://github.com/music-assistant/frontend/releases/tag/2.17.181)


- Fix queue items disappearing in fullscreen player (by @MarvinSchenkel
in [#1874](music-assistant/frontend#1874))
- Add translation key for now playing badge (by @MarvinSchenkel in
[#1889](music-assistant/frontend#1889))

## 🚀 Features and enhancements

- Refactor heart icon and add it to the artist page (by @stvncode in
[#1891](music-assistant/frontend#1891))
- Add back subtitle for discover page (by @stvncode in
[#1890](music-assistant/frontend#1890))
- Bigger tiles on mobile (by @stvncode in
[#1887](music-assistant/frontend#1887))

## 🐛 Bugfixes

- Fix self-sustaining WebRTC reconnect storm in remote transport (by
@MarvinSchenkel in
[#1888](music-assistant/frontend#1888))
- Subtle placeholder for both dark and light mode (by @stvncode in
[#1886](music-assistant/frontend#1886))

## 🧰 Maintenance and dependency bumps

<details>
<summary>8 changes</summary>

- Add built-in playlists for favorites and random tracks (by @OzGav in
[#1876](music-assistant/frontend#1876))
- Chore(deps): Bump ludeeus/action-require-labels from 1.1.0 to 2.0.0
(by @[dependabot[bot]](https://github.com/apps/dependabot) in
[#1878](music-assistant/frontend#1878))
- Chore(deps): Bump happy-dom from 20.9.0 to 20.10.2 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1879](music-assistant/frontend#1879))
- Chore(deps): Bump marked from 18.0.3 to 18.0.5 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1880](music-assistant/frontend#1880))
- Chore(deps-dev): Bump @types/node from 25.9.1 to 25.9.2 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1881](music-assistant/frontend#1881))
- Chore(deps): Bump @tanstack/vue-form from 1.32.0 to 1.33.0 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1882](music-assistant/frontend#1882))
- Chore(deps): Bump vue-i18n from 11.4.4 to 11.4.5 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1883](music-assistant/frontend#1883))
- Chore(deps): Bump reka-ui from 2.9.8 to 2.9.9 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[#1884](music-assistant/frontend#1884))
</details>


## 🙇 Contributors

@MarvinSchenkel, @OzGav and @stvncode

Co-authored-by: stvncode <25082266+stvncode@users.noreply.github.com>
anatosun pushed a commit to anatosun/music-assistant-server that referenced this pull request Jun 14, 2026
Update music-assistant-frontend to version
[2.17.181](https://github.com/music-assistant/frontend/releases/tag/2.17.181)


- Fix queue items disappearing in fullscreen player (by @MarvinSchenkel
in [music-assistant#1874](music-assistant/frontend#1874))
- Add translation key for now playing badge (by @MarvinSchenkel in
[music-assistant#1889](music-assistant/frontend#1889))

## 🚀 Features and enhancements

- Refactor heart icon and add it to the artist page (by @stvncode in
[music-assistant#1891](music-assistant/frontend#1891))
- Add back subtitle for discover page (by @stvncode in
[music-assistant#1890](music-assistant/frontend#1890))
- Bigger tiles on mobile (by @stvncode in
[music-assistant#1887](music-assistant/frontend#1887))

## 🐛 Bugfixes

- Fix self-sustaining WebRTC reconnect storm in remote transport (by
@MarvinSchenkel in
[music-assistant#1888](music-assistant/frontend#1888))
- Subtle placeholder for both dark and light mode (by @stvncode in
[music-assistant#1886](music-assistant/frontend#1886))

## 🧰 Maintenance and dependency bumps

<details>
<summary>8 changes</summary>

- Add built-in playlists for favorites and random tracks (by @OzGav in
[music-assistant#1876](music-assistant/frontend#1876))
- Chore(deps): Bump ludeeus/action-require-labels from 1.1.0 to 2.0.0
(by @[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1878](music-assistant/frontend#1878))
- Chore(deps): Bump happy-dom from 20.9.0 to 20.10.2 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1879](music-assistant/frontend#1879))
- Chore(deps): Bump marked from 18.0.3 to 18.0.5 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1880](music-assistant/frontend#1880))
- Chore(deps-dev): Bump @types/node from 25.9.1 to 25.9.2 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1881](music-assistant/frontend#1881))
- Chore(deps): Bump @tanstack/vue-form from 1.32.0 to 1.33.0 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1882](music-assistant/frontend#1882))
- Chore(deps): Bump vue-i18n from 11.4.4 to 11.4.5 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1883](music-assistant/frontend#1883))
- Chore(deps): Bump reka-ui from 2.9.8 to 2.9.9 (by
@[dependabot[bot]](https://github.com/apps/dependabot) in
[music-assistant#1884](music-assistant/frontend#1884))
</details>


## 🙇 Contributors

@MarvinSchenkel, @OzGav and @stvncode

Co-authored-by: stvncode <25082266+stvncode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants