Skip to content

Crash when shutting down Wifi with 'Wifi 0'#24536

Merged
s-hadinger merged 1 commit intoarendst:developmentfrom
s-hadinger:fix_wifi_shutdown_crash
Mar 9, 2026
Merged

Crash when shutting down Wifi with 'Wifi 0'#24536
s-hadinger merged 1 commit intoarendst:developmentfrom
s-hadinger:fix_wifi_shutdown_crash

Conversation

@s-hadinger
Copy link
Copy Markdown
Collaborator

Description:

Fix a long standing crash when Wifi gets disconnected with command Wifi 0. The fix ensures that Webserver and UDPserver are closed and reopened each time Wifi gets disconnected

Here is the full analysis by Claude:

Wifi Shutdown Crash Analysis — Wifi 0 Command

The Bug

Executing Wifi 0 (CmndWifi()) while there is active Web or UDP traffic causes a crash. The crash does not occur when there is no traffic at the time of shutdown.

Crash Traces (call chain, caller → callee)

TCP (webserver):

Xdrv01() → WebServer::handleClient() → NetworkServer::accept() →
lwip_accept → netconn_delete → netconn_free → netconn_drain →
pbuf_free → esp_pbuf_free   ← CRASH

UDP (Berry/Matter):

Xdrv52() → be_pcall → be_udp_read → NetworkUDP::parsePacket() →
lwip_recvfrom → lwip_recvfrom_udp_raw → netbuf_delete →
pbuf_free → esp_pbuf_free   ← CRASH

Architecture Context

Tasmota is single-threaded and synchronous. The main loop calls Every250mSeconds() which runs a state machine, and also calls FUNC_LOOP which runs driver loops including PollDnsWebserver()Webserver->handleClient().

The webserver uses a listening TCP socket managed by lwIP. Incoming TCP connections land in the socket's accept backlog — a queue of pending connections with associated pbuf chains that reference the network interface (netif) they arrived on.

UDP sockets similarly queue received packets in lwIP's internal receive buffer, with pbuf chains referencing the netif the packet arrived on.

Chain of Events Leading to Crash

The main loop is Scheduler() in tasmota.ino. Each iteration runs, in order:

  1. XdrvXsnsCall(FUNC_LOOP) — calls all driver loops, including Xdrv01()PollDnsWebserver()Webserver->handleClient()
  2. (various 50ms/100ms handlers)
  3. Every250mSeconds() — state machine, runs every 250ms

Step 1: CmndWifi(0) is called (in support_command.ino)

  • Sets Settings->flag4.network_wifi = 0 (that's all it does for payload 0)
  • Does NOT immediately disable WiFi — just sets the flag
  • Returns to the main loop

Step 2: Every250mSeconds() state machine, case 2 (in support_tasmota.ino)

  • Runs every 250ms at the x.5 second mark
  • Checks Settings->flag4.network_wifi: if 0, calls WifiDisable()
  • WifiDisable() (in support_wifi.ino) → WifiShutdown()WiFi.disconnect(true, true) + WifiSetMode(WIFI_OFF)
  • This tears down the WiFi netif in lwIP, freeing its memory structures
  • Meanwhile, any socket (TCP or UDP) still has queued data with pbuf pointers to the now-destroyed WiFi netif

Step 3: Next Scheduler() iteration, FUNC_LOOP runs

  • TCP: Xdrv01(FUNC_LOOP)PollDnsWebserver()Webserver->handleClient()lwip_accept()pbuf_free() on dangling pointer → crash
  • UDP: Xdrv52(FUNC_LOOP) → Berry udp.read()lwip_recvfrom()netbuf_deletepbuf_free() on dangling pointer → crash

Note: the crash can also happen later (e.g. when WiFi is re-enabled) if the socket read is somehow skipped — the poison persists in the socket until the socket is closed.

Why the Poison Persists

The dangling references live inside lwIP-internal queues (TCP accept backlog, UDP receive buffer). The only way to clear them is to close the socket (lwip_close), which properly frees all queued data.

Critically, WifiShutdown() contains multiple delay() calls (totaling ~300ms) that yield to the RTOS scheduler. During these yields, the lwIP TCP/IP task runs and can queue new incoming data into sockets. This means even if you flush a socket right before WifiShutdown() and immediately reopen it, the new socket gets re-poisoned during the delays before WiFi.disconnect() destroys the netif.

Approaches That Don't Work

  1. Guarding PollDnsWebserver() to skip handleClient(): Only postpones the crash. When WiFi is re-enabled, the listening socket still has the poisoned backlog. The next lwip_accept() crashes.

  2. Selectively draining WiFi connections from the backlog: The crash happens inside lwip_accept() before the connection is returned to user code. There is no opportunity to inspect or filter.

  3. Timing flags (disable_in_progress): The crash happens after WifiDisable() returns, in the same main loop iteration. No flag-based deferral helps.

  4. close() + begin() before WifiShutdown(): The fresh socket gets re-poisoned during the delay() calls inside WifiShutdown(), which yield to the RTOS and allow the lwIP task to queue new WiFi connections before WiFi.disconnect() destroys the netif.

The Fix

Close all sockets before WiFi teardown, reopen them after.

TCP Webserver (C level)

WebserverStopSocket() / WebserverStartSocket() in xdrv_01_9_webserver.ino:

void WebserverStopSocket(void) {
  if (Webserver) { Webserver->close(); }
}
void WebserverStartSocket(void) {
  if (Webserver) { Webserver->begin(); }
}

UDP and other sockets (driver notification)

WifiDisable() dispatches FUNC_NETWORK_DOWN before teardown and FUNC_NETWORK_UP after (if Ethernet is still up). Berry's xdrv_52 forwards these as driver events "network_down" / "network_up".

WifiDisable() (in support_wifi.ino)

void WifiDisable(void) {
  if (!TasmotaGlobal.global_state.wifi_down) {
    WebserverStopSocket();             // close TCP listening socket
    XdrvXsnsCall(FUNC_NETWORK_DOWN);  // notify drivers (Berry, emulation, etc.)
    WifiShutdown();                    // delay() calls yield to RTOS — no open sockets to poison
    WifiSetMode(WIFI_OFF);
    WebserverStartSocket();            // reopen TCP listening socket
    if (!TasmotaGlobal.global_state.eth_down) {
      XdrvXsnsCall(FUNC_NETWORK_UP);  // notify drivers to reopen sockets
    }
  }
  TasmotaGlobal.global_state.wifi_down = 1;
}

Berry/Matter UDP (Berry level)

Berry events are dispatched via tasmota.event() to Tasmota drivers (objects registered with tasmota.add_driver()). They are NOT dispatched to rules (tasmota.add_rule()). The method name on the driver must match the event name.

Matter_Device (in Matter_zz_Device.be) is already a Tasmota driver. Added methods:

def network_down()
  if self.udp_server
    self.udp_server.flush_socket()
  end
end

def network_up()
  if self.udp_server
    self.udp_server.reopen_socket()
  end
end

Matter_UDPServer (in Matter_UDPServer.be) — new methods:

def flush_socket()
  if self.listening && self.udp_socket != nil
    self.udp_socket.close()       # Berry udp class uses 'close', not 'stop'
    self.udp_socket = nil
    self.packets_sent = []        # clear retransmit queue, remote will retry
  end
end

def reopen_socket()
  if self.listening && self.udp_socket == nil
    self.udp_socket = udp()
    var ok = self.udp_socket.begin(self.addr, self.port)
    if !ok  log("MTR: error reopening UDP server", 2)  end
  end
end

The loop() method already guards against udp_socket == nil (if self.udp_socket == nil return end). A nil guard was also added to send() to prevent crashes during the brief window when the socket is closed.

Why EspRestart() / deep sleep don't need the fix

Other callers of WifiShutdown() (EspRestart, DeepSleepStart, ULP sleep) don't need the flush because the device is shutting down — the dangling pointer never gets dereferenced.

Gotchas Discovered

  1. Berry udp class uses close(), not stop(): The native method is mapped as close in be_udp_lib.c. Using .stop() raises attribute_error. The existing Matter_UDPServer.stop() method had the same latent bug (fixed).

  2. Berry event dispatch goes to drivers, not rules: callBerryEventDispatcher() calls tasmota.event() which dispatches to driver methods via introspect.get(driver, event_name). It does NOT trigger tasmota.add_rule() callbacks. To receive these events, a class must be registered as a driver (tasmota.add_driver(self)) and have a method matching the event name.

  3. FUNC_NETWORK_DOWN is already called in Every250mSeconds case 3: But only when network_down is true (both WiFi AND Ethernet down), and 250ms after WifiDisable() — too late. Our new call in WifiDisable() fires before teardown regardless of Ethernet state.

Key Files

File Role
tasmota/tasmota_support/support_wifi.ino WifiDisable(), WifiShutdown(), EspRestart()
tasmota/tasmota_xdrv_driver/xdrv_01_9_webserver.ino WebserverStopSocket(), WebserverStartSocket(), PollDnsWebserver()
tasmota/tasmota_xdrv_driver/xdrv_52_9_berry.ino Berry event dispatch for FUNC_NETWORK_DOWN / FUNC_NETWORK_UP
lib/libesp32/berry_matter/src/embedded/Matter_zz_Device.be network_down() / network_up() driver methods
lib/libesp32/berry_matter/src/embedded/Matter_UDPServer.be flush_socket() / reopen_socket() / send() nil guard
lib/libesp32/berry_tasmota/src/be_udp_lib.c Berry udp class definition (method is close, not stop)
tasmota/tasmota_support/support_tasmota.ino Every250mSeconds() state machine (case 2 triggers WifiDisable)
tasmota/tasmota.ino WIFI struct definition
Framework: NetworkServer sockfd is private, no accessor — cannot patch externally
Framework: WebServer handleClient() calls _server.accept()lwip_accept()
Framework: NetworkUDP parsePacket() calls recvfrom()lwip_recvfrom()

Checklist:

  • The pull request is done against the latest development branch
  • Only relevant files were touched
  • Only one feature/fix was added per PR and the code change compiles without warnings
  • The code change is tested and works with Tasmota core ESP8266 V.2.7.8
  • The code change is tested and works with Tasmota core ESP32 V.3.1.10
  • I accept the CLA.

NOTE: The code change must pass CI tests. Your PR cannot be merged unless tests pass

@s-hadinger s-hadinger merged commit 1c90d59 into arendst:development Mar 9, 2026
64 checks passed
josef109 pushed a commit to josef109/Tasmota that referenced this pull request Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant