You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rabbit_msg_store: stop GC with exit(shutdown) during shutdown
When `rabbit_msg_store` shuts down, its `terminate/2` callback calls
`rabbit_msg_store_gc:stop/1`, which sends a `gen_server2:call(stop,
infinity)`. If the GC process is blocked mid-`handle_cast` on disk
I/O (for example, during compaction under disk pressure), the call
blocks indefinitely. `terminate/2` never reaches the code that writes
the recovery files (`file_summary.ets`, `msg_store_index.ets`,
`clean.dot`). Eventually the msg_store child's supervisor shutdown
timeout expires (`msg_store_shutdown_timeout`, default 600s) and the
supervisor kills the msg_store process with reason `killed`. With no
recovery files on disk, the next startup rebuilds indices from
scratch by scanning every segment file, which is slow and expensive
for large stores.
This was observed in production on a broker under PerfTest load. The
persistent message store for the `/` vhost logged "Stopping message
store" and then nothing for 10 minutes until the supervisor killed
it. The GC process was blocked on disk I/O while disk free space was
hovering near the 2 GiB limit. On restart the store logged
"rebuilding indices from scratch" despite the shutdown having been
initiated gracefully via `rabbitmqctl stop`.
Replace the synchronous `rabbit_msg_store_gc:stop/1` call in
`terminate/2` with a new `stop_gc/1` helper that monitors the GC,
sends `exit(GCPid, shutdown)`, and waits for the `'DOWN'` message.
`rabbit_msg_store_gc` does not trap exits, so an exit signal
terminates it immediately even if it is running inside a
`handle_cast` callback on disk I/O. Exit signals are processed by
the scheduler, not the user-level receive loop, so the signal
preempts the blocked callback. The GC's `terminate/2` is a no-op, so
bypassing it has no side effect.
The wait for `'DOWN'` is bounded by
`max(msg_store_shutdown_timeout - 60_000, 5_000)` so that `terminate/2`
stays within the msg_store child's own supervisor shutdown timeout
and leaves at least 60s for the remaining steps (syncing the current
file, writing the file summary, tearing down ETS, writing recovery
terms). If the shutdown signal does not produce a `'DOWN'` in time,
fall back to `exit(GCPid, kill)`. `kill` cannot be trapped, so the
inner `receive` has no timeout.
Killing the GC mid-operation is safe with respect to message data:
- `compact_file` copies messages before updating the index, and the
original data remains on disk until truncation. The code comments
confirm: "it's OK if we crash at any point before we update the
index because the old data is still there until we truncate."
- `truncate_file` only removes data that has already been compacted
to earlier offsets.
- `delete_file` only deletes files with zero valid messages,
enforced by assertions before the delete.
The unclean recovery path (`build_index/3`) rebuilds everything from
the actual segment files on disk using `scan_file_for_valid_messages`,
so any inconsistency between the file summary and the on-disk state
is handled. In the common case (GC killed before it modified the file
summary ETS), the recovery files are fully consistent and the next
startup recovers cleanly without a rebuild.
Add `rabbit_msg_store:gc_pid/1` to expose the GC pid for testing.
Add two test cases to `backing_queue_SUITE`:
- `msg_store_gc_stuck_suspended` suspends the GC with `sys:suspend`,
terminates the persistent store via the supervisor, and verifies
that the store recovers cleanly (`successfully_recovered_state`
returns `true`) with all messages intact. This covers the case
where the GC is blocked in its receive loop.
- `msg_store_gc_stuck_mid_callback` mocks `compact_file/2` to block
indefinitely inside the `handle_cast` callback, sends a compact
cast to the GC to put it in the blocking callback, then terminates
the store via the supervisor. This covers the scenario that
motivated the PR: a GC stuck on disk I/O inside a callback.
Both tests confirm that `exit(GcPid, shutdown)` terminates the GC
process regardless of what Erlang code it is running, that
`terminate/2` completes and writes the recovery files, and that the
next startup recovers cleanly.
0 commit comments