Fix false error logs for non-worker child processes in reap_workers#3566
Fix false error logs for non-worker child processes in reap_workers#3566joho54 wants to merge 2 commits intobenoitc:masterfrom
Conversation
When gunicorn runs as PID 1 (e.g. in containers without tini/dumb-init), it inherits orphaned child processes via the standard UNIX reparenting mechanism. The current reap_workers() logs ERROR for every reaped process before checking whether it is actually a known worker, causing false alerts in monitoring systems like Sentry. Move the WORKERS membership check before exit status logging so that non-worker child processes are reaped silently (with a DEBUG log) while real worker exits continue to be reported as errors. This is a minimal, behavior-preserving fix: waitpid(-1) is kept as-is to fulfill PID 1 zombie reaping duties. Only the log level changes for processes not in the WORKERS dict. Ref benoitc#3220 Related: benoitc#1585
|
I don't disagree, but I believe with very slightly more complexity, a more useful logging is possible.
Yes!
Yes!
Yes!
No! If someone bothered to put log level to DEBUG, then they clearly were interested. Likely the oprhaned child process was reaped by Gunicorn after something went wrong. Once Gunicorn clears it, the status is gone and may not be reported at a higher level. So even after demoting the log level for the following messages and clarifying that the messages are not about a Gunicorn worker, those messages should appear. Only the special treatment of Gunicorn-worker-specific exit codes should be skipped. |
Per review feedback, preserve WIFEXITED/WIFSIGNALED details in DEBUG logs for non-worker children instead of a generic status message. Only worker-specific handling (HaltServer, child_exit) is skipped.
|
Thanks for the feedback! I've added WIFEXITED/WIFSIGNALED logging for non-worker children reaping, while only skipping worker specific handling. Please check 3e705a6 |
Summary
When gunicorn runs as PID 1 in a container (common in Docker/ECS/Kubernetes without
tiniordumb-init),reap_workers()emits false ERROR logs for non-worker child processes reaped viawaitpid(-1).This PR moves the
WORKERSmembership check before exit status logging, so that non-worker processes are reaped silently at DEBUG level while real worker exits continue to be reported as errors.Problem
reap_workers()callsos.waitpid(-1), which reaps any terminated child process. When gunicorn is PID 1, orphaned child processes are reparented to it by the kernel. Upon termination, these non-worker processes are reaped and incorrectly logged as:This triggers false alerts in error tracking systems (e.g. Sentry).
Current code flow
The error log fires before the worker membership check, so every non-worker child with a non-zero exit code produces a spurious error.
Evidence from production
Environment: AWS ECS Fargate, gunicorn 21.2.0, Python 3.11, Django backend
Observed:
Worker (pid:30740) exited with code 23fromgunicorn.errorlogger--workers 3)Booting workerlogs for pid 30740max_requestswas not configured, ruling out worker recyclingFix
Move the
WORKERS.pop(wpid)check to the top of the else branch:What stays the same:
waitpid(-1)is preserved — PID 1 must reap all children to prevent zombiesHaltServeron boot/load errors, signal logging — all unchangedchild_exithook only fires for real workers (same as before)What changes:
Tests added
test_reap_non_worker_child_no_error_log— non-worker exit with error code produces no ERROR logtest_reap_non_worker_child_with_signal_no_error_log— non-worker killed by SIGKILL produces no OOM errortest_reap_mixed_worker_and_non_worker— mixed scenario: non-worker is silent, real worker is reportedAll existing tests pass (1671 passed, 214 skipped).
Design note: why DEBUG instead of suppressing entirely
A question that may come up: is it safe to silently reap unknown child processes?
First,
waitpiddoes not kill anything — it reaps an already terminated process from the kernel's process table. The three options are:DEBUG was chosen over complete suppression so that operators can still trace unexpected child processes when needed (e.g.
--log-level debug). This preserves observability without generating noise at the default log level.The reaping itself is not optional — as PID 1, gunicorn must call
waitpidfor all children, whether it spawned them or not. This is standard POSIX behavior: the kernel reparents orphaned processes to init (PID 1), and init is responsible for reaping them.Related issues
This PR takes a simpler approach than #1585: rather than changing
waitpid(-1)to per-pidwaitpid, it keeps the reaping behavior intact and only fixes the log level for non-worker processes.