Skip to content

Fix false error logs for non-worker child processes in reap_workers#3566

Open
joho54 wants to merge 2 commits intobenoitc:masterfrom
joho54:fix/reap-workers-non-worker-child-error-log
Open

Fix false error logs for non-worker child processes in reap_workers#3566
joho54 wants to merge 2 commits intobenoitc:masterfrom
joho54:fix/reap-workers-non-worker-child-error-log

Conversation

@joho54
Copy link
Copy Markdown

@joho54 joho54 commented Mar 31, 2026

Summary

When gunicorn runs as PID 1 in a container (common in Docker/ECS/Kubernetes without tini or dumb-init), reap_workers() emits false ERROR logs for non-worker child processes reaped via waitpid(-1).

This PR moves the WORKERS membership check before exit status logging, so that non-worker processes are reaped silently at DEBUG level while real worker exits continue to be reported as errors.

Problem

reap_workers() calls os.waitpid(-1), which reaps any terminated child process. When gunicorn is PID 1, orphaned child processes are reparented to it by the kernel. Upon termination, these non-worker processes are reaped and incorrectly logged as:

[ERROR] Worker (pid:30740) exited with code 23

This triggers false alerts in error tracking systems (e.g. Sentry).

Current code flow

waitpid(-1) -> get exit code -> log ERROR -> check if worker -> skip if not worker

The error log fires before the worker membership check, so every non-worker child with a non-zero exit code produces a spurious error.

Evidence from production

Environment: AWS ECS Fargate, gunicorn 21.2.0, Python 3.11, Django backend

Observed:

  • Sentry captured Worker (pid:30740) exited with code 23 from gunicorn.error logger
  • CloudWatch log analysis showed only 3 active workers in the container: pid 40, 41, 42 (matching --workers 3)
  • Zero application logs from pid 30740 — it was never a gunicorn worker
  • Zero Booting worker logs for pid 30740
  • max_requests was not configured, ruling out worker recycling
  • All 3 real workers continued processing requests normally before and after the event
  • The error was a single occurrence over 7 days — a non-worker child process reaped by PID 1

Fix

Move the WORKERS.pop(wpid) check to the top of the else branch:

worker = self.WORKERS.pop(wpid, None)
if not worker:
    self.log.debug(
        "Reaped unknown child process (pid:%s, status:%s). "
        "Not a gunicorn worker.", wpid, status)
    continue

# Only report exit status for actual workers
exitcode = ...

What stays the same:

  • waitpid(-1) is preserved — PID 1 must reap all children to prevent zombies
  • Worker error reporting, HaltServer on boot/load errors, signal logging — all unchanged
  • child_exit hook only fires for real workers (same as before)

What changes:

  • Non-worker child processes are logged at DEBUG instead of ERROR
  • No false alerts in monitoring systems

Tests added

  • test_reap_non_worker_child_no_error_log — non-worker exit with error code produces no ERROR log
  • test_reap_non_worker_child_with_signal_no_error_log — non-worker killed by SIGKILL produces no OOM error
  • test_reap_mixed_worker_and_non_worker — mixed scenario: non-worker is silent, real worker is reported

All existing tests pass (1671 passed, 214 skipped).

Design note: why DEBUG instead of suppressing entirely

A question that may come up: is it safe to silently reap unknown child processes?

First, waitpid does not kill anything — it reaps an already terminated process from the kernel's process table. The three options are:

Approach Result
Don't reap Zombie accumulates in process table (worse)
Reap + ERROR log Zombie prevented, but false alert (current behavior)
Reap + DEBUG log Zombie prevented, no false alert, still traceable (this PR)

DEBUG was chosen over complete suppression so that operators can still trace unexpected child processes when needed (e.g. --log-level debug). This preserves observability without generating noise at the default log level.

The reaping itself is not optional — as PID 1, gunicorn must call waitpid for all children, whether it spawned them or not. This is standard POSIX behavior: the kernel reparents orphaned processes to init (PID 1), and init is responsible for reaping them.

Related issues

This PR takes a simpler approach than #1585: rather than changing waitpid(-1) to per-pid waitpid, it keeps the reaping behavior intact and only fixes the log level for non-worker processes.

When gunicorn runs as PID 1 (e.g. in containers without tini/dumb-init),
it inherits orphaned child processes via the standard UNIX reparenting
mechanism. The current reap_workers() logs ERROR for every reaped process
before checking whether it is actually a known worker, causing false
alerts in monitoring systems like Sentry.

Move the WORKERS membership check before exit status logging so that
non-worker child processes are reaped silently (with a DEBUG log) while
real worker exits continue to be reported as errors.

This is a minimal, behavior-preserving fix: waitpid(-1) is kept as-is
to fulfill PID 1 zombie reaping duties. Only the log level changes for
processes not in the WORKERS dict.

Ref benoitc#3220
Related: benoitc#1585
@pajod
Copy link
Copy Markdown
Contributor

pajod commented Apr 4, 2026

I don't disagree, but I believe with very slightly more complexity, a more useful logging is possible.

Non-worker child processes are logged at DEBUG instead of ERROR

Yes!

No false alerts in monitoring systems

Yes!

Unambigously explain the situation "Not a gunicorn worker"

Yes!

skip os.WIFEXITED / os.WIFSIGNALED details from the (demoted to DEBUG level) log line

No! If someone bothered to put log level to DEBUG, then they clearly were interested. Likely the oprhaned child process was reaped by Gunicorn after something went wrong. Once Gunicorn clears it, the status is gone and may not be reported at a higher level. So even after demoting the log level for the following messages and clarifying that the messages are not about a Gunicorn worker, those messages should appear. Only the special treatment of Gunicorn-worker-specific exit codes should be skipped.

Per review feedback, preserve WIFEXITED/WIFSIGNALED details in DEBUG
logs for non-worker children instead of a generic status message.
Only worker-specific handling (HaltServer, child_exit) is skipped.
@joho54
Copy link
Copy Markdown
Author

joho54 commented Apr 4, 2026

Thanks for the feedback! I've added WIFEXITED/WIFSIGNALED logging for non-worker children reaping, while only skipping worker specific handling. Please check 3e705a6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants