Skip to content

Check all the worker pids instead of every child.#1585

Closed
jwg4 wants to merge 10 commits intobenoitc:masterfrom
jwg4:patch-1
Closed

Check all the worker pids instead of every child.#1585
jwg4 wants to merge 10 commits intobenoitc:masterfrom
jwg4:patch-1

Conversation

@jwg4
Copy link
Copy Markdown

@jwg4 jwg4 commented Sep 6, 2017

This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.

This seems to be a reasonable fix, we only need to wait on the pids of the workers, and not any other arbitrary child pid.

This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.
Comment thread gunicorn/arbiter.py Outdated
try:
while True:
wpid, status = os.waitpid(-1, os.WNOHANG)
for pid, _ in self.WORKERS.items():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need in .items() since you don't use dictionary value.

for pid in tuple(self.WORKERS): might be a better line. tuple is to force copy of pids, because WORKERS dict might change in the meantime.

Copy link
Copy Markdown
Author

@jwg4 jwg4 Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idiomatic way to iterate over the keys of a possibly-changing dict would be:

for pid in self.WORKERS.keys():

Why not use that? It's much clearer what you're trying to do.

jwg4 added 7 commits September 8, 2017 16:04
This if clause was never needed in this version. If you call waitpid with an explicit pid, it either returns a record with than same pid, or raises OSError. It's not possible for it to return a 0 if we call it with a pid.
We don't test for any behavior, we just want to make sure that nothing bad happens here.
Comment thread gunicorn/arbiter.py Outdated
try:
while True:
wpid, status = os.waitpid(-1, os.WNOHANG)
for pid, _ in self.WORKERS.items():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idiomatic way to iterate over the keys of a possibly-changing dict would be:

for pid in self.WORKERS.keys():

Why not use that? It's much clearer what you're trying to do.

@RonRothman
Copy link
Copy Markdown

Just checking: Is there a possible race condition where the arbiter would have tried to spin up a new worker process before WORKERS was updated?

Also, just curious: what problem is this PR intended to solve?

Thanks!

@benoitc
Copy link
Copy Markdown
Owner

benoitc commented Sep 11, 2017

About the patch looking only for the known pids at the time is not really reliable on all systems, it's quite better to wait for all child processes that can exist and filter them which what does the current code. also the result of waitpid might be (0,0) on am system so this value need to checked forts.

@RonRothman i guess it's related to #1584

@jwg4 see my comment on the ticket

@temoto
Copy link
Copy Markdown
Contributor

temoto commented Sep 11, 2017

There is race condition, one way to fix it is to make WORKERS a threading.Queue.

@jwg4
Copy link
Copy Markdown
Author

jwg4 commented Sep 11, 2017

@temoto I am not sure if this would fix the race condition. When a new worker forks couldn't it conceivably die before adding itself to WORKERS? This could happen whether or not WORKERS itself was thread-safe.

@temoto
Copy link
Copy Markdown
Contributor

temoto commented Sep 11, 2017

@jwg4 worker could not add itself because it's a separate process, it can't modify master process memory. Consider code:

# in master process
pid = os.fork()  # or subprocess.Popen()
# if pid > 0
workers_queue.put(pid)

@jwg4
Copy link
Copy Markdown
Author

jwg4 commented Sep 11, 2017

Yes, I miss the check for the return value. However, doesn't the same objection apply? Couldn't the child thread die before the main thread has had a chance to modify the list of workers?

@temoto
Copy link
Copy Markdown
Contributor

temoto commented Sep 11, 2017

Yes, you are right. One way to fix that is to check whether new pid is still running after workers_queue.put(). Gut tells it's getting too complicated for a reliable production system.

@benoitc
Copy link
Copy Markdown
Owner

benoitc commented Sep 11, 2017

@jwg4 this can't happen though, signals are queued so it will be handled at some point. Anyway I think the fix I proposed to the related issue is enough. It makes sure that signals handlers are resetted after the worker is spawned and before eventlet is setup.

@jwg4
Copy link
Copy Markdown
Author

jwg4 commented Sep 12, 2017

I agree @benoitc your fix is more straightforward and less uncertain.

@jwg4 jwg4 closed this Sep 12, 2017
@benoitc
Copy link
Copy Markdown
Owner

benoitc commented Sep 12, 2017

@jwg4 OK, i will make release on thursday and send a pr later tonight.

joho54 added a commit to joho54/gunicorn that referenced this pull request Mar 31, 2026
When gunicorn runs as PID 1 (e.g. in containers without tini/dumb-init),
it inherits orphaned child processes via the standard UNIX reparenting
mechanism. The current reap_workers() logs ERROR for every reaped process
before checking whether it is actually a known worker, causing false
alerts in monitoring systems like Sentry.

Move the WORKERS membership check before exit status logging so that
non-worker child processes are reaped silently (with a DEBUG log) while
real worker exits continue to be reported as errors.

This is a minimal, behavior-preserving fix: waitpid(-1) is kept as-is
to fulfill PID 1 zombie reaping duties. Only the log level changes for
processes not in the WORKERS dict.

Ref benoitc#3220
Related: benoitc#1585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants