Check all the worker pids instead of every child.#1585
Check all the worker pids instead of every child.#1585jwg4 wants to merge 10 commits intobenoitc:masterfrom
Conversation
This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.
| try: | ||
| while True: | ||
| wpid, status = os.waitpid(-1, os.WNOHANG) | ||
| for pid, _ in self.WORKERS.items(): |
There was a problem hiding this comment.
No need in .items() since you don't use dictionary value.
for pid in tuple(self.WORKERS): might be a better line. tuple is to force copy of pids, because WORKERS dict might change in the meantime.
There was a problem hiding this comment.
The idiomatic way to iterate over the keys of a possibly-changing dict would be:
for pid in self.WORKERS.keys():
Why not use that? It's much clearer what you're trying to do.
This if clause was never needed in this version. If you call waitpid with an explicit pid, it either returns a record with than same pid, or raises OSError. It's not possible for it to return a 0 if we call it with a pid.
We don't test for any behavior, we just want to make sure that nothing bad happens here.
This improves coverage.
| try: | ||
| while True: | ||
| wpid, status = os.waitpid(-1, os.WNOHANG) | ||
| for pid, _ in self.WORKERS.items(): |
There was a problem hiding this comment.
The idiomatic way to iterate over the keys of a possibly-changing dict would be:
for pid in self.WORKERS.keys():
Why not use that? It's much clearer what you're trying to do.
|
Just checking: Is there a possible race condition where the arbiter would have tried to spin up a new worker process before Also, just curious: what problem is this PR intended to solve? Thanks! |
|
About the patch looking only for the known pids at the time is not really reliable on all systems, it's quite better to wait for all child processes that can exist and filter them which what does the current code. also the result of waitpid might be (0,0) on am system so this value need to checked forts. @RonRothman i guess it's related to #1584 @jwg4 see my comment on the ticket |
|
There is race condition, one way to fix it is to make |
|
@temoto I am not sure if this would fix the race condition. When a new worker forks couldn't it conceivably die before adding itself to |
|
@jwg4 worker could not add itself because it's a separate process, it can't modify master process memory. Consider code: |
|
Yes, I miss the check for the return value. However, doesn't the same objection apply? Couldn't the child thread die before the main thread has had a chance to modify the list of workers? |
|
Yes, you are right. One way to fix that is to check whether new pid is still running after |
|
@jwg4 this can't happen though, signals are queued so it will be handled at some point. Anyway I think the fix I proposed to the related issue is enough. It makes sure that signals handlers are resetted after the worker is spawned and before eventlet is setup. |
|
I agree @benoitc your fix is more straightforward and less uncertain. |
|
@jwg4 OK, i will make release on thursday and send a pr later tonight. |
When gunicorn runs as PID 1 (e.g. in containers without tini/dumb-init), it inherits orphaned child processes via the standard UNIX reparenting mechanism. The current reap_workers() logs ERROR for every reaped process before checking whether it is actually a known worker, causing false alerts in monitoring systems like Sentry. Move the WORKERS membership check before exit status logging so that non-worker child processes are reaped silently (with a DEBUG log) while real worker exits continue to be reported as errors. This is a minimal, behavior-preserving fix: waitpid(-1) is kept as-is to fulfill PID 1 zombie reaping duties. Only the log level changes for processes not in the WORKERS dict. Ref benoitc#3220 Related: benoitc#1585
This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.
This seems to be a reasonable fix, we only need to wait on the pids of the workers, and not any other arbitrary child pid.