Check all the worker pids instead of every child. by jwg4 · Pull Request #1585 · benoitc/gunicorn

jwg4 · 2017-09-06T16:57:18Z

This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.

This seems to be a reasonable fix, we only need to wait on the pids of the workers, and not any other arbitrary child pid.

This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.

temoto · 2017-09-07T05:28:22Z

-        try:
-            while True:
-                wpid, status = os.waitpid(-1, os.WNOHANG)
+        for pid, _ in self.WORKERS.items():


No need in .items() since you don't use dictionary value.

for pid in tuple(self.WORKERS): might be a better line. tuple is to force copy of pids, because WORKERS dict might change in the meantime.

Done, thanks.

The idiomatic way to iterate over the keys of a possibly-changing dict would be:

for pid in self.WORKERS.keys():

Why not use that? It's much clearer what you're trying to do.

This if clause was never needed in this version. If you call waitpid with an explicit pid, it either returns a record with than same pid, or raises OSError. It's not possible for it to return a 0 if we call it with a pid.

We don't test for any behavior, we just want to make sure that nothing bad happens here.

This improves coverage.

RonRothman · 2017-09-10T18:42:52Z

-        try:
-            while True:
-                wpid, status = os.waitpid(-1, os.WNOHANG)
+        for pid, _ in self.WORKERS.items():


The idiomatic way to iterate over the keys of a possibly-changing dict would be:

for pid in self.WORKERS.keys():

Why not use that? It's much clearer what you're trying to do.

RonRothman · 2017-09-10T18:44:47Z

Just checking: Is there a possible race condition where the arbiter would have tried to spin up a new worker process before WORKERS was updated?

Also, just curious: what problem is this PR intended to solve?

Thanks!

benoitc · 2017-09-11T07:12:55Z

About the patch looking only for the known pids at the time is not really reliable on all systems, it's quite better to wait for all child processes that can exist and filter them which what does the current code. also the result of waitpid might be (0,0) on am system so this value need to checked forts.

@RonRothman i guess it's related to #1584

@jwg4 see my comment on the ticket

temoto · 2017-09-11T12:12:57Z

There is race condition, one way to fix it is to make WORKERS a threading.Queue.

jwg4 · 2017-09-11T13:20:08Z

@temoto I am not sure if this would fix the race condition. When a new worker forks couldn't it conceivably die before adding itself to WORKERS? This could happen whether or not WORKERS itself was thread-safe.

temoto · 2017-09-11T17:24:21Z

@jwg4 worker could not add itself because it's a separate process, it can't modify master process memory. Consider code:

# in master process
pid = os.fork()  # or subprocess.Popen()
# if pid > 0
workers_queue.put(pid)

jwg4 · 2017-09-11T18:00:56Z

Yes, I miss the check for the return value. However, doesn't the same objection apply? Couldn't the child thread die before the main thread has had a chance to modify the list of workers?

temoto · 2017-09-11T19:18:13Z

Yes, you are right. One way to fix that is to check whether new pid is still running after workers_queue.put(). Gut tells it's getting too complicated for a reliable production system.

benoitc · 2017-09-11T19:21:10Z

@jwg4 this can't happen though, signals are queued so it will be handled at some point. Anyway I think the fix I proposed to the related issue is enough. It makes sure that signals handlers are resetted after the worker is spawned and before eventlet is setup.

jwg4 · 2017-09-12T12:25:23Z

I agree @benoitc your fix is more straightforward and less uncertain.

benoitc · 2017-09-12T12:27:05Z

@jwg4 OK, i will make release on thursday and send a pr later tonight.

When gunicorn runs as PID 1 (e.g. in containers without tini/dumb-init), it inherits orphaned child processes via the standard UNIX reparenting mechanism. The current reap_workers() logs ERROR for every reaped process before checking whether it is actually a known worker, causing false alerts in monitoring systems like Sentry. Move the WORKERS membership check before exit status logging so that non-worker child processes are reaped silently (with a DEBUG log) while real worker exits continue to be reported as errors. This is a minimal, behavior-preserving fix: waitpid(-1) is kept as-is to fulfill PID 1 zombie reaping duties. Only the log level changes for processes not in the WORKERS dict. Ref benoitc#3220 Related: benoitc#1585

Check all the worker pids instead of every child.

3716704

This solves a problem where this wait is triggered by something done on a child process launched by a third-party library which we use when setting up workers.

jwg4 mentioned this pull request Sep 6, 2017

Can't boot eventlet workers with eventlet 0.21.0 #1584

Closed

temoto reviewed Sep 7, 2017

View reviewed changes

Loop over a tuple of worker pids.

212e9dd

jwg4 mentioned this pull request Sep 7, 2017

new monotonic broken on docker? eventlet/eventlet#401

Open

jwg4 added 7 commits September 8, 2017 16:04

Remove this check - no longer needed.

20e2ef9

This if clause was never needed in this version. If you call waitpid with an explicit pid, it either returns a record with than same pid, or raises OSError. It's not possible for it to return a 0 if we call it with a pid.

Only return one value - we only call waitpid once.

6371f8d

Check that raising the OSError works.

91aed8e

We don't test for any behavior, we just want to make sure that nothing bad happens here.

Correct how we mock an exception raise

2171806

Add a test to make sure these error get re-raised.

737393a

This improves coverage.

Check for raised error pytest style.

e07c085

Access the ExceptionInfo correctly.

78a53f4

RonRothman suggested changes Sep 10, 2017

View reviewed changes

Loop over the keys.

e2ed3b2

jwg4 closed this Sep 12, 2017

joho54 mentioned this pull request Mar 31, 2026

Fix false error logs for non-worker child processes in reap_workers #3566

Open

Uh oh!

Conversation

jwg4 commented Sep 6, 2017

Uh oh!

temoto Sep 7, 2017

Choose a reason for hiding this comment

Uh oh!

jwg4 Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RonRothman Sep 10, 2017

Choose a reason for hiding this comment

Uh oh!

RonRothman Sep 10, 2017

Choose a reason for hiding this comment

Uh oh!

RonRothman commented Sep 10, 2017

Uh oh!

benoitc commented Sep 11, 2017

Uh oh!

temoto commented Sep 11, 2017

Uh oh!

jwg4 commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

temoto commented Sep 11, 2017

Uh oh!

jwg4 commented Sep 11, 2017

Uh oh!

temoto commented Sep 11, 2017

Uh oh!

benoitc commented Sep 11, 2017

Uh oh!

jwg4 commented Sep 12, 2017

Uh oh!

benoitc commented Sep 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jwg4 Sep 7, 2017 •

edited

Loading

jwg4 commented Sep 11, 2017 •

edited

Loading