Skip to content

@apollo/gateway: Make ApolloGateway.stop() reliable and required#452

Merged
glasser merged 2 commits into
mainfrom
glasser/effective-stop
Feb 9, 2021
Merged

@apollo/gateway: Make ApolloGateway.stop() reliable and required#452
glasser merged 2 commits into
mainfrom
glasser/effective-stop

Conversation

@glasser

@glasser glasser commented Feb 4, 2021

Copy link
Copy Markdown
Member

The (not particularly documented) ApolloGateway.stop() method didn't reliably
stop the gateway from polling. All it did was cancel a timeout. But if it is
called while pollServices is in the middle of running, it would do nothing, and
then pollServices would carry on and set another timeout.

Better semantics would be for stop() to reliably stop polling: allow the current
poll to complete, ensure that there will be no more polls, and then (async)
return. This PR implements those semantics, by implementing an explicit state
machine for ApolloGateway's polling.

One reason that these bugs were able to survive is that calling stop() is often
unnecessary. In apollographql/apollo-server#3105 we
chose to unref the polling timeout to allow the Node process to exit if it's
the only thing left on the event loop, instead of encouraging users of
ApolloGateway to be responsible and call stop() themselves. While this may
be reasonable when the gateway's lifecycle is merely a schema polling timer, we
may have future changes to the gateway where proper lifecycle handling is more
important. So this PR also moves away from the world where it's fine to not
bother to explicitly stop the gateway.

That said, in the common case, we don't want users to have to write gateway
stopping code. It makes more sense for stopping the ApolloGateway to be the
responsibility of ApolloServer.stop(), just as ApolloServer.stop() can
trigger events in plugins. So in the recently-released Apollo Server v2.20,
ApolloServer.stop() calls ApolloGateway.stop(). This should mean that in
most cases, the missing unref shouldn't keep the process running, as long as
you've run ApolloServer.stop() (and if you don't, it's likely that other parts
of the server are keeping your process running).

What if you're still running an old version of Apollo Server? For a bit of
backwards compatibility, ApolloGateway detects if it appears to be connected
to Apollo Server older than v2.18. If so, it continues to do the old
unref(). If you're using v2.18 or v2.19, the upgrade to v2.20 should be pretty
painless (it's mostly just minor bugfixes). If you're using ApolloGateway on its
own for some reason, and this change causes your processes to hang on shutdown,
adding a stop() call should be pretty straightforward. (If we learn that this
change really is devastating, we can always go back to an unconditional
timer.unref() later.)

Fixes apollographql/apollo-server#4428.

@glasser glasser requested a review from trevor-scheer February 4, 2021 06:21
@glasser

glasser commented Feb 4, 2021

Copy link
Copy Markdown
Member Author

Before merging this I should implement the Apollo Server change described in it and release v2.20. It's ready for review but I'll mark as draft for this reason.

@glasser glasser marked this pull request as draft February 4, 2021 06:22
Comment thread gateway-js/src/index.ts Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be able to call load again after awaiting the stop?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably disregard, I can tell below it's explicitly written with the intention of not allowing that.

Comment thread gateway-js/src/index.ts Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! TIL

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I actually don't know offhand if you get unbounded stack sizes from recursion in async functions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I checked and you do.)

glasser added a commit to apollographql/apollo-server that referenced this pull request Feb 8, 2021
Because AS is what invokes ApolloGateway.load, it should be its responsibility
to invoke the matching stop method.

apollographql/federation#452 (aimed at @apollo/gateway
0.23) will change ApolloGateway to no longer "unref" its polling timer: ie, it
makes calling `stop()` actually part of its expected API instead of something
you can ignore if you feel like it without affecting program shutdown. It has
a bit of a hack to still unref the timer if it looks like you're using an
old (pre-2.18) version of Apollo Server, but this PR (which will be released in
v2.20.0) will make ApolloServer stop the gateway for you.

Part of fixing #4428.
glasser added a commit to apollographql/apollo-server that referenced this pull request Feb 8, 2021
Because AS is what invokes ApolloGateway.load, it should be its responsibility
to invoke the matching stop method.

apollographql/federation#452 (aimed at @apollo/gateway
0.23) will change ApolloGateway to no longer "unref" its polling timer: ie, it
makes calling `stop()` actually part of its expected API instead of something
you can ignore if you feel like it without affecting program shutdown. It has
a bit of a hack to still unref the timer if it looks like you're using an
old (pre-2.18) version of Apollo Server, but this PR (which will be released in
v2.20.0) will make ApolloServer stop the gateway for you.

Part of fixing #4428.

Also simplify typings of toDispose set. 
`() => ValueOrPromise<void>` is a weird type because `void` means "I'm not going
to look at the return value" which is sorta incompatible with "but I need to see
if it's a Promise or not". So just make these all async (changing a couple
implementations by adding `async`).
The (not particularly documented) ApolloGateway.stop() method didn't reliably
stop the gateway from polling. All it did was cancel a timeout. But if it is
called while pollServices is in the middle of running, it would do nothing, and
then pollServices would carry on and set another timeout.

Better semantics would be for stop() to reliably stop polling: allow the current
poll to complete, ensure that there will be no more polls, and then (async)
return. This PR implements those semantics, by implementing an explicit state
machine for ApolloGateway's polling.

One reason that these bugs were able to survive is that calling stop() is often
unnecessary. In apollographql/apollo-server#3105 we
chose to `unref` the polling timeout to allow the Node process to exit if it's
the only thing left on the event loop, instead of encouraging users of
`ApolloGateway` to be responsible and call `stop()` themselves. While this may
be reasonable when the gateway's lifecycle is merely a schema polling timer, we
may have future changes to the gateway where proper lifecycle handling is more
important. So this PR also moves away from the world where it's fine to not
bother to explicitly stop the gateway.

That said, in the common case, we don't want users to have to write gateway
stopping code. It makes more sense for stopping the `ApolloGateway` to be the
responsibility of `ApolloServer.stop()`, just as `ApolloServer.stop()` can
trigger events in plugins. So in the recently-released Apollo Server v2.20,
`ApolloServer.stop()` calls `ApolloGateway.stop()`. This should mean that in
most cases, the missing `unref` shouldn't keep the process running, as long as
you've run `ApolloServer.stop()` (and if you don't, it's likely that other parts
of the server are keeping your process running).

What if you're still running an old version of Apollo Server? For a bit of
backwards compatibility, `ApolloGateway` detects if it appears to be connected
to Apollo Server older than v2.18. If so, it continues to do the old
unref().  If you're using v2.18 or v2.19, the upgrade to v2.20 should be pretty
painless (it's mostly just minor bugfixes). If you're using ApolloGateway on its
own for some reason, and this change causes your processes to hang on shutdown,
adding a `stop()` call should be pretty straightforward. (If we learn that this
change really is devastating, we can always go back to an unconditional
timer.unref() later.)

Fixes #4428.
@glasser glasser force-pushed the glasser/effective-stop branch from c4fe737 to d82e916 Compare February 9, 2021 23:09
@glasser glasser marked this pull request as ready for review February 9, 2021 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ApolloGateway.stop() is not reliable

3 participants