Skip to content

Gateway: retry logic for requests to GCS#3836

Merged
trevor-scheer merged 6 commits into
release-2.12.0from
trevor/gateway-retries
Mar 23, 2020
Merged

Gateway: retry logic for requests to GCS#3836
trevor-scheer merged 6 commits into
release-2.12.0from
trevor/gateway-retries

Conversation

@trevor-scheer

@trevor-scheer trevor-scheer commented Feb 27, 2020

Copy link
Copy Markdown
Contributor

This PR utilizes make-fetch-happen's built-in retry capabilities in order to retry failed requests to GCS.

It's worth noting that retries only occur on certain types of failures. For additional details, please see the docs.

Additionally, now that we've added retries, this PR adjusts how polling is done in order to prevent the possibility of multiple in-flight updates. The next "tick" only begins after a full round of updating is completed rather than on a perfectly regular interval. Thanks to @abernix for suggesting this change.

To elaborate a bit: previously the gateway would fire off a series of fetches to GCS every 10s (unless specified otherwise). It was (pretty safely) assumed that this wouldn't be problematic - though technically a race condition exists if the fetches were to take a number of seconds each. With the introduction of retries, this becomes considerably more likely due to exponential backoff. To prevent this condition, the gateway starts the next 10s "tick" once it's finished its round of requests to GCS.

@trevor-scheer trevor-scheer force-pushed the abernix/gateway-minor-qol-improvements branch from be8e5a8 to 4f2fbff Compare March 3, 2020 19:27
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch 2 times, most recently from 605f47c to 7389f98 Compare March 6, 2020 23:15
@trevor-scheer trevor-scheer changed the base branch from abernix/gateway-minor-qol-improvements to release-2.12.0 March 6, 2020 23:15
@trevor-scheer trevor-scheer marked this pull request as ready for review March 11, 2020 00:56

@abernix abernix left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of the implementation here looks good with a couple comments about clarity/intent. However, I'm concerned we're not doing anything explicit/programmatically to ensure that we don't have multiple concurrent retry-able requests being retried at the same time.

For example, the default polling interval for pollingTimer is 10000. Could there be more than one at a time, e.g. the second of the five requests to GCS to obtain a composed schema fails and starts retrying with 30 seconds to go, but another invocation of pollingTimer is kicked off before the retries elapse and starts its own process?

Perhaps the answer here is that the definitive resolution/rejection of the entirety of a fetch pass is what sets the next interval into motion. In other words, setInterval changes to setTimeout and the new timer is created by the rejection/resolution of the totality of the multiple fetches (or rather, fetchApolloGcses), after all retries are fully resolved realized, which all currently happens within updateComposition here:

await this.updateComposition();

Does that make sense?

Comment thread packages/apollo-gateway/src/__tests__/integration/networkRequests.test.ts Outdated
Comment thread packages/apollo-gateway/src/index.ts Outdated
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch from ed6446b to b13fdd3 Compare March 20, 2020 00:34
@trevor-scheer

Copy link
Copy Markdown
Contributor Author

@abernix I've incorporated your larger feedback into b13fdd3. Please take a look and let me know how you feel about the changes!

@abernix abernix left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely!

Comment on lines +460 to +463
// Prevent the Node.js event loop from remaining active (and preventing,
// e.g. process shutdown) by calling `unref` on the `Timeout`. For more
// information, see https://nodejs.org/api/timers.html#timers_timeout_unref.
this.pollingTimer?.unref();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substantially less critical to have this now that it's not a (forever) interval, but probably worth keeping.

Comment thread packages/apollo-gateway/src/index.ts
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch from 16fef9a to ea10d29 Compare March 23, 2020 15:19
@trevor-scheer trevor-scheer changed the title [WIP] Gateway: retry logic for requests to GCS Gateway: retry logic for requests to GCS Mar 23, 2020
@abernix abernix added this to the Release 2.12.0 milestone Mar 23, 2020
@trevor-scheer trevor-scheer merged commit cdee9d6 into release-2.12.0 Mar 23, 2020
@trevor-scheer trevor-scheer deleted the trevor/gateway-retries branch March 23, 2020 23:27
abernix pushed a commit to apollographql/federation that referenced this pull request Sep 4, 2020
…#3836)

Implement gateway retry logic for requests to GCS. Failed requests will retry up to 5 times.

Additionally, this PR adjusts how polling is done in order to prevent the possibility of multiple in-flight updates. The next "tick" only begins after a full round of updating is completed rather than on a perfectly regular interval. Thanks to @abernix for suggesting this change. For more details please see the PR description.
Apollo-Orig-Commit-AS: apollographql/apollo-server@cdee9d6
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants