Implement exponential backoff retry mechanism for transport tasks#1837
Merged
sphuber merged 1 commit intoAug 2, 2018
Merged
Conversation
2a0713e to
6ac3fb5
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1837 +/- ##
===========================================
+ Coverage 66.69% 66.73% +0.03%
===========================================
Files 317 317
Lines 32407 32406 -1
===========================================
+ Hits 21613 21625 +12
+ Misses 10794 10781 -13
Continue to review full report at Codecov.
|
JobProcesses have various tasks the need to execute that require a transport, which can then fail for various reasons due to the command executed over the transport excepting. Examples are the submission of a job calculation as well as updating its scheduler state. These may fail for reasons that do not necessarily mean that the job is unrecoverably lost, such as the internet connection being temporarily unavailable or the scheduler simply not responding. Instead of putting the process in an excepted state, the engine should automatically retry at a later stage. Here we implement the exponential_backoff_retry utility, which is a coroutine that can wrap another function or coroutine and will try to run it, and rerun it when an exception is caught. When an exception is caught as many times as the maximum number of allowed attempts, the exception is reraised. This is implemented in the various transport tasks that are called by the Waiting state of the JobProcess class: * task_submit_job: submit the calculation * task_update_job: update the scheduler state * task_retrieve_job: retrieve the files of the completed calc * task_kill_job: kill the job through the scheduler These are now wrapped in the exponential_backoff_retry coroutine, which will give the process some leeway when they fail for reasons that may often resolve themselves, when given the time.
6ac3fb5 to
3bd18b4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1834
JobProcesses have various tasks the need to execute that require
a transport, which can then fail for various reasons due to the
command executed over the transport excepting. Examples are the
submission of a job calculation as well as updating its scheduler
state. These may fail for reasons that do not necessarily mean that
the job is irrecoverably lost, such as the internet connection being
temporarily unavailable or the scheduler simply not responding.
Instead of putting the process in an excepted state, the engine
should automatically retry at a later stage.
Here we implement the exponential_backoff_retry utility, which is a
coroutine that can wrap another function or coroutine and will try
to run it, and rerun it when an exception is caught. When an
exception is caught as many times as the maximum number of allowed
attempts, the exception is re-raised.
This is implemented in the various transport tasks that are called
by the Waiting state of the JobProcess class:
These are now wrapped in the exponential_backoff_retry coroutine,
which will give the process some leeway when they fail for reasons
that may often resolve themselves, when given the time.