You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to job-status-consumer.
Example of such workflow from job-status-consumer logs:
Workflow 9b67170b-33ce-4dc3-8150-99e490afcade is reported as deleted in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".
In delete_workflow, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to the job-status queue).
Some optional questions are:
Why workflow stayed in "pending" state for so long, and after that actually started running on Kubernetes?
How to reproduce
Start helloworld workflow - reana-client run -w test
As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
Check reana-client list, it will show you that test workflow is deleted
Check kubectl get pods, you will find batch pod in NotReady state (and it will stay like this)
Check kubectl logs deployment/reana-workflow-controller job-status-consumer, it will show you that workflow was not in an alive state but still continued to execute.
Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to
job-status-consumer.Example of such workflow from
job-status-consumerlogs:Workflow
9b67170b-33ce-4dc3-8150-99e490afcadeis reported asdeletedin DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".Looking into the code:
reana-workflow-controller/reana_workflow_controller/rest/utils.py
Lines 202 to 211 in 2e539ec
In
delete_workflow, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to thejob-statusqueue).Some optional questions are:
How to reproduce
reana-client run -w testpendingstate) and delete itreana-client delete -w testreana-client list, it will show you thattestworkflow is deletedkubectl get pods, you will find batch pod inNotReadystate (and it will stay like this)kubectl logs deployment/reana-workflow-controller job-status-consumer, it will show you that workflow was not in an alive state but still continued to execute.Next actions
set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (job-status-consumer: improve logging of "not alive" workflows #443)
if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.