Skip to content

job-status-consumer: improve handling of "not alive" workflows #437

@VMois

Description

@VMois

Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to job-status-consumer.

Example of such workflow from job-status-consumer logs:

2022-03-24 14:20:16,312 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"total": {"total": 1, "job_ids": []}}}}
2022-03-24 14:20:16,315 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"running": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}
...
2022-03-24 14:20:16,860 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}

Workflow 9b67170b-33ce-4dc3-8150-99e490afcade is reported as deleted in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".

Looking into the code:

def delete_workflow(workflow, all_runs=False, workspace=False):
"""Delete workflow."""
if workflow.status in [
RunStatus.created,
RunStatus.finished,
RunStatus.stopped,
RunStatus.deleted,
RunStatus.failed,
RunStatus.queued,
RunStatus.pending,

In delete_workflow, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to the job-status queue).

Some optional questions are:

  • Why workflow stayed in "pending" state for so long, and after that actually started running on Kubernetes?

How to reproduce

  1. Start helloworld workflow - reana-client run -w test
  2. As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
  3. Check reana-client list, it will show you that test workflow is deleted
  4. Check kubectl get pods, you will find batch pod in NotReady state (and it will stay like this)
  5. Check kubectl logs deployment/reana-workflow-controller job-status-consumer, it will show you that workflow was not in an alive state but still continued to execute.

Next actions

  • set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (job-status-consumer: improve logging of "not alive" workflows #443)

  • if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions