job-status-consumer: improve handling of "not alive" workflows

Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to `job-status-consumer`.

Example of such workflow from `job-status-consumer` logs:

```
2022-03-24 14:20:16,312 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"total": {"total": 1, "job_ids": []}}}}
2022-03-24 14:20:16,315 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"running": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}
...
2022-03-24 14:20:16,860 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}
```

Workflow `9b67170b-33ce-4dc3-8150-99e490afcade` is reported as `deleted` in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".

Looking into the code:

https://github.com/reanahub/reana-workflow-controller/blob/2e539ec5b5e8bc6ad03b6e225aa0a4eb8b91852a/reana_workflow_controller/rest/utils.py#L202-L211

In `delete_workflow`, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to the `job-status` queue).

Some optional questions are:

- Why workflow stayed in "pending" state for so long, and after that actually started running on Kubernetes?

## How to reproduce

1. Start helloworld workflow - `reana-client run -w test`
2. As soon as workflow starts, wait 3-5 seconds (it should go to the `pending` state) and delete it `reana-client delete -w test`
3. Check `reana-client list`, it will show you that `test` workflow is deleted
4. Check `kubectl get pods`, you will find batch pod in `NotReady` state (and it will stay like this)
5. Check `kubectl logs deployment/reana-workflow-controller job-status-consumer`, it will show you that workflow was not in an alive state but still continued to execute.

## Next actions

- set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (https://github.com/reanahub/reana-workflow-controller/pull/443)

- if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

	def delete_workflow(workflow, all_runs=False, workspace=False):
	"""Delete workflow."""
	if workflow.status in [
	RunStatus.created,
	RunStatus.finished,
	RunStatus.stopped,
	RunStatus.deleted,
	RunStatus.failed,
	RunStatus.queued,
	RunStatus.pending,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-status-consumer: improve handling of "not alive" workflows #437

How to reproduce

Next actions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

job-status-consumer: improve handling of "not alive" workflows #437

Description

How to reproduce

Next actions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions