Race condition in job.c/slurmdrmaa_job_update_status

Hi,

This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)

Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.

TIM

[givemeabreak_patch.txt](https://github.com/natefoo/slurm-drmaa-old/files/571869/givemeabreak_patch.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in job.c/slurmdrmaa_job_update_status #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Race condition in job.c/slurmdrmaa_job_update_status #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions