Hi,
This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)
Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.
TIM
givemeabreak_patch.txt
Hi,
This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)
Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.
TIM
givemeabreak_patch.txt