Skip to content

all errors reported as FSD_ERRNO_INTERNAL_ERROR #1

@tbooth

Description

@tbooth

Hi,

Thanks for merging my previous fix. This one is in a similar vein.

On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.

Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.

Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.

*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig	2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c	2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
  			}
  		}
  		if (job_info) {
--- 131,150 ----
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else
!                 // We should detect the error corresponding to "Socket timed out" and report
!                 // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
!                 // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
!                 //   which simply indicates the job is still running?? Maybe we should try it and see. )
!                 // To see what _slurm_errno corresponds to which message let's look at
!                 // 'slurm_strerror' in the slurm source code...
!                 //   https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
!             if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
!                  _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
!                ) {
!                 fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
!             } else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
  			}
  		}
  		if (job_info) {

Cheers,

TIM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions