Check Documentation
I have checked the following places for your error, but I didn't find anything related to caching.
Description of the bug
When I relaunch a Sarek workflow, a seemingly random subset of MapReads jobs (ranging from 5 to 50%) are re-run despite cached results being available. This causes all downstream jobs for the affected samples to also be re-run. For large datasets, especially those being processed in the cloud, this bug causes a significant cost to be incurred if the workflow doesn't complete successfully on the first try, especially if most of the workflow is complete before an error arises.
I tried using previous versions of Sarek (specifically, 2.6.1, 2.5.2, and 2.5), and they all seem to run into this bug.

Steps to reproduce
I adapted the input file from the test profile by artificially increasing the number of rows to 100. This makes it easier to spot this issue, which doesn't always occur with only a few samples. It's also easier to notice the issue if you monitor the workflow in Nextflow Tower.
- Run the following Nextflow command where
params.yml contains the parameters copied below. It should complete successfully. On a local machine, this command took less than 2 minutes.
nextflow run 'https://github.com/nf-core/sarek' -r '2.7.1' -dsl1 -profile 'test,docker' -params-file 'params.yml'
- Re-run the same Nextflow command with
-resume. It should also complete successfully, but a subset of MapReads jobs should have been re-run instead of being cached.
Parameters
The parameters aside from input are included to skip steps that aren't relevant to the reproducible example.
input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,fastqc,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null
Expected behaviour
I expect that workflow re-runs will leverage cached results unless there's a reason to believe that the job needs to be updated. In the situation I describe above, nothing changes between the two runs, so there shouldn't be a reason for jobs to be re-run, especially not a seemingly random subset of jobs.
Log files
Have you provided the following extra information/files:
The Nextflow logs below correspond to my walkthrough of the steps I listed in "Steps to reproduce".
firstpass-nextflow.log
secondpass-nextflow.log
System
- Hardware: AWS EC2 instances
- Executor: Tried on both local EC2 instances and on AWS Batch
- OS: Amazon Linux
- Version: 2 (Kernel:
Linux 4.14.232-176.381.amzn2.x86_64)
Nextflow Installation
Container engine
- Engine: Docker
- Version:
20.10.13
- Image tag:
nfcore/sarek:2.7.1
Additional context
Check Documentation
I have checked the following places for your error, but I didn't find anything related to caching.
Description of the bug
When I relaunch a Sarek workflow, a seemingly random subset of
MapReadsjobs (ranging from 5 to 50%) are re-run despite cached results being available. This causes all downstream jobs for the affected samples to also be re-run. For large datasets, especially those being processed in the cloud, this bug causes a significant cost to be incurred if the workflow doesn't complete successfully on the first try, especially if most of the workflow is complete before an error arises.I tried using previous versions of Sarek (specifically,
2.6.1,2.5.2, and2.5), and they all seem to run into this bug.Steps to reproduce
I adapted the input file from the
testprofile by artificially increasing the number of rows to 100. This makes it easier to spot this issue, which doesn't always occur with only a few samples. It's also easier to notice the issue if you monitor the workflow in Nextflow Tower.params.ymlcontains the parameters copied below. It should complete successfully. On a local machine, this command took less than 2 minutes.nextflow run 'https://github.com/nf-core/sarek' -r '2.7.1' -dsl1 -profile 'test,docker' -params-file 'params.yml'-resume. It should also complete successfully, but a subset ofMapReadsjobs should have been re-run instead of being cached.Parameters
The parameters aside from
inputare included to skip steps that aren't relevant to the reproducible example.Expected behaviour
I expect that workflow re-runs will leverage cached results unless there's a reason to believe that the job needs to be updated. In the situation I describe above, nothing changes between the two runs, so there shouldn't be a reason for jobs to be re-run, especially not a seemingly random subset of jobs.
Log files
Have you provided the following extra information/files:
.nextflow.logfilesThe Nextflow logs below correspond to my walkthrough of the steps I listed in "Steps to reproduce".
firstpass-nextflow.log
secondpass-nextflow.log
System
Linux 4.14.232-176.381.amzn2.x86_64)Nextflow Installation
22.04.2Container engine
20.10.13nfcore/sarek:2.7.1Additional context