From @apeltzer cf SciLifeLab#730
Might simply work like this: https://software.broadinstitute.org/gatk/documentation/article?id=11245
Copying the text from there over here:
You don't need a Spark cluster to run Spark-enabled GATK tools!
If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.
To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.
If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.
Example command-line parameters
Here are some example arguments you would give to a Spark-enabled GATK tool:
--spark-master local[*] -> "Run on the local machine using all cores"
--spark-master local[2] -> "Run on the local machine using two cores"
Should be possible to run e.g. picard markduplicates like this, same for some other stuff such as BaseRecalibrator and HaplotypeCaller.
https://gatkforums.broadinstitute.org/gatk/discussion/23441/markduplicatespark-is-slower-than-normal-markduplicates
We should really look into this!
Apparently it benefits when not sorting reads prior MarkDuplicates and takes direct BWA mem BAM output. So we should really try it :-)
We should move this over to nf-core/sarek ;)
Moved over ;-)
From @apeltzer cf SciLifeLab#730
Moved over ;-)