Skip to content

Use spark for GATK #64

@maxulysse

Description

@maxulysse

From @apeltzer cf SciLifeLab#730

Might simply work like this: https://software.broadinstitute.org/gatk/documentation/article?id=11245

Copying the text from there over here:

You don't need a Spark cluster to run Spark-enabled GATK tools!

If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.

To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.

If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.
Example command-line parameters

Here are some example arguments you would give to a Spark-enabled GATK tool:

    --spark-master local[*] -> "Run on the local machine using all cores"
    --spark-master local[2] -> "Run on the local machine using two cores" 

Should be possible to run e.g. picard markduplicates like this, same for some other stuff such as BaseRecalibrator and HaplotypeCaller.


https://gatkforums.broadinstitute.org/gatk/discussion/23441/markduplicatespark-is-slower-than-normal-markduplicates

We should really look into this!


Apparently it benefits when not sorting reads prior MarkDuplicates and takes direct BWA mem BAM output. So we should really try it :-)


We should move this over to nf-core/sarek ;)


Moved over ;-)

Metadata

Metadata

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions