Use spark for GATK

From @apeltzer cf https://github.com/SciLifeLab/Sarek/issues/730

> Might simply work like this: https://software.broadinstitute.org/gatk/documentation/article?id=11245
> 
> Copying the text from there over here:
> 
> > You don't need a Spark cluster to run Spark-enabled GATK tools!
> >
> > If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.
> >
> > To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.
> > 
> > If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.
> > Example command-line parameters
> > 
> > Here are some example arguments you would give to a Spark-enabled GATK tool:
```
    --spark-master local[*] -> "Run on the local machine using all cores"
    --spark-master local[2] -> "Run on the local machine using two cores" 
```
---
> Should be possible to run e.g. `picard markduplicates` like this, same for some other stuff such as `BaseRecalibrator` and `HaplotypeCaller`.
---
> https://gatkforums.broadinstitute.org/gatk/discussion/23441/markduplicatespark-is-slower-than-normal-markduplicates
>
> We should really look into this!
---
> Apparently it benefits when not sorting reads prior `MarkDuplicates` and takes direct `BWA mem` BAM output. So we should really try it :-)
---
> We should move this over to nf-core/sarek ;)
---
Moved over ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use spark for GATK #64

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use spark for GATK #64

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions