Large memory footprint in bedtools coverage process

## Description of the bug

Using `bedtools coverage` within eager currently reserves a large amount of memory that can be inefficient and prohibitive. 
With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.
```
$ cat .command.sh
#!/bin/bash -euo pipefail
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.gz

$ /bin/time -v bash .command.sh
    Command being timed: "bash .command.sh"
    User time (seconds): 157.60
    System time (seconds): 42.88
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:14.69
[...]
    Maximum resident set size (kbytes): 30595900
[...]
    Exit status: 0
```

## Expected behaviour

Get coverage calculations with a smaller memory footprint.

## Additional context

It seems that including a genome file (`-g`) and the `-sorted` flag to `bedtools coverage` can cut down on the computational resources massively.
```
$ cat .command.sh_edited
#!/bin/bash -euo pipefail
samtools view -H PMR002.A0101.SG1_rmdup.bam |grep @SQ|sed 's/@SQ\tSN:\|LN://g' > genome.txt
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.new.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.new.gz

$ /bin/time -v bash .command.sh 
    Command being timed: "bash .command.sh_edited"
    User time (seconds): 80.94
    System time (seconds): 1.03
    Percent of CPU this job got: 110%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14.05
[...]
    Maximum resident set size (kbytes): 23764
[...]
    Exit status: 0
```
In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums. 
```
$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*[!gz]
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth.new
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth.new

$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*gz
707a260525c99e11e7711f6d14488e68  PMR002.A0101.SG1_rmdup.depth.gz
0b3ebac66221ce100ab55fab1208f7ea  PMR002.A0101.SG1_rmdup.depth.new.gz
87f8c0042a20e79600d11356bfd4d811  PMR002.A0101.SG1_rmdup.breadth.gz
e75df82f6254968a1d09f8a902ab1b64  PMR002.A0101.SG1_rmdup.breadth.new.gz
```

The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step. 

Is there a reason to avoid implementing this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory footprint in bedtools coverage process #824

Description of the bug

Expected behaviour

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large memory footprint in bedtools coverage process #824

Description

Description of the bug

Expected behaviour

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions