Description of the bug
Using bedtools coverage within eager currently reserves a large amount of memory that can be inefficient and prohibitive.
With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.
$ cat .command.sh
#!/bin/bash -euo pipefail
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.gz
$ /bin/time -v bash .command.sh
Command being timed: "bash .command.sh"
User time (seconds): 157.60
System time (seconds): 42.88
Percent of CPU this job got: 102%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:14.69
[...]
Maximum resident set size (kbytes): 30595900
[...]
Exit status: 0
Expected behaviour
Get coverage calculations with a smaller memory footprint.
Additional context
It seems that including a genome file (-g) and the -sorted flag to bedtools coverage can cut down on the computational resources massively.
$ cat .command.sh_edited
#!/bin/bash -euo pipefail
samtools view -H PMR002.A0101.SG1_rmdup.bam |grep @SQ|sed 's/@SQ\tSN:\|LN://g' > genome.txt
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.new.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.new.gz
$ /bin/time -v bash .command.sh
Command being timed: "bash .command.sh_edited"
User time (seconds): 80.94
System time (seconds): 1.03
Percent of CPU this job got: 110%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14.05
[...]
Maximum resident set size (kbytes): 23764
[...]
Exit status: 0
In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums.
$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*[!gz]
b1958389b5520a5f31e084eca5458610 PMR002.A0101.SG1_rmdup.depth
b1958389b5520a5f31e084eca5458610 PMR002.A0101.SG1_rmdup.depth.new
32c1422d4cd05f84616057c736105b6b PMR002.A0101.SG1_rmdup.breadth
32c1422d4cd05f84616057c736105b6b PMR002.A0101.SG1_rmdup.breadth.new
$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*gz
707a260525c99e11e7711f6d14488e68 PMR002.A0101.SG1_rmdup.depth.gz
0b3ebac66221ce100ab55fab1208f7ea PMR002.A0101.SG1_rmdup.depth.new.gz
87f8c0042a20e79600d11356bfd4d811 PMR002.A0101.SG1_rmdup.breadth.gz
e75df82f6254968a1d09f8a902ab1b64 PMR002.A0101.SG1_rmdup.breadth.new.gz
The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step.
Is there a reason to avoid implementing this?
Description of the bug
Using
bedtools coveragewithin eager currently reserves a large amount of memory that can be inefficient and prohibitive.With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.
Expected behaviour
Get coverage calculations with a smaller memory footprint.
Additional context
It seems that including a genome file (
-g) and the-sortedflag tobedtools coveragecan cut down on the computational resources massively.In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums.
The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step.
Is there a reason to avoid implementing this?