Skip to content

Large memory footprint in bedtools coverage process #824

@TCLamnidis

Description

@TCLamnidis

Description of the bug

Using bedtools coverage within eager currently reserves a large amount of memory that can be inefficient and prohibitive.
With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.

$ cat .command.sh
#!/bin/bash -euo pipefail
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.gz

$ /bin/time -v bash .command.sh
    Command being timed: "bash .command.sh"
    User time (seconds): 157.60
    System time (seconds): 42.88
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:14.69
[...]
    Maximum resident set size (kbytes): 30595900
[...]
    Exit status: 0

Expected behaviour

Get coverage calculations with a smaller memory footprint.

Additional context

It seems that including a genome file (-g) and the -sorted flag to bedtools coverage can cut down on the computational resources massively.

$ cat .command.sh_edited
#!/bin/bash -euo pipefail
samtools view -H PMR002.A0101.SG1_rmdup.bam |grep @SQ|sed 's/@SQ\tSN:\|LN://g' > genome.txt
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.new.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.new.gz

$ /bin/time -v bash .command.sh 
    Command being timed: "bash .command.sh_edited"
    User time (seconds): 80.94
    System time (seconds): 1.03
    Percent of CPU this job got: 110%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14.05
[...]
    Maximum resident set size (kbytes): 23764
[...]
    Exit status: 0

In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums.

$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*[!gz]
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth.new
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth.new

$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*gz
707a260525c99e11e7711f6d14488e68  PMR002.A0101.SG1_rmdup.depth.gz
0b3ebac66221ce100ab55fab1208f7ea  PMR002.A0101.SG1_rmdup.depth.new.gz
87f8c0042a20e79600d11356bfd4d811  PMR002.A0101.SG1_rmdup.breadth.gz
e75df82f6254968a1d09f8a902ab1b64  PMR002.A0101.SG1_rmdup.breadth.new.gz

The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step.

Is there a reason to avoid implementing this?

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions