Skip to content

Commit 489c112

Browse files
committed
2 parents d54938b + ef519e3 commit 489c112

1 file changed

Lines changed: 31 additions & 0 deletions

File tree

README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
## PartSortBWT
2+
3+
Low memory BWT computation. Usage:
4+
5+
```
6+
std::string input = ...
7+
std::string output;
8+
PartSortBWT(input, output); // computes BWT, stores in output
9+
PartSortBWT(input, input); // computes BWT, stores in the same string and overwrites
10+
```
11+
12+
The input and output can be the same string. For a real example see [src/main.cpp](src/main.cpp). The input string must end with 0, and all other characters must be between 1-5 representing the alphabet `$NACGT`. Not recommended for strings which have long runs of N's, eg. GRCh38.
13+
14+
### Benchmark
15+
16+
Measured with `/usr/bin/time -v bin/main input_file > /dev/null` using commit [19bb9f4](https://github.com/maickrau/PartSortBWT/commit/19bb9f42b6892d62f7952889ad96fb475c5f35a5) on a laptop.
17+
18+
| Dataset | Size | Time (h:mm:ss) | Memory |
19+
| --- | --- | --- | --- |
20+
| CHM13 chr1 hpc | 175 Mbp | 0:00:44 | 291 Mb |
21+
| CHM13 hpc | 2.1 Gbp | 0:11:35 | 3.6 Gb |
22+
| CHM13 | 3.1 Gbp | 0:30:03 | 4.9 Gb |
23+
| Assembly graph hpc | 3.8 Gbp | 0:37:26 | 6.5 Gb |
24+
| AAAAAA... | 1 Mbp | 0:14:49 | 19 Mb |
25+
| GRCh38 | 3.3 Gbp | out of memory crash | >12 Gb |
26+
27+
### Method
28+
29+
Classify suffixes according to their first 4 characters. First collect and sort all suffixes starting with AAAA to get the first x characters of the BWT, then suffixes with AAAC to get the next y characters, etc. Repeat for all 4-character prefixes.
30+
31+
Expected memory use for a random DNA string is `input + output + ~0.875n` bytes. If input == output then total memory use is ~1.875n bytes. Less random strings have higher memory use and worst case is one character repeating (eg `NNNNNNNN...`) using ~16n bytes. Expected runtime for a random string is `O(n log^2 n)` and worst case is `O(n^2 log n)`.

0 commit comments

Comments
 (0)