Skip to content

Commit ad62f98

Browse files
authored
Merge pull request #484 from vgteam/tabix
Integrating tabix-based index files for faster subgraph extraction
2 parents 4968e67 + 9bd8dbc commit ad62f98

31 files changed

Lines changed: 1735 additions & 369 deletions

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,10 @@ To load your own data into the Sequence Tube Map, see the guide to [Adding Your
120120
121121
Previously we provided a Docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach, either using the [online version](#online-version) or a full installation of the [local version](#local-version). However, if you would like to Dockerize the Sequence Tube Map, the repository includes a `Dockerfile`.
122122
123+
## Using tabix-based index files
124+
125+
More information about using this faster alternative in [README.tabix.md](README.tabix.md).
126+
123127
## Contributing
124128
125129
For information on how to develop on the Sequence Tube Map codebase, pleas see the [Development Guide](doc/development.md).

README.tabix.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
## Tabix-based index files
2+
3+
Three files are used, each one indexed with tabix (additional `.tbi` file):
4+
5+
1. `nodes.tsv.gz` contains the sequence of each node.
6+
2. `pos.bed.gz` contains the position (as node intervals) of regions on each haplotype.
7+
3. `haps.gaf.gz` contains the path followed by each haplotype (split in pieces).
8+
9+
Briefly, these three index files can be quickly queried to extract a subgraph covering a region of interest: the `pos.bed.gz` index can first tell us which nodes are covered, then the `nodes.tsv.gz` index gives us the sequence of these nodes, and finally we can stitch the haplotype pieces in those nodes from the `haps.gaf.gz` index.
10+
This approach was implemented in a [`chunkix.py`](scripts/chunkix.py) script which can produce a GFA file or files used by the sequenceTubeMap.
11+
The sequenceTubeMap uses this script internally when given tabix-based index files.
12+
13+
## Using tabix-based index files in the sequenceTubeMap
14+
15+
The version on this `tabix` branch can use those index files, for example when mounted files are provided:
16+
17+
- the `pos.bed.gz` index in the *graph* field
18+
- the `nodes.tsv.gz` index in the *node* field
19+
- the `haps.gaf.gz` index in the *haplotype* field
20+
21+
---
22+
23+
![](images/mount.tabix.index.png)
24+
25+
---
26+
27+
Once the index files are mounted, one can query any region on any haplotype in the form *HAPNAME_CONTIG:START-END*.
28+
29+
Other tracks, for example reads or annotations in bgzipped/indexed GAF files, can be added as *reads* in the menu.
30+
31+
---
32+
33+
![](images/mount.tabix.index.annot.png)
34+
35+
---
36+
37+
Of note, you can set a color for each track using the existing palettes or by picking a specific color.
38+
39+
---
40+
41+
![](images/mount.tabix.index.annot.color.png)
42+
43+
---
44+
45+
## ~~Installation~~ Using the docker container
46+
47+
A docker container with this new sequenceTubeMap version, and all the dependencies necessary, is available at `quay.io/jmonlong/sequencetubemap:tabix_dev`.
48+
49+
To use it, run:
50+
51+
```sh
52+
docker run -it -p 3210:3000 -v `pwd`:/data quay.io/jmonlong/sequencetubemap:tabix_dev
53+
```
54+
55+
Of note, the `-p` option redirects port 3000 to 3210.
56+
In practice, pick an unused port.
57+
58+
Then open: http://localhost:3210/
59+
60+
Note: For mounted files, this assumes all files (pangenomes, reads, annotations) are in the current working directory or in subdirectories.
61+
To test with the files that are already prepared, download all the files (see below).
62+
Then, either use them as *custom* Data adding the tracks with the *Configure Tracks* button, or use the prepared Data set "HPRC Minigraph-Cactus v1.1".
63+
For info, the files for this Dataset were defined in the [config.json file](docker/config.json) used to build the docker.
64+
65+
## Available tabix-based index files for the Minigraph-Cactus v1.1 pangenome
66+
67+
Index files and some annotations have been deposited at https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/
68+
69+
To download it all:
70+
71+
```
72+
# pangenome index files
73+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz
74+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz.tbi
75+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz
76+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz.tbi
77+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz
78+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz.tbi
79+
80+
# annotation files
81+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz
82+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz.tbi
83+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz
84+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz.tbi
85+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz
86+
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz.tbi
87+
```
88+
89+
## Building tabix-based index files from a GFA
90+
91+
### Optional. Make a GFA from a GBZ file
92+
93+
In some cases, you will want to use exactly the same pangenome space as a specific GBZ file.
94+
For example, to visualize reads or annotation on that pangenome.
95+
The GFA provided in the HPRC repo might not match exactly because some nodes may have been split when making the GBZ file.
96+
You can convert a GBZ to a GFA (and not translate the nodes back to the original GFA) with:
97+
98+
```sh
99+
vg convert --no-translation -f -t 4 hprc-v1.1-mc-grch38.gbz | gzip > hprc-v1.1-mc-grch38.gfa.gz
100+
```
101+
102+
### Run the `pgtabix.py` python script
103+
104+
The `pgtabix.py` script can be found in the [`scripts` directory](scripts).
105+
It's also present in the `/build/sequenceTubeMap/scripts` directory of the Docker container `quay.io/jmonlong/sequencetubemap:tabix_dev`.
106+
107+
```sh
108+
python3 pgtabix.py -g hprc-v1.1-mc-grch38.gfa.gz -o output.prefix
109+
```
110+
111+
It takes about 1h30-2h to build index files for the Minigraph-Cactus v1.1 pangenome.
112+
This process should scale linearly with the number of haplotypes.
113+
114+
## Making your own annotation files
115+
116+
To make your own annotation files, we have developed a pipeline to project annotation files at the haplotype level (e.g. BED, GFF) onto a pangenome (e.g. GBZ).
117+
Once the projected GAF files are sorted, bgzipped and indexed, they can be queried fast, for example by sequenceTubeMap.
118+
119+
The pipeline is described in the [manuscript](https://jmonlong.github.io/manu-vggafannot/) and script/docs was deposited in [the GitHub repository](https://github.com/jmonlong/manu-vggafannot?tab=readme-ov-file).
120+
In particular, example on how annotation files were projected for this manuscript are described in [this section](https://github.com/jmonlong/manu-vggafannot/tree/main/analysis/annotate).

docker/Dockerfile

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,37 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
66

77
ENV TZ=America/Los_Angeles
88

9-
109
# install basic apt dependencies
1110
# note: most vg apt dependencies are installed by "make get-deps" below
1211
RUN apt-get -qq update && apt-get -qq install -y \
13-
git \
14-
wget \
15-
less \
16-
npm \
17-
nano
18-
12+
git \
13+
wget \
14+
less \
15+
npm \
16+
nano \
17+
make \
18+
g++ \
19+
gcc \
20+
zlib1g-dev \
21+
libbz2-dev \
22+
liblzma-dev \
23+
python3 \
24+
build-essential
25+
26+
# install tabix/bgzip
27+
RUN wget --quiet --no-check-certificate https://github.com/samtools/htslib/releases/download/1.21/htslib-1.21.tar.bz2 && \
28+
tar -xjvf htslib-1.21.tar.bz2 && \
29+
cd htslib-1.21 && \
30+
./configure && \
31+
make && make install
32+
33+
# install node
1934
RUN npm cache clean -f
2035

2136
RUN npm install -g n && n stable
2237

2338
# download vg binary
24-
RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.59.0/vg \
39+
RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.64.1/vg \
2540
&& mv vg /bin/vg && chmod +x /bin/vg
2641

2742
WORKDIR /build

docker/config.json

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,22 @@
1313
"simplify": false,
1414
"removeSequences": false
1515
},
16+
{
17+
"name": "HPRC Minigraph-Cactus v1.1",
18+
"tracks": [
19+
{"trackFile": "/data/hprc.pos.bed.gz", "trackType": "graph", "trackColorSettings": {"mainPalette": "ygreys", "auxPalette": "greys"}},
20+
{"trackFile": "/data/hprc.nodes.tsv.gz", "trackType": "node"},
21+
{"trackFile": "/data/hprc.haps.gaf.gz", "trackType": "haplotype"},
22+
{"trackFile": "/data/gene_exon.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "reds", "auxPalette": "reds"}},
23+
{"trackFile": "/data/rm.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "blues", "auxPalette": "blues"}},
24+
{"trackFile": "/data/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz",
25+
"trackType": "read", "trackColorSettings": {"mainPalette": "plainColors", "auxPalette": "plainColors"}}
26+
],
27+
"region": "GRCh38#0#chr17:7674450-7675333",
28+
"dataType": "built-in",
29+
"simplify": false,
30+
"removeSequences": false
31+
},
1632
{
1733
"name": "Lancet example",
1834
"tracks": [
@@ -48,6 +64,7 @@
4864
}
4965
],
5066
"vgPath": [""],
67+
"chunkixPath": ["/data", "scripts"],
5168
"dataPath": "/data",
5269
"internalDataPath": "exampleData/internal/",
5370
"tempDirPath": "temp",
@@ -57,27 +74,31 @@
5774
"defaultGraphColorPalette" : {
5875
"mainPalette": "#000000",
5976
"auxPalette": "greys",
60-
"colorReadsByMappingQuality": false
77+
"colorReadsByMappingQuality": false,
78+
"alphaReadsByMappingQuality": false
6179
},
6280

6381
"defaultHaplotypeColorPalette" : {
6482
"mainPalette": "plainColors",
6583
"auxPalette": "lightColors",
66-
"colorReadsByMappingQuality": false
84+
"colorReadsByMappingQuality": false,
85+
"alphaReadsByMappingQuality": false
6786
},
6887

6988
"defaultReadColorPalette" : {
7089
"mainPalette": "blues",
7190
"auxPalette": "reds",
72-
"colorReadsByMappingQuality": false
91+
"colorReadsByMappingQuality": false,
92+
"alphaReadsByMappingQuality": false
7393
},
7494

7595
"defaultTrackProps" : {
7696
"trackType": "graph",
7797
"trackColorSettings": {
7898
"mainPalette": "#000000",
7999
"auxPalette": "greys",
80-
"colorReadsByMappingQuality": false
100+
"colorReadsByMappingQuality": false,
101+
"alphaReadsByMappingQuality": false
81102
}
82103
},
83104

102 KB
Loading

images/mount.tabix.index.annot.png

66.1 KB
Loading

images/mount.tabix.index.png

61.1 KB
Loading

0 commit comments

Comments
 (0)