|
| 1 | +## Tabix-based index files |
| 2 | + |
| 3 | +Three files are used, each one indexed with tabix (additional `.tbi` file): |
| 4 | + |
| 5 | +1. `nodes.tsv.gz` contains the sequence of each node. |
| 6 | +2. `pos.bed.gz` contains the position (as node intervals) of regions on each haplotype. |
| 7 | +3. `haps.gaf.gz` contains the path followed by each haplotype (split in pieces). |
| 8 | + |
| 9 | +Briefly, these three index files can be quickly queried to extract a subgraph covering a region of interest: the `pos.bed.gz` index can first tell us which nodes are covered, then the `nodes.tsv.gz` index gives us the sequence of these nodes, and finally we can stitch the haplotype pieces in those nodes from the `haps.gaf.gz` index. |
| 10 | +This approach was implemented in a [`chunkix.py`](scripts/chunkix.py) script which can produce a GFA file or files used by the sequenceTubeMap. |
| 11 | +The sequenceTubeMap uses this script internally when given tabix-based index files. |
| 12 | + |
| 13 | +## Using tabix-based index files in the sequenceTubeMap |
| 14 | + |
| 15 | +The version on this `tabix` branch can use those index files, for example when mounted files are provided: |
| 16 | + |
| 17 | +- the `pos.bed.gz` index in the *graph* field |
| 18 | +- the `nodes.tsv.gz` index in the *node* field |
| 19 | +- the `haps.gaf.gz` index in the *haplotype* field |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +Once the index files are mounted, one can query any region on any haplotype in the form *HAPNAME_CONTIG:START-END*. |
| 28 | + |
| 29 | +Other tracks, for example reads or annotations in bgzipped/indexed GAF files, can be added as *reads* in the menu. |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +Of note, you can set a color for each track using the existing palettes or by picking a specific color. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | + |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## ~~Installation~~ Using the docker container |
| 46 | + |
| 47 | +A docker container with this new sequenceTubeMap version, and all the dependencies necessary, is available at `quay.io/jmonlong/sequencetubemap:tabix_dev`. |
| 48 | + |
| 49 | +To use it, run: |
| 50 | + |
| 51 | +```sh |
| 52 | +docker run -it -p 3210:3000 -v `pwd`:/data quay.io/jmonlong/sequencetubemap:tabix_dev |
| 53 | +``` |
| 54 | + |
| 55 | +Of note, the `-p` option redirects port 3000 to 3210. |
| 56 | +In practice, pick an unused port. |
| 57 | + |
| 58 | +Then open: http://localhost:3210/ |
| 59 | + |
| 60 | +Note: For mounted files, this assumes all files (pangenomes, reads, annotations) are in the current working directory or in subdirectories. |
| 61 | +To test with the files that are already prepared, download all the files (see below). |
| 62 | +Then, either use them as *custom* Data adding the tracks with the *Configure Tracks* button, or use the prepared Data set "HPRC Minigraph-Cactus v1.1". |
| 63 | +For info, the files for this Dataset were defined in the [config.json file](docker/config.json) used to build the docker. |
| 64 | + |
| 65 | +## Available tabix-based index files for the Minigraph-Cactus v1.1 pangenome |
| 66 | + |
| 67 | +Index files and some annotations have been deposited at https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/ |
| 68 | + |
| 69 | +To download it all: |
| 70 | + |
| 71 | +``` |
| 72 | +# pangenome index files |
| 73 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz |
| 74 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz.tbi |
| 75 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz |
| 76 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz.tbi |
| 77 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz |
| 78 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz.tbi |
| 79 | +
|
| 80 | +# annotation files |
| 81 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz |
| 82 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz.tbi |
| 83 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz |
| 84 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz.tbi |
| 85 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz |
| 86 | +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz.tbi |
| 87 | +``` |
| 88 | + |
| 89 | +## Building tabix-based index files from a GFA |
| 90 | + |
| 91 | +### Optional. Make a GFA from a GBZ file |
| 92 | + |
| 93 | +In some cases, you will want to use exactly the same pangenome space as a specific GBZ file. |
| 94 | +For example, to visualize reads or annotation on that pangenome. |
| 95 | +The GFA provided in the HPRC repo might not match exactly because some nodes may have been split when making the GBZ file. |
| 96 | +You can convert a GBZ to a GFA (and not translate the nodes back to the original GFA) with: |
| 97 | + |
| 98 | +```sh |
| 99 | +vg convert --no-translation -f -t 4 hprc-v1.1-mc-grch38.gbz | gzip > hprc-v1.1-mc-grch38.gfa.gz |
| 100 | +``` |
| 101 | + |
| 102 | +### Run the `pgtabix.py` python script |
| 103 | + |
| 104 | +The `pgtabix.py` script can be found in the [`scripts` directory](scripts). |
| 105 | +It's also present in the `/build/sequenceTubeMap/scripts` directory of the Docker container `quay.io/jmonlong/sequencetubemap:tabix_dev`. |
| 106 | + |
| 107 | +```sh |
| 108 | +python3 pgtabix.py -g hprc-v1.1-mc-grch38.gfa.gz -o output.prefix |
| 109 | +``` |
| 110 | + |
| 111 | +It takes about 1h30-2h to build index files for the Minigraph-Cactus v1.1 pangenome. |
| 112 | +This process should scale linearly with the number of haplotypes. |
| 113 | + |
| 114 | +## Making your own annotation files |
| 115 | + |
| 116 | +To make your own annotation files, we have developed a pipeline to project annotation files at the haplotype level (e.g. BED, GFF) onto a pangenome (e.g. GBZ). |
| 117 | +Once the projected GAF files are sorted, bgzipped and indexed, they can be queried fast, for example by sequenceTubeMap. |
| 118 | + |
| 119 | +The pipeline is described in the [manuscript](https://jmonlong.github.io/manu-vggafannot/) and script/docs was deposited in [the GitHub repository](https://github.com/jmonlong/manu-vggafannot?tab=readme-ov-file). |
| 120 | +In particular, example on how annotation files were projected for this manuscript are described in [this section](https://github.com/jmonlong/manu-vggafannot/tree/main/analysis/annotate). |
0 commit comments