Skip to content

Latest commit

 

History

History
84 lines (74 loc) · 4.46 KB

File metadata and controls

84 lines (74 loc) · 4.46 KB

How to prepare your variant call sets (VCF) for VariantSurvival

Table of Contents

  1. Preamble
  2. Merge VCFs across samples
  3. Annotate genes in the multi-sample VCF file

Preamble

VariantSurvival is a platform to analyze genotype-treatment response with respect to structural variants. Hence, the foundation for the survival analysis in the VariantSurvival Shiny App are structural variants (SV) and it is expected that an SV callset is present for every individual of a study group. VariantSurvival is sequencing platform independent and works with callsets of arbitrary variant calling algorithms as long as they comply with the standard VCF file format. The following sections provide recommendations on how to prepare the input data for the VariantSurvival from multiple SV callsets.

Merge variant callsets across individuals

TL;DR:

Use a tool of your choice to merge SV callsets across all individuals of your study groups. Some recommendations listed below:

Step-by-step:

The first step towards the input file format of VariantSurvival is to merge SV callsets across multiple individuals, e.g. of a study group. For a clinical study, this must be all individuals of the study, i.e. of the drug and placebo groups. Merging SV arcoss multiple individuals can be done with a variety of bioinformatics tools (see recommendations listed above). Here, we demonstrate the merging procedure using SURVIVOR and assume the following generic folder structure:

project
└── variants
    ├── sample1.vcf
    ├── sample2.vcf
    ├── sample3.vcf
    ├── sample4.vcf
    └── sample5.vcf

First, SURVIVOR requires a list of VCF files to merge. From your shell you can list all your VCF files and write them into a text file called samples via

ls variants/*.vcf > samples

Next, we use SURVIVOR to actually merge the variant callsets (now listed in the 'samples' file) into one multi-sample VCF file via

SURVIVOR merge samples 1000 1 1 1 0 30 samples_merged.vcf

For more details on the individual parameters of the SURVIVOR merge command please see the official SURVIVOR Wiki. In the next step we will look into how each individual SV within samples_merged.vcf can be annotated with gene identifiers in order for VariantSurvival to filter variants of interest.

Annotate genes in the multi-sample VCF file

TL;DR:

Use a tool of your choice to annotate the variants of your study group with gene identifiers. Some recommendations listed below:

Step-by-step

The first step towards the input file format of VariantSurvival is to annotate the merged SV callset of your study group with gene identifiers. A list of gene identifiers together with their genomic coordinates can be retrieved from arbitrary databases or resources. The format of that list depends on the annotation tool of choice. We will demonstrate the annotation procedure using bcftools, hence we require the list in the standard BED file format. Given the minimum information required in the leading four columns of a BED file genes.bed.gz, like

CHROM FROM TO GENE
chr1 11200 11500 TERC
chr1 465076 465431 SOX
chr4 5333101 5333440 HOXD
... ... ... ...

we can use bcftools to annotate the variant records of our merged VCF file via

bgzip -c samples_merged.vcf > samples_merged.vcf.gz
tabix -p vcf samples_merged.vcf.gz
bcftools annotate \
  -a genes.bed.gz \
  -c CHROM,FROM,TO,GENE \
  -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') \
  samples_merged.vcf.gz

In more detail, the first two commands prepare (compress and index) the VCF file resulting from the merge step. The bcftools annotate command adds the gene identifiers to the variant records' INFO field and modifies the VCF file's header section accordingly. Mind that the BED file does not need a header line as we define it with -c.