Skip to content

Latest commit

 

History

History
237 lines (156 loc) · 6.63 KB

File metadata and controls

237 lines (156 loc) · 6.63 KB

Benchmark

This repository contains a benchmark for germline variant calling using Illumina short-read sequencing (~30X coverage).
The goal is to evaluate the performance of different variant calling pipelines using a well-characterized reference dataset.

The benchmarking focuses on SNPs and small indels, which are typical outputs of short-read germline variant calling pipelines.


Dataset

HG002 (Genome in a Bottle)

The benchmark uses the HG002 sample from the Genome in a Bottle (GIAB) consortium.

HG002 is widely used as a gold standard dataset for benchmarking germline variant calling pipelines because it provides:

  • High-confidence truth variants
  • High-confidence confident regions
  • Well-curated reference datasets for SNPs and indels

Dataset characteristics:

  • Sample: HG002 (NA24385)
  • Sequencing: Illumina short reads
  • Coverage: ~30X
  • Reference genome: GRCh38
  • Truth set: GIAB high-confidence variant calls

The truth VCF and confident region BED files are used to compare predicted variants against the gold standard.


Pipelines

The following pipelines are evaluated in this benchmark.

:contentReference[oaicite:0]{index=0} is a widely used Nextflow pipeline for germline and somatic variant calling.

Key characteristics:

  • Developed by the :contentReference[oaicite:1]{index=1} community
  • Supports multiple variant callers
  • Designed for reproducible genomics workflows
  • Containerized with Docker / Singularity

Typical tools used:

  • Alignment: BWA
  • Variant callers:
    • GATK HaplotypeCaller
    • Strelka2
    • DeepVariant (optional)

:contentReference[oaicite:2]{index=2} is a Nextflow pipeline designed for standardized germline variant calling with Illumina short reads.

This pipeline aims to provide:

  • A clear and reproducible workflow
  • Standardized best-practice variant calling
  • Modular processes for benchmarking and extension

Typical workflow:

  1. Read quality control
  2. Read alignment
  3. BAM processing
  4. Variant calling
  5. Variant filtering
  6. Benchmarking against truth sets

Benchmarking Method

Variant calls generated by each pipeline are compared against the GIAB truth set using benchmarking tools.

SNP/INDEL Benchmarking

Metrics evaluated include:

  • Precision
  • Recall
  • F1 score
  • SNP performance
  • Indel performance

Benchmarking is performed within the high-confidence regions defined by GIAB.

To benchmark nf-score/sarek and this workflow small variant calls against HG002 truth set:

cd benchmark/small/benchmark
pixi run --environment snpindelbench bash benchmark_and_summary.sh

Structural Variant (SV) Benchmarking

Structural variants are benchmarked separately using Truvari, a specialized tool for SV comparison.

SV benchmarking workflow:

  1. VCF Normalization: Multi-allelic records are split, indels are left-aligned using bcftools norm
  2. Coordinate Conversion: Query VCFs are converted from hg38 (pipeline output) to hg19 (truth set) using CrossMap with chain file
  3. Chromosome Naming: Remove chr prefix from query VCF to match truth set naming convention
  4. VCF Sorting & Indexing: Sort and index normalized VCFs for Truvari benchmarking
  5. Truvari Benchmark: Compare variants using Truvari with default parameters
  6. Metrics Extraction: Extract summary statistics from Truvari JSON output

Running SV Benchmarking (Manta)

To benchmark Manta structural variant calls from Sarek against HG002 truth set:

cd benchmark
pixi run --environment svbench bash benchmark_sv_sarek.sh

This script will:

  • Normalize the Manta VCF output
  • Convert from hg38 to hg19 coordinates
  • Run Truvari comparison
  • Generate summary metrics file (HG002_manta.summary.txt)

Expected Output

The benchmark generates:

  • Normalized and converted VCF files
  • Truvari output directory with TP/FP/FN classifications
  • Summary metrics in text format with:
    • True Positives (TP)
    • False Positives (FP)
    • False Negatives (FN)
    • Precision, Recall, F1 score
    • Genotype concordance

Example output:

True Positives: 1082
False Positives: 1858
False Negatives: 12650
Sensitivity (Recall): 0.0788 (7.88%)
Precision: 0.3680 (36.8%)
F1 Score: 0.1298
Genotype Concordance: 0.9039 (90.39%)

Tools Used

hap.py

:contentReference[oaicite:3]{index=3} is a widely used tool for benchmarking germline variant calls.

It performs:

  • Variant comparison between predicted VCF and truth VCF
  • Stratified benchmarking for SNPs and indels
  • Standard benchmarking metrics (precision, recall, F1)

Repository:

https://github.com/qbic-projects/QSARK/tree/main

Truvari

Truvari is a specialized benchmarking tool for structural variants (SVs).

It performs:

  • SV comparison and matching between query and truth VCFs
  • Genotype concordance calculation
  • Classification of true positives, false positives, and false negatives
  • Generation of stratified comparison reports

Key features:

  • Handles SV size and type variations
  • Provides detailed TP/FP/FN VCF outputs
  • Generates JSON summary statistics
  • Supports multiple matching algorithms

CrossMap

CrossMap is a utility for converting genome coordinates and annotation files between different genome assemblies.

Used for:

  • Converting VCF coordinates from hg38 (pipeline output) to hg19 (truth set)
  • Handling chromosome naming differences (chr prefix)
  • Liftover operations between reference genomes

bcftools

bcftools provides utilities for variant calling and manipulation.

Used for:

  • VCF normalization (splitting multi-allelic records, left-aligning indels)
  • VCF sorting and indexing
  • Variant filtering and annotation

References

  1. hap.py tool for benchmarking germline short-read variants
    https://github.com/qbic-projects/QSARK/tree/main

  2. Genome in a Bottle (GIAB) program for gold standard datasets
    :contentReference[oaicite:4]{index=4}
    https://www.nist.gov/programs-projects/genome-bottle

  3. Truvari: SV benchmarking tool
    https://github.com/ACEnglish/truvari

  4. CrossMap: Genome coordinate conversion tool
    http://crossmap.sourceforge.net/

  5. bcftools: VCF manipulation utilities
    http://samtools.github.io/bcftools/

  6. HG002 Structural Variant Truth Set (v0.6)
    NIST Genome in a Bottle SVs for Tier 1 regions

  7. SV Benchmarking Guide
    See benchmark_sv_sarek.sh for implementation details