Skip to content

FastqToUBAM should detect and handle duplicate read IDs #1051

@dpark01

Description

@dpark01

Problem

When source FASTQs from BaseSpace contain reads with duplicate querynames, the resulting uBAMs cause downstream failures. Specifically, Picard SamToFastq (used inside kraken2/classify workflows) crashes when it encounters a read ID with more than 2 records (i.e., more than the expected R1+R2 pair).

Root cause

BaseSpace FASTQs can contain reads with colliding querynames. In our case, 7-field Illumina read names (without UMI) collided with 8-field names (with UMI suffix), e.g.:

VL00483:22:AAJ2VVNM5:1:1202:24044:55126        (7-field, no UMI)
VL00483:22:AAJ2VVNM5:1:1202:24044:55126:GTAAAAAGCTAA  (8-field, with UMI)

Both get treated as the same queryname by tools that strip the UMI field, resulting in >2 records per queryname.

Impact

All 13 samples in an assemble_denovo_metagenomic submission failed. Each sample had 1–43 affected read IDs (4–172 reads removed per sample out of 0.8M–39M total reads).

Proposed fix

Add validation to FastqToUBAM (or a new post-processing step) that:

  1. Detects querynames appearing more than 2 times in the output uBAM
  2. Removes (or renames) the duplicate-ID reads, preserving properly paired reads
  3. Logs which read IDs were affected and how many reads were removed

This should be a guardrail against malformed input FASTQs from any sequencer or data source, not just BaseSpace.

Workaround (current)

Manual one-off fix using read_utils filter_bam --exclude with a list of offending read IDs extracted via samtools view | cut -f1 | sort | uniq -c | awk '$1 > 2'.

Context

  • Affected workspace submission: 6714bfd9-afcc-4fd1-98b7-d7a3af09fb53 (fastq_to_ubam)
  • Failed downstream submission: assemble_denovo_metagenomic
  • Samples: MADPHSEQ-1, -3, -4, -8, -10, -11, -12, -14, -15, -20, -24, -25, -29

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions