Problem
When source FASTQs from BaseSpace contain reads with duplicate querynames, the resulting uBAMs cause downstream failures. Specifically, Picard SamToFastq (used inside kraken2/classify workflows) crashes when it encounters a read ID with more than 2 records (i.e., more than the expected R1+R2 pair).
Root cause
BaseSpace FASTQs can contain reads with colliding querynames. In our case, 7-field Illumina read names (without UMI) collided with 8-field names (with UMI suffix), e.g.:
VL00483:22:AAJ2VVNM5:1:1202:24044:55126 (7-field, no UMI)
VL00483:22:AAJ2VVNM5:1:1202:24044:55126:GTAAAAAGCTAA (8-field, with UMI)
Both get treated as the same queryname by tools that strip the UMI field, resulting in >2 records per queryname.
Impact
All 13 samples in an assemble_denovo_metagenomic submission failed. Each sample had 1–43 affected read IDs (4–172 reads removed per sample out of 0.8M–39M total reads).
Proposed fix
Add validation to FastqToUBAM (or a new post-processing step) that:
- Detects querynames appearing more than 2 times in the output uBAM
- Removes (or renames) the duplicate-ID reads, preserving properly paired reads
- Logs which read IDs were affected and how many reads were removed
This should be a guardrail against malformed input FASTQs from any sequencer or data source, not just BaseSpace.
Workaround (current)
Manual one-off fix using read_utils filter_bam --exclude with a list of offending read IDs extracted via samtools view | cut -f1 | sort | uniq -c | awk '$1 > 2'.
Context
- Affected workspace submission:
6714bfd9-afcc-4fd1-98b7-d7a3af09fb53 (fastq_to_ubam)
- Failed downstream submission:
assemble_denovo_metagenomic
- Samples: MADPHSEQ-1, -3, -4, -8, -10, -11, -12, -14, -15, -20, -24, -25, -29
Problem
When source FASTQs from BaseSpace contain reads with duplicate querynames, the resulting uBAMs cause downstream failures. Specifically, Picard
SamToFastq(used inside kraken2/classify workflows) crashes when it encounters a read ID with more than 2 records (i.e., more than the expected R1+R2 pair).Root cause
BaseSpace FASTQs can contain reads with colliding querynames. In our case, 7-field Illumina read names (without UMI) collided with 8-field names (with UMI suffix), e.g.:
Both get treated as the same queryname by tools that strip the UMI field, resulting in >2 records per queryname.
Impact
All 13 samples in an
assemble_denovo_metagenomicsubmission failed. Each sample had 1–43 affected read IDs (4–172 reads removed per sample out of 0.8M–39M total reads).Proposed fix
Add validation to
FastqToUBAM(or a new post-processing step) that:This should be a guardrail against malformed input FASTQs from any sequencer or data source, not just BaseSpace.
Workaround (current)
Manual one-off fix using
read_utils filter_bam --excludewith a list of offending read IDs extracted viasamtools view | cut -f1 | sort | uniq -c | awk '$1 > 2'.Context
6714bfd9-afcc-4fd1-98b7-d7a3af09fb53(fastq_to_ubam)assemble_denovo_metagenomic