FastqToUBAM should detect and handle duplicate read IDs

## Problem

When source FASTQs from BaseSpace contain reads with duplicate querynames, the resulting uBAMs cause downstream failures. Specifically, Picard `SamToFastq` (used inside kraken2/classify workflows) crashes when it encounters a read ID with more than 2 records (i.e., more than the expected R1+R2 pair).

### Root cause

BaseSpace FASTQs can contain reads with colliding querynames. In our case, 7-field Illumina read names (without UMI) collided with 8-field names (with UMI suffix), e.g.:

```
VL00483:22:AAJ2VVNM5:1:1202:24044:55126        (7-field, no UMI)
VL00483:22:AAJ2VVNM5:1:1202:24044:55126:GTAAAAAGCTAA  (8-field, with UMI)
```

Both get treated as the same queryname by tools that strip the UMI field, resulting in >2 records per queryname.

### Impact

All 13 samples in an `assemble_denovo_metagenomic` submission failed. Each sample had 1–43 affected read IDs (4–172 reads removed per sample out of 0.8M–39M total reads).

## Proposed fix

Add validation to `FastqToUBAM` (or a new post-processing step) that:

1. **Detects** querynames appearing more than 2 times in the output uBAM
2. **Removes** (or renames) the duplicate-ID reads, preserving properly paired reads
3. **Logs** which read IDs were affected and how many reads were removed

This should be a guardrail against malformed input FASTQs from any sequencer or data source, not just BaseSpace.

### Workaround (current)

Manual one-off fix using `read_utils filter_bam --exclude` with a list of offending read IDs extracted via `samtools view | cut -f1 | sort | uniq -c | awk '$1 > 2'`.

## Context

- Affected workspace submission: `6714bfd9-afcc-4fd1-98b7-d7a3af09fb53` (fastq_to_ubam)
- Failed downstream submission: `assemble_denovo_metagenomic`
- Samples: MADPHSEQ-1, -3, -4, -8, -10, -11, -12, -14, -15, -20, -24, -25, -29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastqToUBAM should detect and handle duplicate read IDs #1051

Problem

Root cause

Impact

Proposed fix

Workaround (current)

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FastqToUBAM should detect and handle duplicate read IDs #1051

Description

Problem

Root cause

Impact

Proposed fix

Workaround (current)

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions