I collect files from multiple subdirectories and work on them in a single process. Nextflow does not complain if two files have the same basename, which leads to silent data loss. It seems that when it stages them, the second symlink overwrites the first one in the working directory.
To reproduce, run mkdir subdir1 subdir2 && echo hello > subdir1/file && echo world > subdir2/file and then run this workflow:
c = Channel.from([
[file('subdir1/file'), file('subdir2/file')]])
process p {
publishDir '.'
input: file(x) from c
output: file('concatenated')
"cat $x > concatenated"
}
The intention was to get an output file that contains hello\nworld\n. Instead, I get world\nworld\n.
To give a little bit of context: In the actual pipeline, the process works with multiple FASTQ files that come from the same individual but were sequenced in different runs. They are stored in different directories, but the file (base-)names are in the standard Illumina scheme <sample-name>_S<sample-index>_L<lane-index>_R1_001.fastq.gz. With the sample name being identical (since they come from same individual), a collision occurs when - by chance - the other run of that sample used the same sample index and the same lane.
I collect files from multiple subdirectories and work on them in a single process. Nextflow does not complain if two files have the same basename, which leads to silent data loss. It seems that when it stages them, the second symlink overwrites the first one in the working directory.
To reproduce, run
mkdir subdir1 subdir2 && echo hello > subdir1/file && echo world > subdir2/fileand then run this workflow:The intention was to get an output file that contains
hello\nworld\n. Instead, I getworld\nworld\n.To give a little bit of context: In the actual pipeline, the process works with multiple FASTQ files that come from the same individual but were sequenced in different runs. They are stored in different directories, but the file (base-)names are in the standard Illumina scheme
<sample-name>_S<sample-index>_L<lane-index>_R1_001.fastq.gz. With the sample name being identical (since they come from same individual), a collision occurs when - by chance - the other run of that sample used the same sample index and the same lane.