It seems to me like the current implementation is aiming to genotype each sample individually, producing n genotype datasets. Is that so?
This approach would not be ideal for end users, since they would then need to merge all these datasets together, something that is usually done in pairs of datasets. This would mean that merging the genotypes of n individuals would require an additional n-1 sequential merging jobs that are not within eager.
On the other hand, putting all individuals together to genotype would prohibit running single stranded and double stranded libraries together, since pileupCaller's --singleStrandMode applies to the entire set of samples being genotyped.
Instead of abandoning the user to run multiple extra jobs, or running those jobs for them in the background (which, if even possible, would increase runtime considerably since they are not entirely parallelisable), or abandoning the advantages of --singleStrandMode, I propose we either:
a) Do not merge single- and double-stranded libraries from the same sample into a single bam file, and genotype each group separately. We can then provide the user with two separate genotype datasets (one for single- and one for double-stranded libraries, even if a version of an individual's data are in both datasets).
b) Block users from submitting batches with both single and double stranded libraries as a whole. This is the easiest option to implement, but also the least useful.
Any other ideas? Maybe I am overlooking something?
It seems to me like the current implementation is aiming to genotype each sample individually, producing n genotype datasets. Is that so?
This approach would not be ideal for end users, since they would then need to merge all these datasets together, something that is usually done in pairs of datasets. This would mean that merging the genotypes of n individuals would require an additional n-1 sequential merging jobs that are not within eager.
On the other hand, putting all individuals together to genotype would prohibit running single stranded and double stranded libraries together, since pileupCaller's
--singleStrandModeapplies to the entire set of samples being genotyped.Instead of abandoning the user to run multiple extra jobs, or running those jobs for them in the background (which, if even possible, would increase runtime considerably since they are not entirely parallelisable), or abandoning the advantages of
--singleStrandMode, I propose we either:a) Do not merge single- and double-stranded libraries from the same sample into a single bam file, and genotype each group separately. We can then provide the user with two separate genotype datasets (one for single- and one for double-stranded libraries, even if a version of an individual's data are in both datasets).
b) Block users from submitting batches with both single and double stranded libraries as a whole. This is the easiest option to implement, but also the least useful.
Any other ideas? Maybe I am overlooking something?