Fastq from unmapped reads#189
Conversation
|
But the current code already does this? Lines 846 to 898 in bc55df3 It even extracts the unmapped data to either BAM, FastQ depending on the users choice. I think we need something else :-( |
|
@maxibor are these fastq reads pre Trimming and merging, but without human reads? |
in this PR, yes |
But this step works on post AR fastq files |
|
Sorry - that close was courtesy of Maia |
|
I think you are good to go for testing once python version fixed. Minor thing: the help message/description is unspecific of what actually is being output. Thus maybe Alex' confusion |
|
Updated Conda env so Pysam should now come. But the test should fail because the Docker container isn't rebuilt as of now with the new Pysam dependancy |
There was a problem hiding this comment.
(Official review)
Looking more in depth (please correct me if I misunderstand anything).
Major
I can't comment on the python section. However,
-
The BAM file as input into this tool should be immediately after BWA (
sorted.bam) as it should include all possible mapped reads, not just ones that mapped exactly. i.e. if you use the reads post-samtools filter -q 37, the discarded mapped reads would not be filtered by the new module, because they are removed from the mapped BAM file after thesamtools filter. -
Is L921 flipped? Shouldn't it be
if (params.singleEnd) {- as you only indicate a single output file in that conditional block?
Minor
For the process name (L906), flag itself (L363, L218) and help message (L108), I suggest the following extra precision instead of -unmap:
--strip_input_fastq Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)
publishDir should renamed to something like /samtools/stripped_fastq rather than unmapped_fastq, as the latter file already exists with the --extract_unmapped functionality.
Equally, the output FASTQ names should be e.g. stripped.fwd.fq.gz and stripped.rev.fq.gz or something. I think R1/R2 would be dangerous for novices re-analyzing the data with the same pipeline, if they forget to add underscores (i.e. there are two R1/R2s in the name). This may then lead to funky input regex for a EAGER2 re-run and subsequent errors.
The actual names/flags can be discussed here - you don't have to go exactly with my 'stripped' suggestion.
|
Followed the suggestions of @jfy133 |
|
@ivelsko owes you a beer ;) |
|
@apeltzer Can you have a look and merge if ok ? |
|
Just failing because of missing |
To answer #187
Comes with a python script (using pysam) to recreate fastq files from the unmapped reads.
However this is using Python3 and the Python in the conda env is currently 2.7 :(