Skip to content

Commit b21fb60

Browse files
committed
Replace all reference of strip with host removal (or equivalent)
1 parent 648ecd6 commit b21fb60

7 files changed

Lines changed: 220 additions & 60 deletions

File tree

.github/workflows/ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,9 +101,9 @@ jobs:
101101
- name: MAPPER_BT2 Test running with BowTie2
102102
run: |
103103
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --mapper 'bowtie2' --bt2_alignmode 'local' --bt2_sensitivity 'sensitive' --bt2n 1 --bt2l 16 --bt2_trim5 1 --bt2_trim3 1
104-
- name: STRIP_FASTQ Run the basic pipeline with output unmapped reads as fastq
104+
- name: HOST REMOVAL_FASTQ Run the basic pipeline with output unmapped reads as fastq
105105
run: |
106-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --strip_input_fastq
106+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --hostremoval_input_fastq
107107
- name: BAM_FILTERING Run basic mapping pipeline with mapping quality filtering, and unmapped export
108108
run: |
109109
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_mapping_quality_threshold 37 --bam_unmapped_type 'fastq'

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
5151
* [#516](https://github.com/nf-core/eager/issues/516) - Made bedtools not report out of memory exit code when warning of inconsistant FASTA/Bed entry names
5252
* [#504](https://github.com/nf-core/eager/issues/504) - Removed uninformative sexdeterrmine-snps plot from MultiQC report.
5353
* Nuclear contamination is now reported with the correct library names.
54+
* [#531](https://github.com/nf-core/eager/pull/531) - Renamed FASTQ stripping with host removal
5455

5556
### `Dependencies`
5657

bin/extract_map_reads.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ def _get_args():
3636
parser.add_argument(
3737
'-m',
3838
dest='mode',
39-
default='strip',
40-
help='Read removal mode: remove reads (strip) or replace sequence by N (replace)'
39+
default='remove',
40+
help='Read removal mode: remove reads (remove) or replace sequence by N (replace)'
4141
)
4242
parser.add_argument(
4343
'-p',
@@ -179,27 +179,27 @@ def difference(list1, list2):
179179
return(fqd)
180180

181181

182-
def write_fq(fq_dict, fname, write_mode, strip_mode, proc):
182+
def write_fq(fq_dict, fname, write_mode, remove_mode, proc):
183183
"""Write to fastq file
184184
Args:
185185
fq_dict(dict): fq_dict with unmapped read names as keys,
186186
unmapped/mapped (u|m), seq, and quality as values in a list
187187
fname(string) Path to output fastq file
188188
write_mode (str): 'rb' or 'r'
189-
strip_mode (str): strip (remove read) or replace (replace read sequence) by Ns
189+
remove_mode (str): remove (remove read) or replace (replace read sequence) by Ns
190190
proc(int) number of processes
191191
"""
192192
fq_dict_keys = list(fq_dict.keys())
193193
if write_mode == 'wb':
194194
with xopen(fname, mode='wb', threads=proc) as fw:
195195
for fq_dict_key in fq_dict_keys:
196196
wstring = ""
197-
if strip_mode == 'strip':
197+
if remove_mode == 'remove':
198198
if fq_dict[fq_dict_key][0] == 'u':
199199
wstring += f"@{fq_dict_key+fq_dict[fq_dict_key][1]}\n"
200200
for i in fq_dict[fq_dict_key][2:]:
201201
wstring += f"{i}\n"
202-
elif strip_mode == 'replace':
202+
elif remove_mode == 'replace':
203203
# if unmapped, write all the read lines
204204
if fq_dict[fq_dict_key][0] == 'u':
205205
wstring += f"@{fq_dict_key+fq_dict[fq_dict_key][1]}\n"
@@ -217,12 +217,12 @@ def write_fq(fq_dict, fname, write_mode, strip_mode, proc):
217217
with open(fname, 'w') as fw:
218218
for fq_dict_key in fq_dict_keys:
219219
wstring = ""
220-
if strip_mode == 'strip':
220+
if remove_mode == 'remove':
221221
if fq_dict[fq_dict_key][0] == 'u':
222222
wstring += f"@{fq_dict_key+fq_dict[fq_dict_key][1]}\n"
223223
for i in fq_dict[fq_dict_key][2:]:
224224
wstring += f"{i}\n"
225-
elif strip_mode == 'replace':
225+
elif remove_mode == 'replace':
226226
# if unmapped, write all the read lines
227227
if fq_dict[fq_dict_key][0] == 'u':
228228
wstring += f"@{fq_dict_key+fq_dict[fq_dict_key][1]}\n"
@@ -238,8 +238,8 @@ def write_fq(fq_dict, fname, write_mode, strip_mode, proc):
238238
fw.write(wstring)
239239

240240

241-
def check_strip_mode(mode):
242-
if mode.lower() not in ['replace', 'strip']:
241+
def check_remove_mode(mode):
242+
if mode.lower() not in ['replace', 'remove']:
243243
print(f"Mode must be {' or '.join(mode)}")
244244
return(mode.lower())
245245

@@ -257,7 +257,7 @@ def check_strip_mode(mode):
257257
else:
258258
write_mode = "w"
259259

260-
strip_mode = check_strip_mode(MODE)
260+
remove_mode = check_remove_mode(MODE)
261261
BAMFILE = pysam.AlignmentFile(BAM, 'r')
262262

263263
# FORWARD OR SE FILE
@@ -270,7 +270,7 @@ def check_strip_mode(mode):
270270
# print(fq_dict_fwd)
271271
print(f"- Writing forward fq to {out_fwd}")
272272
write_fq(fq_dict=fq_dict_fwd, fname=out_fwd,
273-
write_mode=write_mode, strip_mode=strip_mode, proc=PROC)
273+
write_mode=write_mode, remove_mode=remove_mode, proc=PROC)
274274

275275
# REVERSE FILE
276276
if IN_REV:
@@ -284,4 +284,4 @@ def check_strip_mode(mode):
284284
fq_dict_rev = get_mapped_reads(fqd_rev, mapped_reads)
285285
print(f"- Writing reverse fq to {out_rev}")
286286
write_fq(fq_dict=fq_dict_rev, fname=out_rev,
287-
write_mode=write_mode, strip_mode=strip_mode, proc=PROC)
287+
write_mode=write_mode, remove_mode=remove_mode, proc=PROC)

docs/usage.md

Lines changed: 165 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,30 +10,189 @@
1010
- [Running the pipeline](#running-the-pipeline)
1111
- [Updating the pipeline](#updating-the-pipeline)
1212
- [Mandatory Arguments](#mandatory-arguments)
13+
- [`-profile`](#-profile)
14+
- [`--input`](#--input)
15+
- [Direct Input Method](#direct-input-method)
16+
- [TSV Input Method](#tsv-input-method)
17+
- [`--bam`](#--bam)
18+
- [`--single_stranded`](#--single_stranded)
19+
- [`--colour_chemistry`](#--colour_chemistry)
20+
- [`--fasta`](#--fasta)
21+
- [`--genome` (using iGenomes)](#--genome-using-igenomes)
1322
- [Output Directories](#output-directories)
23+
- [`--outdir`](#--outdir)
24+
- [`-w / -work-dir`](#-w---work-dir)
1425
- [Optional Reference Options](#optional-reference-options)
26+
- [`--large_ref`](#--large_ref)
27+
- [`--save_reference`](#--save_reference)
28+
- [`--bwa_index`](#--bwa_index)
29+
- [`--seq_dict`](#--seq_dict)
30+
- [`--fasta_index`](#--fasta_index)
1531
- [Other run specific parameters](#other-run-specific-parameters)
32+
- [`-r`](#-r)
33+
- [`--max_memory`](#--max_memory)
34+
- [`--max_time`](#--max_time)
35+
- [`--max_cpus`](#--max_cpus)
36+
- [`--email`](#--email)
37+
- [`--plaintext_email`](#--plaintext_email)
38+
- [`-name`](#-name)
39+
- [`-resume`](#-resume)
40+
- [`-c`](#-c)
41+
- [`--monochrome_logs`](#--monochrome_logs)
42+
- [`--multiqc_config`](#--multiqc_config)
43+
- [`--custom_config_version`](#--custom_config_version)
1644
- [Adjustable parameters for nf-core/eager](#adjustable-parameters-for-nf-coreeager)
1745
- [Step skipping parameters](#step-skipping-parameters)
46+
- [`--skip_fastqc`](#--skip_fastqc)
47+
- [`--skip_adapterremoval`](#--skip_adapterremoval)
48+
- [`--skip_preseq`](#--skip_preseq)
49+
- [`--skip_deduplication`](#--skip_deduplication)
50+
- [`--skip_damage_calculation`](#--skip_damage_calculation)
51+
- [`--skip_qualimap`](#--skip_qualimap)
1852
- [BAM Conversion Options](#bam-conversion-options)
53+
- [`--run_convertinputbam`](#--run_convertinputbam)
1954
- [Complexity Filtering Options](#complexity-filtering-options)
55+
- [`--complexity_filter_poly_g`](#--complexity_filter_poly_g)
56+
- [`--complexity_filter_poly_g_min`](#--complexity_filter_poly_g_min)
2057
- [Adapter Clipping and Merging Options](#adapter-clipping-and-merging-options)
58+
- [`--clip_forward_adaptor`](#--clip_forward_adaptor)
59+
- [`--clip_reverse_adaptor`](#--clip_reverse_adaptor)
60+
- [`--clip_readlength`](#--clip_readlength)
61+
- [`--clip_min_read_quality`](#--clip_min_read_quality)
62+
- [`--clip_min_adap_overlap`](#--clip_min_adap_overlap)
63+
- [`--skip_collapse`](#--skip_collapse)
64+
- [`--skip_trim`](#--skip_trim)
65+
- [`--preserve5p`](#--preserve5p)
66+
- [`--mergedonly`](#--mergedonly)
2167
- [Read Mapping Parameters](#read-mapping-parameters)
22-
- [Mapped Reads Stripping](#mapped-reads-stripping)
68+
- [`--mapper`](#--mapper)
69+
- [BWA (default)](#bwa-default)
70+
- [`--bwaalnn`](#--bwaalnn)
71+
- [`--bwaalnk`](#--bwaalnk)
72+
- [`--bwaalnl`](#--bwaalnl)
73+
- [CircularMapper](#circularmapper)
74+
- [`--circularextension`](#--circularextension)
75+
- [`--circulartarget`](#--circulartarget)
76+
- [`--circularfilter`](#--circularfilter)
77+
- [Bowtie2](#bowtie2)
78+
- [`--bt2_alignmode`](#--bt2_alignmode)
79+
- [`--bt2_sensitivity`](#--bt2_sensitivity)
80+
- [`--bt2n`](#--bt2n)
81+
- [`--bt2l`](#--bt2l)
82+
- [`-bt2_trim5`](#-bt2_trim5)
83+
- [`-bt2_trim3`](#-bt2_trim3)
84+
- [Mapped Reads Host Removal](#mapped-reads-host-removal)
85+
- [`--hostremoval_input_fastq`](#--hostremoval_input_fastq)
86+
- [`--hostremoval_mode`](#--hostremoval_mode)
2387
- [Read Filtering and Conversion Parameters](#read-filtering-and-conversion-parameters)
88+
- [`--run_bam_filtering`](#--run_bam_filtering)
89+
- [`--bam_unmapped_type`](#--bam_unmapped_type)
90+
- [`--bam_mapping_quality_threshold`](#--bam_mapping_quality_threshold)
91+
- [`bam_filter_minreadlength`](#bam_filter_minreadlength)
2492
- [Read DeDuplication Parameters](#read-deduplication-parameters)
93+
- [`--dedupper`](#--dedupper)
94+
- [`--dedup_all_merged`](#--dedup_all_merged)
2595
- [Library Complexity Estimation Parameters](#library-complexity-estimation-parameters)
96+
- [`--preseq_step_size`](#--preseq_step_size)
2697
- [DNA Damage Assessment Parameters](#dna-damage-assessment-parameters)
98+
- [`--udg_type`](#--udg_type)
99+
- [`--damageprofiler_length`](#--damageprofiler_length)
100+
- [`--damageprofiler_threshold`](#--damageprofiler_threshold)
101+
- [`--damageprofiler_yaxis`](#--damageprofiler_yaxis)
102+
- [`--run_pmdtools`](#--run_pmdtools)
103+
- [`--pmdtools_range`](#--pmdtools_range)
104+
- [`--pmdtools_threshold`](#--pmdtools_threshold)
105+
- [`--pmdtools_reference_mask`](#--pmdtools_reference_mask)
106+
- [`--pmdtools_max_reads`](#--pmdtools_max_reads)
27107
- [BAM Trimming Parameters](#bam-trimming-parameters)
108+
- [`--run_trim_bam`](#--run_trim_bam)
109+
- [`--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right`](#--bamutils_clip_half_udg_left----bamutils_clip_half_udg_right)
110+
- [`--bamutils_clip_none_udg_left` / `--bamutils_clip_none_udg_right`](#--bamutils_clip_none_udg_left----bamutils_clip_none_udg_right)
111+
- [`--bamutils_softclip`](#--bamutils_softclip)
28112
- [Captured Library Parameters](#captured-library-parameters)
113+
- [`--snpcapture` false](#--snpcapture-false)
114+
- [`--bedfile`](#--bedfile)
29115
- [Feature Annotation Statistics](#feature-annotation-statistics)
116+
- [`--run_bedtools_coverage`](#--run_bedtools_coverage)
117+
- [`--anno_file`](#--anno_file)
30118
- [Genotyping Parameters](#genotyping-parameters)
119+
- [`--run_genotyping`](#--run_genotyping)
120+
- [`--genotyping_tool`](#--genotyping_tool)
121+
- [`--genotyping_source`](#--genotyping_source)
122+
- [`--gatk_ug_jar`](#--gatk_ug_jar)
123+
- [`--gatk_call_conf`](#--gatk_call_conf)
124+
- [`--gatk_ploidy`](#--gatk_ploidy)
125+
- [`--gatk_dbsnp`](#--gatk_dbsnp)
126+
- [`--gatk_ug_out_mode`](#--gatk_ug_out_mode)
127+
- [`--gatk_hc_out_mode`](#--gatk_hc_out_mode)
128+
- [`--gatk_ug_genotype_model`](#--gatk_ug_genotype_model)
129+
- [`--gatk_hc_emitrefconf`](#--gatk_hc_emitrefconf)
130+
- [`--gatk_ug_keep_realign_bam`](#--gatk_ug_keep_realign_bam)
131+
- [`--gatk_downsample`](#--gatk_downsample)
132+
- [`--gatk_ug_gatk_ug_defaultbasequalities`](#--gatk_ug_gatk_ug_defaultbasequalities)
133+
- [`--freebayes_C`](#--freebayes_c)
134+
- [`--freebayes_g`](#--freebayes_g)
135+
- [`--freebayes_p`](#--freebayes_p)
136+
- [`--pileupcaller_bedfile`](#--pileupcaller_bedfile)
137+
- [`--pileupcaller_snpfile`](#--pileupcaller_snpfile)
138+
- [`--pileupcaller_method`](#--pileupcaller_method)
139+
- [`--angsd_glmodel`](#--angsd_glmodel)
140+
- [`--angsd_glformat`](#--angsd_glformat)
141+
- [`--angsd_createfasta`](#--angsd_createfasta)
142+
- [`--angsd_fastamethod`](#--angsd_fastamethod)
143+
- [`--pileupcaller_transitions_mode`](#--pileupcaller_transitions_mode)
31144
- [Consensus Sequence Generation](#consensus-sequence-generation)
145+
- [`--run_vcf2genome`](#--run_vcf2genome)
146+
- [`--vcf2genome_outfile`](#--vcf2genome_outfile)
147+
- [`--vcf2genome_header`](#--vcf2genome_header)
148+
- [`--vcf2genome_minc`](#--vcf2genome_minc)
149+
- [`--vcf2genome_minq`](#--vcf2genome_minq)
150+
- [`--vcf2genome_minfreq`](#--vcf2genome_minfreq)
32151
- [Mitochondrial to Nuclear Ratio](#mitochondrial-to-nuclear-ratio)
152+
- [`--run_mtnucratio`](#--run_mtnucratio)
153+
- [`--mtnucratio_header`](#--mtnucratio_header)
33154
- [SNP Table Generation](#snp-table-generation)
155+
- [`--run_multivcfanalyzer`](#--run_multivcfanalyzer)
156+
- [`--write_allele_frequencies`](#--write_allele_frequencies)
157+
- [`--min_genotype_quality`](#--min_genotype_quality)
158+
- [`--min_base_coverage`](#--min_base_coverage)
159+
- [`--min_allele_freq_hom`](#--min_allele_freq_hom)
160+
- [`--min_allele_freq_het`](#--min_allele_freq_het)
161+
- [`--additional_vcf_files`](#--additional_vcf_files)
162+
- [`--reference_gff_annotations`](#--reference_gff_annotations)
163+
- [`--reference_gff_exclude`](#--reference_gff_exclude)
164+
- [`--snp_eff_results`](#--snp_eff_results)
34165
- [Human Sex Determination](#human-sex-determination)
166+
- [`--run_sexdeterrmine`](#--run_sexdeterrmine)
167+
- [`--sexdeterrmine_bedfile`](#--sexdeterrmine_bedfile)
35168
- [Human Nuclear Contamination](#human-nuclear-contamination)
169+
- [`--run_nuclear_contamination`](#--run_nuclear_contamination)
170+
- [`--contamination_chrom_name`](#--contamination_chrom_name)
36171
- [Metagenomic Screening](#metagenomic-screening)
172+
- [`--run_metagenomic_screening`](#--run_metagenomic_screening)
173+
- [`--metagenomic_tool`](#--metagenomic_tool)
174+
- [`--metagenomic_min_support_reads`](#--metagenomic_min_support_reads)
175+
- [`--database`](#--database)
176+
- [`--percent_identity`](#--percent_identity)
177+
- [`--malt_mode`](#--malt_mode)
178+
- [`--malt_alignment_mode`](#--malt_alignment_mode)
179+
- [`--malt_top_percent`](#--malt_top_percent)
180+
- [`--malt_min_support_mode`](#--malt_min_support_mode)
181+
- [`--malt_min_support_percent`](#--malt_min_support_percent)
182+
- [`--malt_max_queries`](#--malt_max_queries)
183+
- [`--malt_memory_mode`](#--malt_memory_mode)
184+
- [`--run_maltextract`](#--run_maltextract)
185+
- [`--maltextract_taxon_list`](#--maltextract_taxon_list)
186+
- [`--maltextract_ncbifiles`](#--maltextract_ncbifiles)
187+
- [`--maltextract_filter`](#--maltextract_filter)
188+
- [`--maltextract_toppercent`](#--maltextract_toppercent)
189+
- [`--maltextract_destackingoff`](#--maltextract_destackingoff)
190+
- [`--maltextract_downsamplingoff`](#--maltextract_downsamplingoff)
191+
- [`--maltextract_duplicateremovaloff`](#--maltextract_duplicateremovaloff)
192+
- [`--maltextract_matches`](#--maltextract_matches)
193+
- [`--maltextract_megansummary`](#--maltextract_megansummary)
194+
- [`--maltextract_percentidentity`](#--maltextract_percentidentity)
195+
- [`maltextract_topalignment`](#maltextract_topalignment)
37196
- [Clean up](#clean-up)
38197

39198
## General Nextflow info
@@ -669,23 +828,23 @@ Number of bases to trim of 5' (left) end of read prior alignment. Maybe useful w
669828

670829
Number of bases to trim of 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.
671830

672-
### Mapped Reads Stripping
831+
### Mapped Reads Host Removal
673832

674833
These parameters are used for removing mapped reads from the original input FASTQ files, usually in the context of uploading the original FASTQ files to a public read archive (NCBI SRA/EBI ENA).
675834

676835
These flags will produce FASTQ files almost identical to your input files, except that reads with the same read ID as one found in the mapped bam file, are either removed or 'masked' (every base replaced with Ns).
677836

678837
This functionality allows you to provide other researchers who wish to re-use your data to apply their own adapter removal/read merging procedures, while maintaining anonyminity for sample donors - for example with microbiome research.
679838

680-
If using TSV input, stripping is performed library, i.e. after lane merging.
839+
If using TSV input, mapped read removal is performed library, i.e. after lane merging.
681840

682-
#### `--strip_input_fastq`
841+
#### `--hostremoval_input_fastq`
683842

684843
Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)
685844

686-
#### `--strip_mode`
845+
#### `--hostremoval_mode`
687846

688-
Read removal mode. Strip mapped reads completely (`'strip'`) or just replace mapped reads sequence by N (`'replace'`)
847+
Read removal mode. Completely remove mapped reads from the file(s) (`'remove'`) or just replace mapped reads sequence by N (`'replace'`)
689848

690849
### Read Filtering and Conversion Parameters
691850

0 commit comments

Comments
 (0)