Usage:
SeqFilter <fa/fq>
SeqFilter [filter] <fa/fq> --out <fa/fq>
cat <fa/fq> | SeqFilter [filter] --out - | ...
Options:
-o|--out <FILE/-> [OFF]
-c|--stdout
Output to file. -c for STDOUT.
--stats <FILE> [-]
Print stats to file. Default STDOUT or STDERR if STDOUT is in use.
--ids <FILE/-/IDLIST>
File of sequence IDs or literal list of IDs to be reported. Reads
comma-, whitespace and newline separated lists. Leading '>' or '@'
are ignored.
SeqFilter fasta.fa --ids "seq35,seq49" # list
SeqFilter fasta.fa --ids ids.list # file
--ids-pattern <FILE/-/PATTERNLIST>
Match a perl PATTERN or a link to a file containing multiple
PATTERNs, one per line, against sequence ids. Keep matching
sequences.
SeqFilter seqs.fq --ids-patt 'comp2_c13_seq.*|comp2_c88_seq.*'
Extended usage: add capture groups to perl P(A)TT(ER)N using "()".
Splits matched sequences to different output files based on capture.
Use sprintf conversions (%s, %02d, ...) in --out to define a
template for the output file names.
# split seqs by library (LIB1, LIB2, LIB14)
SeqFilter multilib.fq --ids-pattern '\w+(\d+)' --out multilib_%02d.fq
# creates multilib_01.fq, multilib_02.fq, multilib_14.fq
NOTE: Perl needs to open a filehandle to every split file, this can
slow things down considerably if you want to split into more than
1000 different files with occurances of patterns randomly mixed in
source file.
--ids-exclude
Reverse behaviour of --ids and --ids-pattern to excluding matched
sequences, and keeping unmatched ones.
--ids-rename <PATTERN>
Provide a perl substitution pattern as string. The pattern is
applied to every id. Use global "$COUNT" to access the output
sequence counter, $PICARD to access the picard number.
SeqFilter library_1.fq --ids-rename='s/.*\//sprintf("MYLIB_%05d\/%d", $COUNT, $PICARD)/e'
# creates ids:
# @MYLIB_00001/1
# @MYLIB_00002/1 ...
--desc-replace
--desc-append
Remove (no arg)/replace/append description of sequences.
-l|--min-length <INT>
Minimum sequence length.
-L|--max-length <INT>
Maximum length for sequences to be retrieved. Default off.
-A|--fasta
-Q|--fastq <CHAR>
Convert FASTQ/FASTA to FASTA/FASTQ, respectively.
-w|--line-width [80 for FASTA]
FASTA only. Set 0 for single line. Ignored for FASTQ.
--rc|--rev-comp <FILE/-/LIST>
File of sequence IDs or list of IDs to be transformed, no argument
to transform all sequences. Formatting follows the same rules as
--ids files.
--careful [OFF]
FASTQ only. Check every FASTQ record to have a valid format. '@/+'
at the start of id lines, sequence and quality of identical length,
phred within boundaries. Slows down the parsing.
--lower-case
--upper-case
Convert output sequence to lower/upper case.
--iupac-to-N
Convert non [ATGCNatgcn] characters to N.
--phred-offset [auto]
FASTQ only. Specify Phred offset for quality scores. Default
auto-detect.
--phred-transform
FASTQ only. Transform phreds from input offset to specified
"--phred-transform" offset, usually 33 to 64 or wise versa.
--phred-mask
FASTQ only. At least two values separated by ",", e.g "0,10" to mask
all Nucleotides with phred below 10 with an "N". Optional additional
values control:
minimum length of unmasked regions
minimum length of masked regions
bps to ignore at the ends of masked regions (shorten masked regions to their
core)
a ratio that determines whether to mask/unmasked terminal regions that are
smaller than are minimum unmasked region
NOTE: Requires "--phred-offset".
--trim-window <INT1>,[<INT2>],[<INT3>]
FASTQ only. Trim sequences to quality >= SOFT,HARD,SIZE in a sliding
window, default 10. The sliding window allows to have positions
below the SOFT cutoff provided the window mean is higher than SOFT.
Qualities below HARD, default 0, will always terminate a stretch. It
is made sure that a) positions with quality below cutoff only occur
within the remaining sequence, not at its start/end and b) windows
never overlap eachother.
--trim-lcs <INT,INT,INT>
FASTQ only. Three values separated by ",", e.g. "30,40,50" to grep
all stretches of quality >= 30 and minimum length 50 from the
sequences. Faster than "--trim-window" yet breaks sequences even on
a single low quality position.
NOTE: "--trim-lcs" and "--trim-window" can be combined, e.g.
--trim-lcs 5,40,100 --trim-window 10
will generate sequences with qualities of at least 5 at every
position and a window mean of 10.
--substr <FILE/-/LIST>
Pathname to a FILE containing information for subseq
extraction/modification. The format is a tsv, by default lines of
the format ID FROM TO are expected. Lines prepened by '#' are
treated as comments and therefore ignored. If --substr-perl-style is
set, the lines must start with the ID of the read, followed by the
substr values OFFSET,LENGTH,REPLACESEQ,REPLACEQUAL. The parameter
usage is than the same as for perl builtin "substr" function,
meaning an OFFSET alone is sufficient, a positive value is set from
the start of the sequence, a negative offset from the end, without
LENGTH, the sequence is returned from OFFSET to its end.
REPLACEMENTS are introduced at the OFFSET position, if LENGTH is 0,
it is a simple insertion, else a part is deleted first and the
REPLACEMENT is then inserted. Substring extraction is of course
performed prior to any other trimming. To trim all reads use '*'
instead of the read id. This command will be performed prior to any
indiviual substr command.
FROM TO:
# extract sequence from pos 10 to pos 50
read1 10 50
OFFSET [LENGTH [REPLACEMENT]]
# trim read1 head and tail by 10
read1 10
# extract from read2 250 nts starting at pos 15
read2 15 250
# replace 3 nt by an "N"" with qual "!" (for FASTQ)
read3 3 1 N !
# trim from all reads 5nts at the beginning and the end.
* 5
* -5
--substr-perl-style
By default, substr information are read according to the format FROM
TO. Set this flag to switch the behaviour to perl substr() like
style of "OFFSET [LENGTH [REPLACEMENT]]"
-N|--Nx <INT,INT...>
Report Nx value (N50, N90...). Default "50,90".
-C|--base-composition <BASE(S),BASE(S),BASE(S),...>
Report relative amount of given bases. Takes a "," separated list,
each element of the list can be one or more bases (cummulative).
--base-composition=GC,N # combined GC and N content
-H|--histogram
Plot distribution of bases by length as ASCII plot. Uses linear
scale for data sets with difference in order of magnitude < 2, log
scale otherwise.
--[no]-smart-labels
Toggle shortening filepaths to shortest unique labels.
-p|--progress
Display progress bars (eq. '--verbose 2')
-q|--quiet
Omit all verbose messages. The same as --verbose=0, Superceeds
--verbose settings.
--verbose <INT>
Toggle verbose level, default 2, which outputs statistics and
progress. Set 1 for statistics only or 0 for no verbose output.
-h|--help
Display this help
-V|--version
Display current version
BioInf-Wuerzburg/SeqFilter
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|