This directory contains several command-line utilities that can help in some ancillary tasks to get V-pipe running.
quick_install.sh is a script that can assist deploying V-pipe: It can automatically download and install bioconda, snakemake and fetch V-pipe from the repository.
It is possible to directly run it from the web with the following commands:
curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -w working
cd ./working/- This will download and install bioconda, snakemake and V-pipe in the current directory (use option
-pto specify another directory). - The installed version by default is the
mastergit branch- It is possible to specify another git branch or a tag using the
-boptions, e.g.-b v2.99.1. - Alternatively, the
-roption will download and install the.tar.gztar-ball package of a specific version. - See the release page for available tags and releases.
- It is possible to specify another git branch or a tag using the
- using
-wwill create a working directory and populate it.- It will copy over a default
config.yamlthat you can edit to your liking - And it will create a handy
vpipeshort-cut script to invokesnakemake:
- It will copy over a default
cd working
# edit config.yaml and provide samples/ directory
./vpipe --jobs 4 --printshellcmds --dry-runTips: To create and populate other new working directories, you can call
init_project.shfrom within the new directory:mkdir -p working_2 cd working_2 ../V-pipe/init_project.sh
Available options:
# ./quick_install.sh -h
usage: ./quick_install.sh [options]
options:
-f force overwriting directories
-p PREFIX prefix directory under which to install V-pipe
[default: current directory]
-b BRANCH install specified branch of V-pipe
[default: master]
-r RELEASE install specified release package of V-pipe
-w WORKDIR create and populate working directory
-m only minimal working directory
-h print this help message and exit"Here are also some tools that can help mass-importing samples into V-pipe.
They will search for .fastq.gz files, put them in the two-level hierarchy that V-pipe expects in samples/ and generate the corresponding samples.tsv. By default, hard-links will be used to save space and avoid duplicating these large sequencing files.
Whichever of these tools is most suitable for you depends on how you receive the files from the sequencing lab.
sort_samples_dumbis for loose collection of FASTQ filessort_samples_demultiplexstatsandsort_samples_jobinfoare better suited when additional information has been provided by the base calling and demultiplexing software.- these two also support patch maps TSV files, helping to rename the samples from the name used in the original sample sheet in the LIMS (laboratory's information management system) to something more flexible. For example, to remap simple sequential numbers to longer names, use the following patch map TSV:
1 05_2021_02_01
2 05_2021_02_02
3 05_2021_02_03
4 05_2021_02_04
5 05_2021_02_05
268 05_2021_02_06
269 05_2021_02_07
270 05_2021_02_08
271 05_2021_02_09
272 05_2021_02_10
273 05_2021_02_11
274 05_2021_02_12
275 05_2021_02_13sort_samples_dumb is useful when labs are providing the sequencing data simply as a loose collection of FASTQ files.
Example of usage:
V-pipe/utils/sort_samples_dumb -f download/ -t working/samples.tsv -o working/samples -b 20210110Where:
-f: specifies the main directory containing the downloaded the.fastq.gzfiles.- all its subdirectories will be searched recursively
-t: specifies thesamples.tsvfile to create-o: specifies the output directory where to store the files-b: is the directory to use for the second level (see V-pipe tutorials), e.g.: dates- the first level is usually the patient or sample name and
sort_samples_dumbwill attempt to guess it from the file names. - the second level is usually the sampling date or sequencing batch, but
sort_samples_dumbhas no simple way to guess that.
- the first level is usually the patient or sample name and
Running this command will immediately copy over the .fastq.gz files (using hard links) in the working directory and generate the samples.tsv. After checking the content, you should be able to run V-pipe.
Available options:
./sort_samples_dumb -h
Usage: ./sort_samples_dumb -f <DIR> -b <BATCH> [-l <LEN>] [-L {''|--link|--symbolic-link|--reflink}] -o <DIR> -m <MODE>
-f : directory containing .fastq.gz files
-b : batch name to use for 2nd level (e.g.: date)
-l : read lenght (default: autodetect)
-L : link parameter to pass to cp when copying (default: --link)
-t : tsv file (default: samples.<BATCH>.tsv)
-T : do not truncate (empty) the file before starting
-g : store list in .tsv.staging instead and only rename into final .tsv if successful
-D : sample have duplicates (e.g.: across lanes)
-p : prefix to prepend to fastq files (e.g.: for fusing runs)
-s : suffix to append to fastq files (e.g.: for fusing runs)
-o : output directory
-m : POSIX mode parameter to pass to mkdir (e.g.: 0770)Of special interest:
-l: sets the read length instead of trying to autodetect it with an awk script.-D: when multiple files have the same name, the importer will group them for merging as the same sample (e.g.: lane-duplicates).
sort_samples_demultiplexstats can help if the lab provides the DemultiPlex Stats files generated by Illumina's bcl2fastq version 2.x software. This tool will use the information provided in the JSON file to match files to the respective samples they came from.
The general syntax is:
V-pipe/utils/sort_samples_demultiplexstats --statsdir downloads/Demultiplex --fastqdir downloads/RawReads --qcdir downloads/FastQC --outdir working/samplesWhere:
--statsdir: is the directory containing the fileStats/Stats.jsonproduced bybcl2fastq.--outdir: directory where to create the output optionally:--fastqdir: is the directory containing the.fastq.gzfiles if they are not in the same directory as stats.--qcdir: is the directory with the FastQC's quality checks if provided by the lab.--noempty: if the lab deleted the empty (0 reads).fastq.gzfiles.
This command is the first step (analysing Stats.json):
- It will generate a samples TSV file in the output directory:
working/samples/samples.{some date}.tsv. You can copy the content of this file into yoursamples.tsv - It will generate a file
working/samples/movedata.sh, you can run it with:and that file will in turn perform the second step: hard-linking all the files frombash working/samples/movedata.shdownload/RawReadsintoworking/samples/{sample name}/{some date}. - once the second step has been performed, you should be able to run V-pipe.
Available options:
./sort_samples_demultiplexstats -h
usage: sort_samples_demultiplexstats [-h] -S DIR [-f DIR] [-q DIR] [-o DIR] [-m MODE] [-L CPLINK] [-s] [-a] [-n] [-p TSV]
Uses bcl2fastq's demultiplexing stats as metadata to organise samples
optional arguments:
-h, --help show this help message and exit
-S DIR, --statsdir DIR
directory containing 'Stats/Stats.json'
-f DIR, --fastqdir DIR
directory containing .fastq.gz files if different from above
-q DIR, --qcdir DIR if set, import FastQC's _fastqc.html files from there
-o DIR, --outdir DIR output directory
-m MODE, --mode MODE POSIX file access mode to be passed to mkdir
-L CPLINK, --linking CPLINK
parameter to pass to `cp` for linking files instead of copying their data
--force Force overwriting any existing file when moving
-s, --summary Only display a summary of datasets, not an exhaustive list of all samples
-a, --append Append to the end of movedatafiles.sh, instead of overwritting (use when calling from an external combiner wrapper)
-g, --staging Write samples list in .tsv.staging and only rename them to the final .tsv at the end of movedatafiles.sh if there were no errors.
-n, --noempty skip fastq.gz files with bad yield (0 reads)
-p TSV, --patchmap TSV
patchmap file to rename samplessort_samples_jobinfo is similar in concept to sort_samples_demultiplexstats, but it relies on the CompletedJobInfo.xml and SampleSheetUsed.csv files generated by the Illumina Analysis Software on Windows, if those are files are provided by the lab, and will try to match samples listed therein to .fastq.gz files.
The general syntax is:
V-pipe/utils/sort_samples_jobinfo --sourcedir=downloads/20210528_061936 --outdir=working/samplesWhere:
--sourcedir: is the directory, usually named {yymmdd}_{hhmmss} and found inside the subdirectoryAlignment_1of the Illumina's run folder (e.g.:E:\210527_NDX550487_RUO_0008_AHTWG2AFX2\Alignment_1\20210528_061936). It contains the two filesCompletedJobInfo.xmlandSampleSheetUsed.csvand a subdirectory namedFastqcontaining all the.fastq.gzsequencing files.--outdir: directory where to create the output
This command is the first step (analyzing CompletedJobInfo.xml and SampleSheetUsed.csv ):
- It will generate a samples TSV file in the output directory:
working/samples/samples.{some date}.tsv. You can copy the content of this file into yoursamples.tsv - It will generate a file
working/samples/movedata.sh, you can run it with:and that file will in turn perform the second step: hard-linking all the files frombash working/samples/movedata.shdownloads/20210528_061936/Fastqintoworking/samples/{sample name}/{some date}. - once the second step has been performed, you should be able to run V-pipe.
- if the option
--batchis provided, it will also generate an additional fileworking/samples/batch.{some date}.yamlincluding extra information gathered from the files (e.g. the Library preparation kit listed in the input CSV). The parameter of--batchis used to provide the name of the lab for thelab:field in this file.
Available options:
./sort_samples_jobinfo -h
usage: sort_samples_jobinfo [-h] -S DIR [-f DIR] [-o DIR] [-m MODE] [-L CPLINK] [-b LAB] [-s] [-a] [-l] [-p TSV]
Uses CompletedJobInfo.xml and SampleSheetUsed.csv from Illumina Analysis Software
optional arguments:
-h, --help show this help message and exit
-S DIR, --sourcedir DIR
directory containing CompletedJobInfo.xml and SampleSheetUsed.csv
-f DIR, --fastqdir DIR
directory containing .fastq.gz files if not in 'Fastq' subdirectory
-o DIR, --outdir DIR output directory
-m MODE, --mode MODE POSIX file access mode to be passed to mkdir
-L CPLINK, --linking CPLINK
parameter to pass to `cp` for linking files instead of copying their data
--force Force overwriting any existing file when moving
-b LAB, --batch LAB generate batch description
-s, --summary Only display a summary of datasets, not an exhaustive list of all samples
-a, --append Append to the end of movedatafiles.sh, instead of overwritting (use when calling from an external combiner wrapper)
-g, --staging Write samples list in .tsv.staging and only rename them to the final .tsv at the end of movedatafiles.sh if there were no errors.
-l, --forcelanes Explicitly look for sample in each lane (for replicates across lanes)
-p TSV, --patchmap TSV
patchmap file to rename samples