You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# CAW - Cancer Analysis Workflow to process normal/tumor WGS data
1
+
# European Human Genetics Conference 2017
2
2
3
-
Maxime Garcia 1, Szilveszter Juhos 2, Malin Larsson 3, Teresita Diaz de Ståhl 4, Jesper Eisfeldt 5, Sebastian DiLorenzo 6, Pall Olason 7, Björn Nystedt 7, Monica Nistér 4, Max Käller 8
3
+
## CAW - Cancer Analysis Workflow to process normal/tumor WGS data
4
4
5
-
As whole genome sequencing is getting cheaper, it is viable to compare NGS data from normal and tumor samples of numerous patients. There are still many challenges, mostly regarding bioinformatics: datasets are huge, workflows are complex, and there are multiple tools to choose from for somatic and structural variants and quality control.
6
-
7
-
We are presenting CAW (Cancer Analysis Workflow) a complete open source pipeline to resolve somatic variants from WGS data: it is written in Nextflow, a domain specific language for workflow building. We are utilizing GATK best practices to align, realign and recalibrate short-read data in parallel for both tumor and normal sample. After these preprocessing steps several somatic variant callers scan the resulting BAM files; MuTect1, MuTect2 and Strelka are used to find somatic SNVs and small indels.For structural variants we use Manta. Furthermore, we are applying ASCAT to estimate sample heterogeneity, ploidy and CNVs.
8
-
9
-
The software can start the analysis from raw FASTQ files, from the realignment step, or directly with any subset of variant callers. At the end of the analysis the resulting VCF files are merged to facilitate further downstream processing, though the individual results are also retained. The flow is capable of accommodating further variant calling software or CNV callers. It is also prepared to process normal - tumor - and several relapse samples.
10
-
11
-
Besides variant calls, the workflow provides quality controls presented by MultiQC. A docker image is also available, the open source software can be downloaded from https://github.com/SciLifeLab/CAW .
5
+
Maxime Garcia 1,
6
+
Szilveszter Juhos 2,
7
+
Malin Larsson 3,
8
+
Teresita Díaz de Ståhl 4,
9
+
Jesper Eisfeldt 5,
10
+
Sebastian DiLorenzo 6,
11
+
Pall Olason 7,
12
+
Björn Nystedt 7,
13
+
Monica Nistér 4,
14
+
Max Käller 8
12
15
13
16
1. BarnTumörBanken, Department of Oncology Pathology, Science for Life Laboratory, Karolinska Institutet
14
17
2. Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University
@@ -18,3 +21,11 @@ Besides variant calls, the workflow provides quality controls presented by Multi
18
21
6. Department of Medical Sciences, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University
19
22
7. Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University
20
23
8. Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, Royal Institute of Technology
24
+
25
+
As whole genome sequencing is getting cheaper, it is viable to compare NGS data from normal and tumor samples of numerous patients. There are still many challenges, mostly regarding bioinformatics: datasets are huge, workflows are complex, and there are multiple tools to choose from for somatic and structural variants and quality control.
26
+
27
+
We are presenting CAW (Cancer Analysis Workflow) a complete open source pipeline to resolve somatic variants from WGS data: it is written in Nextflow, a domain specific language for workflow building. We are utilizing GATK best practices to align, realign and recalibrate short-read data in parallel for both tumor and normal sample. After these preprocessing steps several somatic variant callers scan the resulting BAM files; MuTect1, MuTect2 and Strelka are used to find somatic SNVs and small indels.For structural variants we use Manta. Furthermore, we are applying ASCAT to estimate sample heterogeneity, ploidy and CNVs.
28
+
29
+
The software can start the analysis from raw FASTQ files, from the realignment step, or directly with any subset of variant callers. At the end of the analysis the resulting VCF files are merged to facilitate further downstream processing, though the individual results are also retained. The flow is capable of accommodating further variant calling software or CNV callers. It is also prepared to process normal - tumor - and several relapse samples.
30
+
31
+
Besides variant calls, the workflow provides quality controls presented by MultiQC. A docker image is also available, the open source software can be downloaded from https://github.com/SciLifeLab/CAW .
# The Nordic Precision Medicine Initiative - Meeting No 5
2
+
3
+
## Sarek, a portable workflow for WGS analysis of germline and somatic mutations
4
+
5
+
Maxime Garcia 123*,
6
+
Szilveszter Juhos 123*,
7
+
Malin Larsson 456,
8
+
Teresita Díaz de Ståhl 13,
9
+
Johanna Sandgren 13,
10
+
Jesper Eisfeldt 73,
11
+
Sebastian DiLorenzo 85A,
12
+
Marcel Martin B5C,
13
+
Pall Olason 95A,
14
+
Phil Ewels B2C,
15
+
Björn Nystedt 95A*,
16
+
Monica Nistér 13,
17
+
Max Käller 2D,
18
+
*Corresponding Author
19
+
20
+
1. Barntumörbanken, Dept. of Oncology Pathology;
21
+
2. Science for Life Laboratory;
22
+
3. Karolinska Institutet;
23
+
4. Dept. of Physics, Chemistry and Biology;
24
+
5. National Bioinformatics Infrastructure Sweden, Science for Life Laboratory;
25
+
6. Linköping University;
26
+
7. Clinical Genetics, Dept. of Molecular Medicine and Surgery;
27
+
8. Dept. of Medical Sciences;
28
+
9. Dept. of Cell and Molecular Biology;
29
+
A. Uppsala University;
30
+
B. Dept. of Biochemistry and Biophysics;
31
+
C. Stockholm University;
32
+
D. School of Biotechnology, Division of Gene Technology, Royal Institute of Technology
33
+
34
+
We present Sarek, a portable Open Source pipeline to resolve germline and somatic variants from WGS data: it is written in Nextflow, a domain-specific language for workflow building.
35
+
It processes normal samples or normal/tumor pairs (with the option to include matched relapses).
36
+
37
+
Sarek is based on GATK best practices to prepare short-read data, which is done in parallel for a tumor/normal pair sample.
38
+
After these preprocessing steps several variant callers scan the resulting BAM files: Manta for structural variants; Strelka and GATK HaplotypeCaller for germline variants; Freebayes, MuTect1, MuTect2 and Strelka for somatic variants; ASCAT to estimate sample heterogeneity, ploidy and CNVs.
39
+
At the end of the analysis the resulting VCF files can be annotated by SNPEff and/or VEP to facilitate further downstream processing.
40
+
Our ongoing effort focuses in filtering and prioritizing the annotated variants.
41
+
42
+
Sarek is based on Docker and Singularity containers, enabling version tracking, reproducibility and handling sensitive data.
43
+
It is designed with flexible environments in mind, like running on a local fat node, a HTC cluster or in a cloud environment like AWS.
44
+
The workflow is capable of accommodating further variant callers.
45
+
Besides variant calls, the workflow provides quality controls presented by MultiQC.
46
+
Checkpoints allow the software to be started from FastQ, BAM or VCF.
47
+
Besides WGS data, it is capable to process inputs from WES or gene panels.
48
+
The pipeline currently use GRCh37 or GRCh38 as a reference genome, it is also possible to add custom genomes.
49
+
It has been successfully used to analyze more than two hundred WGS samples sent to National Genomics Infrastructure (Science for Life Laboratory) from different users.
50
+
The MIT licensed Open Source code can be downloaded from GitHub.
# Journées Ouvertes en Biologie, Informatique et Mathématiques 2018
2
+
3
+
## Sarek, a portable workflow for WGS analysis of germline and somatic mutations
4
+
5
+
Maxime Garcia 123,
6
+
Szilveszter Juhos 123,
7
+
Malin Larsson 456,
8
+
Teresita Díaz de Ståhl 13,
9
+
Johanna Sandgren 13,
10
+
Jesper Eisfeldt 73,
11
+
Sebastian DiLorenzo 85A,
12
+
Marcel Martin B5C,
13
+
Pall Olason 95A,
14
+
Phil Ewels B2C,
15
+
Björn Nystedt 95A,
16
+
Monica Nistér 13,
17
+
Max Käller 2D
18
+
19
+
Max Käller <max.kaller@scilifelab.se>
20
+
21
+
1. Barntumörbanken, Dept. of Oncology Pathology;
22
+
2. Science for Life Laboratory;
23
+
3. Karolinska Institutet;
24
+
4. Dept. of Physics, Chemistry and Biology;
25
+
5. National Bioinformatics Infrastructure Sweden, Science for Life Laboratory;
26
+
6. Linköping University;
27
+
7. Clinical Genetics, Dept. of Molecular Medicine and Surgery;
28
+
8. Dept. of Medical Sciences;
29
+
9. Dept. of Cell and Molecular Biology;
30
+
A. Uppsala University;
31
+
B. Dept. of Biochemistry and Biophysics;
32
+
C. Stockholm University;
33
+
D. School of Biotechnology, Division of Gene Technology, Royal Institute of Technology
34
+
35
+
We present Sarek, a portable Open Source pipeline to resolve germline and somatic variants from WGS data: it is written in Nextflow, a domain-specific language for workflow building. It processes normal samples or normal/tumor pairs (with the option to include matched relapses).
36
+
37
+
Sarek is based on GATK best practices to prepare short-read data, which is done in parallel for a tumor/normal pair sample. After these preprocessing steps several variant callers scan the resulting BAM files: Manta for structural variants; Strelka and GATK HaplotypeCaller for germline variants; Freebayes, MuTect2 and Strelka for somatic variants; ASCAT and Control-FREEC to estimate sample heterogeneity, ploidy and CNVs. At the end of the analysis the resulting VCF files can be annotated by SNPEff and/or VEP to facilitate further downstream processing. Our ongoing effort focuses in filtering and prioritizing the annotated variants.
38
+
39
+
Sarek is based on Docker and Singularity containers, enabling version tracking, reproducibility and handling sensitive data. It is designed with flexible environments in mind, like running on a local fat node, a HTC cluster or in a cloud environment like AWS. The workflow is modular and capable of accommodating further variant callers. Besides variant calls, the workflow provides quality controls presented by MultiQC. Checkpoints allow the software to be started from FastQ, BAM or VCF. Besides WGS data, it is capable to process inputs from WES or gene panels.
40
+
41
+
The pipeline currently uses GRCh37 or GRCh38 as a reference genome, it is also possible to add custom genomes. It has been successfully used to analyze more than two hundred WGS samples sent to National Genomics Infrastructure (Science for Life Laboratory) from different users. The MIT licensed Open Source code can be downloaded from GitHub.
42
+
43
+
The authors thank the Swedish Childhood Cancer Foundation for the funding of Barntumörbanken. We would like to acknowledge support from Science for Life Laboratory, the National Genomics Infrastructure, NGI, and UPPMAX for providing assistance in massive parallel sequencing and computational infrastructure.
0 commit comments