@@ -5,12 +5,73 @@ BinaryDosage: Creates, Merges, and Reads Binary Dosage Files
55
66[ ![ AppVeyor build
77status] ( https://ci.appveyor.com/api/projects/status/github/USCbiostats/BinaryDosage?branch=master&svg=true )] ( https://ci.appveyor.com/project/USCbiostats/BinaryDosage )
8+ [ ![ Travis build
9+ status] ( https://travis-ci.org/USCbiostats/BinaryDosage.svg?branch=master )] ( https://app.travis-ci.com/USCbiostats/BinaryDosage )
810[ ![ Codecov test
911coverage] ( https://codecov.io/gh/USCbiostats/BinaryDosage/branch/master/graph/badge.svg )] ( https://app.codecov.io/gh/USCbiostats/BinaryDosage?branch=master )
1012<!-- badges: end -->
1113
1214# Binary Dosage Files
1315
16+ ### Important News
17+
18+ A new version of BinaryDosage has been developed that significantly
19+ reduces data read times by a factor of more than 10 times. This new
20+ version uses the hstlib libraries which greatly improves the read speed
21+ of VCF files. To compile this new version requires the installation of
22+ the
23+ [ Rhtslib] ( https://bioconductor.org/packages/release/bioc/html/Rhtslib.html )
24+ library from Bioconductor.
25+
26+ Data compression of the BinaryDosage formatted files has also been
27+ improved. We have had reports that the BinaryDosage formatted files were
28+ over 3 times larger than the gzipped VCF file. This was due to the
29+ compression routine not compressing SNPs with low minor allele
30+ frequencies (\< 0.01) well. When BinaryDosage was first written,
31+ imputation servers did not include many rare SNPs. This has changed
32+ since BinaryDosage was first written.
33+
34+ To install the latest version of BinaryDosage, it is recommended the
35+ user have R 4.3.x or higher. If the user is using Windows, they will
36+ need to verify that the current version of [ R
37+ tools] ( https://cran.r-project.org/bin/windows/Rtools/ ) is installed. If
38+ the user is using Linux or Mac OS X, the zlib development tools need to
39+ be installed, often named zlib1g-dev. For most systems, these tools are
40+ usually already loaded.
41+
42+ The package
43+ [ Rhtslib] ( https://bioconductor.org/packages/release/bioc/html/Rhtslib.html )
44+ from BioConductor needs to be installed using the following code.
45+
46+ ``` r
47+ if (! require(" BiocManager" , quietly = TRUE ))
48+ install.packages(" BiocManager" )
49+
50+ BiocManager :: install(" Rhtslib" )
51+ ```
52+
53+ Once the preceding prerequisites are met the follow code will install
54+ the latest version of BinaryDosage.
55+
56+ ``` r
57+ remove.packages(" BinaryDosage" )
58+ devtools :: install_github(" https://github.com/USCbiostats/BinaryDosage@htslib" )
59+
60+ library(BinaryDosage )
61+ ```
62+
63+ #### Important
64+
65+ All BinaryDosage formatted files created with older versions are fully
66+ compatible with this new version of BinaryDosage.
67+ [ GxEScanR] ( https://github.com/USCbiostats/GxEScanR ) works with files
68+ created by all versions of BinaryDosage, including this new one.
69+
70+ The information below is for the current release version of
71+ BinaryDosage. Visit the [ htslib
72+ branch] ( https://github.com/USCbiostats/BinaryDosage/tree/htslib ) or
73+ BinaryDosage for more information about the new version.
74+
1475### Introduction
1576
1677Genotype imputation is an essential tool in genomics, enabling
@@ -48,24 +109,23 @@ For GWAS/GWIS analysis of BinaryDosage files, please refer to the
48109 - Family ID
49110 - Subject ID
50111- SNP information
51- - Chromosome number
52- - SNP ID
53- - Location in base pairs
54- - Reference allele
55- - Alternate allele
112+ - Chromosome number\
113+ - SNP ID\
114+ - Location in base pairs\
115+ - Reference allele\
116+ - Alternate allele\
56117- Genetic information
57118 - Dosage values
58119 - Genotype probabilities, Pr(* g=0* ), Pr(* g=1* ), Pr(* g=2* )
59120
60- There are 5 formats for a binary dosage data set. Data sets in formats
121+ There are 4 formats for a binary dosage data set. Data sets in formats
611221, 2, and 3 have 3 files, a sample information file, a SNP information
62123file, and a genetic information file. Data sets in format 4 have just 1
63- file. Format 5 uses per-SNP gzip compression and stores metadata in a
64- companion RDS file (` .bdose.bdi ` ), named by appending ` .bdi ` to the
65- ` .bdose ` filename. This file contains all the information listed above
66- and may contain the following information.
124+ file. This file contains all the information listed above and may
125+ contain the following information.
67126
68- ** Note:** Format 5 is the recommended format for new data sets.
127+ ** Note:** Format 4 is recommended and is the default value for all
128+ functions.
69129
70130- Additional SNP information
71131 - Alternate allele frequency
@@ -78,27 +138,27 @@ and may contain the following information.
78138
79139### Functions
80140
81- #### Format 5 (recommended)
82-
83- - ** vcftobd ** - Converts a bgzipped VCF file to a Format 5 binary dosage data set (requires vcfppR)
84- - ** getbd5info ** - Loads a Format 5 file pair and returns an R list (required for ** getsnp ** , ** bdapply ** , and ** mergebd ** )
85- - ** getbd5snp ** - Reads a single SNP from a Format 5 file by index or ID
86- - ** updatebd ** - Converts a legacy format (1–4) binary dosage file to Format 5
87- - ** subsetbd ** - Creates a new Format 5 file containing a subset of SNPs and/or subjects from any binary dosage file (formats 1–5)
88- - ** mergebd ** - Merges two or more Format 5 files into a single Format 5 file
89-
90- #### Legacy formats (1–4)
91-
92- - ** vcftobdlegacy ** - Converts a VCF file to a legacy format (1–4) binary dosage data set (deprecated; use ** vcftobd ** instead)
93- - ** gentobd ** - Converts a GEN (impute2) file to a binary dosage data set
94- - ** bdmerge ** - Merges multiple legacy binary dosage data sets into a single data set
95- - ** getbdinfo ** - Creates an R list containing information about a binary dosage data set (required for ** getsnp ** and ** bdapply ** )
96- - ** getvcfinfo ** - Creates an R list containing information about a VCF file (required for ** vcfapply ** )
97- - ** getgeninfo ** - Creates an R list containing information about a GEN file (required for ** genapply ** )
98- - ** bdapply ** - Applies a function to the data for each SNP in a binary dosage file (requires list returned by ** getbdinfo ** or ** getbd5info ** )
99- - ** vcfapply ** - Applies a function to the data for each SNP in a VCF file (requires list returned by ** getvcfinfo ** )
100- - ** genapply ** - Applies a function to the data for each SNP in a GEN file (requires list returned by ** getgeninfo ** )
101- - ** getsnp ** - Returns dosage and genotype probabilities for a single SNP from a binary dosage file
141+ - ** vcftobd ** - Converts a VCF file to a Format 5 binary dosage data set
142+ - ** vcftobdlegacy ** - Converts a VCF file to a legacy format (1-4)
143+ binary dosage data set
144+ - ** gentobd ** - Converts a GEN (impute2) file to a binary dosage data
145+ set
146+ - ** bdmerge ** - Merges multiple binary dosage data sets into a single
147+ data set
148+ - ** getbdinfo ** - Creates an R List containing information about a
149+ binary dosage data set (required for ** getsnp ** and ** bdapply ** )
150+ - ** getvcfinfo ** - Creates an R List containing information about a VCF
151+ file (required for ** vcfapply ** )
152+ - ** getgeninfo ** - Creates an R List containing information about a GEN
153+ file (required for ** genapply ** )
154+ - ** bdapply ** - Applies a function to the data for each SNP in a binary
155+ dosage file (requires list returned by ** getbdinfo ** )
156+ - ** vcfapply ** - Applies a function to the data for each SNP in a VCF
157+ file (requires list returned by ** getvcfinfo ** )
158+ - ** genapply ** - Applies a function to the data for each SNP in a GEN
159+ file (requires list returned by ** getgeninfo ** )
160+ - ** getsnp ** - Obtain genotype Dosages/Genotype Probabilities from a
161+ binary dosage file, outputs results to an R list
102162
103163# Installation
104164
@@ -114,6 +174,15 @@ devtools::install_github("https://github.com/USCbiostats/BinaryDosage")
114174library(BinaryDosage )
115175```
116176
177+ To install the package with vignettes built, use the ` build_vignettes `
178+ option:
179+
180+ ``` r
181+ devtools :: install_github(" https://github.com/USCbiostats/BinaryDosage" , build_vignettes = TRUE )
182+
183+ library(BinaryDosage )
184+ ```
185+
117186# Usage
118187
119188#### General Workflow
@@ -210,11 +279,11 @@ mergebd3 <- tempfile()
210279Converting a VCF file into a binary dosage file is simple. The user
211280passes the names of the VCF and information files along with the name
212281for the binary dosage file to the
213- <span style =" font-family :Courier " >vcftobdlegacy</span > function. There are
214- some options available for the
215- <span style =" font-family :Courier " >vcftobdlegacy</span > functions such as using
216- gz compressed files vcf files. More information about these options can
217- be found using the help files or reading the vignette
282+ <span style =" font-family :Courier " >vcftobdlegacy</span > function. There
283+ are some options available for the
284+ <span style =" font-family :Courier " >vcftobdlegacy</span > functions such as
285+ using gz compressed files vcf files. More information about these
286+ options can be found using the help files or reading the vignette
218287<span style =" font-family :Courier " >usingvcffiles</span >.
219288
220289The following commands convert VCF data sets 1a and 1b into the binary
0 commit comments