You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Restore correct README: remove Travis badge and htslib section
Re-sync README.Rmd with the state of README.md from commit 353cb4d,
which had been edited directly without updating the .Rmd source.
Removes the Travis CI badge, removes the outdated htslib Important News
block, updates the format count to 5, marks Format 5 as recommended,
and reorganises the functions list into Format 5 and legacy sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
[](https://app.codecov.io/gh/USCbiostats/BinaryDosage?branch=master)
18
17
<!-- badges: end -->
19
18
20
19
# Binary Dosage Files
21
20
22
-
### Important News
23
-
24
-
A new version of BinaryDosage has been developed that significantly reduces data read times by a factor of more than 10 times. This new version uses the hstlib libraries which greatly improves the read speed of VCF files. To compile this new version requires the installation of the [Rhtslib](https://bioconductor.org/packages/release/bioc/html/Rhtslib.html) library from Bioconductor.
25
-
26
-
Data compression of the BinaryDosage formatted files has also been improved. We have had reports that the BinaryDosage formatted files were over 3 times larger than the gzipped VCF file. This was due to the compression routine not compressing SNPs with low minor allele frequencies (<0.01) well. When BinaryDosage was first written, imputation servers did not include many rare SNPs. This has changed since BinaryDosage was first written.
27
-
28
-
To install the latest version of BinaryDosage, it is recommended the user have R 4.3.x or higher. If the user is using Windows, they will need to verify that the current version of [R tools](https://cran.r-project.org/bin/windows/Rtools/) is installed. If the user is using Linux or Mac OS X, the zlib development tools need to be installed, often named zlib1g-dev. For most systems, these tools are usually already loaded.
29
-
30
-
The package [Rhtslib](https://bioconductor.org/packages/release/bioc/html/Rhtslib.html) from BioConductor needs to be installed using the following code.
31
-
```{r, eval = F}
32
-
if (!require("BiocManager", quietly = TRUE))
33
-
install.packages("BiocManager")
34
-
35
-
BiocManager::install("Rhtslib")
36
-
```
37
-
38
-
Once the preceding prerequisites are met the follow code will install the latest version of BinaryDosage.
All BinaryDosage formatted files created with older versions are fully compatible with this new version of BinaryDosage. [GxEScanR](https://github.com/USCbiostats/GxEScanR) works with files created by all versions of BinaryDosage, including this new one.
49
-
50
-
The information below is for the current release version of BinaryDosage. Visit the [htslib branch](https://github.com/USCbiostats/BinaryDosage/tree/htslib) or BinaryDosage for more information about the new version.
51
-
52
21
### Introduction
53
22
54
23
Genotype imputation is an essential tool in genomics, enabling association testing with markers not directly genotyped, increasing statistical power, and facilitating data pooling between studies that employ different genotyping platforms. Two commonly used software packages for imputation are [minimac](https://genome.sph.umich.edu/wiki/Minimac) and [Impute2](http://mathgen.stats.ox.ac.uk/impute/impute_v2.html). Furthermore, services such as the [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) have made genotype imputation much more accessible and streamlined.
@@ -75,9 +44,9 @@ For GWAS/GWIS analysis of BinaryDosage files, please refer to the [**GxEScanR**]
There are 4 formats for a binary dosage data set. Data sets in formats 1, 2, and 3 have 3 files, a sample information file, a SNP information file, and a genetic information file. Data sets in format 4 have just 1 file. This file contains all the information listed above and may contain the following information.
47
+
There are 5 formats for a binary dosage data set. Data sets in formats 1, 2, and 3 have 3 files, a sample information file, a SNP information file, and a genetic information file. Data sets in format 4 have just 1 file. Format 5 uses per-SNP gzip compression and stores metadata in a companion RDS file (`.bdinfo`). This file contains all the information listed above and may contain the following information.
79
48
80
-
**Note:** Format 4 is recommended and is the default value for all functions.
49
+
**Note:** Format 5 is the recommended format for new data sets.
81
50
82
51
- Additional SNP information
83
52
+ Alternate allele frequency
@@ -89,17 +58,28 @@ There are 4 formats for a binary dosage data set. Data sets in formats 1, 2, and
89
58
+ Sample size of each data set merged
90
59
91
60
### Functions
92
-
-**vcftobd** - Converts a VCF file to a Format 5 binary dosage data set
93
-
-**vcftobdlegacy** - Converts a VCF file to a legacy format (1-4) binary dosage data set
61
+
62
+
#### Format 5 (recommended)
63
+
64
+
-**vcftobd** - Converts a bgzipped VCF file to a Format 5 binary dosage data set (requires vcfppR)
65
+
-**getbd5info** - Loads a Format 5 file pair and returns an R list (required for **getsnp**, **bdapply**, and **mergebd**)
66
+
-**getbd5snp** - Reads a single SNP from a Format 5 file by index or ID
67
+
-**updatebd** - Converts a legacy format (1–4) binary dosage file to Format 5
68
+
-**subsetbd** - Creates a new Format 5 file containing a subset of SNPs and/or subjects from any binary dosage file (formats 1–5)
69
+
-**mergebd** - Merges two or more Format 5 files into a single Format 5 file
70
+
71
+
#### Legacy formats (1–4)
72
+
73
+
-**vcftobdlegacy** - Converts a VCF file to a legacy format (1–4) binary dosage data set (deprecated; use **vcftobd** instead)
94
74
-**gentobd** - Converts a GEN (impute2) file to a binary dosage data set
95
-
-**bdmerge** - Merges multiple binary dosage data sets into a single data set
96
-
-**getbdinfo** - Creates an R List containing information about a binary dosage data set (required for **getsnp** and **bdapply**)
97
-
-**getvcfinfo** - Creates an R List containing information about a VCF file (required for **vcfapply**)
98
-
-**getgeninfo** - Creates an R List containing information about a GEN file (required for **genapply**)
99
-
-**bdapply** - Applies a function to the data for each SNP in a binary dosage file (requires list returned by **getbdinfo**)
75
+
-**bdmerge** - Merges multiple legacy binary dosage data sets into a single data set
76
+
-**getbdinfo** - Creates an R list containing information about a binary dosage data set (required for **getsnp** and **bdapply**)
77
+
-**getvcfinfo** - Creates an R list containing information about a VCF file (required for **vcfapply**)
78
+
-**getgeninfo** - Creates an R list containing information about a GEN file (required for **genapply**)
79
+
-**bdapply** - Applies a function to the data for each SNP in a binary dosage file (requires list returned by **getbdinfo** or **getbd5info**)
100
80
-**vcfapply** - Applies a function to the data for each SNP in a VCF file (requires list returned by **getvcfinfo**)
101
81
-**genapply** - Applies a function to the data for each SNP in a GEN file (requires list returned by **getgeninfo**)
102
-
-**getsnp** - Obtain genotype Dosages/Genotype Probabilities from a binary dosage file, outputs results to an R list
82
+
-**getsnp** - Returns dosage and genotype probabilities for a single SNP from a binary dosage file
0 commit comments