Skip to content

Commit d64a488

Browse files
jimb3claude
andcommitted
Add mergebd vignette; mark bdmerge as legacy
Add vignettes/mergingbd5files.Rmd covering mergebd with subject-merge and SNP-merge examples. Add legacy notice to mergingfiles.Rmd linking to the new vignette. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 0d532c3 commit d64a488

File tree

3 files changed

+175
-1
lines changed

3 files changed

+175
-1
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: BinaryDosage
22
Title: Creates, Merges, and Reads Binary Dosage Files
3-
Version: 1.0.0.9026
3+
Version: 1.0.0.9027
44
Authors@R:
55
c(person(given = "John",
66
family = "Morrison",

vignettes/mergingbd5files.Rmd

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
title: "Merging Format 5 Binary Dosage Files"
3+
output:
4+
rmarkdown::html_vignette:
5+
toc: true
6+
vignette: >
7+
%\VignetteIndexEntry{Merging Format 5 Binary Dosage Files}
8+
%\VignetteEngine{knitr::rmarkdown}
9+
%\VignetteEncoding{UTF-8}
10+
---
11+
12+
```{r, include = FALSE}
13+
knitr::opts_chunk$set(
14+
collapse = TRUE,
15+
comment = "#>"
16+
)
17+
```
18+
19+
```{r setup, echo = FALSE}
20+
library(BinaryDosage)
21+
```
22+
23+
# Introduction
24+
25+
The `mergebd` function merges two or more Format 5 binary dosage files into a
26+
single Format 5 output file. The merge type is detected automatically from the
27+
input files.
28+
29+
- **Subject merge** — subject IDs do not overlap across files. The output
30+
contains all subjects from every input file and the SNPs common to all files.
31+
- **SNP merge** — SNP IDs do not overlap across files. The output contains all
32+
SNPs from every input file and the subjects common to all files.
33+
34+
If both subject IDs and SNP IDs overlap across files an error is returned,
35+
since the merge type is ambiguous.
36+
37+
SNPs are identified by chromosome, position, reference allele, and alternate
38+
allele, regardless of the SNP ID format stored in each file.
39+
40+
The function takes the following parameters.
41+
42+
- `bdose_files` — character vector of paths to the input `.bdose` files (at
43+
least two). The companion `.bdi` file for each is expected at
44+
`paste0(bdose_files[i], ".bdi")`.
45+
- `bdose_file` — path for the output `.bdose` file. The companion `.bdi` file
46+
is written automatically to `paste0(bdose_file, ".bdi")`.
47+
48+
# Setup
49+
50+
The examples below use the bgzipped VCF file included with the package,
51+
*set1a.vcf.gz*, which contains data for 60 subjects and 10 SNPs on chromosome
52+
1. All output files are written to a temporary directory.
53+
54+
```{r setup_files, message = FALSE, warning = FALSE}
55+
vcf_file <- system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage")
56+
bdose_full <- file.path(tempdir(), "full.bdose")
57+
58+
vcftobd(vcffile = vcf_file, bdose_file = bdose_full)
59+
bd_full <- getbdinfo(bdose_full)
60+
61+
cat("Subjects:", nrow(bd_full$samples), "\n")
62+
cat("SNPs: ", nrow(bd_full$snps), "\n")
63+
```
64+
65+
# Subject merge
66+
67+
A subject merge combines files that cover different subjects but the same (or
68+
overlapping) set of SNPs. The output contains all subjects and the SNPs common
69+
to every input file.
70+
71+
The example splits the 60-subject file into two 30-subject files using
72+
`subsetbd`, then merges them back together.
73+
74+
```{r subject_merge, message = FALSE, warning = FALSE}
75+
bdose_a <- file.path(tempdir(), "set_a.bdose")
76+
bdose_b <- file.path(tempdir(), "set_b.bdose")
77+
bdose_out <- file.path(tempdir(), "merged_subjects.bdose")
78+
79+
sids <- bd_full$samples$sid
80+
81+
subsetbd(bdfiles = bdose_full,
82+
bdose_file = bdose_a,
83+
subjectids = sids[1:30])
84+
85+
subsetbd(bdfiles = bdose_full,
86+
bdose_file = bdose_b,
87+
subjectids = sids[31:60])
88+
89+
mergebd(bdose_files = c(bdose_a, bdose_b),
90+
bdose_file = bdose_out)
91+
92+
bd_a <- getbdinfo(bdose_a)
93+
bd_b <- getbdinfo(bdose_b)
94+
bd_out <- getbdinfo(bdose_out)
95+
96+
cat("File A subjects:", nrow(bd_a$samples), "\n")
97+
cat("File B subjects:", nrow(bd_b$samples), "\n")
98+
cat("Merged subjects:", nrow(bd_out$samples), "\n")
99+
cat("Merged SNPs: ", nrow(bd_out$snps), "\n")
100+
```
101+
102+
The merged file contains all 60 subjects and all 10 SNPs.
103+
104+
## Verifying subject order
105+
106+
The subjects in the merged file appear in input-file order: all subjects from
107+
the first file followed by all subjects from the second file.
108+
109+
```{r subject_order}
110+
knitr::kable(bd_out$samples, caption = "Subjects in merged file")
111+
```
112+
113+
# SNP merge
114+
115+
A SNP merge combines files that cover different SNPs but the same (or
116+
overlapping) set of subjects. The output contains all SNPs and the subjects
117+
common to every input file.
118+
119+
The example splits the 10-SNP file into two 5-SNP files using `subsetbd`, then
120+
merges them back together.
121+
122+
```{r snp_merge, message = FALSE, warning = FALSE}
123+
bdose_snp_a <- file.path(tempdir(), "snp_a.bdose")
124+
bdose_snp_b <- file.path(tempdir(), "snp_b.bdose")
125+
bdose_snp_out <- file.path(tempdir(), "merged_snps.bdose")
126+
127+
locs <- bd_full$snps$location
128+
129+
subsetbd(bdfiles = bdose_full,
130+
bdose_file = bdose_snp_a,
131+
locations = locs[1:5])
132+
133+
subsetbd(bdfiles = bdose_full,
134+
bdose_file = bdose_snp_b,
135+
locations = locs[6:10])
136+
137+
mergebd(bdose_files = c(bdose_snp_a, bdose_snp_b),
138+
bdose_file = bdose_snp_out)
139+
140+
bd_snp_a <- getbdinfo(bdose_snp_a)
141+
bd_snp_b <- getbdinfo(bdose_snp_b)
142+
bd_snp_out <- getbdinfo(bdose_snp_out)
143+
144+
cat("File A SNPs: ", nrow(bd_snp_a$snps), "\n")
145+
cat("File B SNPs: ", nrow(bd_snp_b$snps), "\n")
146+
cat("Merged SNPs: ", nrow(bd_snp_out$snps), "\n")
147+
cat("Merged subjects:", nrow(bd_snp_out$samples), "\n")
148+
```
149+
150+
## Verifying SNP order
151+
152+
SNPs appear in input-file order: all SNPs from the first file followed by all
153+
SNPs from the second file.
154+
155+
```{r snp_order}
156+
knitr::kable(bd_snp_out$snps, caption = "SNPs in merged file")
157+
```
158+
159+
```{r cleanup, include = FALSE}
160+
unlink(c(bdose_full, paste0(bdose_full, ".bdi"),
161+
bdose_a, paste0(bdose_a, ".bdi"),
162+
bdose_b, paste0(bdose_b, ".bdi"),
163+
bdose_out, paste0(bdose_out, ".bdi"),
164+
bdose_snp_a, paste0(bdose_snp_a, ".bdi"),
165+
bdose_snp_b, paste0(bdose_snp_b, ".bdi"),
166+
bdose_snp_out, paste0(bdose_snp_out, ".bdi")))
167+
```

vignettes/mergingfiles.Rmd

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@ knitr::opts_chunk$set(
1818
library(BinaryDosage)
1919
```
2020

21+
**Note:** `bdmerge` is a legacy function that operates on formats 1–4 only.
22+
For Format 5 files, use `mergebd` instead. See the
23+
[Merging Format 5 Binary Dosage Files](mergingbd5files.html) vignette for
24+
details.
25+
26+
---
27+
2128
Quite often subjects have their genotypes imputed in batches. The files returned by these imputation can be converted into binary dosage files. These binary files can be merged into a single file if they have the same SNPs and different subjects using the bdmerge routine.
2229

2330
## bdmerge

0 commit comments

Comments
 (0)