Skip to content

Commit e280a8d

Browse files
jimb3claude
andcommitted
Add bd5snpreaders vignette; update README with buf/con functions
New vignette covers all three Format 5 SNP readers: getbd5snp (simple, allocates per call), getbd5snp_buf (pre-allocated vectors), and getbd5snp_con (persistent connection via openbd5con/closebd5con). Explains when to use each, the R copy-on-modify constraint, and verifies all three produce identical results. README functions list updated to include getbd5snp_buf, openbd5con, closebd5con, and getbd5snp_con with brief descriptions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 1ff0bbe commit e280a8d

File tree

4 files changed

+309
-3
lines changed

4 files changed

+309
-3
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: BinaryDosage
22
Title: Creates, Merges, and Reads Binary Dosage Files
3-
Version: 1.0.0.9023
3+
Version: 1.0.0.9024
44
Authors@R:
55
c(person(given = "John",
66
family = "Morrison",

README.Rmd

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,11 @@ There are 5 formats for a binary dosage data set. Data sets in formats 1, 2, and
6262
#### Format 5 (recommended)
6363

6464
- **vcftobd** - Converts a bgzipped VCF file to a Format 5 binary dosage data set (requires vcfppR)
65-
- **getbd5snp** - Reads a single SNP from a Format 5 file by index or ID
65+
- **getbd5snp** - Reads a single SNP from a Format 5 file by index or ID; returns a list with dosage and genotype probabilities
66+
- **getbd5snp_buf** - Like **getbd5snp** but writes results into pre-allocated vectors, avoiding repeated memory allocation in loops
67+
- **openbd5con** - Opens a persistent connection to a Format 5 file for use with **getbd5snp_con**
68+
- **closebd5con** - Explicitly closes a connection opened by **openbd5con**
69+
- **getbd5snp_con** - Like **getbd5snp_buf** but reuses an open connection, eliminating file open/close overhead per SNP
6670
- **updatebd** - Converts a legacy format (1–4) binary dosage file to Format 5
6771
- **subsetbd** - Creates a new Format 5 file containing a subset of SNPs and/or subjects from any binary dosage file (formats 1–5)
6872
- **mergebd** - Merges two or more Format 5 files into a single Format 5 file

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,16 @@ listed above and may contain the following information.
8181

8282
- **vcftobd** - Converts a bgzipped VCF file to a Format 5 binary dosage
8383
data set (requires vcfppR)
84-
- **getbd5snp** - Reads a single SNP from a Format 5 file by index or ID
84+
- **getbd5snp** - Reads a single SNP from a Format 5 file by index or
85+
ID; returns a list with dosage and genotype probabilities
86+
- **getbd5snp_buf** - Like **getbd5snp** but writes results into
87+
pre-allocated vectors, avoiding repeated memory allocation in loops
88+
- **openbd5con** - Opens a persistent connection to a Format 5 file for
89+
use with **getbd5snp_con**
90+
- **closebd5con** - Explicitly closes a connection opened by
91+
**openbd5con**
92+
- **getbd5snp_con** - Like **getbd5snp_buf** but reuses an open
93+
connection, eliminating file open/close overhead per SNP
8594
- **updatebd** - Converts a legacy format (1–4) binary dosage file to
8695
Format 5
8796
- **subsetbd** - Creates a new Format 5 file containing a subset of SNPs

vignettes/bd5snpreaders.Rmd

Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
---
2+
title: "Reading SNPs from Format 5 Files"
3+
output:
4+
rmarkdown::html_vignette:
5+
toc: true
6+
vignette: >
7+
%\VignetteIndexEntry{Reading SNPs from Format 5 Files}
8+
%\VignetteEngine{knitr::rmarkdown}
9+
%\VignetteEncoding{UTF-8}
10+
---
11+
12+
```{r, include = FALSE}
13+
knitr::opts_chunk$set(
14+
collapse = TRUE,
15+
comment = "#>"
16+
)
17+
```
18+
19+
```{r setup, echo = FALSE}
20+
library(BinaryDosage)
21+
```
22+
23+
# Introduction
24+
25+
The BinaryDosage package provides three functions for reading individual SNPs
26+
from Format 5 binary dosage files. They return the same data but differ in how
27+
they manage memory and file connections, which matters when reading many SNPs
28+
in a loop.
29+
30+
| Function | Allocates output? | Opens file per call? |
31+
|---|---|---|
32+
| `getbd5snp` | Yes — returns a new list | Yes |
33+
| `getbd5snp_buf` | No — writes into caller vectors | Yes |
34+
| `getbd5snp_con` | No — writes into caller vectors | No — reuses open connection |
35+
36+
All three functions require the `"genetic-info"` list returned by `getbdinfo`.
37+
38+
# Setup
39+
40+
The examples below use the small bgzipped VCF file included with the package.
41+
First, convert it to a Format 5 file pair and load the metadata.
42+
43+
```{r setup_files, message = FALSE, warning = FALSE}
44+
vcf_file <- system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage")
45+
bdose_file <- file.path(tempdir(), "set1a.bdose")
46+
47+
vcftobd(vcffile = vcf_file, bdose_file = bdose_file)
48+
bd5 <- getbdinfo(bdose_file)
49+
50+
n_snps <- nrow(bd5$snps)
51+
n_samp <- nrow(bd5$samples)
52+
cat("SNPs:", n_snps, " Samples:", n_samp, "\n")
53+
```
54+
55+
---
56+
57+
# getbd5snp
58+
59+
`getbd5snp` is the simplest interface. It opens the `.bdose` file, seeks to
60+
the requested SNP's compressed block, decompresses it, decodes the values, and
61+
returns them as a new list. The file is opened and closed on every call.
62+
63+
## Parameters
64+
65+
- `bd5info` — object returned by `getbdinfo`
66+
- `snp` — 1-based integer index or character SNP ID
67+
68+
## Return value
69+
70+
A list with four numeric vectors of length *n_samples*:
71+
72+
- `dosage` — DS values in \[0, 2\]; `NA` = missing
73+
- `p0` — Pr(*g*=0) in \[0, 1\]; `NA` = missing
74+
- `p1` — Pr(*g*=1) in \[0, 1\]; `NA` = missing
75+
- `p2` — Pr(*g*=2) in \[0, 1\]; `NA` = missing
76+
77+
## Reading a SNP by index
78+
79+
```{r getbd5snp_index}
80+
snp1 <- getbd5snp(bd5info = bd5, snp = 1L)
81+
82+
knitr::kable(
83+
data.frame(
84+
SampleID = head(bd5$samples$sid, 10),
85+
Dosage = round(head(snp1$dosage, 10), 4),
86+
P_00 = round(head(snp1$p0, 10), 4),
87+
P_01 = round(head(snp1$p1, 10), 4),
88+
P_11 = round(head(snp1$p2, 10), 4)
89+
),
90+
caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects")
91+
)
92+
```
93+
94+
## Reading a SNP by ID
95+
96+
```{r getbd5snp_id}
97+
snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C")
98+
99+
knitr::kable(
100+
data.frame(
101+
SampleID = head(bd5$samples$sid, 10),
102+
Dosage = round(head(snp3$dosage, 10), 4),
103+
P_00 = round(head(snp3$p0, 10), 4),
104+
P_01 = round(head(snp3$p1, 10), 4),
105+
P_11 = round(head(snp3$p2, 10), 4)
106+
),
107+
caption = "SNP 1:12000:T:C — first 10 subjects"
108+
)
109+
```
110+
111+
## Using getbd5snp in a loop
112+
113+
`getbd5snp` is convenient but allocates a new list and opens the file on every
114+
call. For a small number of SNPs this is fine. The following calculates the
115+
alternate allele frequency for every SNP.
116+
117+
```{r getbd5snp_loop}
118+
aaf <- numeric(n_snps)
119+
for (i in seq_len(n_snps)) {
120+
aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2
121+
}
122+
123+
knitr::kable(
124+
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
125+
caption = "Alternate allele frequencies via getbd5snp"
126+
)
127+
```
128+
129+
---
130+
131+
# getbd5snp_buf
132+
133+
`getbd5snp_buf` eliminates the per-call list allocation by writing results
134+
directly into four numeric vectors that the caller pre-allocates once before
135+
the loop. This avoids repeated garbage collection pressure when reading many
136+
SNPs. The file is still opened and closed on every call.
137+
138+
## Parameters
139+
140+
- `bd5info` — object returned by `getbdinfo`
141+
- `snp` — 1-based integer index or character SNP ID
142+
- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors
143+
144+
## Return value
145+
146+
`NULL` invisibly. The four output vectors are updated in place.
147+
148+
## Important: R copy-on-modify semantics
149+
150+
The output vectors **must not have more than one R binding** at the call site.
151+
If another variable also points to the same vector, R's copy-on-modify rule
152+
will copy the vector before writing, so the caller's variable will not be
153+
updated. Always use plain, freshly allocated vectors:
154+
155+
```r
156+
# Correct
157+
dosage <- numeric(n_samp)
158+
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)
159+
160+
# Wrong — dosage2 creates a second binding; the update may not be visible
161+
dosage2 <- dosage
162+
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)
163+
```
164+
165+
## Example
166+
167+
```{r getbd5snp_buf_loop}
168+
dosage <- numeric(n_samp)
169+
p0 <- numeric(n_samp)
170+
p1 <- numeric(n_samp)
171+
p2 <- numeric(n_samp)
172+
173+
aaf_buf <- numeric(n_snps)
174+
for (i in seq_len(n_snps)) {
175+
getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage,
176+
p0 = p0, p1 = p1, p2 = p2)
177+
aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2
178+
}
179+
180+
knitr::kable(
181+
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)),
182+
caption = "Alternate allele frequencies via getbd5snp_buf"
183+
)
184+
```
185+
186+
---
187+
188+
# getbd5snp_con
189+
190+
`getbd5snp_con` is the highest-performance option. It combines the
191+
pre-allocated vector approach of `getbd5snp_buf` with a persistent open file
192+
connection, so the `.bdose` file is opened once before the loop and kept open
193+
for all reads. The C-level read uses zlib directly rather than R's `memDecompress`.
194+
195+
Use this when reading a large number of SNPs sequentially and minimizing
196+
elapsed time matters.
197+
198+
## Workflow
199+
200+
1. Call `openbd5con(bd5info)` to open the file and get a `"bd5con"` object.
201+
2. Call `getbd5snp_con` inside the loop, passing the `"bd5con"` object.
202+
3. Call `closebd5con` when finished to release the file handle promptly.
203+
(The connection is also closed automatically when the `"bd5con"` object
204+
is garbage-collected or when R exits, so the explicit close is optional
205+
but recommended.)
206+
207+
## openbd5con
208+
209+
Opens a persistent connection to the `.bdose` file.
210+
211+
**Parameters**
212+
213+
- `bd5info` — object returned by `getbdinfo`
214+
215+
**Return value**
216+
217+
An object of class `"bd5con"` to be passed to `getbd5snp_con` and
218+
`closebd5con`.
219+
220+
## closebd5con
221+
222+
Explicitly closes the connection.
223+
224+
**Parameters**
225+
226+
- `bd5con` — object returned by `openbd5con`
227+
228+
**Return value**
229+
230+
`NULL` invisibly.
231+
232+
## getbd5snp_con
233+
234+
**Parameters**
235+
236+
- `bd5info` — object returned by `getbdinfo`
237+
- `snp` — 1-based integer index or character SNP ID
238+
- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors
239+
- `bd5con` — object returned by `openbd5con`
240+
241+
**Return value**
242+
243+
`NULL` invisibly. The four output vectors are updated in place.
244+
245+
The same copy-on-modify constraint as `getbd5snp_buf` applies.
246+
247+
## Example
248+
249+
```{r getbd5snp_con_loop}
250+
dosage <- numeric(n_samp)
251+
p0 <- numeric(n_samp)
252+
p1 <- numeric(n_samp)
253+
p2 <- numeric(n_samp)
254+
255+
con <- openbd5con(bd5)
256+
257+
aaf_con <- numeric(n_snps)
258+
for (i in seq_len(n_snps)) {
259+
getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage,
260+
p0 = p0, p1 = p1, p2 = p2, bd5con = con)
261+
aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2
262+
}
263+
264+
closebd5con(con)
265+
266+
knitr::kable(
267+
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)),
268+
caption = "Alternate allele frequencies via getbd5snp_con"
269+
)
270+
```
271+
272+
---
273+
274+
# Choosing a reader
275+
276+
- **`getbd5snp`** — use when reading a small number of SNPs, or when
277+
simplicity matters more than speed.
278+
- **`getbd5snp_buf`** — use in loops where allocation overhead is measurable
279+
but the file seek cost is acceptable.
280+
- **`getbd5snp_con`** — use when reading all or most SNPs sequentially and
281+
elapsed time is important. The persistent connection avoids both the
282+
allocation and the file open/close on every iteration.
283+
284+
All three functions produce identical results, as confirmed below.
285+
286+
```{r verify}
287+
all(aaf == aaf_buf)
288+
all(aaf == aaf_con)
289+
```
290+
291+
```{r cleanup, include = FALSE}
292+
unlink(c(bdose_file, paste0(bdose_file, ".bdi")))
293+
```

0 commit comments

Comments
 (0)