|
| 1 | +--- |
| 2 | +title: "Reading SNPs from Format 5 Files" |
| 3 | +output: |
| 4 | + rmarkdown::html_vignette: |
| 5 | + toc: true |
| 6 | +vignette: > |
| 7 | + %\VignetteIndexEntry{Reading SNPs from Format 5 Files} |
| 8 | + %\VignetteEngine{knitr::rmarkdown} |
| 9 | + %\VignetteEncoding{UTF-8} |
| 10 | +--- |
| 11 | + |
| 12 | +```{r, include = FALSE} |
| 13 | +knitr::opts_chunk$set( |
| 14 | + collapse = TRUE, |
| 15 | + comment = "#>" |
| 16 | +) |
| 17 | +``` |
| 18 | + |
| 19 | +```{r setup, echo = FALSE} |
| 20 | +library(BinaryDosage) |
| 21 | +``` |
| 22 | + |
| 23 | +# Introduction |
| 24 | + |
| 25 | +The BinaryDosage package provides three functions for reading individual SNPs |
| 26 | +from Format 5 binary dosage files. They return the same data but differ in how |
| 27 | +they manage memory and file connections, which matters when reading many SNPs |
| 28 | +in a loop. |
| 29 | + |
| 30 | +| Function | Allocates output? | Opens file per call? | |
| 31 | +|---|---|---| |
| 32 | +| `getbd5snp` | Yes — returns a new list | Yes | |
| 33 | +| `getbd5snp_buf` | No — writes into caller vectors | Yes | |
| 34 | +| `getbd5snp_con` | No — writes into caller vectors | No — reuses open connection | |
| 35 | + |
| 36 | +All three functions require the `"genetic-info"` list returned by `getbdinfo`. |
| 37 | + |
| 38 | +# Setup |
| 39 | + |
| 40 | +The examples below use the small bgzipped VCF file included with the package. |
| 41 | +First, convert it to a Format 5 file pair and load the metadata. |
| 42 | + |
| 43 | +```{r setup_files, message = FALSE, warning = FALSE} |
| 44 | +vcf_file <- system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage") |
| 45 | +bdose_file <- file.path(tempdir(), "set1a.bdose") |
| 46 | +
|
| 47 | +vcftobd(vcffile = vcf_file, bdose_file = bdose_file) |
| 48 | +bd5 <- getbdinfo(bdose_file) |
| 49 | +
|
| 50 | +n_snps <- nrow(bd5$snps) |
| 51 | +n_samp <- nrow(bd5$samples) |
| 52 | +cat("SNPs:", n_snps, " Samples:", n_samp, "\n") |
| 53 | +``` |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +# getbd5snp |
| 58 | + |
| 59 | +`getbd5snp` is the simplest interface. It opens the `.bdose` file, seeks to |
| 60 | +the requested SNP's compressed block, decompresses it, decodes the values, and |
| 61 | +returns them as a new list. The file is opened and closed on every call. |
| 62 | + |
| 63 | +## Parameters |
| 64 | + |
| 65 | +- `bd5info` — object returned by `getbdinfo` |
| 66 | +- `snp` — 1-based integer index or character SNP ID |
| 67 | + |
| 68 | +## Return value |
| 69 | + |
| 70 | +A list with four numeric vectors of length *n_samples*: |
| 71 | + |
| 72 | +- `dosage` — DS values in \[0, 2\]; `NA` = missing |
| 73 | +- `p0` — Pr(*g*=0) in \[0, 1\]; `NA` = missing |
| 74 | +- `p1` — Pr(*g*=1) in \[0, 1\]; `NA` = missing |
| 75 | +- `p2` — Pr(*g*=2) in \[0, 1\]; `NA` = missing |
| 76 | + |
| 77 | +## Reading a SNP by index |
| 78 | + |
| 79 | +```{r getbd5snp_index} |
| 80 | +snp1 <- getbd5snp(bd5info = bd5, snp = 1L) |
| 81 | +
|
| 82 | +knitr::kable( |
| 83 | + data.frame( |
| 84 | + SampleID = head(bd5$samples$sid, 10), |
| 85 | + Dosage = round(head(snp1$dosage, 10), 4), |
| 86 | + P_00 = round(head(snp1$p0, 10), 4), |
| 87 | + P_01 = round(head(snp1$p1, 10), 4), |
| 88 | + P_11 = round(head(snp1$p2, 10), 4) |
| 89 | + ), |
| 90 | + caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects") |
| 91 | +) |
| 92 | +``` |
| 93 | + |
| 94 | +## Reading a SNP by ID |
| 95 | + |
| 96 | +```{r getbd5snp_id} |
| 97 | +snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C") |
| 98 | +
|
| 99 | +knitr::kable( |
| 100 | + data.frame( |
| 101 | + SampleID = head(bd5$samples$sid, 10), |
| 102 | + Dosage = round(head(snp3$dosage, 10), 4), |
| 103 | + P_00 = round(head(snp3$p0, 10), 4), |
| 104 | + P_01 = round(head(snp3$p1, 10), 4), |
| 105 | + P_11 = round(head(snp3$p2, 10), 4) |
| 106 | + ), |
| 107 | + caption = "SNP 1:12000:T:C — first 10 subjects" |
| 108 | +) |
| 109 | +``` |
| 110 | + |
| 111 | +## Using getbd5snp in a loop |
| 112 | + |
| 113 | +`getbd5snp` is convenient but allocates a new list and opens the file on every |
| 114 | +call. For a small number of SNPs this is fine. The following calculates the |
| 115 | +alternate allele frequency for every SNP. |
| 116 | + |
| 117 | +```{r getbd5snp_loop} |
| 118 | +aaf <- numeric(n_snps) |
| 119 | +for (i in seq_len(n_snps)) { |
| 120 | + aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2 |
| 121 | +} |
| 122 | +
|
| 123 | +knitr::kable( |
| 124 | + data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)), |
| 125 | + caption = "Alternate allele frequencies via getbd5snp" |
| 126 | +) |
| 127 | +``` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +# getbd5snp_buf |
| 132 | + |
| 133 | +`getbd5snp_buf` eliminates the per-call list allocation by writing results |
| 134 | +directly into four numeric vectors that the caller pre-allocates once before |
| 135 | +the loop. This avoids repeated garbage collection pressure when reading many |
| 136 | +SNPs. The file is still opened and closed on every call. |
| 137 | + |
| 138 | +## Parameters |
| 139 | + |
| 140 | +- `bd5info` — object returned by `getbdinfo` |
| 141 | +- `snp` — 1-based integer index or character SNP ID |
| 142 | +- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors |
| 143 | + |
| 144 | +## Return value |
| 145 | + |
| 146 | +`NULL` invisibly. The four output vectors are updated in place. |
| 147 | + |
| 148 | +## Important: R copy-on-modify semantics |
| 149 | + |
| 150 | +The output vectors **must not have more than one R binding** at the call site. |
| 151 | +If another variable also points to the same vector, R's copy-on-modify rule |
| 152 | +will copy the vector before writing, so the caller's variable will not be |
| 153 | +updated. Always use plain, freshly allocated vectors: |
| 154 | + |
| 155 | +```r |
| 156 | +# Correct |
| 157 | +dosage <- numeric(n_samp) |
| 158 | +getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2) |
| 159 | + |
| 160 | +# Wrong — dosage2 creates a second binding; the update may not be visible |
| 161 | +dosage2 <- dosage |
| 162 | +getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2) |
| 163 | +``` |
| 164 | + |
| 165 | +## Example |
| 166 | + |
| 167 | +```{r getbd5snp_buf_loop} |
| 168 | +dosage <- numeric(n_samp) |
| 169 | +p0 <- numeric(n_samp) |
| 170 | +p1 <- numeric(n_samp) |
| 171 | +p2 <- numeric(n_samp) |
| 172 | +
|
| 173 | +aaf_buf <- numeric(n_snps) |
| 174 | +for (i in seq_len(n_snps)) { |
| 175 | + getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage, |
| 176 | + p0 = p0, p1 = p1, p2 = p2) |
| 177 | + aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2 |
| 178 | +} |
| 179 | +
|
| 180 | +knitr::kable( |
| 181 | + data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)), |
| 182 | + caption = "Alternate allele frequencies via getbd5snp_buf" |
| 183 | +) |
| 184 | +``` |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +# getbd5snp_con |
| 189 | + |
| 190 | +`getbd5snp_con` is the highest-performance option. It combines the |
| 191 | +pre-allocated vector approach of `getbd5snp_buf` with a persistent open file |
| 192 | +connection, so the `.bdose` file is opened once before the loop and kept open |
| 193 | +for all reads. The C-level read uses zlib directly rather than R's `memDecompress`. |
| 194 | + |
| 195 | +Use this when reading a large number of SNPs sequentially and minimizing |
| 196 | +elapsed time matters. |
| 197 | + |
| 198 | +## Workflow |
| 199 | + |
| 200 | +1. Call `openbd5con(bd5info)` to open the file and get a `"bd5con"` object. |
| 201 | +2. Call `getbd5snp_con` inside the loop, passing the `"bd5con"` object. |
| 202 | +3. Call `closebd5con` when finished to release the file handle promptly. |
| 203 | + (The connection is also closed automatically when the `"bd5con"` object |
| 204 | + is garbage-collected or when R exits, so the explicit close is optional |
| 205 | + but recommended.) |
| 206 | + |
| 207 | +## openbd5con |
| 208 | + |
| 209 | +Opens a persistent connection to the `.bdose` file. |
| 210 | + |
| 211 | +**Parameters** |
| 212 | + |
| 213 | +- `bd5info` — object returned by `getbdinfo` |
| 214 | + |
| 215 | +**Return value** |
| 216 | + |
| 217 | +An object of class `"bd5con"` to be passed to `getbd5snp_con` and |
| 218 | +`closebd5con`. |
| 219 | + |
| 220 | +## closebd5con |
| 221 | + |
| 222 | +Explicitly closes the connection. |
| 223 | + |
| 224 | +**Parameters** |
| 225 | + |
| 226 | +- `bd5con` — object returned by `openbd5con` |
| 227 | + |
| 228 | +**Return value** |
| 229 | + |
| 230 | +`NULL` invisibly. |
| 231 | + |
| 232 | +## getbd5snp_con |
| 233 | + |
| 234 | +**Parameters** |
| 235 | + |
| 236 | +- `bd5info` — object returned by `getbdinfo` |
| 237 | +- `snp` — 1-based integer index or character SNP ID |
| 238 | +- `dosage`, `p0`, `p1`, `p2` — pre-allocated `numeric(n_samples)` vectors |
| 239 | +- `bd5con` — object returned by `openbd5con` |
| 240 | + |
| 241 | +**Return value** |
| 242 | + |
| 243 | +`NULL` invisibly. The four output vectors are updated in place. |
| 244 | + |
| 245 | +The same copy-on-modify constraint as `getbd5snp_buf` applies. |
| 246 | + |
| 247 | +## Example |
| 248 | + |
| 249 | +```{r getbd5snp_con_loop} |
| 250 | +dosage <- numeric(n_samp) |
| 251 | +p0 <- numeric(n_samp) |
| 252 | +p1 <- numeric(n_samp) |
| 253 | +p2 <- numeric(n_samp) |
| 254 | +
|
| 255 | +con <- openbd5con(bd5) |
| 256 | +
|
| 257 | +aaf_con <- numeric(n_snps) |
| 258 | +for (i in seq_len(n_snps)) { |
| 259 | + getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage, |
| 260 | + p0 = p0, p1 = p1, p2 = p2, bd5con = con) |
| 261 | + aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2 |
| 262 | +} |
| 263 | +
|
| 264 | +closebd5con(con) |
| 265 | +
|
| 266 | +knitr::kable( |
| 267 | + data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)), |
| 268 | + caption = "Alternate allele frequencies via getbd5snp_con" |
| 269 | +) |
| 270 | +``` |
| 271 | + |
| 272 | +--- |
| 273 | + |
| 274 | +# Choosing a reader |
| 275 | + |
| 276 | +- **`getbd5snp`** — use when reading a small number of SNPs, or when |
| 277 | + simplicity matters more than speed. |
| 278 | +- **`getbd5snp_buf`** — use in loops where allocation overhead is measurable |
| 279 | + but the file seek cost is acceptable. |
| 280 | +- **`getbd5snp_con`** — use when reading all or most SNPs sequentially and |
| 281 | + elapsed time is important. The persistent connection avoids both the |
| 282 | + allocation and the file open/close on every iteration. |
| 283 | + |
| 284 | +All three functions produce identical results, as confirmed below. |
| 285 | + |
| 286 | +```{r verify} |
| 287 | +all(aaf == aaf_buf) |
| 288 | +all(aaf == aaf_con) |
| 289 | +``` |
| 290 | + |
| 291 | +```{r cleanup, include = FALSE} |
| 292 | +unlink(c(bdose_file, paste0(bdose_file, ".bdi"))) |
| 293 | +``` |
0 commit comments