anndataR is a Bioconductor R package that provides native R support for reading/writing .h5ad (AnnData) files and bidirectional conversion between AnnData objects and popular Bioconductor/Seurat formats.
Key Features:
- Native R reading/writing of
.h5adfiles (HDF5 backend) - Three AnnData implementations:
InMemoryAnnData,HDF5AnnData,ReticulateAnnData - Bidirectional conversion: SingleCellExperiment ↔ AnnData ↔ Seurat
- S4 coercion system for seamless type conversion
- Python interoperability via reticulate
AbstractAnnData (abstract base class)
├── InMemoryAnnData (all data in RAM)
├── HDF5AnnData (HDF5-backed, lazy loading)
├── ReticulateAnnData (Python anndata wrapper via reticulate)
└── AnnDataView (lazy subsetting view, no data copying)
When to use each:
- InMemoryAnnData: Default for most operations, fast for small/medium datasets
- HDF5AnnData: Large datasets exceeding RAM, on-disk persistence
- ReticulateAnnData: Interoperability with Python anndata library, testing against Python behavior
- AnnDataView: Lazy subsetting/slicing without copying data; convert to concrete implementation when needed
X: Main matrix (n_obs × n_vars)layers: Named list of alternative matrices with same dimensions as Xobs: DataFrame with observation (cell) metadata (n_obs rows)var: DataFrame with variable (gene) metadata (n_vars rows)obsm: List of observation-aligned matrices (e.g., PCA, UMAP embeddings)varm: List of variable-aligned matricesobsp: List of observation pairwise matrices (e.g., cell-cell distances)varp: List of variable pairwise matricesuns: Unstructured metadata (arbitrary nested lists/dicts)
Important validation rules:
- All matrices in X/layers must have same shape:
[n_obs, n_vars] - Row names of obs/var define observation/variable names
- obsm/varm matrices must align: first dimension matches n_obs/n_vars
- obsp/varp must be square pairwise matrices
obs_names and var_names are now stored separately from obs/var data.frames:
# InMemoryAnnData and HDF5AnnData store names separately
private$.obs_names # Character vector of observation names
private$.var_names # Character vector of variable names
# obs/var data.frames are stored WITHOUT rownames internally
private$.obs # Data.frame with NULL rownames
private$.var # Data.frame with NULL rownames
# Dimnames are added ON-THE-FLY when users access data
ad$obs # Returns data.frame with rownames = obs_names
ad$X # Returns matrix with dimnames from obs_names/var_namesWhy this matters:
- All matrix data (X, layers, obsm, varm, obsp, varp) is stored internally without dimnames
- Dimnames are dynamically added via
.add_matrix_dimnames()and.add_obsvar_dimnames()helper methods - This ensures consistency with Python anndata where obs/var names are separate from the data
- When setting obs/var, extract rownames first if present, then strip them
Pattern for setters:
# Extract names before validation
if (!is.null(value) && has_row_names(value)) {
private$.obs_names <- rownames(value)
rownames(value) <- NULL
}
private$.obs <- private$.validate_obsvar_dataframe(value, "obs")Use obs_names/var_names properties directly:
- Don't use
rownames(ad$obs)orcolnames(ad$X)in internal code - Use
ad$obs_namesandad$var_namesfor reliable access to names
AnnDataView provides lazy subsetting without data copying:
# Create a view with S3 [ operator
ad <- AnnData(
X = matrix(1:15, 3L, 5L),
obs = data.frame(row.names = LETTERS[1:3], cell_type = c("A", "B", "A"))
)
view <- ad[ad$obs$cell_type == "A", ] # Returns AnnDataView, no data copied
# Convert to concrete implementation when needed
result <- view$as_InMemoryAnnData() # Now subsetting is appliedKey characteristics:
- Inherits from AbstractAnnData
- Stores base AnnData object and subset indices
- All getters apply subsetting on-the-fly
- Setters are disabled - must convert to concrete implementation first
- Use
.apply_subset()helper for matrix/data.frame subsetting - Use
.apply_vector_subset()for vector subsetting (obs_names, var_names)
Testing pattern: See tests/testthat/test-AnnDataView.R for usage examples
Standard R S3 methods now work on all AnnData objects:
# Dimension methods
dim(ad) # [n_obs, n_vars]
nrow(ad) # n_obs
ncol(ad) # n_vars
dimnames(ad) # list(obs_names, var_names)
rownames(ad) # obs_names
colnames(ad) # var_names
# Subsetting with [ operator
ad[1:5, ] # Subset observations, returns AnnDataView
ad[, 1:10] # Subset variables, returns AnnDataView
ad[1:5, 1:10] # Subset both, returns AnnDataViewImplementation: See R/AbstractAnnData-s3methods.R for all S3 methods
Pattern: HDF5AnnData manages file handles with lifecycle:
# Automatic closure on finalization
adata <- read_h5ad("file.h5ad") # close_on_finalize = TRUE
# Manual closure required for explicit construction
adata <- HDF5AnnData$new("file.h5ad")
adata$close() # Must call explicitly
# Check validity before operations
private$.check_file_valid() # Throws if handle closedTesting pattern: test-h5ad-fileclosure.R validates handles close properly.
Alignment validation:
# All setters use validation helpers from AbstractAnnData
private$.validate_aligned_array(
value,
"X",
shape = c(self$n_obs(), self$n_vars()),
expected_rownames = rownames(self),
expected_colnames = colnames(self)
)Dimension checking:
private$.validate_aligned_mapping(
value,
"layers",
c(self$n_obs(), self$n_vars()),
expected_rownames = rownames(self),
expected_colnames = colnames(self)
)Pattern: Helper functions check for optional dependencies:
# tests/testthat/helper-skip_if_no_anndata.R
skip_if_no_anndata <- function() {
if (!rlang::is_installed("reticulate")) {
skip("reticulate not installed")
}
# Check Python anndata module
if (!reticulate::py_module_available("anndata")) {
skip("Python anndata not available")
}
}
# Usage in tests
test_that("ReticulateAnnData works", {
skip_if_no_anndata()
# ... test code ...
})Available skip helpers:
skip_if_no_anndata(): Python anndata + reticulateskip_if_no_dummy_anndata(): anndataR.testdata Python module (for roundtrip tests)skip_if_no_h5diff(): h5diff CLI tool
Goal: Ensure R read/write matches Python anndata behavior
# tests/testthat/test-roundtrip-X.R example
test_that("Dense X roundtrips correctly", {
skip_if_no_dummy_anndata()
# Generate test data with known structure
adata <- generate_dataset(
X_type = "dense",
obs = data.frame(row.names = LETTERS[1:10]),
var = data.frame(row.names = letters[1:5])
)
# Write R → H5AD
tmp <- tempfile(fileext = ".h5ad")
adata$write_h5ad(tmp)
# Read back and verify
adata2 <- read_h5ad(tmp)
expect_equal(adata$X, adata2$X)
# Compare with Python via dummy_anndata
py <- reticulate::import("anndataR.testdata")
py_adata <- py$dummy_anndata(X = "dense")
expect_equal_py(adata, py_adata) # Custom helper
})Custom helpers:
expect_equal_py(r_adata, py_adata): Compare R and Python AnnData objectsgenerate_dataset(): Create test AnnData with configurable matrix types
Usage:
# R/generate_dataset.R
adata <- generate_dataset(
X_type = "dense", # "dense", "sparse", "csparse", "rsparse"
obs = data.frame(row.names = LETTERS[1:100]),
var = data.frame(row.names = letters[1:50]),
n_layers = 2,
obs_has_row_names = TRUE # Toggle for testing edge cases
)Pattern: from_*() functions for explicit conversion:
# from_SingleCellExperiment.R
adata <- from_SingleCellExperiment(
sce,
output_class = "InMemoryAnnData", # or "HDF5AnnData"
X_name = "counts", # Which assay → X
layers = c("logcounts"), # Additional assays → layers
uns_keys = c("pca") # Which metadata → uns
)
# from_Seurat.R
adata <- from_Seurat(
seurat_obj,
output_class = "InMemoryAnnData",
assay = "RNA" # Which assay to extract
)Guessing pattern: from_Seurat() uses helper functions to intelligently map:
.from_Seurat_guess_layers(): Identify assay data slots.from_Seurat_guess_obsms(): Extract dimensionality reductions (PCA, UMAP).from_Seurat_guess_uns(): Map miscellaneous metadata
Pattern: as_*() functions with optional parameters:
# as_SingleCellExperiment.R
sce <- as_SingleCellExperiment(
adata,
X_name = "counts", # What to call X assay
layer_names = NULL # Which layers to include (NULL = all)
)
# as_Seurat.R
seurat <- as_Seurat(
adata,
assay_name = "RNA" # Assay name in Seurat object
)# Full R CMD check (Bioconductor standards)
R CMD build .
R CMD check --as-cran anndataR_*.tar.gz
# Quick test suite
R -e 'devtools::test()'
# Run specific test file
R -e 'devtools::test(filter = "roundtrip-X")'
# Lint checking
R -e 'lintr::lint_package()'
# Check code formatting
air format --check .
# Reformat code
air format .Key rules:
- Line length: 120 characters max
- Use
paste()to wrap long strings in test helpers - Prefer explicit
::for package functions in examples
Example fix:
# ❌ Too long
expect_warning(as(adata, "Seurat"), "Consider using as_Seurat() for more control over the conversion")
# ✅ Wrapped
expect_warning(
as(adata, "Seurat"),
paste(
"Consider using as_Seurat() for more control",
"over the conversion"
)
)Pattern: Check for optional packages before use:
# R/check_requires.R
check_requires <- function(package, reason = NULL) {
if (!rlang::is_installed(package)) {
msg <- c(
"!" = "Package {.pkg {package}} is required",
"i" = "Install with: {.code install.packages(\"{package}\")}"
)
if (!is.null(reason)) {
msg <- c(msg, "i" = reason)
}
cli::cli_abort(msg)
}
}
# Usage
as_Seurat <- function(adata, ...) {
check_requires("SeuratObject", "for converting to Seurat objects")
# ... conversion code ...
}Problem: Trying to access obs/var names via rownames instead of dedicated properties
Wrong:
# ❌ Internal code should not use this pattern
obs_ids <- rownames(adata$obs)
gene_ids <- colnames(adata$X)Correct:
# ✅ Always use dedicated properties
obs_ids <- adata$obs_names
gene_ids <- adata$var_namesWhy: obs_names/var_names are stored separately and added on-the-fly to user-facing data
Problem: R data.frames require unique row names, but AnnData allows duplicates
Solution: generate_dataframe.R has obs_has_row_names parameter for testing edge cases
Problem: Different sparse matrix classes (Matrix::dgCMatrix vs Matrix::dgRMatrix)
Solution: Use generate_dataset(X_type = "csparse") for CSC, "rsparse" for CSR testing
Problem: Python distinguishes None from {}; R treats NULL differently
Solution: As of anndata 0.12.0, write NULL as empty HDF5 dataset. Controlled by options(anndataR.write_null = TRUE)
Problem: Loop variables not captured in closure
# ❌ WRONG - all handlers reference final iteration value
for (class in classes) {
setAs(class, "Seurat", function(from) convert(from, class))
}
# ✅ CORRECT - force() captures variable value
.make_convert_handler <- function(convert_fn, from_str, to_str) {
force(convert_fn)
force(from_str)
force(to_str)
function(from) convert_fn(from)
}Core Classes:
R/AbstractAnnData.R: Base class with abstract slots and validationR/AbstractAnnData-s3methods.R: S3 methods (dim, nrow, ncol, dimnames,[)R/InMemoryAnnData.R: RAM-based implementationR/HDF5AnnData.R: HDF5-backed implementationR/ReticulateAnnData.R: Python wrapper (experimental)R/AnnDataView.R: Lazy view for subsetting without data copying
I/O:
R/read_h5ad.R,R/read_h5ad_helpers.R: Reading HDF5 filesR/write_h5ad.R,R/write_h5ad_helpers.R: Writing HDF5 filesR/write_hdf5_helpers.R: Low-level HDF5 utilities
Conversion:
R/as-coercions.R: S4 coercion registrationR/as_AnnData.R: Generic converter to AnnDataR/as_SingleCellExperiment.R: AnnData → SCER/as_Seurat.R: AnnData → SeuratR/from_SingleCellExperiment.R: SCE → AnnDataR/from_Seurat.R: Seurat → AnnData
Testing:
R/generate_dataset.R,R/generate_*.R: Mock data generationtests/testthat/helper-*.R: Test utilities and skip helperstests/testthat/test-roundtrip-*.R: Python compatibility teststests/testthat/test-as-*.R: Conversion validationtests/testthat/test-AnnDataView.R: Lazy subsetting teststests/testthat/test-AbstractAnnData-s3methods.R: S3 method tests
Utilities:
R/check_requires.R: Dependency validationR/ui.R: CLI messaging helpersR/utils.R: Miscellaneous helpersR/known_issues.R: Track known bugs/limitations
Class documentation:
#' @title InMemoryAnnData
#' @description Implementation of an in-memory AnnData object.
#' @seealso [AnnData-usage] for details on creating and using AnnData objects
#' @family AnnData classes
#' @examples
#' adata <- AnnData(X = matrix(1:15, 3L, 5L), ...)Cross-references:
- Use
[AnnData-usage]for user-facing documentation - Link related functions:
@seealso [read_h5ad()], [write_h5ad()]
vignettes/anndataR.Rmd: Main usage guidevignettes/software_design.Rmd: Architecture documentationvignettes/usage_*.Rmd: Format-specific tutorialsvignettes/known_issues.Rmd: Documented limitations
Bioconductor requirements:
- Must pass
R CMD check --as-cranwith no errors/warnings - BiocCheck compliance
- All examples must run successfully
- Conditional package usage (see
check_requirespattern)
# In-memory (default)
adata <- AnnData(
X = matrix(1:15, 3L, 5L),
obs = data.frame(row.names = LETTERS[1:3]),
var = data.frame(row.names = letters[1:5])
)
# HDF5-backed
adata <- HDF5AnnData$new("path.h5ad", mode = "w")
adata$X <- matrix(1:15, 3L, 5L)
# ... set other slots ...
# From file
adata <- read_h5ad("file.h5ad")# SingleCellExperiment → AnnData
adata <- from_SingleCellExperiment(sce, X_name = "counts")
# AnnData → SingleCellExperiment
sce <- as_SingleCellExperiment(adata)
# Seurat → AnnData
adata <- from_Seurat(seurat_obj)
# AnnData → Seurat
seurat <- as_Seurat(adata)
# S4 coercion (if registered)
sce <- as(adata, "SingleCellExperiment")# Dimensions (using S3 methods)
dim(adata) # [n_obs, n_vars]
nrow(adata) # n_obs
ncol(adata) # n_vars
dimnames(adata) # list(obs_names, var_names)
rownames(adata) # obs_names
colnames(adata) # var_names
# Or using R6 methods
adata$n_obs() # Number of observations
adata$n_vars() # Number of variables
# Main matrix
adata$X # Read
adata$X <- mat # Write
# Metadata
adata$obs # Observation metadata
adata$var # Variable metadata
adata$uns # Unstructured metadata
# Additional matrices
adata$layers[["raw"]] # Named layer
adata$obsm[["X_pca"]] # Observation matrix (PCA)
adata$obsp[["distances"]] # Pairwise matrix# Subsetting returns AnnDataView (lazy, no data copied)
view <- adata[1:10, ] # Subset observations
view <- adata[, c("gene1", "gene2")] # Subset variables
view <- adata[adata$obs$cell_type == "T", ] # Conditional subsetting
# Convert to concrete implementation to apply changes
result <- view$as_InMemoryAnnData()
result <- view$as_HDF5AnnData("output.h5ad")- Does this need Python compatibility? → Add roundtrip test
- Requires optional package? → Use
check_requires()+ conditional tests - Modifying validation? → Check AbstractAnnData validators
- New matrix type? → Add to
generate_dataset()for testing - Changing uns handling? → Consider Python dict requirements
- HDF5 changes? → Verify with
h5diffagainst Python output - New conversion feature? → Update both
from_*()andas_*()paths
- AnnData spec: Official Python documentation
vignettes/software_design.Rmd: Detailed architecture diagramsinst/known_issues.yaml: Tracked bugs and workarounds- Bioconductor submission guidelines