Important
Data Migration Notice: This dataset is now available on the Google Cloud Marketplace.
Note: The new bucket is subject to Requester Pays. Users can access up to 2TB of data per month for free before fees apply.
Access to the current GCS bucket (gs://arc-scbasecount) will be deprecated on March 31, 2026. Please update your workflows to use the Google Marketplace bucket gs://arc-institute-virtual-cell-atlas.
scBaseCount is a continuously updated single-cell RNA-seq database that employs an AI-driven, hierarchical agent workflow to automate discovery, metadata extraction, and standardized preprocessing of Sequence Read Archive (SRA) data.
Currently the largest public repository of single-cell data comprising over 502 million cells (and expanding), spanning 27 organisms and 75 tissues.
By continually discovering, annotating, and reprocessing raw single-cell RNA-seq data, scBaseCount offers an expansive and harmonized repository that can serve as a foundation for AI-driven modeling and integrative meta-analyses.
scBaseCount: An AI agent-curated, uniformly processed, and continually expanding single cell data repository Nicholas D. Youngblut, Christopher Carpenter, Arshia Nayebnazar, Abhinav Adduri, Rohan Shah, Chiara Ricci-Tam, Jaanak Prashar, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick D. Hsu, Alexander Dobin, Dave P. Burke, Hani Goodarzi, Yusuf H Roohani. bioRxiv 2025.02.27.640494; doi: https://doi.org/10.1101/2025.02.27.640494
- Google Cloud Marketplace (Recommended)
- Google Cloud Storage (Google Marketplace bucket:
gs://arc-institute-virtual-cell-atlas) - Google Cloud Storage (Deprecated: access ends March 31, 2026)
- Path:
gs://arc-scbasecount
- Path:
See the tutorials below on programmatically accessing the data.
You can also directly download the data with the gsutil tool.
‼️ This release is a revamp and expansion of the initial release. It expands on the cells, metadata, and STARsolo count features (see below). The initial release can generally be considered deprecated, but we will continue to support it.
- >502 million cells
- Note: the number of cells differs by STARsolo feature type
- 27 organisms
- 75 tissues
- h5ad files include all STARsolo mapping approaches:
Gene*feature types:Unique: Unique molecular identifiers (UMIs) are assigned to each read based on the read sequence and the barcode sequence.adata.Xmatrix
EM: Equivalent molecular identifiers (EMs) are assigned to each read based on the read sequence and the barcode sequence.UniqueAndMult-EMlayer
Uniform: Uniform molecular identifiers (UMIs) are assigned to each read based on the read sequence and the barcode sequence.UniqueAndMult-Uniformlayer
Velocytofeature type:spliced: Spliced reads are assigned to each read based on the read sequence and the barcode sequence.adata.Xmatrix andsplicedlayer
unspliced: Unspliced reads are assigned to each read based on the read sequence and the barcode sequence.unsplicedlayer
ambiguous: Ambiguous reads are assigned to each read based on the read sequence and the barcode sequence.ambiguouslayer
tissue_ontology_term_id: Tissue ontology term ID (if applicable)disease_ontology_term_id: Disease ontology term ID (if applicable)antibody_derived_tag: SRAgent was used to determine whether the SRX accession metadata in the SRA/ENA included any mention of "antibody derived tag". Such SRX accessions were labeled as "maybe" for ADT status.
- UMI count per feature type
- Gene count per feature type
cell_type: Cell type annotation (if available)cell_type_ontology_term_id: Cell type ontology term ID (if available)
We created the NRX accession prefix and use it to identify datasets not in the SRA/ENA.
NRX accessions are datasets that do not have any SRX/ERX accession ID.
These datasets are associated with CZI CELLxGENE collections and were made public but never uploaded to the SRA/ENA.
We manually processed these datasets outside of the SRAgent/scRecounter workflow.
In order to associate an NRX accession with a CZI collection, you can use the sample metadata tables (parquet files).
Disease annotations are extracted at the study level, not the sample or cell level. SRAgent derives disease labels from author-supplied abstracts in the SRA, which describe the overall study design (e.g., "A study of COVID-19 infection") rather than the status of individual donors or samples. As a result, a single disease label (e.g., "COVID-19") may be propagated to all samples within a study, including healthy controls.
- >230 million cells
- 21 organisms
- Multiple STARsolo count features (e.g.,
Gene,GeneFull_Ex50pAS, andVelocyto)- Note: only
filteredcount tables are provided (noraw)
- Note: only
The observation metadata was obtained via SRAgent.
Each sample includes the following metadata fields:
Core Identifiers
entrez_id: Entrez database identifiersrx_accession: SRA experiment accessionfile_path: Path to h5ad file in Google Cloud Storageobs_count: Total number of cells/observations
Technical Information
lib_prep: Library preparation method (10X Genomics or other)tech_10x: Specific 10X Genomics technology (e.g. 3' or 5')cell_prep: Sample preparation type (single nucleus or single cell)
Biological Information
organism: Species of origintissue: Source tissuedisease: Disease status (if applicable)perturbation: Experimental perturbations (if applicable)cell_line: Cell line information (if applicable)
Collection Information
czi_collection_id: CZI collection identifier (if applicable)czi_collection_name: CZI collection name (if applicable)
Each individual cell (obs) contains:
gene_count: Number of unique genes detectedumi_count: Total number of Unique Molecular Identifiers (UMIs)
- Gene: Counts reads mapping entirely to exonic regions of annotated transcripts, capturing fully spliced transcripts. This procedure represents traditional gene expression analyses and was the default in earlier CellRanger versions.
- GeneFull: Counts reads overlapping entire gene locus, i.e. including exonic and intronic regions. This captures both unspliced (primary) and spliced transcripts and provides a more comprehensive view of gene activity.
- GeneFull_ExonOverIntron: Counts reads overlapping exonic and intronic regions, but assigns higher priority to exonic overlaps. This option helps to resolve reads that map to overlapping genes.
- GeneFull_Ex50pAS: Similar to the above option, but with more sophisticated priority scheme, which prioritizes partial and antisense exonic overlap over intronic reads.
- Velocyto: Generates separate count matrices for spliced, unspliced, and ambiguous reads, following the rules from La Manno et al., 2018. This enables RNA velocity analyses to infer dynamic cellular processes.
- STARsolo count features (e.g.,
Gene,GeneFull_Ex50pAS, andVelocyto)GeneFull*will have more counts thanGenedue to the mapping algorithm.Velocytoh5ad files include 3 layers:spliced,unspliced, andambiguous.- The
Xmatrix issplicedmatrix, in addition to thesplicedlayer.
- The