single-cell-tools is a small toolkit for working with single-cell datasets locally or on a shared cluster. The repo centers around a Gradio app for generating embeddings, exploring datasets, downloading public data into a managed .data/ workspace, and kicking off Slurm jobs when the workload is too large for an interactive session.
- Python 3.12+ (see
requires-pythoninpyproject.toml) - uv for environments and runs (
uv sync,uv run …)
Optional: copy .env.example to .env and fill in only what you need (tokens, paths). .env is gitignored.
If you want to try it quickly, the shortest path is:
git clone <repository-url>
cd single-cell-tools
uv sync
uv run python main.py embeddingsYou can launch the same Gradio UI by running the app module directly from the repo root:
uv run python app/app.pymain.py embeddings is a thin wrapper that runs app.app as a module (python -m app.app); using app/app.py skips the CLI and is handy in editors or when you only care about the embeddings UI.
Then open the URL Gradio prints (often http://127.0.0.1:7861), point the app at a server-side .h5ad, .csv, or .tsv, choose a model, and run embeddings on the current node or through sbatch. Adjust Slurm partition and --gres in the UI to match your cluster.
- Compute embeddings with
geneformer,scGPT, orscVI - Load data from a server path instead of uploading large files through the browser
- Download datasets from direct URLs, Zenodo, GEO, or CELLxGENE into
.data/ - Convert downloaded
.csvand.tsvfiles into.h5adinside the repo - Inspect PCA and UMAP views
- Submit embedding jobs to Slurm with automatic resource recommendations
- Run benchmark-related workflows from the app when your environment supports them
This project is set up for uv:
uv syncThat creates the environment from pyproject.toml and uv.lock.
Launch the app (either form is equivalent for the embeddings UI):
uv run python main.py embeddingsor:
uv run python app/app.pyFrom there, a typical workflow looks like this:
- Pick a dataset from the managed
.data/area or paste a server path to a.h5ad,.csv, or.tsv. - If you do not already have a dataset handy, use the app's "Download Dataset" section to pull one from a URL, Zenodo, GEO, or CELLxGENE.
- Choose a model.
- Run embeddings either on the current node or through Slurm.
- Explore the outputs in the app, including dimensionality reduction views and saved artifacts.
Some models need local weights or checkpoints. You can fetch them with:
uv run python main.py download-weights -- --models geneformer scgpt transcriptformerYou can also call the script directly:
uv run python scripts/download_weights.py --models geneformer scgpt transcriptformerAvailable downloads include:
geneformer(Hugging Face snapshot)transcriptformer(official CZI weights from S3 — seemodels/README.mdfor variants)scgpt(Google Drive folder)xtrimo(Hugging Face; may need a token)
Model files are stored under models/ (see models/README.md for TranscriptFormer layout and how it differs from Hugging Face). File contents are gitignored so the clone stays small.
Hugging Face: some downloads (e.g. xtrimo) may need a token. Set HF_TOKEN in the environment or in .env — never commit it.
Useful environment variables if you want to point the app at existing local checkpoints:
GENEFORMER_MODELTRANSCRIPTFORMER_MODELSCGPT_CKPT_DIR
A fuller list of optional variables (paths, SCFMS_ALLOWED_PATH_PREFIXES, W&B, etc.) is in .env.example.
The app has a built-in dataset download panel. Downloaded files are organized under:
.data/<dataset-name>/
Inside each dataset folder:
data/holds the raw downloaded filesprocessed/holds derived.h5adfiles when conversion succeeds
Right now the in-repo conversion step handles .h5ad, .csv, and .tsv inputs. If a download does not convert automatically, the raw files are still preserved in .data/ so you can process them yourself.
For larger jobs, you can submit embeddings from the CLI:
uv run python main.py submit-embedding-slurm -- \
--input-h5ad /path/to/my_dataset.h5ad \
--model scgpt \
--matrix-spec X \
--mem auto \
--time autoOr call the helper directly:
uv run python scripts/submit_scfm_embedding_slurm.py \
--input-h5ad /path/to/my_dataset.h5ad \
--model scgpt \
--matrix-spec X \
--mem auto \
--time autoThis writes staging files under scfms_job_store/ (override with SCFMS_JOB_DIR) and, unless you override it, saves outputs under scfms_job_store/slurm_embeddings/. The generated batch script is named slurm_gpu_embed.sh; use --partition and --gres that your site supports.
Slurm defaults from .env: set SCFMS_SLURM_PARTITION and optional SCFMS_SLURM_ACCOUNT in .env (see .env.example). The UI and CLI prefill the partition from that variable, and a non-empty account adds #SBATCH -A … to generated scripts. Values you type in the UI or pass with --partition still override the partition default.
If you just want to see what would be submitted:
uv run python scripts/submit_scfm_embedding_slurm.py \
--input-h5ad /path/to/my_dataset.h5ad \
--model geneformer \
--dry-runapp/: the Gradio application and related UI logic (app/app.pycan be run withuv run python app/app.py)scripts/: download helpers, embedding utilities, and Slurm helpersmain.py: a simple CLI entrypoint for the most common commands (e.g.embeddingsdelegates toapp.app).data/: managed dataset workspace created at runtimemodels/: local model weights and checkpoints
models/,.data/, logs, and runtime job artifacts are ignored by git- the CLI entrypoints available today are
embeddings,download-weights, andsubmit-embedding-slurm - benchmark and W&B-related features are available in the app, but they depend on your local environment being set up for them