This repository contains analysis code for the phenotypic plasticity project, which links gene expression variability across epithelial cell types (single‑cell RNA‑seq) to DNA methylation divergence (WGBS PWD) across regulatory contexts.
The core scientific question is:
Do genes that are more expression‑plastic across epithelial cell types show more conserved or more divergent DNA methylation in promoters and enhancers?
- Data sources:
- SCP259: Normal vs. Inflamed (UC) colon (Epithelial cells only).
- SCP1162: Normal vs. Tumor (MMRd & MMRp) (Epithelial cells only).
- Pipeline:
- Collapse transcript‑level rows → gene‑level counts (summation).
- Per‑cell normalization: CP10k (Counts Per 10k).
- Log transform: log2(CP10k + 1).
- Average within epithelial cell types.
- Metrics per gene:
mean_expr: Mean expression across epithelial cell types.var_expr: Variance across epithelial cell types (Plasticity Proxy).
- Data source: Whole‑genome bisulfite sequencing (WGBS).
- Metric: Pairwise Methylation Divergence (PWD).
- Regulatory Contexts:
- TSS500: ±500 bp around the 5′‑most TSS per gene.
- pELS: Proximal Enhancer-Like Signatures (mapped to nearest gene).
- dELS: Distal Enhancer-Like Signatures (mapped to nearest gene).
- Aggregation: Interval‑level PWD is aggregated to the gene level using a CpG‑weighted mean.
This project integrates single-cell transcriptomics with bulk Whole Genome Bisulfite Sequencing (WGBS) data. The following section outlines the origin of all genomic intervals and expression datasets used in this analysis.
To create a comprehensive transcript universe, we merged two major UCSC tables for the hg38/GRCh38 build:
- Sources:
refGeneandncbiRefSeqtables via the UCSC Table Browser. - Processing: * Removed version suffixes from accessions (e.g.,
NM_001.1→NM_001).- Filtered for mRNA (
NM/XM) and non-coding RNA (NR/XR) transcripts. - 5'-Most TSS Selection: For each gene symbol (
name2), we identified the representative Transcription Start Site (TSS) by selecting the minimumtxStartfor positive-strand genes and the maximumtxEndfor negative-strand genes. This ensures we capture the most upstream regulatory architecture for every gene.
- Filtered for mRNA (
We utilized several specialized databases to define the cis-regulatory landscape:
| Category | Source | Description / Filters |
|---|---|---|
| cCREs | ENCODE SCREEN | Large Intestine (noccl): Proximal (pELS) and Distal (dELS) enhancer-like signatures. |
| TFBS | ChIP-Atlas | Tissue: Digestive tract; Threshold: 100; Antigen: ALL. |
| CpG Islands | UCSC Browser | cpgIslandExt table (hg38). |
Enhancer Mapping: Since SCREEN cCREs are not pre-linked to genes, pELS and dELS intervals were assigned to the nearest gene using a genomic nearest-neighbor operation relative to the refseq_all 5'-most TSS.
Expression metrics were derived from the following epithelial-specific datasets hosted on the Broad Single Cell Portal:
- SCP259 (Normal vs. Ulcerative Colitis): Analyzed 15 epithelial cell types (e.g., Stem, Enterocytes, Goblet) across ~123,000 cells to compute variance-based plasticity.
- SCP1162 (Normal vs. CRC): Analyzed ~168,000 epithelial cells across Normal, MMRd, and MMRp tumor environments.
This script must be run first. It generates the foundational data structures (gene models, intervals, and PWD tables) used by all downstream analysis scripts.
🚨 Master Pipeline: Intervals & Methylation
- Builds
refseq_all: Consolidates UCSC RefSeq and KnownGene tables into a single master annotation. - Enforces 5'–Most Rule: Rigorously selects the single 5'–most transcript per gene (strand–aware) to define the canonical gene start.
- Defines Intervals: Generates coordinates for Promoters (
TSS500), Gene Bodies (TxBody,CDS), and regulatory features (pELS,dELS,CGI,TF_peaks) mapped to the nearest gene. - Calculates PWD: Runs the memory–efficient, chunked PWD calculation engine for all defined intervals and sample pairs.
Output: Saves gene-level RDS files (e.g., pwd_TSS500.rds, pwd_pELS.rds) to the /data folder.
These scripts consume the PWD files generated above and integrate them with single-cell expression data to produce the final figures.
Dataset: Normal Colon vs. Ulcerative Colitis (UC).
- Expression Processing:
- Loads raw SCP259 matrices.
- Computes
meanandvarianceof expression for Healthy and UC epithelial cells.
- Integration: Merges expression stats with
pwd_TSS500,pwd_pELS, andpwd_dELS. - Visualization: Generates scatter plots of Methylation Divergence (X) vs. Expression Plasticity (Y).
- Output: Summary tables and plots for Healthy/UC conditions.
Dataset: Normal Colon vs. Colorectal Cancer (MMRd & MMRp).
- Robust ID Mapping: Implements advanced logic to clean and map messy gene IDs (e.g.,
ENSG..._Symbol, version numbers) to valid HGNC symbols. - Expression Processing: Uses BPCells for memory-efficient processing of the large SCP1162 dataset.
- Grouping: Separately processes Normal, Tumor_MMRd, and Tumor_MMRp epithelial populations.
- Visualization: Compares plasticity vs. conservation across normal and tumor states.
Methodological Validation: Why the 5'–most TSS?
- Goal: Scientifically validates the decision to use the 5'–most TSS as the representative promoter.
- Analysis:
- Identifies genes with multiple unique TSSs.
- Compares PWD of the 5'–most TSS against alternative downstream TSSs.
- Key Insight: Demonstrates that phenotypic plasticity signals are best captured by the canonical 5' start site.
Data Quality Control
- Diagnoses duplicate transcripts and one‑to‑many mappings.
- Documents decisions made regarding gene ID collapsing and filtering.
These files are retained for reproducibility and transparency. While their core logic has been integrated into the Master Data Generation pipeline, they serve as independent references for specific methodological steps.
Legacy Module: TSS Definition
- Defines the logic for selecting the 5′‑most TSS per gene (strand‑aware) and building promoter windows.
- Status: Its logic is now fully incorporated into
Refseq_master.Rmd. Retained as a standalone record of the interval definition strategy.
Legacy Module: Methylation Calculation
- Contains the original standalone code for calculating PWD from interval tables.
- Status: The PWD calculation engine has been optimized and moved to
Refseq_master.Rmd. This file is useful for auditing the math behind the divergence metric.
Exploratory Notebooks: ID Resolution
- These scripts explore the complex mapping between Ensembl transcript IDs (ENST) and gene symbols.
- Status: They document the rationale behind the final ID selection strategy used to merge the Single-Cell (Expression) and WGBS (Methylation) datasets.
graph TD
A[Refseq_master.Rmd] -->|Generates| B(data/refseq_all.rds)
A -->|Generates| C(data/pwd_*.rds)
C --> D[SCP259_Analysis.Rmd]
C --> E[SCP1162_Analysis.Rmd]
B --> F[5prime_validation.Rmd]
subgraph "Reference / Legacy"
G[TSS_gene_mapping.Rmd]
H[Refseq_PWD.Rmd]
I[ENST_mapping_*.Rmd]
end