GDSCN_Book_Statistics_for_Genomics_scRNA-seq/08-annotation.Rmd at main · jhudsl/GDSCN_Book_Statistics_for_Genomics_scRNA-seq · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# (PART\*) ANNOTATION {-}

# Marker gene detection

We have managed to identify clusters of related cells in the data. However, these clusters aren't very useful until we can identify the biology meaning of each group. This is where functional annotation comes in.

The real art, and the greatest challenge, in an scRNA-seq analysis comes when interpreting the results. Up to this point (cleaning and clustering the data), the analysis and computation has been straightforward. Figuring out the biological state that each cluster represents, on the other hand, is more difficult, as it requires applying prior biological knowledge to the dataset.

Thanks to previous research, we know many _marker genes_, or genes can be used to identify particular cell types. These genes are differentially expressed across cell types, and by examining the expression profiles of multiple marker genes across all the clusters, we can assign particular cell type identities to each cluster.

## Calculating and ranking effect size summary statistics

We begin by comparing each pair of clusters and calculating scores for expression differences between the two for each gene. We have multiple options for the statistics used to compare expression values.

```{r, warning = FALSE, message = FALSE, echo=FALSE}
library(scRNAseq)
library(scater)
library(scran)
library(BiocSingular)
sce.zeisel <- ZeiselBrainData()
#sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel)))
stats <- perCellQCMetrics(sce.zeisel, subsets=list(
    Mt=rowData(sce.zeisel)$featureType=="mito"))
qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent",
    "subsets_Mt_percent"))
sce.zeisel.qc <- sce.zeisel[,!qc$discard]
set.seed(1000)
clusters <- quickCluster(sce.zeisel.qc)
sce.zeisel.qc <- computeSumFactors(sce.zeisel.qc, cluster=clusters)
sce.zeisel.qc <- logNormCounts(sce.zeisel.qc)
dec.zeisel.qc <- modelGeneVarWithSpikes(sce.zeisel.qc, "ERCC")
top1000.hvgs <- getTopHVGs(dec.zeisel.qc, n=1000)
sce.zeisel.PCA.1000 <- denoisePCA(sce.zeisel.qc, technical=dec.zeisel.qc, subset.row=top1000.hvgs)
sce.zeisel.tsne20 <- runTSNE(sce.zeisel.PCA.1000, dimred="PCA")
nn.clusters <- clusterCells(sce.zeisel.tsne20, use.dimred="PCA")
colLabels(sce.zeisel.tsne20) <- nn.clusters
```

  * **AUC** (area under the curve) is the probability that a randomly chosen observation from cluster A is greater than a randomly chosen observation from cluster B. This statistic is a way to quantify how well we can distinguish between two distributions (clusters) in a pairwise comparison. An AUC of 1 means all values in cluster A are greater than any value from cluster B and suggests upregulation. An AUC of of 0.5 means the two clusters are indistinguishable from each other, while an AUC of 0 suggests the marker gene observations in cluster A are downregulated compared to those in cluster B.

  * **Cohen's _d_** is a standardized log-fold change, and can be thought of as the number of standard deviations that separate the two groups. Positive values of Cohen's _d_ suggest that our cluster of interest (cluster A) are upregulated compared to cluster B, while negative values suggest the marker gene observations in cluster A are downregulated compared to cluster B.

  * **log-fold change (logFC)** is a measure of whether there is a difference in expression between clusters. Keep in mind that these values ignore the magnitude of the change. As with the others, positive values indicate upregulation in the cluster of interest (cluster A), while negative values indicate downregulation.

For each of these statistics, `scoreMarkers` calculates mean, median, minimum value (min), maximum value (max), and minimum rank (rank; the smallest rank of each gene across all pairwise comparisons). For most of these measures, a larger number indicates upregulation. For minimum rank, however, a small value means the gene is one of the top upregulated genes.

AUC or Cohen’s _d_ are effective regardless of the magnitude of the expression values and thus are good choices for general marker detection. The log-fold change in the detected proportion is specifically useful for identifying binary changes in expression.

For this exercise, we're going to focus on upregulated markers, since those are particularly useful for identifying cell types in a heterogeneous population like the Zeisel dataset. We use the `findMarkers` command, which quickly identifies potential marker genes.

```{r, warning = FALSE, message = FALSE}
# identify those genes which are upregulated in some clusters compared to others
markers <- findMarkers(sce.zeisel.tsne20, direction = "up")
marker.set <- markers[["1"]]
head(marker.set, 10)
```

This dataframe shows us the in log-fold expression change for each potential marker gene between cluster 1 and every other cluster.

## Comparing gene expression levels across clusters

Once we've identified potential marker genes, we can use a heatmap to compare gene expression in each cell between clusters.

```{r, warning = FALSE, message = FALSE, echo=FALSE}
install.packages("pheatmap")
```

```{r, warning = FALSE, message = FALSE}
# pull the top-most upregulated markers from cluster 1 (compared to the rest of the clusters) and look at their expression in all clusters
top.markers <- rownames(marker.set)[marker.set$Top <= 10]
plotHeatmap(sce.zeisel.tsne20, features = top.markers, order_columns_by = "label")
```

In this heatmap, clusters are on the horizontal, while the top upregulated genes in cluster 1 are on the vertical. The magnitude of the log-fold expression change is indicated by color of each cell.

We can also create a heatmap that shows the mean log-fold change of cluster 1 cells compared to the mean of each other cluster. This can simplify the heatmap and is useful when dealing with many clusters.

```{r, warning = FALSE, message = FALSE}
# AnVIL::install("pheatmap")
library(pheatmap)

#this heatmap lets us compare the average expression of the gene within a cluster compared to the other clusters
logFCs <- getMarkerEffects(marker.set[1:50,])
pheatmap(logFCs, breaks = seq(-5, 5, length.out = 101))
```

Here we see that three genes are generally upregulated in Cluster 1 compared to the other clusters: _Gad1_, _Gad2_, and _Slc6a1_. This is where prior biological knowledge comes in handy, as both _Gad1_ and _Slc6a1_ are known interneuron markers (Zeng et al. 2012).

::: {.reflection}
QUESTION
1. Are there any groups or patterns you see in the second heatmap that look interesting?
:::

# Cell type annotation

Unless you are already an expert in neuronal cell expression, you probably didn't know _Gad1_ and _Slc6a1_ are known interneuron markers until you were told. Luckily, we have several high-quality references databases that can be used for annotating scRNA-seq datasets. Some of these references are useful for identifying cell types (the Zeisel dataset we have been using is a well-known reference for identifying neuronal cell types). Others, such as the Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) collections, are useful for identifying biological processes associated with each cluster.

## Annotating cell types

There are three basic strategies for annotating datasets: match the expression profile of each individual cell to the expression profile of cells from a reference dataset (the reference dataset approach); identify sets of marker genes highly expressed in each cell and match to gene sets from known cell types (the gene set approach); or perform a gene-set enrichment analysis on the marker genes that define each cluster.

The Zeisel dataset we have been working with has actually been annotated all this time.

```{r, warning = FALSE, message = FALSE}
# calculate the top marker genes assigned to each cell type (level1class) in the Zeisel dataset
wilcox.z <- pairwiseWilcox(sce.zeisel.qc, sce.zeisel.qc$level1class,
    lfc = 1, direction = "up")
markers.z <- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs,
    pairwise = FALSE, n = 50)

# look at how many cell-type categories there are, as well as how many cells assigned to each category
lengths(markers.z)
```

This dataset has grouped the individual cells into 7 different categories of neuronal subtypes. We can use the gene sets that define these categories to annotate a second brain scRNA-seq dataset from Tasic et al. (2016). (We will not go through all the data cleaning and clustering steps for this new dataset - having made it this far, we trust you can do that!)

```{r, warning = FALSE, message = FALSE, echo=FALSE}
BiocManager::install("GSEABase")
BiocManager::install("AUCell")
```

```{r, warning = FALSE, message = FALSE}
#load the new dataset
sce.tasic <- TasicBrainData()
```

We first create the gene set lists using the `GSEABase` package. The `AUCell` package identifies and ranks marker sets highly expressed in each cell using an area under the curve (AUC) approach. We can assign cell type identity in the Tasic dataset by taking the marker set with the highest AUC as the label for the cell.

```{r, warning = FALSE, message = FALSE}
# AnVIL::install("GSEABase")
library(GSEABase)

#create a dataset that contains just the information about the Zeisel cell-type categories and the marker genes that define each cell-type
all.sets <- lapply(names(markers.z), function(x) {
    GeneSet(markers.z[[x]], setName = x)
})
all.sets <- GeneSetCollection(all.sets)

# AnVIL::install("AUCell")
library(AUCell)

# rank genes by expression values within each cell
rankings <- AUCell_buildRankings(counts(sce.tasic),
    plotStats = FALSE, verbose = FALSE)

# calculate AUC for each previously-defined marker set (from Zeisel) in the Tasic data
cell.aucs <- AUCell_calcAUC(all.sets, rankings)
results <- t(assay(cell.aucs))
```

After assigning cell type identities to each cluster, a researcher should always verify that the identities make sense. Since the Tasic dataset has also been annotated, we can compare our annotation to the researcher-provided annotation as a sort of sanity check.

```{r, warning = FALSE, message = FALSE}
# assign cell type identity in the Tasic dataset by assumig the marker set with the top AUC is the proper label
new.labels <- colnames(results)[max.col(results)]

# compare our annotations to the annotations provided by the Tasic dataset
tab <- table(new.labels, sce.tasic$broad_type)
tab
```

As you can see, the labels applied by the researchers and by our annotation mostly match. (Pyramidal SS nerves are primarily glutamatergic, so although the categories are labeled differently, those two categories do indeed match!) The major exception is the oligodendrocyte precursor cells, which our annotation called astrocytes. However, this mismatch isn't as concerning as you might think, once you know that both astrocytes and oligodendrocytes come from the same precursor cell lineage.

::: {.reflection}
QUESTION
1. Should we be concerned when 1 or 2 cells are assigned a different cell type identity by our annotation than by the researchers? Why or why not?
:::

```{r, warning = FALSE, message = FALSE}
sessionInfo()
```