Detecting novel/unrepresented cell states
Background
This case study illustrates how you can apply reference-based analysis of single-cell data to interpret cell states beyond those represented in the reference map. Upon encountering novel cell states, a reference map can also be updated to incorporate the new cell diversity.
For this example, we will use a dataset of CD4+ T cells responding to color adenocarcinoma murine tumors MC38_GP66 (all T cells are specific for the GP66 epitope expressed by the tumor). These cells were isolated both from the tumor (TILs) and from lymphnodes (LN) of two mice, individually. The data are available at GEO - GSE200635. We will project these T cells on a reference map of virus-specific CD4+ T cells, and determine the degree of correspondence of the T cells from the two disease models.
To install the ProjecTILs package for reference-based analysis of scRNA-seq data, please refer to the instructions on the ProjecTILs repository.
R Environment
scRNA-seq data preparation
We can download the single-cell data directly from Gene Expression Omnibus (GEO).
library(GEOquery)
geo_acc <- "GSE200635"
datadir <- "input/MC38_GP66"
geodir <- sprintf("%s/%s", datadir, geo_acc)
gse <- getGEO(geo_acc)
system(paste0("mkdir -p ", datadir))
getGEOSuppFiles(geo_acc, baseDir = datadir)
system(sprintf("tar -xvf %s/GSE200635_RAW.tar -C %s", geodir, geodir))
# rename files to be understood by Seurat
system(sprintf("mv %s/GSM6040471_matrix.mtx.gz %s/matrix.mtx.gz", geodir, geodir))
system(sprintf("mv %s/GSM6040471_barcodes.tsv.gz %s/barcodes.tsv.gz", geodir, geodir))
system(sprintf("mv %s/GSM6040471_features.tsv.gz %s/features.tsv.gz", geodir, geodir))
Read in 10x data and store as a Seurat object
These data have been multiplexed - i.e. different samples were tagged
with antibodies and then loaded in a single sequencing run. The
following code de-multiplexes the data to reassign cells to their sample
of origin. It uses the HTODemux
function from Seurat,
described in detail in
this tutorial.
ID <- "MC38GP66"
# Gene expression
data <- CreateSeuratObject(obj[["Gene Expression"]], project = "MC38_GP66")
data <- NormalizeData(data, assay = "RNA")
# Antibody capture
valid.hashtags <- c(1, 2, 3, 4, 5)
names <- c("LN_1", "LN_2", "TIL_1", "TIL_2", "CD44_2")
abc <- obj[["Antibody Capture"]][valid.hashtags, ]
rownames(abc) <- names
data[["hash"]] <- CreateAssayObject(counts = abc)
data <- NormalizeData(data, assay = "hash", normalization.method = "CLR")
data <- HTODemux(data, assay = "hash", positive.quantile = 0.995, kfunc = "kmeans")
data$Sample <- data$hash.ID
data <- RenameCells(object = data, add.cell.id = ID)
data$barcode <- rownames(data@meta.data)
Idents(data) <- "hash_maxID"
RidgePlot(data, assay = "hash", features = rownames(data[["hash"]]), ncol = 3)
Doublet CD44-2 LN-1 TIL-2 LN-2 TIL-1 Negative
435 411 203 419 743 301 51
Remove cells that could not be demultiplexed (doublets and untagged cells).
which.samples <- c("LN-1", "LN-2", "TIL-1", "TIL-2")
data <- subset(data, subset = Sample %in% which.samples)
data$Sample <- factor(data$Sample, levels = which.samples)
data$Tissue <- "LN"
data$Tissue[data$Sample %in% c("TIL-1", "TIL-2")] = "TIL"
data$subject <- factor(data$Sample, levels = c("LN-1", "LN-2", "TIL-1", "TIL-2"),
labels = c("m1", "m2", "m1", "m2"))
table(data$Tissue)
LN TIL
946 720
One may want to additionally perform standard quality checks (e.g. number of UMIs, percentage of mitochondrial genes etc.), but for the sake of simplicity we will keep all cells in this case study.
Load reference atlas
The CD4+ T cell atlas is described in Andreatta et al. (2021), and can be downloaded from Figshare at: doi.org/10.6084/m9.figshare.16592693.v2
You can also use the commands below to download the atlas directly within R, and load it into memory.
# Download the reference atlas
cd4.atlas.file <- "ref_LCMV_CD4_mouse_release_v1.rds"
if (!file.exists(cd4.atlas.file)) {
dataUrl <- "https://figshare.com/ndownloader/files/31057081"
download.file(dataUrl, cd4.atlas.file)
}
ref <- load.reference.map(cd4.atlas.file)
[1] "Loading Custom Reference Atlas..."
[1] "Loaded Custom Reference map ref_LCMV_CD4_mouse_v1"
Diagnostics of query-reference fit
Radar plots for a panel of marker genes are a good initial indication of how well the query can be matched to the reference.
genes4radar <- c("Cxcr6", "Id2", "Tbx21", "Ccl5", "Ly6c2", "Cxcr5", "Tox", "Izumo1r",
"Tnfsf8", "Tcf7", "Foxp3", "Ctla4")
rr.list <- plot.states.radar(ref = ref, query = query.projected, genes4radar = genes4radar,
min.cells = 100, return.as.list = T)
rr <- wrap_plots(rr.list)
rr
We can observe that TILs projected in the Th1 effector cluster lack some key markers for Th1 (Ccl5, Ly6c2). This is a first hint that T helpers from the tumor may be different from Th1 cells in a viral context.
Indeed, as seen below, tumor-specific T cells assigned to Th1 cluster have a much lower silhouette coefficient than other subtypes, suggesting that they fit poorly with the reference Th1 effectors.
Cluster Silhouette Silhouette.norm
1 INFI_stimulated 0.13923288 0.7093004
2 Tfh_Effector 0.11312605 0.7119907
3 Th1_Effector 0.05240514 0.2582595
4 Treg 0.24598558 1.0000000
Cluster Silhouette Silhouette.norm
1 Tcm 0.2750466 0.6966634
2 Tfh_Effector 0.1846125 1.0000000
3 Tfh_Memory 0.2067704 0.9120953
4 Treg 0.2142372 0.9252067
What distinguishes these Th-like TILs from the reference Th1
effectors? The find.discriminant.genes()
function can help
you calculate differentially expressed genes between cells of a specific
subset.
library(EnhancedVolcano)
min.pct <- 0.1
min.diff.pct <- 0.3
logfc.threshold <- 2
genes.use.DE <- rownames(ref)
genes.use.DE <- grep("^Gm|Rik$", genes.use.DE, value = T, invert = T)
# Based on Ensembl 'biotype' classifications in mouse, genes starting with
# 'Gm-' or ending with 'Rik' include protein-coding RNAs, long non-coding RNAs,
# and antisense transcripts. Genes with the 'Gm' prefix seem to be enriched for
# pseudogenes.
tab <- find.discriminant.genes(ref = ref, query = query.projected$TIL, state = "Th1_Effector",
min.pct = min.pct, min.diff.pct = min.diff.pct, logfc.threshold = logfc.threshold,
which.genes = genes.use.DE)
head(tab)
p_val avg_log2FC pct.1 pct.2 p_val_adj
Ccl1 2.761870e-251 7.080878 0.614 0.009 9.097601e-247
Igfbp7 9.457282e-249 6.322539 0.721 0.029 3.115229e-244
Ccr8 1.187229e-193 4.582086 0.833 0.097 3.910733e-189
Ebi3 1.835231e-183 4.703946 0.644 0.041 6.045251e-179
Lgals7 9.151339e-164 5.747199 0.622 0.051 3.014451e-159
Tnfrsf9 6.159990e-147 4.801449 0.893 0.227 2.029101e-142
a <- EnhancedVolcano(tab, lab = rownames(tab), x = "avg_log2FC", y = "p_val", FCcutoff = 1,
pCutoff = 10^(-10), title = "Th1 effector cells", subtitle = "TILs vs. LCMV spleen",
drawConnectors = T, max.overlaps = 20)
a
Interestingly, several markers associated with Th2 differentiation are overexpressed in tumoral T helpers, e.g. Ccl1, Ccr8 and Igfbp7, while Th1-associated genes are downregulated (Ccl5, Ly6c2). This suggests that CD4+ T cells acquire distinct effector programs in cancer and infection.
Recalculate map with novel state
We can re-cacalculate the low-dimensional representation of the data to also account for the projected data. In this way, any novelty brought by the new dataset can contribute in determining the low dimensional space.
set.seed(1234)
merged <- recalculate.embeddings(ref, projected = query.projected$TIL, umap.method = "umap",
resol = 1)
Idents(merged) <- "functional.cluster"
plot.projection(ref = merged, query = subset(merged, subset = ref_or_query == "query"),
linesize = 0.5) + theme(aspect.ratio = 1) + ggtitle("Recalculated reference space")
Where do the TILs end up in the recalculated embedding?
sub <- merged
sub$functional.cluster[sub$ref_or_query == "query"] = NA
a <- DimPlot(sub, reduction = "umap", cols = palette, group.by = "functional.cluster") +
theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
NoLegend() + ggtitle("Reference")
sub <- merged
sub$functional.cluster[sub$ref_or_query == "ref"] = NA
b <- DimPlot(sub, reduction = "umap", cols = palette, group.by = "functional.cluster") +
theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
ggtitle("Projected")
c <- DimPlot(merged, group.by = "seurat_clusters", label = T) + theme(aspect.ratio = 1) +
theme(axis.ticks = element_blank(), axis.text = element_blank()) + ggtitle("re-clustering") +
NoLegend()
d <- FeaturePlot(merged, features = "newclusters") + theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(),
axis.text = element_blank()) + ggtitle("Query-enriched clusters")
(a | b)/(c | d)
Tregs of the query were much more abundant than those of the reference, but they all occupy the same area. Tumor-specific T helpers instead appear to form a separate cluster in the recalculated map.
We can now annotate the new cluster of Th-like cells from the tumor, as well as the combined cluster of Tregs from the reference and the tumor.
merged$functional.cluster[merged$seurat_clusters == 9] <- "Treg"
merged$functional.cluster[merged$seurat_clusters == 12] <- "tumoralTh"
palette2 <- c(palette, tumoralTh = "brown")
merged@misc$atlas.palette <- palette2
merged$functional.cluster.new <- merged$functional.cluster
merged$functional.cluster.new[merged$seurat_clusters == "tumoralTh"] <- merged$seurat_clusters[merged$seurat_clusters ==
"tumoralTh"]
DimPlot(merged, reduction = "umap", group.by = "functional.cluster.new", cols = palette2) +
theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
ggtitle("Updated reference map")
This constitutes an updated reference that you can use directly to project and interpret new data!
Further reading
Dataset available on Gene Expression Omnibus - GSE200635
ProjecTILs case studies - INDEX - Repository
ProjecTILs method Andreatta et. al (2021) Nat. Comm. and code
References
- M. Andreatta, Z. Sherman, A. Tjitropranoto, M. C. Kelly, T. Ciucci, S. J. Carmona “A single-cell reference map delineates CD4+ T cell subtype-specific adaptation during acute and chronic viral infections” eLife 2022 doi: https://doi.org/10.7554/eLife.76339