Detecting novel/unrepresented cell states

Background

This case study illustrates how you can apply reference-based analysis of single-cell data to interpret cell states beyond those represented in the reference map. Upon encountering novel cell states, a reference map can also be updated to incorporate the new cell diversity.

For this example, we will use a dataset of CD4+ T cells responding to color adenocarcinoma murine tumors MC38_GP66 (all T cells are specific for the GP66 epitope expressed by the tumor). These cells were isolated both from the tumor (TILs) and from lymphnodes (LN) of two mice, individually. The data are available at GEO - GSE200635. We will project these T cells on a reference map of virus-specific CD4+ T cells, and determine the degree of correspondence of the T cells from the two disease models.

To install the ProjecTILs package for reference-based analysis of scRNA-seq data, please refer to the instructions on the ProjecTILs repository.

R Environment

library(renv)
renv::restore()

library(ggplot2)
library(reshape2)
library(patchwork)
library(ProjecTILs)

scRNA-seq data preparation

We can download the single-cell data directly from Gene Expression Omnibus (GEO).

library(GEOquery)
geo_acc <- "GSE200635"
datadir <- "input/MC38_GP66"
geodir <- sprintf("%s/%s", datadir, geo_acc)
gse <- getGEO(geo_acc)

system(paste0("mkdir -p ", datadir))
getGEOSuppFiles(geo_acc, baseDir = datadir)

system(sprintf("tar -xvf %s/GSE200635_RAW.tar -C %s", geodir, geodir))

# rename files to be understood by Seurat
system(sprintf("mv %s/GSM6040471_matrix.mtx.gz %s/matrix.mtx.gz", geodir, geodir))
system(sprintf("mv %s/GSM6040471_barcodes.tsv.gz %s/barcodes.tsv.gz", geodir, geodir))
system(sprintf("mv %s/GSM6040471_features.tsv.gz %s/features.tsv.gz", geodir, geodir))

Read in 10x data and store as a Seurat object

obj <- Read10X(geodir)

These data have been multiplexed - i.e. different samples were tagged with antibodies and then loaded in a single sequencing run. The following code de-multiplexes the data to reassign cells to their sample of origin. It uses the HTODemux function from Seurat, described in detail in this tutorial.

ID <- "MC38GP66"

# Gene expression
data <- CreateSeuratObject(obj[["Gene Expression"]], project = "MC38_GP66")

# Antibody capture
valid.hashtags <- c(1, 2, 3, 4, 5)
names <- c("LN_1", "LN_2", "TIL_1", "TIL_2", "CD44_2")

abc <- obj[["Antibody Capture"]][valid.hashtags, ]

rownames(abc) <- names
data[["hash"]] <- CreateAssayObject(counts = abc)

data <- NormalizeData(data, assay = "hash", normalization.method = "CLR")
data <- HTODemux(data, assay = "hash", positive.quantile = 0.995, kfunc = "kmeans")
data$Sample <- data$hash.ID
data <- RenameCells(object = data, add.cell.id = ID)
data$barcode <- rownames(data@meta.data)

Idents(data) <- "hash_maxID"
RidgePlot(data, assay = "hash", features = rownames(data[["hash"]]), ncol = 3)

table(data$Sample)

 Doublet   CD44-2     LN-1    TIL-2     LN-2    TIL-1 Negative 
     435      411      203      419      743      301       51 

Remove cells that could not be demultiplexed (doublets and untagged cells).

which.samples <- c("LN-1", "LN-2", "TIL-1", "TIL-2")
data <- subset(data, subset = Sample %in% which.samples)

data$Sample <- factor(data$Sample, levels = which.samples)
data$Tissue <- "LN"
data$Tissue[data$Sample %in% c("TIL-1", "TIL-2")] = "TIL"

data$subject <- factor(data$Sample, levels = c("LN-1", "LN-2", "TIL-1", "TIL-2"),
    labels = c("m1", "m2", "m1", "m2"))

table(data$Tissue)

 LN TIL 
946 720 

One may want to additionally perform standard quality checks (e.g. number of UMIs, percentage of mitochondrial genes etc.), but for the sake of simplicity we will keep all cells in this case study.

Load reference atlas

The CD4+ T cell atlas is described in Andreatta et al. (2021), and can be downloaded from Figshare at: doi.org/10.6084/m9.figshare.16592693.v2

You can also use the commands below to download the atlas directly within R, and load it into memory.

# Download the reference atlas
cd4.atlas.file <- "ref_LCMV_CD4_mouse_release_v1.rds"
if (!file.exists(cd4.atlas.file)) {
    dataUrl <- "https://figshare.com/ndownloader/files/31057081"
    download.file(dataUrl, cd4.atlas.file)
}
ref <- load.reference.map(cd4.atlas.file)
[1] "Loading Custom Reference Atlas..."
[1] "Loaded Custom Reference map ref_LCMV_CD4_mouse_v1"
palette <- ref@misc$atlas.palette
DimPlot(ref, cols = palette)

Project tumor-specific data into viral atlas

The make.projection() function from the ProjecTILs package allows projecting new data into the reference map.

# Project data by subject (two mice separately)
query.list <- SplitObject(data, split.by = "subject")

query.projected <- make.projection(query.list, ref = ref, ncores = 2)

Now see projection on individual samples

query.projected.merged <- Reduce(f = merge.Seurat.embeddings, x = query.projected)

query.projected <- SplitObject(query.projected.merged, split.by = "Tissue")
library(patchwork)

plots <- list()
lgt <- length(query.projected)

for (i in seq_along(query.projected)) {
    sample <- names(query.projected)[i]
    plots[[i]] <- plot.projection(ref, query.projected[[i]], linesize = 0.5, pointsize = 0.5) +
        ggtitle(sample) + NoLegend() + theme(axis.ticks = element_blank(), axis.text = element_blank())

    query.projected[[i]] <- cellstate.predict(ref = ref, query = query.projected[[i]],
        reduction = "umap")

    plots[[i + lgt]] <- plot.statepred.composition(ref, query = query.projected[[i]],
        cols = palette, metric = "Percent") + theme_bw() + theme(axis.text.x = element_blank(),
        legend.position = "none") + ylim(0, 70) + ggtitle(" ")
}

wrap_plots(plots, ncol = 2)

While T cells from the lymphnode are mostly of the follicular helper lineage, tumor-infiltrating T cells are predicted to be mostly Tregs and T helpers. We can inspect a panel of important CD4 T cell markers genes for these subtypes, to verify how well they match the reference profiles.

Diagnostics of query-reference fit

Radar plots for a panel of marker genes are a good initial indication of how well the query can be matched to the reference.

genes4radar <- c("Cxcr6", "Id2", "Tbx21", "Ccl5", "Ly6c2", "Cxcr5", "Tox", "Izumo1r",
    "Tnfsf8", "Tcf7", "Foxp3", "Ctla4")


rr.list <- plot.states.radar(ref = ref, query = query.projected, genes4radar = genes4radar,
    min.cells = 100, return.as.list = T)
rr <- wrap_plots(rr.list)
rr

We can observe that TILs projected in the Th1 effector cluster lack some key markers for Th1 (Ccl5, Ly6c2). This is a first hint that T helpers from the tumor may be different from Th1 cells in a viral context.

Indeed, as seen below, tumor-specific T cells assigned to Th1 cluster have a much lower silhouette coefficient than other subtypes, suggesting that they fit poorly with the reference Th1 effectors.

compute_silhouette(ref, query = query.projected$TIL, normalize.scores = T)
          Cluster Silhouette Silhouette.norm
1 INFI_stimulated 0.16376759       0.8342887
2    Tfh_Effector 0.14483651       0.9115695
3    Th1_Effector 0.06571984       0.3238761
4            Treg 0.21312575       0.9204067
compute_silhouette(ref, query = query.projected$LN, normalize.scores = T)
       Cluster Silhouette Silhouette.norm
1          Tcm 0.26266760       0.6653086
2 Tfh_Effector 0.18716131       1.0000000
3   Tfh_Memory 0.20776726       0.9164924
4 Th1_Effector 0.02240643       0.1104219
5         Treg 0.17593343       0.7597876

What distinguishes these Th-like TILs from the reference Th1 effectors? The find.discriminant.genes() function can help you calculate differentially expressed genes between cells of a specific subset.

library(EnhancedVolcano)
min.pct <- 0.1
min.diff.pct <- 0.3
logfc.threshold <- 2


genes.use.DE <- rownames(ref)
genes.use.DE <- grep("^Gm|Rik$", genes.use.DE, value = T, invert = T)
# Based on Ensembl 'biotype' classifications in mouse, genes starting with
# 'Gm-' or ending with 'Rik' include protein-coding RNAs, long non-coding RNAs,
# and antisense transcripts. Genes with the 'Gm' prefix seem to be enriched for
# pseudogenes.


tab <- find.discriminant.genes(ref = ref, query = query.projected$TIL, state = "Th1_Effector",
    min.pct = min.pct, min.diff.pct = min.diff.pct, logfc.threshold = logfc.threshold,
    which.genes = genes.use.DE)

head(tab)
                p_val avg_log2FC pct.1 pct.2     p_val_adj
Igfbp7  2.649832e-225   2.545170 0.647 0.029 8.728547e-221
Ccl1    1.439674e-213   3.961802 0.524 0.009 4.742286e-209
Ccr8    1.005959e-204   2.813846 0.815 0.097 3.313630e-200
Tnfrsf9 4.136379e-170   3.504702 0.902 0.227 1.362523e-165
Capg    2.506177e-140   2.316081 0.905 0.312 8.255348e-136
Il2ra   1.009715e-136   2.061609 0.658 0.094 3.326003e-132
a <- EnhancedVolcano(tab, lab = rownames(tab), x = "avg_log2FC", y = "p_val", FCcutoff = 1,
    pCutoff = 10^(-10), title = "Th1 effector cells", subtitle = "TILs vs. LCMV spleen",
    drawConnectors = T, max.overlaps = 20)

a

Interestingly, several markers associated with Th2 differentiation are overexpressed in tumoral T helpers, e.g. Ccl1, Ccr8 and Igfbp7, while Th1-associated genes are downregulated (Ccl5, Ly6c2). This suggests that CD4+ T cells acquire distinct effector programs in cancer and infection.

Recalculate map with novel state

We can re-cacalculate the low-dimensional representation of the data to also account for the projected data. In this way, any novelty brought by the new dataset can contribute in determining the low dimensional space.

set.seed(1234)
merged <- recalculate.embeddings(ref, projected = query.projected$TIL, umap.method = "umap",
    resol = 1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13440
Number of edges: 500523

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.7759
Number of communities: 15
Elapsed time: 2 seconds
Idents(merged) <- "functional.cluster"
plot.projection(ref = merged, query = subset(merged, subset = ref_or_query == "query"),
    linesize = 0.5) + theme(aspect.ratio = 1) + ggtitle("Recalculated reference space")

Where do the TILs end up in the recalculated embedding?

sub <- merged
sub$functional.cluster[sub$ref_or_query == "query"] = NA
a <- DimPlot(sub, reduction = "umap", cols = palette, group.by = "functional.cluster") +
    theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
    NoLegend() + ggtitle("Reference")

sub <- merged
sub$functional.cluster[sub$ref_or_query == "ref"] = NA

b <- DimPlot(sub, reduction = "umap", cols = palette, group.by = "functional.cluster") +
    theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
    ggtitle("Projected")

c <- DimPlot(merged, group.by = "seurat_clusters", label = T) + theme(aspect.ratio = 1) +
    theme(axis.ticks = element_blank(), axis.text = element_blank()) + ggtitle("re-clustering") +
    NoLegend()

d <- FeaturePlot(merged, features = "newclusters") + theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(),
    axis.text = element_blank()) + ggtitle("Query-enriched clusters")


(a | b)/(c | d)

Tregs of the query were much more abundant than those of the reference, but they all occupy the same area. Tumor-specific T helpers instead appear to form a separate cluster in the recalculated map.

We can now annotate the new cluster of Th-like cells from the tumor, as well as the combined cluster of Tregs from the reference and the tumor.

merged$functional.cluster[merged$seurat_clusters == 9] <- "Treg"
merged$functional.cluster[merged$seurat_clusters == 12] <- "tumoralTh"

palette2 <- c(palette, tumoralTh = "brown")
merged@misc$atlas.palette <- palette2

merged$functional.cluster.new <- merged$functional.cluster
merged$functional.cluster.new[merged$seurat_clusters == "tumoralTh"] <- merged$seurat_clusters[merged$seurat_clusters ==
    "tumoralTh"]

DimPlot(merged, reduction = "umap", group.by = "functional.cluster.new", cols = palette2) +
    theme(aspect.ratio = 1) + theme(axis.ticks = element_blank(), axis.text = element_blank()) +
    ggtitle("Updated reference map")

This constitutes an updated reference that you can use directly to project and interpret new data!

Further reading

Dataset available on Gene Expression Omnibus - GSE200635

ProjecTILs case studies - INDEX - Repository

ProjecTILs method Andreatta et. al (2021) Nat. Comm. and code

References

  • M. Andreatta, Z. Sherman, A. Tjitropranoto, M. C. Kelly, T. Ciucci, S. J. Carmona “A single-cell reference map delineates CD4+ T cell subtype-specific adaptation during acute and chronic viral infections” eLife 2022 doi: https://doi.org/10.7554/eLife.76339