Projecting scRNA-seq data onto a reference map

ProjecTILs is a computational method to project scRNA-seq data into reference single-cell atlases, enabling their direct comparison in a stable, annotated system of coordinates. This tutorial outlines the main functions implemented in ProjecTILs on a small, simple dataset. For more advanced (and more biologically interesting) applications of ProjecTILs see this list of ProjecTILs case studies

R environment

First, check package dependencies and install ProjecTILs

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = "true")
options(timeout = max(300, getOption("timeout")))

if (!requireNamespace("renv")) install.packages("renv")
library(renv)
renv::restore()
- The library is already synchronized with the lockfile.
# remotes::install_github('carmonalab/STACAS')
# remotes::install_github('carmonalab/ProjecTILs')

library(ProjecTILs)
library(Seurat)

Load reference atlas and query data

First, load a mouse reference TIL atlas. Several reference maps are available from GitHub pr from the SPICA website. If no reference map file is provided, the function load.reference.map() will automatically download it from https://doi.org/10.6084/m9.figshare.12478571

ref <- load.reference.map()
[1] "Loading Default Reference Atlas..."
[1] "/Users/mass/Documents/Projects/Cell_clustering/ProjecTILs.demo/ref_TILAtlas_mouse_v1.rds"
[1] "Loaded Reference map ref_TILAtlas_mouse_v1"

Let’s explore the reference atlas

refCols <- c("#edbe2a", "#A58AFF", "#53B400", "#F8766D", "#00B6EB", "#d1cfcc", "#FF0000",
    "#87f6a5", "#e812dd")
DimPlot(ref, label = T, cols = refCols)

See expression of important marker genes across reference subtypes

markers <- c("Cd4", "Cd8a", "Ccr7", "Tcf7", "Pdcd1", "Havcr2", "Tox", "Izumo1r",
    "Cxcr6", "Xcl1", "Gzmb", "Gzmk", "Ifng", "Foxp3")
VlnPlot(ref, features = markers, stack = T, flip = T, fill.by = "ident", cols = refCols,
    assay = "RNA") + NoLegend()

Now let’s load a query dataset - Miller et al., Nature Immunol (2019)

# A sample data set is provided with the ProjecTILs package
querydata <- ProjecTILs::query_example_seurat

More generally, it is possible to load a query matrix with gene names and barcodes (e.g. 10X format or raw counts)

## Raw count matrix from GEO
library(GEOquery)
geo_acc <- "GSE86028"
getGEOSuppFiles(geo_acc)

fname2 <- sprintf("%s/GSE86028_TILs_sc_wt_mtko.tpm.log2.txt.gz", geo_acc)
querydata2 <- read.sc.query(fname2, type = "raw.log2")

Run Projection algorithm

The main function in ProjecTILs is Run.ProjecTILs, which takes as input a reference map and query dataset. The query will be batch-corrected and projected into the reference map, with low-dimensional embeddings (PCA and UMAP) compatible with those of the reference.

query.projected <- Run.ProjecTILs(querydata, ref = ref)
  |                                                                              |                                                                      |   0%[1] "Using assay RNA for query"
[1] "40 out of 1501 ( 3% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."

Projecting corrected query onto Reference PCA space

Projecting corrected query onto Reference UMAP space
  |                                                                              |======================================================================| 100%

NB: by default, Run.ProjecTILs() will pre-filter T cells using scGate. In case the input dataset is already pre-filtered, you can disable this step using Run.ProjecTILs(querydata, ref=ref, filter.cells = FALSE). If you are using a custom reference map that is not composed of T cells, you may specify a different scGate filter using the scGate_model parameter.

Visualize projection

Plot projection of new data over the reference in UMAP space. The contour lines display the density of projected query cells onto the reference map.

plot.projection(ref, query.projected, linesize = 0.5, pointsize = 0.5)

Plot the predicted composition of the query in terms of reference T cell subtypes

plot.statepred.composition(ref, query.projected, metric = "Percent")

Compare gene expression

How do the gene expression levels compare between reference and query for the different cell states?

genes4radar = c("Foxp3", "Cd4", "Cd8a", "Tcf7", "Ccr7", "Sell", "Gzmb", "Gzmk", "Pdcd1",
    "Havcr2", "Tox", "Mki67")

plot.states.radar(ref, query = query.projected, genes4radar = genes4radar, min.cells = 20)

Compare gene programs

We may want to compare query and reference for gene programs, rather than individual genes. For example, we can use signatures stored in the SignatuR database and calculate average signature scores per subtype.

remotes::install_github("carmonalab/SignatuR")
library(SignatuR)

programs <- GetSignature(SignatuR$Mm$Programs)
names(programs)
[1] "cellCycle.G1S"      "cellCycle.G2M"      "HeatShock"         
[4] "IfnResp"            "Tcell.stemness"     "Tcell.cytotoxicity"
[7] "Tcell.exhaustion"  

We can obtain per-cell scores using UCell, and then generate radar plots on average signature scores per subtype:

library(UCell)
ref <- AddModuleScore_UCell(ref, features = programs, assay = "RNA", name = NULL)
query.projected <- AddModuleScore_UCell(query.projected, features = programs, assay = "RNA",
    name = NULL)

plot.states.radar(ref, query = query.projected, meta4radar = names(programs))

Compare cell states across conditions

If we have multiple conditions (e.g. control vs. treatment, or samples from different tissues), we can search for discriminant genes between conditions (otherwise, by default this analysis is performed using the reference as the ‘control’)

# Simulate a condition which e.g. increases Gzmb expression compared to control
query.control <- subset(query.projected, subset = Gzmb < 1.5)
query.perturb <- subset(query.projected, subset = Gzmb >= 1.5)

plot.states.radar(ref, query = list(Control = query.control, Query = query.perturb))

In this toy example, where we simulated a condition that increases Gzmb expression compared to control, we expect cytotoxicity genes to drive differences.

discriminantGenes <- find.discriminant.genes(ref = ref, query = query.perturb, query.control = query.control,
    state = "CD8_Tex")
head(discriminantGenes, n = 10)
                  p_val avg_log2FC pct.1 pct.2     p_val_adj
Gzmb      8.036650e-138  2.9879497 1.000 0.843 2.197301e-133
Gzmc       1.866108e-48  3.2420851 0.804 0.412  5.102125e-44
AA467197   3.757184e-36  1.7734309 0.845 0.502  1.027252e-31
Lgals1     2.526012e-34  0.4732254 1.000 1.000  6.906369e-30
Ccl3       7.093916e-23  1.6487859 0.901 0.737  1.939548e-18
Mt1        3.450720e-22  1.5050687 0.766 0.494  9.434614e-18
Serpinb6b  2.852272e-21  1.0106648 0.851 0.616  7.798398e-17
Ccl9       6.441550e-21  2.6732170 0.485 0.169  1.761184e-16
Prf1       2.027946e-20  1.8743347 0.500 0.180  5.544607e-16
Gzme       5.399204e-20  3.2061630 0.460 0.161  1.476196e-15

We can use a volcano plot to display differentially expressed genes:

library(EnhancedVolcano)
EnhancedVolcano(discriminantGenes, lab = rownames(discriminantGenes), x = "avg_log2FC",
    y = "p_val", pCutoff = 1e-09, FCcutoff = 0.5, labSize = 5, legendPosition = "none",
    drawConnectors = F, title = "Gzmb_high vs. Gzmb_low (Tex)")

Using a random subsetting, p-values should not be significant:

rand.list <- ProjecTILs:::randomSplit(query.projected, n = 2, seed = 1)
discriminantGenes <- find.discriminant.genes(ref = ref, query = rand.list[[1]], query.control = rand.list[[2]],
    logfc.threshold = 0.01, state = "CD8_Tex")
EnhancedVolcano(discriminantGenes, lab = rownames(discriminantGenes), x = "avg_log2FC",
    y = "p_val", pCutoff = 1e-09, FCcutoff = 0.5, labSize = 5, legendPosition = "none",
    drawConnectors = F, title = "Random split (Tex)")

Using ProjecTILs as a classifier

If you do not wish to embed your query data into the reference space, you may also simply use ProjecTILs as a cell type classifier. This may be useful to annotate cell types in your query without altering existing embeddings.

See the query dataset in unsupervised low-dim embeddings:

querydata <- querydata |>
    FindVariableFeatures(nfeatures = 500) |>
    ScaleData() |>
    RunPCA(npcs = 10) |>
    RunUMAP(dims = 1:10)

DimPlot(querydata)

The ProjecTILs.classifier function applies reference-projection but does not alter the current embeddings.

querydata <- ProjecTILs.classifier(query = querydata, ref = ref)
  |                                                                              |                                                                      |   0%[1] "Using assay RNA for query"
[1] "40 out of 1501 ( 3% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."

Projecting corrected query onto Reference PCA space

Projecting corrected query onto Reference UMAP space
  |                                                                              |======================================================================| 100%
palette <- c("#edbe2a", "#A58AFF", "#53B400", "#F8766D", "#00B6EB", "#d1cfcc", "#FF0000",
    "#87f6a5", "#e812dd", "#777777")
names(palette) <- c(levels(ref$functional.cluster), "NA")
DimPlot(querydata, group.by = "functional.cluster", cols = palette)

We can confirm that most of the cells were classified as CD8_Tex. Please note that filtered cells (i.e. those that were removed by the scGate filter) are assigned the NA label, as they correspond to cell types that are not present in the reference.

Further reading

For applications of ProjecTILs to gain biological insight on several public datasets please see the ProjecTILs case studies - INDEX - Repository

To generate your own reference map for ProjecTILs see Custom reference map tutorial

Publication: Andreatta et al Nat. Comm. 2021

ProjecTILs repository