Projecting scRNA-seq data onto a reference map
ProjecTILs is a computational method to project scRNA-seq data into reference single-cell atlases, enabling their direct comparison in a stable, annotated system of coordinates. This tutorial outlines the main functions implemented in ProjecTILs on a small, simple dataset. For more advanced (and more biologically interesting) applications of ProjecTILs see this list of ProjecTILs case studies
R environment
First, check package dependencies and install ProjecTILs
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = "true")
options(timeout = max(300, getOption("timeout")))
if (!requireNamespace("renv")) install.packages("renv")
library(renv)
renv::restore()
- The library is already synchronized with the lockfile.
Load reference atlas and query data
First, load a mouse reference TIL atlas. Several reference maps are
available from
GitHub pr from the SPICA
website. If no reference map file is provided, the function
load.reference.map()
will automatically download it from https://doi.org/10.6084/m9.figshare.12478571
[1] "Loading Default Reference Atlas..."
[1] "/Users/mass/Documents/Projects/Cell_clustering/ProjecTILs.demo/ref_TILAtlas_mouse_v1.rds"
[1] "Loaded Reference map ref_TILAtlas_mouse_v1"
Let’s explore the reference atlas
refCols <- c("#edbe2a", "#A58AFF", "#53B400", "#F8766D", "#00B6EB", "#d1cfcc", "#FF0000",
"#87f6a5", "#e812dd")
DimPlot(ref, label = T, cols = refCols)
See expression of important marker genes across reference subtypes
markers <- c("Cd4", "Cd8a", "Ccr7", "Tcf7", "Pdcd1", "Havcr2", "Tox", "Izumo1r",
"Cxcr6", "Xcl1", "Gzmb", "Gzmk", "Ifng", "Foxp3")
VlnPlot(ref, features = markers, stack = T, flip = T, fill.by = "ident", cols = refCols,
assay = "RNA") + NoLegend()
Now let’s load a query dataset - Miller et al., Nature Immunol (2019)
# A sample data set is provided with the ProjecTILs package
querydata <- ProjecTILs::query_example_seurat
More generally, it is possible to load a query matrix with gene names and barcodes (e.g. 10X format or raw counts)
Run Projection algorithm
The main function in ProjecTILs is Run.ProjecTILs
, which
takes as input a reference map and query dataset. The query will be
batch-corrected and projected into the reference map, with
low-dimensional embeddings (PCA and UMAP) compatible with those of the
reference.
| | | 0%[1] "Using assay RNA for query"
[1] "40 out of 1501 ( 3% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."
Projecting corrected query onto Reference PCA space
Projecting corrected query onto Reference UMAP space
| |======================================================================| 100%
NB: by default, Run.ProjecTILs()
will
pre-filter T cells using scGate. In case the
input dataset is already pre-filtered, you can disable this step using
Run.ProjecTILs(querydata, ref=ref, filter.cells = FALSE)
.
If you are using a custom reference map that is not composed of T cells,
you may specify a different scGate filter using the
scGate_model
parameter.
Visualize projection
Plot projection of new data over the reference in UMAP space. The contour lines display the density of projected query cells onto the reference map.
Plot the predicted composition of the query in terms of reference T cell subtypes
Compare gene expression
How do the gene expression levels compare between reference and query for the different cell states?
Compare gene programs
We may want to compare query and reference for gene programs, rather than individual genes. For example, we can use signatures stored in the SignatuR database and calculate average signature scores per subtype.
remotes::install_github("carmonalab/SignatuR")
library(SignatuR)
programs <- GetSignature(SignatuR$Mm$Programs)
names(programs)
[1] "cellCycle.G1S" "cellCycle.G2M" "HeatShock"
[4] "IfnResp" "Tcell.stemness" "Tcell.cytotoxicity"
[7] "Tcell.exhaustion"
We can obtain per-cell scores using UCell, and then generate radar plots on average signature scores per subtype:
Compare cell states across conditions
If we have multiple conditions (e.g. control vs. treatment, or samples from different tissues), we can search for discriminant genes between conditions (otherwise, by default this analysis is performed using the reference as the ‘control’)
# Simulate a condition which e.g. increases Gzmb expression compared to control
query.control <- subset(query.projected, subset = Gzmb < 1.5)
query.perturb <- subset(query.projected, subset = Gzmb >= 1.5)
plot.states.radar(ref, query = list(Control = query.control, Query = query.perturb))
In this toy example, where we simulated a condition that increases Gzmb expression compared to control, we expect cytotoxicity genes to drive differences.
discriminantGenes <- find.discriminant.genes(ref = ref, query = query.perturb, query.control = query.control,
state = "CD8_Tex")
head(discriminantGenes, n = 10)
p_val avg_log2FC pct.1 pct.2 p_val_adj
Gzmb 8.036650e-138 2.9879497 1.000 0.843 2.197301e-133
Gzmc 1.866108e-48 3.2420851 0.804 0.412 5.102125e-44
AA467197 3.757184e-36 1.7734309 0.845 0.502 1.027252e-31
Lgals1 2.526012e-34 0.4732254 1.000 1.000 6.906369e-30
Ccl3 7.093916e-23 1.6487859 0.901 0.737 1.939548e-18
Mt1 3.450720e-22 1.5050687 0.766 0.494 9.434614e-18
Serpinb6b 2.852272e-21 1.0106648 0.851 0.616 7.798398e-17
Ccl9 6.441550e-21 2.6732170 0.485 0.169 1.761184e-16
Prf1 2.027946e-20 1.8743347 0.500 0.180 5.544607e-16
Gzme 5.399204e-20 3.2061630 0.460 0.161 1.476196e-15
We can use a volcano plot to display differentially expressed genes:
library(EnhancedVolcano)
EnhancedVolcano(discriminantGenes, lab = rownames(discriminantGenes), x = "avg_log2FC",
y = "p_val", pCutoff = 1e-09, FCcutoff = 0.5, labSize = 5, legendPosition = "none",
drawConnectors = F, title = "Gzmb_high vs. Gzmb_low (Tex)")
Using a random subsetting, p-values should not be significant:
rand.list <- ProjecTILs:::randomSplit(query.projected, n = 2, seed = 1)
discriminantGenes <- find.discriminant.genes(ref = ref, query = rand.list[[1]], query.control = rand.list[[2]],
logfc.threshold = 0.01, state = "CD8_Tex")
EnhancedVolcano(discriminantGenes, lab = rownames(discriminantGenes), x = "avg_log2FC",
y = "p_val", pCutoff = 1e-09, FCcutoff = 0.5, labSize = 5, legendPosition = "none",
drawConnectors = F, title = "Random split (Tex)")
Using ProjecTILs as a classifier
If you do not wish to embed your query data into the reference space, you may also simply use ProjecTILs as a cell type classifier. This may be useful to annotate cell types in your query without altering existing embeddings.
See the query dataset in unsupervised low-dim embeddings:
querydata <- querydata |>
FindVariableFeatures(nfeatures = 500) |>
ScaleData() |>
RunPCA(npcs = 10) |>
RunUMAP(dims = 1:10)
DimPlot(querydata)
The ProjecTILs.classifier
function applies
reference-projection but does not alter the current embeddings.
| | | 0%[1] "Using assay RNA for query"
[1] "40 out of 1501 ( 3% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."
Projecting corrected query onto Reference PCA space
Projecting corrected query onto Reference UMAP space
| |======================================================================| 100%
palette <- c("#edbe2a", "#A58AFF", "#53B400", "#F8766D", "#00B6EB", "#d1cfcc", "#FF0000",
"#87f6a5", "#e812dd", "#777777")
names(palette) <- c(levels(ref$functional.cluster), "NA")
DimPlot(querydata, group.by = "functional.cluster", cols = palette)
We can confirm that most of the cells were classified as CD8_Tex. Please note that filtered cells (i.e. those that were removed by the scGate filter) are assigned the NA label, as they correspond to cell types that are not present in the reference.
Further reading
For applications of ProjecTILs to gain biological insight on several public datasets please see the ProjecTILs case studies - INDEX - Repository
To generate your own reference map for ProjecTILs see Custom reference map tutorial
Publication: Andreatta et al Nat. Comm. 2021
ProjecTILs repository