Building a custom reference atlas for ProjecTILs
Introduction
ProjecTILs comes with a number of curated references (see e.g. TIL atlas, CD8 viral infection atlas or CD4 viral infection atlas). However, you may want to use your own custom atlas as a reference for dataset projection. This vignette will walk you through the steps required to convert your custom single-cell map into a reference atlas for ProjecTILs.
We are going to use the dataset by Zilionis et al., and generate a reference map for monocyte, macrophages and dendritic cells (MoMacDC) in murine tumors. You should be able to adapt the code for any mouse or human dataset (or integrated collection of datasets) for which you have calculated PCA and UMAP embeddings.
Note: For integrated collections of datasets, a requirement is to have corrected gene expression values for variable genes, e.g. as returned by Seurat or STACAS integration. Integrated objects generated by methods like Harmony, which only return PCA embeddings, cannot currently be used as ProjecTILs references.
R enviroment
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = "true")
if (!requireNamespace("renv")) install.packages("renv")
library(renv)
renv::restore()
- The library is already synchronized with the lockfile.
Get scRNA-seq data
The data matrices for this study can be found at the Gene Expression Omnibus GEO: GSE127465. We will download the data and then subset on the Monocyte-macrophage-DC classifications provided by the authors.
do_download <- T
options(timeout = max(300, getOption("timeout")))
library(GEOquery)
geo_acc <- "GSE127465"
datadir <- "input/Zilionis"
dir.create(datadir, showWarnings = FALSE)
gse <- getGEO(geo_acc)
if (do_download) {
series <- paste0(geo_acc, "_series_matrix.txt.gz")
system(paste0("mkdir -p ", datadir))
getGEOSuppFiles(geo_acc, baseDir = datadir)
}
Read in the data and store them in a Seurat object
# Read metadata
metadata <- sprintf("%s/%s/GSE127465_mouse_cell_metadata_15939x12.tsv.gz", datadir,
geo_acc)
meta <- read.delim(metadata, header = T)
meta$Barcode_library <- paste(meta$Barcode, meta$Library, sep = "_")
rownames(meta) <- meta$Barcode_library
# Read expression matrix and gene list
matrix <- sprintf("%s/%s/GSE127465_mouse_counts_normalized_15939x28205.mtx.gz", datadir,
geo_acc)
genes <- sprintf("%s/%s/GSE127465_gene_names_mouse_28205.tsv.gz", datadir, geo_acc)
exp.mat <- readMM(matrix)
gene.list <- read.csv(genes, header = F)$V1
colnames(exp.mat) <- gene.list
rownames(exp.mat) <- meta$Barcode_library
data.Mm <- CreateSeuratObject(t(exp.mat), project = "Zilionis_Mm", meta.data = meta)
## NB: data are already normalized, no counts available
data.Mm@assays$RNA@data <- data.Mm@assays$RNA@counts
# Subset on tumour MoMacDCs for this demo
mo.mac <- subset(data.Mm, subset = Major.cell.type %in% c("MoMacDC", "pDC"))
Generate low-dimensional embeddings using a standard Seurat pipeline
mo.mac <- NormalizeData(mo.mac, verbose = FALSE)
mo.mac <- FindVariableFeatures(mo.mac, nfeatures = 1500, verbose = FALSE)
# bk.list <- unlist(scGate::genes.blacklist.default$Mm)
# VariableFeatures(mo.mac) <- setdiff(VariableFeatures(mo.mac), bk.list)
ndim = 20
seed = 1234
mo.mac <- ScaleData(mo.mac, verbose = TRUE)
mo.mac <- RunPCA(mo.mac, features = VariableFeatures(mo.mac), ndims.print = 1:5,
nfeatures.print = 5, npcs = ndim)
mo.mac <- RunUMAP(mo.mac, reduction = "pca", dims = 1:ndim, seed.use = seed)
Generate ProjecTILs reference
We are now ready to convert the dataset into a ProjecTILs reference.
The make.reference
function will populate the required
metadata fields and return an object that should be directly usable as a
reference. A few things to keep in mind:
The
annotation.column
parameter identifies the metadata column to be used as the primary cell labels of the reference. It will be stored asfunctional.cluster
in the metadata of the new reference map.We recommend using
recalculate.umap=TRUE
to build the reference. This will probably slightly alter the shape of your UMAP, but is necessary to return a UMAP predictive model for embedding new data points after projection. It is possible to preserve the original UMAP visualization by settingrecalculate.umap=FALSE
, in which case query UMAP embeddings will use an approximate algorithm (see below).If your object is the result of data integration, rembember to set
assay="integrated"
or set the default assay to integrated prior to runningmake.reference
ref.momac <- make.reference(ref = mo.mac, ndim = ndim, seed = seed, recalculate.umap = TRUE,
annotation.column = "Minor.subset")
DimPlot(ref.momac, label = T, repel = T, label.size = 4) + theme(aspect.ratio = 1) +
NoLegend()
If you really do not want to alter the UMAP space of your reference,
you may choose to let recalculate.umap=FALSE
. In this case,
the original UMAP embeddings are preserved, but the reference won’t have
a predictive UMAP model. When new data are projected into this
reference, an approximate algorithm based on nearest neighbors in PCA
space is used instead of umap::predict.umap
for UMAP
embedding. This only affects UMAP embeddings - reference-based cell type
label classification is normally performed in PCA space and will give
identical results regardless of the status of
recalculate.umap
.
Prepare a reference without altering UMAP embeddings.
Test the reference
That’s it! you can already use the dataset as a reference to project new data. We may test it by making a dummy dataset using one of the dataset subtypes, e.g. the DC3 cells.
query <- subset(mo.mac, subset = Minor.subset == "DC3")
query.projected <- Run.ProjecTILs(query, ref = ref.momac, filter.cell = F, skip.normalize = T)
|
| | 0%[1] "Using assay RNA for query"
[1] "Aligning query to reference map for batch-correction..."
Projecting corrected query onto Reference PCA space
Projecting corrected query onto Reference UMAP space
|
|======================================================================| 100%
See the projection results
Optional metadata for your atlas
Several additional metadata can be added to the reference to improve its capabilities, as described below.
Custom color palette
You may store a custom color palette with the atlas, so that all
ProjecTILs plotting functions will return a consistent color scheme.
Simply store this information in the ref@misc$atlas.palette
slot.
scGate models
Your reference map contains a certain subset of cell types. When projecting new data into the reference, one has to ensure that the cell types of the query are represented in the reference. For example, in the MoMacDC reference we generated in this tutorial, we will be able to only project MoMacDCs to obtain sound classifications. If you have carefully sorted the cells of your query for a cell type of interest, that should be sufficient. However, it is useful to have an automated check for ‘gating’ out pure cells for the cell types of the reference. You can store scGate models in ProjecTILs references to inform the projection algorithm which cell types are represented in the reference.
You may design your own gating models, or use some of the models distributed with scGate. If you plan to use the map for cross-species mapping, you may want to store gating models both for mouse and human.
library(scGate)
models <- scGate::get_scGateDB()
gate.mouse <- models$mouse$generic$MoMacDC
gate.human <- models$human$generic$MoMacDC
ref.momac@misc$scGate$mouse <- gate.mouse
ref.momac@misc$scGate$human <- gate.human
Run projection algorithm with automated gating model
query <- subset(mo.mac, subset = Minor.subset == "DC3")
query.projected <- Run.ProjecTILs(query, ref = ref.momac, filter.cell = TRUE, skip.normalize = T)
|
| | 0%[1] "Using assay RNA for query"
[1] "6 out of 299 ( 2% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."
Projecting corrected query onto Reference PCA space
Projecting corrected query onto Reference UMAP space
|
|======================================================================| 100%