Building a custom reference atlas for ProjecTILs

Introduction

ProjecTILs comes with a number of curated references (see e.g. TIL atlas, CD8 viral infection atlas or CD4 viral infection atlas). However, you may want to use your own custom atlas as a reference for dataset projection. This vignette will walk you through the steps required to convert your custom single-cell map into a reference atlas for ProjecTILs.

We are going to use the dataset by Zilionis et al., and generate a reference map for monocyte, macrophages and dendritic cells (MoMacDC) in murine tumors. You should be able to adapt the code for any mouse or human dataset (or integrated collection of datasets) for which you have calculated PCA and UMAP embeddings.

Note: For integrated collections of datasets, a requirement is to have corrected gene expression values for variable genes, e.g. as returned by Seurat or STACAS integration. Integrated objects generated by methods like Harmony, which only return PCA embeddings, cannot currently be used as ProjecTILs references.

R enviroment

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = "true")

if (!requireNamespace("renv")) install.packages("renv")
library(renv)
renv::restore()
- The library is already synchronized with the lockfile.
# remotes::install_github('carmonalab/ProjecTILs')

library(Seurat)
library(scGate)
library(ProjecTILs)
library(Matrix)

Get scRNA-seq data

The data matrices for this study can be found at the Gene Expression Omnibus GEO: GSE127465. We will download the data and then subset on the Monocyte-macrophage-DC classifications provided by the authors.

do_download <- T
options(timeout = max(300, getOption("timeout")))

library(GEOquery)
geo_acc <- "GSE127465"

datadir <- "input/Zilionis"
dir.create(datadir, showWarnings = FALSE)

gse <- getGEO(geo_acc)

if (do_download) {
    series <- paste0(geo_acc, "_series_matrix.txt.gz")

    system(paste0("mkdir -p ", datadir))
    getGEOSuppFiles(geo_acc, baseDir = datadir)
}

Read in the data and store them in a Seurat object

# Read metadata
metadata <- sprintf("%s/%s/GSE127465_mouse_cell_metadata_15939x12.tsv.gz", datadir,
    geo_acc)
meta <- read.delim(metadata, header = T)
meta$Barcode_library <- paste(meta$Barcode, meta$Library, sep = "_")
rownames(meta) <- meta$Barcode_library

# Read expression matrix and gene list
matrix <- sprintf("%s/%s/GSE127465_mouse_counts_normalized_15939x28205.mtx.gz", datadir,
    geo_acc)
genes <- sprintf("%s/%s/GSE127465_gene_names_mouse_28205.tsv.gz", datadir, geo_acc)

exp.mat <- readMM(matrix)
gene.list <- read.csv(genes, header = F)$V1

colnames(exp.mat) <- gene.list
rownames(exp.mat) <- meta$Barcode_library

data.Mm <- CreateSeuratObject(t(exp.mat), project = "Zilionis_Mm", meta.data = meta)
## NB: data are already normalized, no counts available
data.Mm@assays$RNA@data <- data.Mm@assays$RNA@counts

# Subset on tumour MoMacDCs for this demo
mo.mac <- subset(data.Mm, subset = Major.cell.type %in% c("MoMacDC", "pDC"))

Generate low-dimensional embeddings using a standard Seurat pipeline

mo.mac <- NormalizeData(mo.mac, verbose = FALSE)
mo.mac <- FindVariableFeatures(mo.mac, nfeatures = 1500, verbose = FALSE)

# bk.list <- unlist(scGate::genes.blacklist.default$Mm)
# VariableFeatures(mo.mac) <- setdiff(VariableFeatures(mo.mac), bk.list)

ndim = 20
seed = 1234
mo.mac <- ScaleData(mo.mac, verbose = TRUE)

mo.mac <- RunPCA(mo.mac, features = VariableFeatures(mo.mac), ndims.print = 1:5,
    nfeatures.print = 5, npcs = ndim)
mo.mac <- RunUMAP(mo.mac, reduction = "pca", dims = 1:ndim, seed.use = seed)
library(ggplot2)
DimPlot(mo.mac, group.by = "Minor.subset", label = T, repel = T, label.size = 4) +
    theme(aspect.ratio = 1) + NoLegend()

Generate ProjecTILs reference

We are now ready to convert the dataset into a ProjecTILs reference. The make.reference function will populate the required metadata fields and return an object that should be directly usable as a reference. A few things to keep in mind:

  • The annotation.column parameter identifies the metadata column to be used as the primary cell labels of the reference. It will be stored as functional.cluster in the metadata of the new reference map.

  • We recommend using recalculate.umap=TRUE to build the reference. This will probably slightly alter the shape of your UMAP, but is necessary to return a UMAP predictive model for embedding new data points after projection. It is possible to preserve the original UMAP visualization by setting recalculate.umap=FALSE, in which case query UMAP embeddings will use an approximate algorithm (see below).

  • If your object is the result of data integration, rembember to set assay="integrated" or set the default assay to integrated prior to running make.reference

ref.momac <- make.reference(ref = mo.mac, ndim = ndim, seed = seed, recalculate.umap = TRUE,
    annotation.column = "Minor.subset")
DimPlot(ref.momac, label = T, repel = T, label.size = 4) + theme(aspect.ratio = 1) +
    NoLegend()

If you really do not want to alter the UMAP space of your reference, you may choose to let recalculate.umap=FALSE. In this case, the original UMAP embeddings are preserved, but the reference won’t have a predictive UMAP model. When new data are projected into this reference, an approximate algorithm based on nearest neighbors in PCA space is used instead of umap::predict.umap for UMAP embedding. This only affects UMAP embeddings - reference-based cell type label classification is normally performed in PCA space and will give identical results regardless of the status of recalculate.umap.

Prepare a reference without altering UMAP embeddings.

ref.B <- make.reference(ref = mo.mac, ndim = ndim, seed = seed, recalculate.umap = FALSE,
    annotation.column = "Minor.subset")
DimPlot(ref.B, label = T, repel = T, label.size = 4) + theme(aspect.ratio = 1)

Test the reference

That’s it! you can already use the dataset as a reference to project new data. We may test it by making a dummy dataset using one of the dataset subtypes, e.g. the DC3 cells.

query <- subset(mo.mac, subset = Minor.subset == "DC3")

query.projected <- Run.ProjecTILs(query, ref = ref.momac, filter.cell = F, skip.normalize = T)

  |                                                                            
  |                                                                      |   0%[1] "Using assay RNA for query"
[1] "Aligning query to reference map for batch-correction..."

Projecting corrected query onto Reference PCA space

Projecting corrected query onto Reference UMAP space

  |                                                                            
  |======================================================================| 100%

See the projection results

a <- plot.projection(ref = ref.momac, query = query.projected, linesize = 0.5) +
    NoLegend() + ggtitle("DC3 re-projected")
b <- plot.statepred.composition(ref.momac, query.projected)
a | b

Optional metadata for your atlas

Several additional metadata can be added to the reference to improve its capabilities, as described below.

Custom color palette

You may store a custom color palette with the atlas, so that all ProjecTILs plotting functions will return a consistent color scheme. Simply store this information in the ref@misc$atlas.palette slot.

library(RColorBrewer)

subtypes <- levels(ref.momac$functional.cluster)
n_subtypes <- length(subtypes)

palette <- brewer.pal(n_subtypes, "Paired")
names(palette) <- subtypes

# Store palette
ref.momac@misc$atlas.palette <- palette
a <- plot.projection(ref = ref.momac, query = query.projected, linesize = 0.5) +
    NoLegend() + ggtitle("DC3 re-projected")
b <- plot.statepred.composition(ref.momac, query.projected)
a | b

scGate models

Your reference map contains a certain subset of cell types. When projecting new data into the reference, one has to ensure that the cell types of the query are represented in the reference. For example, in the MoMacDC reference we generated in this tutorial, we will be able to only project MoMacDCs to obtain sound classifications. If you have carefully sorted the cells of your query for a cell type of interest, that should be sufficient. However, it is useful to have an automated check for ‘gating’ out pure cells for the cell types of the reference. You can store scGate models in ProjecTILs references to inform the projection algorithm which cell types are represented in the reference.

You may design your own gating models, or use some of the models distributed with scGate. If you plan to use the map for cross-species mapping, you may want to store gating models both for mouse and human.

library(scGate)

models <- scGate::get_scGateDB()

gate.mouse <- models$mouse$generic$MoMacDC
gate.human <- models$human$generic$MoMacDC

ref.momac@misc$scGate$mouse <- gate.mouse
ref.momac@misc$scGate$human <- gate.human

Run projection algorithm with automated gating model

query <- subset(mo.mac, subset = Minor.subset == "DC3")

query.projected <- Run.ProjecTILs(query, ref = ref.momac, filter.cell = TRUE, skip.normalize = T)

  |                                                                            
  |                                                                      |   0%[1] "Using assay RNA for query"
[1] "6 out of 299 ( 2% ) non-pure cells removed. Use filter.cells=FALSE to avoid pre-filtering"
[1] "Aligning query to reference map for batch-correction..."

Projecting corrected query onto Reference PCA space

Projecting corrected query onto Reference UMAP space

  |                                                                            
  |======================================================================| 100%

References

ProjecTILs GitHub page - Repository

ProjecTILs case studies - INDEX - Repository