scGate: marker-based purification of cell types from single-cell RNA-seq datasets

This demo illustrates the main functionalities of the scGate package for purifying cell type populations of interest from heterogeneous single-cell cell datasets. We start from basic, single-gene filters and move gradually to more complex hierarchical models composed of multi-gene signatures. We show these examples on public data, but you should be able to adapt them to your own single-cell datasets.

Setting up the environment

library(renv)
renv::restore()

library(ggplot2)
library(dplyr)
library(patchwork)
library(viridis)
library(Seurat)

#Install scGate
#install.packages("scGate")
library(scGate)

Loading demo datasets

This helper function retrieves some toy datasets with cell type annotations, useful to test our models:

testing.datasets <- scGate::get_testing_data(version = "hsa.latest")
palette <- c(list(Impure = "gray", Pure = "green"))

Let’s play with a (downsampled) dataset of PBMCs from Hao et al. 2021 (https://doi.org/10.1016/j.cell.2021.04.048)

obj <- testing.datasets[["Satija"]]
DimPlot(obj, label = T, repel = T, group.by = "celltype.l1") + theme(legend.position = "none",
    aspect.ratio = 1)

Creating a simple gating model

Now let’s setup a simple scGate gating model to purify a population of interest - here B cells.

B cells are usually identified by the expression of CD20, encoded by MS4A1.

my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
my_scGate_model
  levels   use_as  name signature
1 level1 positive Bcell     MS4A1

Run scGate to purify B cells with the simple single-gene model

obj <- scGate(data = obj, model = my_scGate_model, verbose = T)
DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

Because in this demo example we know the ground truth cell types (at least, as defined by the authors), we can evaluate scGate predictive performance in terms of precision/positive predictive value, recall/sensitivity, and accuracy using Matthews Correlation Coefficient (MCC)

scGate::performance.metrics(grepl("^B ", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9967742 0.9967742 0.9961825 

With this very simple model, >99% (Recall) are isolated with a >99% purity (Precision), and overall classification performance (MCC) >99%

Another example: we can isolate plasmacytoid dendritic cells (pDCs), defined using the marker LILRA4 (e.g. https://pubmed.ncbi.nlm.nih.gov/30395816/)

my_scGate_model <- gating_model(name = "pDC", signature = c("LILRA4"))  # add one positive signature
my_scGate_model
  levels   use_as name signature
1 level1 positive  pDC    LILRA4
# Run the model
obj <- scGate(data = obj, model = my_scGate_model)
DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

scGate::performance.metrics(obj$cell_type == "pDC", obj$is.pure == "Pure")
PREC  REC  MCC 
   1    1    1 

Gating models with positive and negative markers

Natural killer cells (NKs) are characterized by the marker KLRD1. However, KLRD1 can also be expressed by some T cell subsets. To improve sensitivity to isolate NKs, we can include in our gating strategy “negative” T cell markers such as CD3D.

my_scGate_model <- gating_model(name = "NK", signature = c("NCAM1+", "KLRD1+", "CD3D-"))

obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

With this simple model, 91% of NKs (recall) are isolated with a 100% purity (Precision), and overall classification performance (MCC) of 95%

scGate::performance.metrics(grepl("^NK", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
1.0000000 0.9386792 0.9653516 

Hierarchical gating models

In the examples above, we have been using a single-cell dataset derived from blood (PBMC). When working with more complex tissues, such as tumors, we might need to apply multiple levels of gating to isolate the population of interest. scGate allows to purify by steps, using a hierarchical gating model - for example, first purifying immune cells, and among the immune cells the cell type of interest.

Let’s explore (a downsampled version of) the whole tumor dataset by Jerby-Arnon et al. 2018 (https://doi.org/10.1016/j.cell.2018.09.006), with the cell type annotations provided by the authors.

obj <- testing.datasets[["JerbyArnon"]]
DimPlot(obj, label = T, repel = T, group.by = "cell_type") + theme(legend.position = "none",
    aspect.ratio = 1)

The dataset comprises non-immune populations such as malignant/cancer cells (Mal), Endothelial cells (Endo) and cancer-associated fibroblasts (CAF). We can try to purify first all immune cells, using pan-immune cell marker CD45 (encoded by the gene PTPRC):

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"))
obj <- scGate(data = obj, model = my_scGate_model, verbose = T)
DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

scGate::performance.metrics(obj$cell_type %in% c("B.cell", "NK", "T.CD4", "T.CD8",
    "T.cell", "Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.8942701 0.9827586 0.7990127 

From the immmune cells, we can generate a simple gating model to purify macrophages (using common markers CD68 and FCGR1A).

Instead of purifying macrophages directly from the whole tissue, we can set up a hierarchial scGate model to: i) isolate immune cells as in the previous example, ii) isolate macrophages from immune cells.

Hierarchical gating models can be specified in scGate using parameter “level”, as follows:

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"), level = 1)  # initialize model with one positive signature
my_scGate_model <- gating_model(model = my_scGate_model, name = "macrophage", signature = c("CD68",
    "FCGR1A"), level = 2)  # add positive signature at second step

obj <- scGate(data = obj, model = my_scGate_model, save.levels = T)

DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

scGate::performance.metrics(obj$cell_type %in% c("Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.8968254 0.9300412 0.9010617 

Here we isolated 95% of macrophages with a 90% purity (according to annotation by Jerby-Arnon et al.). We could easily improve PREC and REC by adding more positive and negative markers, respectively (e.g. removing lymphocytes)

We can always inspect the distribution of UCell scores calculated for each signature. For example:

FeaturePlot(obj, features = c("macrophage_UCell")) + scale_color_viridis(option = "D") +
    theme(aspect.ratio = 1)

We can see how the two-level gating model worked in each step using the function plot_levels. The first plot shows the purification step for CD45+ cells, the second plot the isolation of macrophages.

wrap_plots(plot_levels(obj))

Editing models

scGate models of arbitrary complexity can be easily constructed to achieve high purification performance. These gating models can be written using the gating_model() function, and manually edited using the fix() function.

You can also export your basic model as tabulated text and edit it manually in Excel.

fix(my_scGate_model)  # edit locally in R
write.table(my_scGate_model, "test.tsv", sep = "\t")  # export and then edit on Excel
my_scGate_model_edited <- scGate::load_scGate_model("test.tsv")  # tabulated-text model can loaded

Evaluation of model performance

We provide a built-in function test_my_model to automatically evaluate the performance of a gating model using 3 pre-annotated testing datasets:

my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.8654545 0.9794239 0.9091995
Zilionis   0.9576720 0.9627660 0.9560761
Satija     0.9967742 0.9967742 0.9961825

In one of the datasets, precision was not optimal (90%).

We can refine it, for instance, by removing potentially contaminating T cell genes (CD3D) and try again:

my_scGate_model <- gating_model(model = my_scGate_model, name = "Tcell", signature = c("CD3D"),
    negative = T)
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.9449153 0.9176955 0.9218635
Zilionis   0.9576720 0.9627660 0.9560761
Satija     0.9967742 0.9967742 0.9961825

Predictive performance is now very high for the three sample datasets.

Using pre-defined gating models

We also provide some pre-defined gating models, these can be retrieved with the get_scGateDB() function:

models.DB <- scGate::get_scGateDB()

names(models.DB$human$generic)
 [1] "Bcell"                   "Bcell.GerminalCenter"   
 [3] "Bcell.NonGerminalCenter" "CD4T"                   
 [5] "CD8T"                    "CD8TIL"                 
 [7] "Endothelial"             "Epithelial"             
 [9] "Erythrocyte"             "Immune"                 
[11] "Macrophage"              "Mast"                   
[13] "Megakaryocyte"           "MoMac"                  
[15] "MoMacDC"                 "MoMacDCMast"            
[17] "Monocyte"                "Myeloid"                
[19] "Neutrophils"             "NK"                     
[21] "PanBcell"                "panDC"                  
[23] "PlasmaCell"              "Stromal"                
[25] "Tcell"                   "Tcell.alphabeta"        

names(models.DB$mouse$generic)
[1] "CD4T"            "CD8T"            "MoMacDC"         "Myeloid"        
[5] "panDC"           "Tcell"           "Tcell.alphabeta"

Visualizing gating models

To visualize complex models in tree-like structures, we provide the plot_tree() function.

The following is a model to purify plasma (B) cells with high accuracy:

library(ggparty)
my_scGate_model <- models.DB$human$generic$PlasmaCell
plt.tree <- scGate::plot_tree(my_scGate_model)
plt.tree

At each level in the tree, UCell scores are evaluated both for positive and negative signatures. Only cells with sufficiently high UCell scores for at least one positive signature, and consistently low UCell scores for all negative signatures will be passed on to the next level in the tree. Upon reaching the final branch of the tree, only cells that passed all gating levels are labeled as “Pure” for the population of interest.

We can run the Plasma cell model on the whole-tumor dataset by Zilionis et al. (https://doi.org/10.1016/j.immuni.2019.03.009)

obj <- testing.datasets[["Zilionis"]]

DimPlot(obj, group.by = "cell_type", label = T, repel = T, label.size = 3) + theme(aspect.ratio = 1) +
    ggtitle("Original manual annotation") + NoLegend()

Run scGate with save.levels=TRUE to output per-level results

obj <- scGate(obj, model = my_scGate_model, save.levels = TRUE)
DimPlot(obj, cols = palette) + theme(aspect.ratio = 1)

Visualize filtering results by level. We can see how the purity increases at each level.

plots <- scGate::plot_levels(obj)
wrap_plots(plots, ncol = 2)

Evaluate performance of filtering compared to original annotation

scGate::performance.metrics(actual = obj$Plasma_cell, pred = obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9764706 0.9021739 0.9357819 

scGate as a multi-class classifier

scGate can also be used a cell type classifier, to annotate multiple cell types in a dataset. Simply provide a list of models (one for each cell type of interest) to scGate and these will be jointly evaluated (faster than computing them individually)

For instance, we can use a list of models from the default scGate DB:

models.DB <- scGate::get_scGateDB()

models.list <- models.DB$human$TME_HiRes

Then we can run scGate with this list of models:

obj <- testing.datasets[["Zilionis"]]

obj <- scGate(obj, model = models.list, ncores = 4)

Cells that are unequivocally assigned to only one cell type will be annotated to that cell type; cells that pass the gate of more than one model will be annotated as “Multi”; cells that cannot be gated by any model will be labeled as NA. These labels are exported in a new metadata column named scGate_multi:

table(obj$scGate_multi, useNA = "ifany")

      Bcell        CD4T        CD8T Endothelial  Epithelial  Fibroblast 
        190         198         120          44         148         104 
 Macrophage        Mast    Monocyte       Multi Neutrophils          NK 
        209          66          74           4          97         135 
      panDC  PlasmaCell        <NA> 
         96          85         430 
multi <- DimPlot(obj, group.by = "scGate_multi", label = T, repel = T, label.size = 2) +
    theme(aspect.ratio = 1) + ggtitle("scGate annotation") + NoLegend()

orig <- DimPlot(obj, group.by = "cell_type", label = T, repel = T, label.size = 2) +
    theme(aspect.ratio = 1) + ggtitle("Original manual annot") + NoLegend()

orig | multi

We note that cell types for which a model was available were consistently annotated. In this example we did not provide models for e.g. Neutrophils and Mast cells, so these cells were left unannotated. We leave it as an exercise to design models for these cell types based on markers from literature, and to increase the coverage of cells that can be confidently annotated. scGate users are welcome to contribute their gating models to the scGate models repository!

Fast gating using pre-calculated dimensionality reduction

By default, scGate performs dimensionality reduction (DM) at each hierarchical level of the gating strategy, for sensitive kNN smoothing (see paper). However, if a pre-calculated DM is available in your input object, this can be provided to scGate to avoid this step and speed-up computation. This feature can be particularly useful to gate integrated datasets (eg after using Harmony, Seurat, or STACAS).

Here we will run scGate using the pre-calculated pca space, by setting reduction = "pca"

obj <- scGate(obj, model = models.list, reduction = "pca", ncores = 4)

We also used ncores=4 to tell scGate to parallelize the computation using 4 cores, to speed up

multi <- DimPlot(obj, group.by = "scGate_multi") + theme(aspect.ratio = 1)

orig <- DimPlot(obj, group.by = "cell_type", label = T, repel = T, label.size = 2) +
    theme(aspect.ratio = 1) + ggtitle("Original manual annot") + NoLegend()

orig | multi

Further reading

The scGate package and installation instructions are available at: scGate package

The code for this demo can be found on GitHub

The repository for scGate gating moels is at: scGate models repository

References

  • Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell.

  • Murray, L., Xi, Y., & Upham, J. W. (2019). CLEC4C gene expression can be used to quantify circulating plasmacytoid dendritic cells. Journal of immunological methods, 464, 126-130.

  • Jerby-Arnon, L., Shah, P., Cuoco, M. S., Rodman, C., Su, M. J., Melms, J. C., … & Regev, A. (2018). A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell, 175(4), 984-997.

  • Zilionis, R., Engblom, C., Pfirschke, C., Savova, V., Zemmour, D., Saatcioglu, H. D., … & Klein, A. M. (2019). Single-cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity, 50(5), 1317-1334.