scGate to annotate integrated scRNA-seq datasets

A typical task in single-cell analysis is cell type annotation of datasets composed of multiple samples. You may have used one of several tools for batch-effect correction to integrate multiple samples, and generated a combined dataset. In this demo we will show how scGate can help you annotate an integrated dataset, by using simple, customizable models based on common marker genes from literature. We will show the case of a PBMC dataset integrated either with STACAS or Harmony, but the same applies to different integration tools.

Set up the environment

library(renv)
renv::restore()

library(ggplot2)
library(dplyr)
library(patchwork)
library(Seurat)
library(SeuratObject)
library(remotes)

if (!require("harmony", quietly = TRUE))
  install.packages("harmony")

#Packages from GitHub
if (!require("seurat-data", quietly = TRUE))
  remotes::install_github('satijalab/seurat-data')

if (!require("STACAS", quietly = TRUE))
  remotes::install_github("carmonalab/STACAS")

library(scGate)
library(SeuratData)
library(STACAS)

Get a test dataset

Download the dataset of PBMCs (SCP424) distributed with SeuratData. For more information on this dataset you can do ?pbmcsca

options(timeout = max(300, getOption("timeout")))

InstallData("pbmcsca")
data("pbmcsca")

scGate on STACAS-integrated object

Integrate different batches (in this example, datasets generated with different sequencing method) with STACAS

nfeatures <- 1000  # define number of variable features to consider
npcs <- 20  # define number of Principal Components for dimensionality reduction

pbmcsca <- UpdateSeuratObject(pbmcsca) |>
    NormalizeData()
pbmcsca.list <- SplitObject(pbmcsca, split.by = "Method")
pbmc.stacas <- Run.STACAS(pbmcsca.list, anchor.features = nfeatures, dims = 1:npcs)
pbmc.stacas <- RunUMAP(pbmc.stacas, dims = 1:npcs)
DimPlot(pbmc.stacas, group.by = "Method") + theme(aspect.ratio = 1)

We can run scGate directly on this integrated space, for instance to isolate NK cells

models.db <- scGate::get_scGateDB()
model.NK <- models.db$human$generic$NK

pbmc.stacas <- scGate(pbmc.stacas, model = model.NK, reduction = "pca", ncores = 4,
    output.col.name = "NK")

We can compare the automatic filtering to the “CellType” manual annotation by the authors:

DimPlot(pbmc.stacas, group.by = c("NK", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

New models can be easily defined based on cell type-specific markers from literature. For instance, we can set up a new simple model to identify Megakaryocytes:

model.MK <- scGate::gating_model(name = "Megakaryocyte", signature = c("ITGA2B",
    "PF4", "PPBP"))

pbmc.stacas <- scGate(pbmc.stacas, model = model.MK, reduction = "pca", ncores = 4,
    output.col.name = "Megakaryocyte")
DimPlot(pbmc.stacas, group.by = c("Megakaryocyte", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

We can also run multiple gating models at once. Besides pure/impure classifications for each model, scGate will also return a combined annotation based on all the models we provided. In this setting, scGate can be used as a multi-classifier to automatically annotate datasets:

models.hs <- models.db$human$generic
models.list <- models.hs[c("Bcell", "PlasmaCell", "CD4T", "CD8T", "Monocyte", "NK",
    "Erythrocyte", "Megakaryocyte", "Mast", "panDC")]

pbmc.stacas <- scGate(pbmc.stacas, model = models.list, reduction = "pca", ncores = 4)
DimPlot(pbmc.stacas, group.by = c("Method", "CellType", "scGate_multi"), ncol = 3) +
    theme(aspect.ratio = 1)

UCell scores for individual signatures are also available in metadata (’*_UCell’ columns). These scores are useful to see which features contribute more strongly to a particular gating model:

FeaturePlot(pbmc.stacas, ncol = 3, features = c("Tcell_UCell", "CD4T_UCell", "CD8T_UCell",
    "MoMacDC_UCell", "pDC_UCell", "Bcell_UCell"))

scGate on Harmony-integrated object

A very popular tool for single-cell data integration is Harmony. The RunHarmony() function provides a convenient wrapper to integrate samples stored in a Seurat object:

pbmcsca <- NormalizeData(pbmcsca) %>%
    FindVariableFeatures(nfeatures = nfeatures) %>%
    ScaleData() %>%
    RunPCA(npcs = npcs)
pbmc.harmony <- RunHarmony(pbmcsca, group.by.vars = "Method")

The corrected embeddings after batch effect correction will be stored in the ‘harmony’ reduction slot:

pbmc.harmony <- RunUMAP(pbmc.harmony, reduction = "harmony", dims = 1:npcs)

Let’s apply scGate in this space to isolate high-quality T cells:

models.db <- scGate::get_scGateDB()
model.Tcell <- models.db$human$generic$Tcell

pbmc.harmony <- scGate(pbmc.harmony, model = model.Tcell, reduction = "harmony",
    ncores = 4, output.col.name = "Tcell")
DimPlot(pbmc.harmony, group.by = c("Tcell", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

We can also efficiently run multiple gating models at once, by providing a list of gating models:

models.db <- scGate::get_scGateDB()

models.hs <- models.db$human$generic
models.list <- models.hs[c("Bcell", "PlasmaCell", "CD4T", "CD8T", "Monocyte", "NK",
    "Erythrocyte", "Megakaryocyte", "Mast", "panDC")]

pbmc.harmony <- scGate(pbmc.harmony, model = models.list, reduction = "harmony",
    ncores = 4)
DimPlot(pbmc.harmony, group.by = c("Method", "CellType", "scGate_multi"), ncol = 3) +
    theme(aspect.ratio = 1)

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] formatR_1.14             rmdformats_1.0.4         knitr_1.45              
 [4] pbmcsca.SeuratData_3.0.0 SeuratData_0.2.2.9001    scGate_1.6.0            
 [7] STACAS_2.2.0             harmony_1.1.0            Rcpp_1.0.11             
[10] remotes_2.4.2.1          Seurat_5.0.0             SeuratObject_5.0.0      
[13] sp_2.1-1                 patchwork_1.1.3          dplyr_1.1.3             
[16] ggplot2_3.4.4            renv_1.0.3              

loaded via a namespace (and not attached):
 [1] RcppAnnoy_0.0.21        splines_4.3.0           later_1.3.1            
 [4] bitops_1.0-7            tibble_3.2.1            R.oo_1.25.0            
 [7] polyclip_1.10-6         fastDummies_1.7.3       lifecycle_1.0.3        
[10] globals_0.16.2          lattice_0.22-5          MASS_7.3-60            
[13] magrittr_2.0.3          plotly_4.10.3           sass_0.4.7             
[16] rmarkdown_2.25          jquerylib_0.1.4         yaml_2.3.7             
[19] httpuv_1.6.12           sctransform_0.4.1       spam_2.10-0            
[22] spatstat.sparse_3.0-3   reticulate_1.34.0       cowplot_1.1.1          
[25] pbapply_1.7-2           RColorBrewer_1.1-3      abind_1.4-5            
[28] zlibbioc_1.48.0         Rtsne_0.16              GenomicRanges_1.54.1   
[31] purrr_1.0.2             R.utils_2.12.2          BiocGenerics_0.48.1    
[34] RCurl_1.98-1.13         rappdirs_0.3.3          GenomeInfoDbData_1.2.11
[37] IRanges_2.36.0          S4Vectors_0.40.1        ggrepel_0.9.4          
[40] irlba_2.3.5.1           listenv_0.9.0           spatstat.utils_3.0-4   
[43] goftest_1.2-3           RSpectra_0.16-1         spatstat.random_3.2-1  
[46] fitdistrplus_1.1-11     parallelly_1.36.0       leiden_0.4.3           
[49] codetools_0.2-19        DelayedArray_0.28.0     tidyselect_1.2.0       
[52] farver_2.1.1            UCell_2.6.1             matrixStats_1.0.0      
[55] stats4_4.3.0            spatstat.explore_3.2-5  jsonlite_1.8.7         
[58] BiocNeighbors_1.20.0    ellipsis_0.3.2          progressr_0.14.0       
[61] ggridges_0.5.4          survival_3.5-7          tools_4.3.0            
[64] ica_1.0-3               glue_1.6.2              SparseArray_1.2.0      
[67] gridExtra_2.3           xfun_0.41               MatrixGenerics_1.14.0  
[70] GenomeInfoDb_1.38.0     withr_2.5.2             BiocManager_1.30.22    
[73] fastmap_1.1.1           fansi_1.0.5             digest_0.6.33          
 [ reached getOption("max.print") -- omitted 68 entries ]

Final notes

scGate can be applied on individual samples to purify a cell population of interest and remove contaminants, prior to downstream steps in the analysis pipeline (e.g. integration, clustering, differential gene expression, etc.). However, it is becoming increasingly common to work with integrated (i.e. batch-corrected) collections of datasets, e.g. from multiple samples or batches of samples. As we have shown here, scGate can be applied directly on integrated objects, and their batch-corrected low-dimensional representations, to aid the annotation of cell types based on marker genes.

By default, scGate calculates PCA embeddings from normalized (raw) feature counts, and repeats this operation for each hierarchical level of a gating model. Signature scores for each cell (calculated using UCell) are smoothed by the scores of the neighboring cells, and used to determine whether a given cell “passes the gate”. While for individual samples such neighbor smoothing is generally more accurate when recalculated at each level of gating, it can be costly in terms of computing time. Providing a precalculated dimensionality reduction to scGate, as shown in this demo, can significantly speed up computation while benefiting from mitigated batch effects for nearest neighbors smoothing.

Further reading

The scGate package and installation instructions are available at: scGate package

The code for this demo can be found on GitHub

The repository for scGate gating moels is at: scGate models repository

References

Ding, Jiarui, et al. “Systematic comparison of single-cell and single-nucleus RNA-sequencing methods.” Nature biotechnology 38.6 (2020): 737-746.

Andreatta, Massimo, and Santiago J. Carmona. “STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data.” Bioinformatics 37.6 (2021): 882-884.

Korsunsky, Ilya, et al. “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature methods 16.12 (2019): 1289-1296.