Data integration with STACAS

STACAS is a method for scRNA-seq integration that is particuarly suited for collections of datasets with large cell type imbalance.

Prior cell type knowledge, given as cell labels, can be provided to the algorithm to perform semi-supervised integration, leading to increased preservation of biological variability in the data. STACAS is robust to partial and imperfect cell type labels and can be applied to large-scale integrations.

In this demo we will show the application of STACAS to integrate a collection of scRNA-seq datasets of immune cells from multiple donors, human tissues and studies, assembled by Luecken et al. for their comprehensive benchmark.

The data are available at: figshare/12420968

R environment

Get and load some useful packages

renv::restore()

if (!require("remotes", quietly = TRUE))
    install.packages("remotes")
library(remotes)

if (!requireNamespace("STACAS", quietly = TRUE))
  remotes::install_github("carmonalab/STACAS")
library(Seurat)
library(dplyr)
library(ggplot2)
library(STACAS)

seed = 1234
set.seed(seed)

Load test datasets

Download the dataset of human immune cells assembled by Luecken et al., and convert them to Seurat objects.

download <- FALSE
where <- 'aux'
dir.create(where, showWarnings = FALSE)

rds.path <- sprintf("%s/Immune_ALL_human.rds", where)

if(download){

  options(timeout=5000)
  url <- "https://figshare.com/ndownloader/files/25717328"
  h5.path <- sprintf("%s/Immune_ALL_human.rds", where)
  download.file(url = url, destfile = h5.path)
  
  if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

  if (!require("zellkonverter", quietly = TRUE))
    install.packages("zellkonverter") # to convert from 'h5ad' to 'sce' object
  
  #Convert to Seurat 
  object.sce <- zellkonverter::readH5AD(h5.path)
  object <- Seurat::as.Seurat(object.sce, counts = "counts", data = "X")
  object <- RenameAssays(object, originalexp="RNA")
  rm (object.sce)
  
  Idents(object) <- "final_annotation"
  saveRDS(object = object,file = rds.path)

}else{
  object <- readRDS(rds.path)
}

Cell types were annotated by the authors on each dataset individually, using a common dictionary of cell types (see https://github.com/theislab/scib-reproducibility). These are stored in the final_annotation metadata column. Study of origin is stored in batch metadaa

table(object$final_annotation, object$batch)[,1:5]
##                                   
##                                     10X Freytag Oetjen_A Oetjen_P Oetjen_U
##   CD4+ T cells                     2937    1238      577      295     1652
##   CD8+ T cells                      350     270       93      639      253
##   CD10+ B cells                       0       0       91       41       75
##   CD14+ Monocytes                  3388     452      350      263      384
##   CD16+ Monocytes                   364      25       61       26       78
##   CD20+ B cells                    1546     427       43       31      417
##   Erythrocytes                        0       0      186     1219       97
##   Erythroid progenitors               0       0      112      249      102
##   HSPCs                              28       0      227      119       99
##   Megakaryocyte progenitors          21      16       72       71       76
##   Monocyte progenitors                0       0      183      109      136
##   Monocyte-derived dendritic cells  182       0       89       60       65
##   NK cells                          756     476       26        0       63
##   NKT cells                        1056     432      389       62      157
##   Plasma cells                       18       0       33       40       38
##   Plasmacytoid dendritic cells       81      11       54       41       38

How does the collection of datasets look without any integration? Run a standard Seurat pipeline for dimensionality reduction and visualization.

nfeatures <- 1000
ndim <- 20
object <- FindVariableFeatures(object, nfeatures = nfeatures) %>%
  NormalizeData() %>% ScaleData() %>%
  RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_pre <- DimPlot(object, group.by = "batch") + theme(aspect.ratio = 1) +
  ggtitle("Dataset/batch before integration")
p2_pre <- DimPlot(object, group.by = "final_annotation", label=T, label.size = 5) +
  NoLegend() +
  theme(aspect.ratio = 1) + ggtitle("Cell labels before integration")


p1_pre | p2_pre

Although cells mostly cluster by the cell type (as annotated in individual datasets), there are also visible batch effects (seen as dataset-specific clustering).

One-liner STACAS

STACAS integration can be performed with a single one-liner command:

object_integrated <- object %>% SplitObject(split.by = "batch") %>%
      Run.STACAS(dims = 1:ndim, anchor.features = nfeatures) %>%
      RunUMAP(dims = 1:ndim) 

DimPlot(object_integrated, group.by = "batch")

Step-by-step STACAS integration

For more control over the integration steps in STACAS, you can run individual steps separately and inspect intermediate results.

1. Find integration anchors between datasets/batches

obj.list <- SplitObject(object, split.by = "batch")

stacas_anchors <- FindAnchors.STACAS(obj.list, 
                                     anchor.features = nfeatures,
                                     dims = 1:ndim)
##  1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
## Finding integration anchors...

2. Guide tree for integration order

st1 <- SampleTree.STACAS(
  anchorset = stacas_anchors,
  obj.names = names(obj.list)
  )

Are samples from the same sequencing technology or from similar tissues clustering together? Different hierarchical clustering methods are available as hclust.methods parameter.

3. Dataset integration

object_integrated <- IntegrateData.STACAS(stacas_anchors,
                                          sample.tree = st1,
                                          dims=1:ndim)

Calculate low-dimensional embeddings and visualize integration results in UMAP

object_integrated <- object_integrated %>% ScaleData() %>%
  RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_int <- DimPlot(object_integrated, group.by = "batch") +
  theme(aspect.ratio = 1) + ggtitle("Dataset/batch after integration")
p2_int <- DimPlot(object_integrated, group.by = "final_annotation", label=T, label.size = 5) +
  NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after integration") 

p1_int | p2_int

At least visually, cells from different studies appear to be better mixed than the uncorrected data. In a later section we will use quantitative metrics to verify whether integration was successful in removing batch effects.

Semi-supervised integration

When available, cell type annotations can be used to guide the alignment. STACAS will use this information to penalize anchors where cell types are inconsistent.

In this dataset, cells were annotated by the authors of the benchmark. For your own data, you may want to do manual annotation or apply one of several cell annotation tools, such as SingleR, Garnett or scGate. We will be posting some examples of cell type annotation using scGate in a different demo.

Run semi-supervised STACAS as one-liner by indicating the metadata column that contains cell annotations (cell.labels in this case):

object_integrated_ss <- obj.list %>%
  Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "final_annotation")

Note that there is no need for ALL cells to be annotated: we recommend to set labels to NA or unknown for cells that cannot be confidently annotated, and they won’t be penalized for label inconsistency. In addition, you can decide how much weight to give to cell labels with the label.confidence parameter (from 0 to 1).

Visualize on UMAP space

object_integrated_ss <- object_integrated_ss %>% RunUMAP(dims=1:ndim)
p1_ss <- DimPlot(object_integrated_ss, group.by = "batch") +
  theme(aspect.ratio = 1) +
  ggtitle("Dataset/batch after semi-supervised integration")
p2_ss <- DimPlot(object_integrated_ss, group.by = "final_annotation", label=T, label.size = 5) + 
  NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after semi-supervised integration")

p1_ss | p2_ss