Data integration with STACAS
STACAS is a method for scRNA-seq integration that is particuarly suited for collections of datasets with large cell type imbalance.
Prior cell type knowledge, given as cell labels, can be provided to the algorithm to perform semi-supervised integration, leading to increased preservation of biological variability in the data. STACAS is robust to partial and imperfect cell type labels and can be applied to large-scale integrations.
In this demo we will show the application of STACAS to integrate a collection of scRNA-seq datasets of immune cells from multiple donors, human tissues and studies, assembled by Luecken et al. for their comprehensive benchmark.
The data are available at: figshare/12420968
R environment
Get and load some useful packages
Load test datasets
Download the dataset of human immune cells assembled by Luecken et al., and convert them to Seurat objects.
download <- FALSE
where <- 'aux'
dir.create(where, showWarnings = FALSE)
rds.path <- sprintf("%s/Immune_ALL_human.rds", where)
if(download){
options(timeout=5000)
url <- "https://figshare.com/ndownloader/files/25717328"
h5.path <- sprintf("%s/Immune_ALL_human.rds", where)
download.file(url = url, destfile = h5.path)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("zellkonverter", quietly = TRUE))
install.packages("zellkonverter") # to convert from 'h5ad' to 'sce' object
#Convert to Seurat
object.sce <- zellkonverter::readH5AD(h5.path)
object <- Seurat::as.Seurat(object.sce, counts = "counts", data = "X")
object <- RenameAssays(object, originalexp="RNA")
rm (object.sce)
Idents(object) <- "final_annotation"
saveRDS(object = object,file = rds.path)
}else{
object <- readRDS(rds.path)
}
Cell types were annotated by the authors on each dataset
individually, using a common dictionary of cell types (see https://github.com/theislab/scib-reproducibility). These
are stored in the final_annotation
metadata column. Study
of origin is stored in batch
metadaa
##
## 10X Freytag Oetjen_A Oetjen_P Oetjen_U
## CD4+ T cells 2937 1238 577 295 1652
## CD8+ T cells 350 270 93 639 253
## CD10+ B cells 0 0 91 41 75
## CD14+ Monocytes 3388 452 350 263 384
## CD16+ Monocytes 364 25 61 26 78
## CD20+ B cells 1546 427 43 31 417
## Erythrocytes 0 0 186 1219 97
## Erythroid progenitors 0 0 112 249 102
## HSPCs 28 0 227 119 99
## Megakaryocyte progenitors 21 16 72 71 76
## Monocyte progenitors 0 0 183 109 136
## Monocyte-derived dendritic cells 182 0 89 60 65
## NK cells 756 476 26 0 63
## NKT cells 1056 432 389 62 157
## Plasma cells 18 0 33 40 38
## Plasmacytoid dendritic cells 81 11 54 41 38
How does the collection of datasets look without any integration? Run a standard Seurat pipeline for dimensionality reduction and visualization.
nfeatures <- 1000
ndim <- 20
object <- FindVariableFeatures(object, nfeatures = nfeatures) %>%
NormalizeData() %>% ScaleData() %>%
RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_pre <- DimPlot(object, group.by = "batch") + theme(aspect.ratio = 1) +
ggtitle("Dataset/batch before integration")
p2_pre <- DimPlot(object, group.by = "final_annotation", label=T, label.size = 5) +
NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Cell labels before integration")
p1_pre | p2_pre
Although cells mostly cluster by the cell type (as annotated in individual datasets), there are also visible batch effects (seen as dataset-specific clustering).
One-liner STACAS
STACAS integration can be performed with a single one-liner command:
Step-by-step STACAS integration
For more control over the integration steps in STACAS, you can run individual steps separately and inspect intermediate results.
1. Find integration anchors between datasets/batches
2. Guide tree for integration order
Are samples from the same sequencing technology or from similar
tissues clustering together? Different hierarchical clustering methods
are available as hclust.methods
parameter.
3. Dataset integration
Calculate low-dimensional embeddings and visualize integration results in UMAP
object_integrated <- object_integrated %>% ScaleData() %>%
RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_int <- DimPlot(object_integrated, group.by = "batch") +
theme(aspect.ratio = 1) + ggtitle("Dataset/batch after integration")
p2_int <- DimPlot(object_integrated, group.by = "final_annotation", label=T, label.size = 5) +
NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after integration")
p1_int | p2_int
At least visually, cells from different studies appear to be better mixed than the uncorrected data. In a later section we will use quantitative metrics to verify whether integration was successful in removing batch effects.
Semi-supervised integration
When available, cell type annotations can be used to guide the alignment. STACAS will use this information to penalize anchors where cell types are inconsistent.
In this dataset, cells were annotated by the authors of the
benchmark. For your own data, you may want to do manual annotation or
apply one of several cell annotation tools, such as SingleR,
Garnett or scGate. We will be
posting some examples of cell type annotation using scGate
in a different demo.
Run semi-supervised STACAS as one-liner by
indicating the metadata column that contains cell annotations
(cell.labels
in this case):
object_integrated_ss <- obj.list %>%
Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "final_annotation")
Note that there is no need for ALL cells to be annotated: we
recommend to set labels to NA or unknown for cells
that cannot be confidently annotated, and they won’t be penalized for
label inconsistency. In addition, you can decide how much weight to give
to cell labels with the label.confidence
parameter (from 0
to 1).
Visualize on UMAP space
p1_ss <- DimPlot(object_integrated_ss, group.by = "batch") +
theme(aspect.ratio = 1) +
ggtitle("Dataset/batch after semi-supervised integration")
p2_ss <- DimPlot(object_integrated_ss, group.by = "final_annotation", label=T, label.size = 5) +
NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after semi-supervised integration")
p1_ss | p2_ss
Metrics of integration quality
Two main aspects should be considered to determine the quality of single-cell data integration: (i) batch mixing and (ii) preservation of biological variability.
Batch mixing measures whether similar cells originating from different batches are well mixed after integration. A common metrics to quantify batch mixing is the integration LISI; here we will use a celltype-aware version of the integration LISI (CiLISI), to account for different composition between datasets, as implemented in the scIntegrationMetrics package.
Preservation of biological variability can be quantified by how close to each other cells of the same type are, and how separated from each other cells of different types are in the joint integrated embedding. A useful metric for preservation of biological variability is Average Silhouette Width (ASW) of cell labels (celltype_ASW) in the corrected PCA space.
if (!require("scIntegrationMetrics", quietly = TRUE))
install_github("carmonalab/scIntegrationMetrics") #calculates LISI and Silhouette
library(scIntegrationMetrics)
integrationMetrics <- list()
integrationMetrics[["uncorrected"]] <- getIntegrationMetrics(object=object,
metrics = c("CiLISI","celltype_ASW"),
meta.label = "final_annotation",
meta.batch = "batch")
integrationMetrics[["STACAS"]] <- getIntegrationMetrics(object=object_integrated,
metrics = c("CiLISI","celltype_ASW"),
meta.label = "final_annotation",
meta.batch = "batch")
integrationMetrics[["ssSTACAS"]] <- getIntegrationMetrics(object=object_integrated_ss,
metrics = c("CiLISI","celltype_ASW"),
meta.label = "final_annotation",
meta.batch = "batch")
if (!require("tidyr", quietly = TRUE))
install.packages("tidyr")
integrationMetricsSummary <- data.frame(unlist(integrationMetrics)) |>
tibble::rownames_to_column() |> rename(value=unlist.integrationMetrics.) |>
separate(rowname, c("Method","Metric"), sep="\\.")
a <- integrationMetricsSummary %>% filter(Metric=="CiLISI") %>%
ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") +
ggtitle("CiLISI") + theme_bw() +
theme(legend.position="none", axis.text.x=element_blank())
b <- integrationMetricsSummary %>% filter(Metric=="celltype_ASW") %>%
ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") +
ggtitle("celltype_ASW") + theme_bw() +
theme(axis.text.x=element_blank())
a | b
Compared to the uncorrected data, both STACAS and ssSTACAS achieve better batch mixing (CiLISI), and improved clustering of cells of the same type (celltype_ASW). Semi-supervised integration mixes slightly less different batches compared to unsupervised STACAS, but is better at preserving biological variability (i.e. higher celltype_ASW).
Label transfer between datasets
In some integration tasks, cell type annotations may be available
only for a fraction of the datasets. In such cases, labels can be
“transferred” to unannotated datasets by similarity to annotated cells.
The annotate.by.neighbors
function allows propagating cell
labels to unannotated cells by K-nearest neighbor similarity to
annotated cells.
For example, if one of the datasets were lacking cell type labels (set to NA):
Transfer labels from annotated to non-annotated cells
complete <- annotate.by.neighbors(incomplete, labels.col = "final_annotation")
a <- DimPlot(incomplete, group.by = "final_annotation", label=T, label.size = 5) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Incomplete annotation")
b <- DimPlot(complete, group.by = "final_annotation", label=T, label.size = 5) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Complete annotation")
a | b
Notes on data integration
- The first critical step in batch effect correction is to ensure that
the gene names in your datasets are harmonized. If
different studies use different naming conventions, synonyms, etc.,
methods for dimensionality reduction will treat these genes as entirely
different variables, and introducing artificial differences between
datasets. It is recommended that your pre-processing pipeline takes care
of resolving ambiguities in gene naming across datasets. A useful
function is
STACAS::StandardizeGeneSymbols()
, but there are also other tools such as HGNChelper
#load conversion table
library(STACAS)
data(EnsemblGeneTable.Hs)
#convert gene names separately for each sample
obj.list <- lapply(obj.list, function(x) {
StandardizeGeneSymbols(x, EnsemblGeneTable = EnsemblGeneTable.Hs)
})
- The calculation of highly variable genes (HVG) is a fundamental step
for dimensionality reduction and integration. We recommend excluding
certain classes of genes, e.g. mitochondrial, ribosomal, heat-shock
genes, from the list of HVG, as their expression are typically more
strongly associated to technical variation than to actual biological
differences. The
FindVariableGenes.STACAS()
function allows providing a list of genes to exclude from HVG; see an example below. We can use the collection of signatures stored in SignatuR package
if (!require("SignatuR", quietly = TRUE))
install_github("carmonalab/SignatuR")
library(SignatuR)
#Retrieve full list of signatures for human
hs.sign <- GetSignature(SignatuR$Hs)
Then we tell STACAS to exclude specific genes from HVG used in
integration, using the parameter genesBlockList
Further reading
The STACAS package and installation instructions are available at: STACAS package
The code for this demo can be found on GitHub
References
Andreatta A., Carmona S. J. (2021). STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data. - Bioinformatics
Andreatta M, Herault L, Gueguen P, Gfeller D, Berenstein AJ, Carmona SJ (2024) Semi-supervised integration of single-cell transcriptomics data - Nature Communications
Luecken, M. D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M. F., … & Theis, F. J. (2022). Benchmarking atlas-level data integration in single-cell genomics. - Nature methods
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. - Cell
Session info
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] SignatuR_0.1.1 tidyr_1.3.1 scIntegrationMetrics_1.1
## [4] STACAS_2.2.2 ggplot2_3.4.4 dplyr_1.1.4
## [7] Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-3
## [10] remotes_2.4.2.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8
## [4] magrittr_2.0.3 spatstat.utils_3.0-4 farver_2.1.1
## [7] rmarkdown_2.25 vctrs_0.6.5 ROCR_1.0-11
## [10] spatstat.explore_3.2-6 htmltools_0.5.7 BiocNeighbors_1.20.2
## [13] sass_0.4.8 sctransform_0.4.1 parallelly_1.36.0
## [16] KernSmooth_2.23-22 bslib_0.6.1 htmlwidgets_1.6.4
## [19] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
## [22] zoo_1.8-12 cachem_1.0.8 igraph_2.0.1.1
## [25] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.6-5 R6_2.5.1 fastmap_1.1.1
## [31] fitdistrplus_1.1-11 future_1.33.1 shiny_1.8.0
## [34] digest_0.6.34 colorspace_2.1-0 S4Vectors_0.40.2
## [37] patchwork_1.2.0 tensor_1.5 RSpectra_0.16-1
## [40] irlba_2.3.5.1 vegan_2.6-4 labeling_0.4.3
## [43] progressr_0.14.0 fansi_1.0.6 spatstat.sparse_3.0-3
## [46] mgcv_1.9-1 httr_1.4.7 polyclip_1.10-6
## [49] abind_1.4-5 compiler_4.3.0 withr_3.0.0
## [52] BiocParallel_1.36.0 fastDummies_1.7.3 highr_0.10
## [55] R.utils_2.12.3 MASS_7.3-60.0.1 permute_0.9-7
## [58] tools_4.3.0 lmtest_0.9-40 httpuv_1.6.14
## [61] future.apply_1.11.1 goftest_1.2-3 R.oo_1.26.0
## [64] glue_1.7.0 nlme_3.1-164 promises_1.2.1
## [67] grid_4.3.0 Rtsne_0.17 cluster_2.1.6
## [70] reshape2_1.4.4 generics_0.1.3 gtable_0.3.4
## [73] spatstat.data_3.0-4 R.methodsS3_1.8.2 rmdformats_1.0.4
## [76] data.table_1.15.0 utf8_1.2.4 BiocGenerics_0.48.1
## [79] spatstat.geom_3.2-8 RcppAnnoy_0.0.22 ggrepel_0.9.5
## [82] RANN_2.6.1 pillar_1.9.0 stringr_1.5.1
## [85] spam_2.10-0 RcppHNSW_0.5.0 later_1.3.2
## [88] splines_4.3.0 lattice_0.22-5 renv_1.0.3
## [91] survival_3.5-7 deldir_2.0-2 tidyselect_1.2.0
## [94] miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.45
## [97] gridExtra_2.3 bookdown_0.37 scattermore_1.2
## [100] stats4_4.3.0 xfun_0.41 matrixStats_1.2.0
## [103] stringi_1.8.3 lazyeval_0.2.2 yaml_2.3.8
## [106] evaluate_0.23 codetools_0.2-19 data.tree_1.1.0
## [109] tibble_3.2.1 BiocManager_1.30.22 cli_3.6.2
## [112] uwot_0.1.16 xtable_1.8-4 reticulate_1.35.0
## [115] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.12
## [118] globals_0.16.2 spatstat.random_3.2-2 png_0.1-8
## [121] parallel_4.3.0 ellipsis_0.3.2 dotCall64_1.1-1
## [124] listenv_0.9.1 viridisLite_0.4.2 scales_1.3.0
## [127] ggridges_0.5.6 leiden_0.4.3.1 purrr_1.0.2
## [130] rlang_1.1.3 cowplot_1.1.3