The scECODA package provides functions for single-cell Exploratory COmpositional Data Analysis at the population scale. It enables users to explore their dataset by visualizing the sample structure and how groups of samples relate to each other by providing numerous convenient plotting and clustering functions. Additionally, it provides metrics to quantify how strongly groups of samples separate and which cell types are driving this separation (differential abundance analysis).
Check package dependencies and install scECODA
# if (!requireNamespace("renv")) install.packages("renv")
# library(renv)
# renv::restore()
# renv::install("carmonalab/scECODA")
remotes::install_github("carmonalab/scECODA")
library(scECODA)
You can create an ECODA object from a single-cell data (Seurat or
SingleCellExperiment object) with create_ecoda_object().
Alternatively, you can create an ECODA object from a data frame containing cell
counts or frequencies (in % per sample) with create_ecoda_object_helper().
Based on the input, the following data and transformations will be automatically calculated:
Additionally, it will calculate useful metrics, such as:
NOTE: To calculate CLR-transformed cell type abundances, zero counts or relative abundances are not allowed (log of zero is not defined). For this reason, zeros are imputed by adding a pseudocount of +1 to all samples (to not distort ratios) or the smallest value in case the ECODA object was created from a data frame of relative abundances.
# Loading an example Seurat dataset
library(Seurat)
library(SeuratData)
InstallData("pbmc3k")
data("pbmc3k")
seurat_object <- UpdateSeuratObject(pbmc3k)
The cells of this example dataset are annotated but not assigned to samples or biological conditions. Let us artificially assign them by creating a sample and condition column in the metadata.
n_samples <- 40
n_groups <- 2
seurat_object$sample_id <- ""
seurat_object$sample_id <- rep(
x = paste0("sample", seq(n_samples)),
length.out = length(seurat_object$sample_id)
)
seurat_object$condition_id <- ""
seurat_object$condition_id <- c(rep("condition1", n_samples/n_groups),
rep("condition2", n_samples/n_groups))
# Create ECODA object
ecoda_object <- create_ecoda_object(
seurat_object,
sample_col = "sample_id",
celltype_col = "seurat_annotations"
)
# Loading an example SingleCellExperiment dataset
library(scRNAseq)
all.ds <- surveyDatasets()
ds <- 1
sce_object <- fetchDataset(
all.ds@listData[["name"]][ds],
all.ds@listData[["version"]][ds]
)
names(sce_object@colData)
## [1] "sample" "DevelopmentalStage" "DaysPostAmputation"
## [4] "cluster" "CellCyclePhase" "Lane"
## [7] "Condition" "batch"
# Create ECODA object
ecoda_object <- create_ecoda_object(
sce_object,
sample_col = "sample",
celltype_col = "cluster"
)
Metadata can be added optionally.
Here, we load the immune cell composition in samples from a breast cancer cohort (Zhang et al. 2021)
NOTES:
# Load example datasets
data(example_data)
Zhang <- example_data$Zhang
counts <- Zhang$cell_counts_lowresolution
freq <- calc_freq(counts)
metadata <- Zhang$metadata
ecoda_object <- create_ecoda_object_helper(
counts = counts,
metadata = metadata
)
ecoda_object <- create_ecoda_object_helper(
freq = freq,
metadata = metadata
)
To explore a dataset, Principal Component Analysis (PCA) is commonly performed
as a first step. It is very useful to find the major axis of variation between
samples. You can plot a PCA of your ecoda_object with plot_pca(ecoda_object).
How strongly samples separate based on a user defined grouping can be quantified with different metrics, for example:
ecoda_object <- create_ecoda_object_helper(
counts = example_data$Zhang$cell_counts_lowresolution,
metadata = example_data$Zhang$metadata,
)
plot_pca(
ecoda_object,
label_col = "Tissue",
title = "PCA based on cell type composition",
n_hv_feat_show = 5 # Shows the most highly variable features (cell types)
)
# Using only the most highly variable cell types
plot_pca(
ecoda_object,
slot = "clr_hvc",
label_col = "Tissue",
title = "PCA based on highly variable cell types",
n_hv_feat_show = 5
)
# You can get the separatino scores like this:
feat_mat <- ecoda_object@clr
labels <- ecoda_object@metadata$Tissue
anosim_r_score <- calc_anosim(feat_mat, labels)
ari_score <- calc_ari(feat_mat, labels)
modularity_score <- calc_modularity(feat_mat, labels)
silhouette_score <- calc_sil(feat_mat, labels)
print(paste("ANOSIM R:", anosim_r_score))
## [1] "ANOSIM R: 0.531"
print(paste("ARI:", ari_score))
## [1] "ARI: 0.639"
print(paste("Modularity:", modularity_score))
## [1] "Modularity: 0.716"
print(paste("Silhouette:", silhouette_score))
## [1] "Silhouette: 0.237"
Sometimes, looking at only the first two PCA dimensions can be misleading if there are other main sources of variance. For this purpose, we can also plot the PCA in 3D.
plot_pca(
ecoda_object,
label_col = "Tissue",
plotly_3d = TRUE
)
Box plots are another useful tool to visualize the distribution of cell types across samples or differences in cell type composition between samples. This function also allows you do some basic statistical comparisons in cell type composition between groups.
NOTE: It is highly recommended to only do statistical testing on log-ratio-transformed compositional data and not on counts or relative abundances in % (Greenacre 2021).
The main focus of scECODA is exploratory (unsupervised) analysis of samples based on cell type composition. If you want to do down-stream comparative (supervised) differential abundance testing of a priori known groups (including statistical designs with multiple known covariates), there are other tools available for that purpose, e.g. scCODA (Büttner et al. 2021) or tascCODA (“tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data - PMC,” n.d.).
plot_boxplot(ecoda_object, title = "CLR Abundance by Cell Type")
plot_boxplot(
ecoda_object,
label_col = "Tissue",
title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)
This can also be plotted as stacked bar plots but it is a bit harder to read.
# Plotting average cell type abundance by experimental group
plot_barplot(
ecoda_object,
label_col = "Tissue",
plot_by = "group",
title = "Mean Relative Abundance by Condition"
)
# Plotting cell type abundance for each sample separately
plot_barplot(
ecoda_object,
label_col = "Tissue",
plot_by = "sample",
title = "Relative Abundance for Each Sample"
)
The clustered heatmap plot is a good alternative to not only visualize how samples and cell types correlate but also to cluster samples. It is important not to re-scale in order to avoid amplifying tiny differences.
It is especially useful, if you have many cell types and subtypes annotated. Box or bar plots can get too crowded in such cases. Heatmaps are better suited to visualize large number of samples and cell types.
Also, it is possible that you see your samples forming separate clusters in the PCA or heatmap even though you do not find any (or only few) significant differences in cell type composition. However, there might be small differences among a large number of cell types which, taken all together, can still drive sample separation and explain differences between your biological groups.
NOTE: you can specify multiple metadata columns in “label_col” to visualize how samples group across multiple covariates simultaneously.
Here is an example for a simple dataset:
plot_heatmap(ecoda_object, label_col = c("Clinical.efficacy.", "Tissue"))
plot_heatmap(
ecoda_object,
label_col = c("Clinical.efficacy.", "Tissue"),
# Additional arguments for pheatmap:
cutree_rows = 3,
cutree_cols = 3
)
And here an example of a large healthy cohort with 868 samples and 69 cell types (Gong et al. 2024):
ecoda_object <- create_ecoda_object_helper(
counts = example_data$GongSharma_full$cell_counts_highresolution,
metadata = example_data$GongSharma_full$metadata
)
plot_heatmap(
ecoda_object,
label_col = c("subject.cmv", "age_group"),
# Additional arguments for pheatmap:
cutree_rows = 3,
cutree_cols = 5,
show_colnames = FALSE
)
Subsetting to only the most highly variable cell types (HVCs) yields a very similar sample clustering, while being much more interpretable.
plot_heatmap(
ecoda_object,
slot = "clr_hvc",
label_col = c("subject.cmv", "age_group"),
# Additional arguments for pheatmap:
cutree_rows = 3,
cutree_cols = 4,
show_colnames = FALSE
)
The cell type correlation plot offers the possibility to visualize which cell types correlate with each other, highlighting possible microenvironment hubs that might be of relevance to explain possible groups of samples in your data:
plot_corr(ecoda_object)
While the correlation plot below showing only the HVCs is easier to read, it is nevertheless interesting to see the full correlation plot above. It seems that there is additional information that might be overlooked:
Besides the main cell type clusters of effector memory cells (separating samples by cytomegalovirus (CMV) infection status) and SOX4+ naive T cells (separating samples by age), there seems to be a cluster of ISG+ cell types, memory T cells and monocytes/DCs whose abundance strongly correlate, respectively.
plot_corr(ecoda_object, slot = "clr_hvc")
As demonstrated in the heatmap of the extensive Gong & Sharma dataset, not all 69 cell subtypes contribute equally to sample separation.
Cell types at the top exhibit high variance across samples, indicating they are the main drivers of group separation, while those toward the bottom show minor variation.
This principle is directly analogous to the selection of highly variable genes (HVGs) in (sc)RNA-seq data: high variance is a powerful proxy for biological relevance.
We expect cell types that are differentially abundant between key biological groups to display this increased variance. scECODA provides convenient functions to calculate and visualize these cell type variances and mean abundance across all samples:
library(ggplot2)
plot_varmean(ecoda_object, label_points = T) +
ggtitle("Default - cell types explaining 50% variance")
plot_varmean(ecoda_object) +
ggtitle("Default - cell types explaining 50% variance")
ecoda_object_new <- find_hvcs(ecoda_object, variance_explained = 0.8)
plot_varmean(ecoda_object_new) +
ggtitle("Cell types explaining 80% variance")
ecoda_object_new <- find_hvcs(ecoda_object, top_n_hvcs = 5)
plot_varmean(ecoda_object_new) +
ggtitle("Top 5 HVCs")
ECODA is much more interpretable (10-100 cell types) compared to gene expression (2’000 - 20’000 genes). However, pseudobulk analysis has the advantage of not requiring cell type annotations.
The scECODA package provides convenient functions to explore sample structure at the population level based on cell type composition but also by default automatically calculates DESeq2-normalized (Love et al. 2015) sample pseudobulk gene expression.
A PCA plot based on pseudobulk can be plotted like this:
# Loading an example SingleCellExperiment dataset
library(scRNAseq)
all.ds <- surveyDatasets()
ds <- 1
sce_object <- fetchDataset(
all.ds@listData[["name"]][ds],
all.ds@listData[["version"]][ds]
)
ecoda_object <- create_ecoda_object(
sce_object,
sample_col = "sample",
celltype_col = "cluster",
)
plot_pca(
ecoda_object,
label_col = "Condition",
title = "PCA based on cell type composition",
n_hv_feat_show = 5
)
plot_pca(
ecoda_object,
slot = "pb",
label_col = "Condition",
title = "PCA based on pseudobulk gene expression"
#, n_hv_feat_show = 5 # optionally, show top n most highly variable genes
)
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] ggplot2_4.0.0 scRNAseq_2.18.0
## [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [5] Biobase_2.64.0 GenomicRanges_1.56.2
## [7] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [9] S4Vectors_0.42.1 BiocGenerics_0.50.0
## [11] MatrixGenerics_1.16.0 matrixStats_1.5.0
## [13] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2.9001
## [15] Seurat_5.3.0 SeuratObject_5.2.0
## [17] sp_2.2-0 scECODA_0.9.1
## [19] BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.36.0 spatstat.sparse_3.1-0 bitops_1.0-9
## [4] httr_1.4.7 RColorBrewer_1.1-3 tools_4.4.2
## [7] sctransform_0.4.2 backports_1.5.0 alabaster.base_1.4.2
## [10] R6_2.6.1 vegan_2.7-2 HDF5Array_1.32.1
## [13] lazyeval_0.2.2 uwot_0.2.3 mgcv_1.9-3
## [16] rhdf5filters_1.16.0 permute_0.9-8 withr_3.0.2
## [19] gridExtra_2.3 progressr_0.17.0 cli_3.6.5
## [22] factoextra_1.0.7 spatstat.explore_3.5-3 fastDummies_1.7.5
## [25] labeling_0.4.3 alabaster.se_1.4.1 sass_0.4.10
## [28] S7_0.2.0 spatstat.data_3.1-8 ggridges_0.5.7
## [31] pbapply_1.7-4 Rsamtools_2.20.0 parallelly_1.45.1
## [34] RSQLite_2.4.3 generics_0.1.4 BiocIO_1.14.0
## [37] crosstalk_1.2.2 gtools_3.9.5 ica_1.0-3
## [40] spatstat.random_3.4-2 car_3.1-3 dplyr_1.1.4
## [43] Matrix_1.7-4 abind_1.4-8 lifecycle_1.0.4
## [46] yaml_2.3.10 carData_3.0-5 rhdf5_2.48.0
## [49] SparseArray_1.4.8 BiocFileCache_2.12.0 Rtsne_0.17
## [52] grid_4.4.2 blob_1.2.4 promises_1.4.0
## [55] ExperimentHub_2.12.0 crayon_1.5.3 miniUI_0.1.2
## [58] lattice_0.22-7 cowplot_1.2.0 GenomicFeatures_1.56.0
## [61] KEGGREST_1.44.1 pillar_1.11.1 knitr_1.50
## [64] rjson_0.2.23 future.apply_1.20.0 codetools_0.2-20
## [67] glue_1.8.0 spatstat.univar_3.1-4 data.table_1.17.8
## [70] remotes_2.5.0 vctrs_0.6.5 png_0.1-8
## [73] gypsum_1.0.1 spam_2.11-1 gtable_0.3.6
## [76] cachem_1.1.0 xfun_0.53 S4Arrays_1.4.1
## [79] mime_0.13 survival_3.8-3 pheatmap_1.0.13
## [82] fitdistrplus_1.2-4 ROCR_1.0-11 nlme_3.1-168
## [85] bit64_4.6.0-1 alabaster.ranges_1.4.2 filelock_1.0.3
## [88] RcppAnnoy_0.0.22 bslib_0.9.0 irlba_2.3.5.1
## [91] KernSmooth_2.23-26 otel_0.2.0 DBI_1.2.3
## [94] DESeq2_1.44.0 tidyselect_1.2.1 bit_4.6.0
## [97] compiler_4.4.2 curl_7.0.0 httr2_1.2.1
## [100] DelayedArray_0.30.1 plotly_4.11.0 bookdown_0.45
## [103] rtracklayer_1.64.0 scales_1.4.0 lmtest_0.9-40
## [106] rappdirs_0.3.3 stringr_1.5.2 digest_0.6.37
## [109] goftest_1.2-3 spatstat.utils_3.2-0 alabaster.matrix_1.4.2
## [112] rmarkdown_2.30 XVector_0.44.0 htmltools_0.5.8.1
## [115] pkgconfig_2.0.3 ensembldb_2.28.1 dbplyr_2.5.1
## [118] fastmap_1.2.0 rlang_1.1.6 htmlwidgets_1.6.4
## [121] UCSC.utils_1.0.0 shiny_1.11.1 farver_2.1.2
## [124] jquerylib_0.1.4 zoo_1.8-14 jsonlite_2.0.0
## [127] BiocParallel_1.38.0 mclust_6.1.1 RCurl_1.98-1.17
## [130] magrittr_2.0.4 Formula_1.2-5 GenomeInfoDbData_1.2.12
## [133] dotCall64_1.2 patchwork_1.3.2 Rhdf5lib_1.26.0
## [136] Rcpp_1.1.0 reticulate_1.43.0 stringi_1.8.7
## [139] alabaster.schemas_1.4.0 zlibbioc_1.50.0 MASS_7.3-65
## [142] AnnotationHub_3.12.0 plyr_1.8.9 parallel_4.4.2
## [145] listenv_0.9.1 ggrepel_0.9.6 deldir_2.0-4
## [148] Biostrings_2.72.1 splines_4.4.2 tensor_1.5.1
## [151] locfit_1.5-9.12 igraph_2.2.0 ggpubr_0.6.2
## [154] spatstat.geom_3.6-0 ggsignif_0.6.4 RcppHNSW_0.6.0
## [157] reshape2_1.4.4 BiocVersion_3.19.1 XML_3.99-0.19
## [160] evaluate_1.0.5 renv_1.1.5 BiocManager_1.30.26
## [163] httpuv_1.6.16 RANN_2.6.2 tidyr_1.3.1
## [166] purrr_1.1.0 polyclip_1.10-7 alabaster.sce_1.4.0
## [169] future_1.67.0 scattermore_1.2 broom_1.0.10
## [172] xtable_1.8-4 AnnotationFilter_1.28.0 restfulr_0.0.16
## [175] RSpectra_0.16-2 rstatix_0.7.3 later_1.4.4
## [178] viridisLite_0.4.2 tibble_3.3.0 memoise_2.0.1
## [181] AnnotationDbi_1.66.0 GenomicAlignments_1.40.0 cluster_2.1.8.1
## [184] corrplot_0.95 globals_0.18.0