ECODA can be more sensitive to detect differences between samples then pseudobulk, especially when studying complex conditions and tissues with high-resolution cell type annotation including numerous low abundant cell subtypes.
Additionally, ECODA is much more interpretable (10-100 cell types) compared to gene expression (2’000 - 20’000 genes).
In this example, you will see how a dataset without sample separation looks like and how it compares to a dataset with differentially abundant cell types, in the cell type compositional and gene expressional space.
Also, it will give you a feeling about the sensitivity of cell type composition based sample separation versus pseudobulk based sample separation.
Check package dependencies and install scECODA
# if (!requireNamespace("renv")) install.packages("renv")
# library(renv)
# renv::restore()
# renv::install("carmonalab/scECODA")
remotes::install_github("carmonalab/scECODA")
library(scECODA)
scECODA provides a convenient function to calculate and visualize sample pseudobulks from scRNA-seq data, e.g. from a Seurat object.
In this example you will see:
We use the Seurat pbmc3k dataset.
# Loading an example Seurat dataset
library(Seurat)
library(SeuratData)
InstallData("pbmc3k")
data("pbmc3k")
seurat_object <- UpdateSeuratObject(pbmc3k)
The cells of this example dataset are annotated but not assigned to samples or biological conditions. Let us artificially assign them by creating a sample and condition column in the metadata.
n_samples <- 40
n_groups <- 2
seurat_object$sample_id <- ""
seurat_object$sample_id <- rep(
x = paste0("sample", seq(n_samples)),
length.out = length(seurat_object$sample_id))
seurat_object$condition_id <- ""
seurat_object$condition_id <- c(rep("condition1", n_samples/n_groups),
rep("condition2", n_samples/n_groups))
As we randomly assigned cells and samples to two arbitrary “conditions”, we see no separation between samples. All separation metrics show low scores around zero, indicating indeed a random distribution of our samples between conditions.
# Create ECODA object
ecoda_object <- create_ecoda_object(
seurat_object,
sample_col = "sample_id",
celltype_col = "seurat_annotations"
)
# # Internally this is calculated with this function:
# pb <- calculate_pseudobulk(
# count_matrix = seurat_object[["RNA"]]$counts,
# sample_ids = seurat_object@meta.data[[sample_col]]
# )
plot_pca(
ecoda_object,
label_col = "condition_id",
title = "PCA - Cell type composition"
)
plot_pca(
ecoda_object,
slot = "pb",
label_col = "condition_id",
title = "PCA - Pseudobulk"
)
Observe what happens if you introduce a differential abundance of CD8 T and NK cells between the two groups.
# Multiply CD8 T cells and NK cells within "condition2" by 5:
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
seurat_object = seurat_object,
condition_column = "condition_id",
target_condition = "condition2",
celltype_column = "seurat_annotations",
target_celltypes = c("CD8 T", "NK"),
multiplier = 5
)
# Verification (Optional):
# Check the cell counts before:
table(
seurat_object@meta.data$condition,
seurat_object@meta.data$seurat_annotation
)
##
## Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK DC
## condition1 336 252 246 176 139 80 77 19
## condition2 361 231 234 168 132 82 78 13
##
## Platelet
## condition1 8
## condition2 6
# Check the cell counts after:
table(
seurat_object_DiffAbu@meta.data$condition,
seurat_object_DiffAbu@meta.data$seurat_annotations
)
##
## Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK DC
## condition1 336 252 246 176 139 80 77 19
## condition2 361 231 234 168 660 82 390 13
##
## Platelet
## condition1 8
## condition2 6
All separation metrics increased and are now close to their maximum value of 1, indicating separation of the two groups of samples.
ecoda_object_DiffAbu <- create_ecoda_object(
seurat_object_DiffAbu,
sample_col = "sample_id",
celltype_col = "seurat_annotations"
)
plot_pca(
ecoda_object_DiffAbu,
label_col = "condition_id",
title = "PCA cell type composition",
n_hv_feat_show = 2
)
plot_pca(
ecoda_object_DiffAbu,
slot = "pb",
label_col = "condition_id",
title = "PCA pseudobulk",
n_hv_feat_show = 2
)
It can be seen that CD8 T and NK cells are significantly higher in the condition2. Interestingly, it can also be seen that all the other cell types seem to be lower abundant, even though their counts were not changed. However, their relative abundance decreased. Remember that parts of a composition are not independent but all are part of a total. If you change one part (e.g. relatively increase), all others change too (e.g. relatively decrease).
plot_boxplot(
ecoda_object,
label_col = "condition_id",
title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)
plot_boxplot(
ecoda_object_DiffAbu,
label_col = "condition_id",
title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)
This can also be plotted as stacked bar plots but it is harder to read.
# 1. Plotting Summary by condition_id (your original plot)
plot_barplot(
ecoda_object,
label_col = "condition_id",
plot_by = "group",
title = "Mean Relative Abundance by Condition"
)
plot_barplot(
ecoda_object_DiffAbu,
label_col = "condition_id",
plot_by = "group",
title = "Mean Relative Abundance by Condition"
)
# 2. Plotting Each Sample Separately (the new request)
plot_barplot(
ecoda_object,
label_col = "condition_id",
plot_by = "sample",
title = "Relative Abundance for Each Sample"
)
plot_barplot(
ecoda_object_DiffAbu,
label_col = "condition_id",
plot_by = "sample",
title = "Relative Abundance for Each Sample"
)
Before you introduced an artificial differential abundance, there was no major correlation between any of the cell types. Afterwards, however, it can be clearly seen that CD8 T and NK cells strongly correlate.
plot_corr(ecoda_object)
plot_corr(ecoda_object_DiffAbu)
Pseudobulk falls short of detecting differences between samples if the differences are in very low abundant cell types
To showcase this effect we can again use our semi-synthetic dataset:
First, we make the Seurat object so that it has a lot more cells. For this purpose, we multiply the cells in all samples (both conditions) by 10. Next,
all_celltypes <- colnames(ecoda_object@counts)
low_abundant_celltypes <- c("FCGR3A+ Mono", "B")
high_abundant_celltypes <- all_celltypes[
!all_celltypes %in% low_abundant_celltypes
]
# Let's remove the platelets and DCs as they are zero in about half the samples
# They would introduce noise which would be further amplified when multiplying
seurat_object_clean <- seurat_object
seurat_object_clean$seurat_annotations[
seurat_object_clean$seurat_annotations %in% c("Platelet", "DC")
] <- NA
# We multiply most cell types by 10 in both conditions:
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
seurat_object = seurat_object_clean,
condition_column = "condition_id",
target_condition = "condition1",
celltype_column = "seurat_annotations",
target_celltypes = high_abundant_celltypes,
multiplier = 10
)
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
seurat_object = seurat_object_DiffAbu,
condition_column = "condition_id",
target_condition = "condition2",
celltype_column = "seurat_annotations",
target_celltypes = high_abundant_celltypes,
multiplier = 10
)
# Next, we introduce a differential abundance in the low abundant cell types
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
seurat_object = seurat_object_DiffAbu,
condition_column = "condition_id",
target_condition = "condition2",
celltype_column = "seurat_annotations",
target_celltypes = low_abundant_celltypes,
multiplier = 5
)
# Verification (Optional):
# Check the cell counts before:
table(
seurat_object@meta.data$condition,
seurat_object@meta.data$seurat_annotation
)
##
## Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK DC
## condition1 336 252 246 176 139 80 77 19
## condition2 361 231 234 168 132 82 78 13
##
## Platelet
## condition1 8
## condition2 6
# Check the cell counts after:
table(
seurat_object_DiffAbu@meta.data$condition,
seurat_object_DiffAbu@meta.data$seurat_annotations
)
##
## Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK
## condition1 3360 2520 2460 176 1390 80 770
## condition2 3610 2310 2340 840 1320 410 780
##
## DC Platelet
## condition1 0 0
## condition2 0 0
While cell type composition is very sensitive to pick up differential abundances even in low abundant cell types, pseudobulk gene expression completely fails to pick up gene expressional differences in low abundant cell types:
ecoda_object_DiffAbu <- create_ecoda_object(
seurat_object_DiffAbu,
sample_col = "sample_id",
celltype_col = "seurat_annotations"
)
plot_pca(
ecoda_object_DiffAbu,
label_col = "condition_id",
title = "PCA cell type composition",
n_hv_feat_show = 4
)
plot_pca(
ecoda_object_DiffAbu,
slot = "pb",
label_col = "condition_id",
title = "PCA pseudobulk",
n_hv_feat_show = 4
)
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2.9001 Seurat_5.3.0
## [4] SeuratObject_5.2.0 sp_2.2-0 scECODA_0.9.1
## [7] BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.2
## [3] later_1.4.4 tibble_3.3.0
## [5] polyclip_1.10-7 fastDummies_1.7.5
## [7] factoextra_1.0.7 lifecycle_1.0.4
## [9] rstatix_0.7.3 globals_0.18.0
## [11] lattice_0.22-7 MASS_7.3-65
## [13] backports_1.5.0 magrittr_2.0.4
## [15] plotly_4.11.0 sass_0.4.10
## [17] rmarkdown_2.30 jquerylib_0.1.4
## [19] yaml_2.3.10 remotes_2.5.0
## [21] httpuv_1.6.16 otel_0.2.0
## [23] sctransform_0.4.2 spam_2.11-1
## [25] spatstat.sparse_3.1-0 reticulate_1.43.0
## [27] cowplot_1.2.0 pbapply_1.7-4
## [29] RColorBrewer_1.1-3 abind_1.4-8
## [31] zlibbioc_1.50.0 Rtsne_0.17
## [33] GenomicRanges_1.56.2 purrr_1.1.0
## [35] BiocGenerics_0.50.0 rappdirs_0.3.3
## [37] GenomeInfoDbData_1.2.12 IRanges_2.38.1
## [39] S4Vectors_0.42.1 ggrepel_0.9.6
## [41] irlba_2.3.5.1 listenv_0.9.1
## [43] spatstat.utils_3.2-0 pheatmap_1.0.13
## [45] vegan_2.7-2 goftest_1.2-3
## [47] RSpectra_0.16-2 spatstat.random_3.4-2
## [49] fitdistrplus_1.2-4 parallelly_1.45.1
## [51] permute_0.9-8 codetools_0.2-20
## [53] DelayedArray_0.30.1 tidyselect_1.2.1
## [55] UCSC.utils_1.0.0 farver_2.1.2
## [57] matrixStats_1.5.0 stats4_4.4.2
## [59] spatstat.explore_3.5-3 jsonlite_2.0.0
## [61] progressr_0.17.0 Formula_1.2-5
## [63] ggridges_0.5.7 survival_3.8-3
## [65] tools_4.4.2 ica_1.0-3
## [67] Rcpp_1.1.0 glue_1.8.0
## [69] gridExtra_2.3 SparseArray_1.4.8
## [71] xfun_0.53 mgcv_1.9-3
## [73] DESeq2_1.44.0 MatrixGenerics_1.16.0
## [75] GenomeInfoDb_1.40.1 dplyr_1.1.4
## [77] withr_3.0.2 BiocManager_1.30.26
## [79] fastmap_1.2.0 digest_0.6.37
## [81] R6_2.6.1 mime_0.13
## [83] scattermore_1.2 gtools_3.9.5
## [85] tensor_1.5.1 spatstat.data_3.1-8
## [87] tidyr_1.3.1 generics_0.1.4
## [89] renv_1.1.5 data.table_1.17.8
## [91] httr_1.4.7 htmlwidgets_1.6.4
## [93] S4Arrays_1.4.1 uwot_0.2.3
## [95] pkgconfig_2.0.3 gtable_0.3.6
## [97] lmtest_0.9-40 S7_0.2.0
## [99] XVector_0.44.0 htmltools_0.5.8.1
## [101] carData_3.0-5 dotCall64_1.2
## [103] bookdown_0.45 scales_1.4.0
## [105] Biobase_2.64.0 png_0.1-8
## [107] spatstat.univar_3.1-4 corrplot_0.95
## [109] knitr_1.50 reshape2_1.4.4
## [111] nlme_3.1-168 curl_7.0.0
## [113] cachem_1.1.0 zoo_1.8-14
## [115] stringr_1.5.2 KernSmooth_2.23-26
## [117] parallel_4.4.2 miniUI_0.1.2
## [119] pillar_1.11.1 grid_4.4.2
## [121] vctrs_0.6.5 RANN_2.6.2
## [123] promises_1.4.0 ggpubr_0.6.2
## [125] car_3.1-3 xtable_1.8-4
## [127] cluster_2.1.8.1 evaluate_1.0.5
## [129] cli_3.6.5 locfit_1.5-9.12
## [131] compiler_4.4.2 rlang_1.1.6
## [133] crayon_1.5.3 future.apply_1.20.0
## [135] ggsignif_0.6.4 labeling_0.4.3
## [137] mclust_6.1.1 plyr_1.8.9
## [139] stringi_1.8.7 viridisLite_0.4.2
## [141] deldir_2.0-4 BiocParallel_1.38.0
## [143] lazyeval_0.2.2 spatstat.geom_3.6-0
## [145] Matrix_1.7-4 RcppHNSW_0.6.0
## [147] patchwork_1.3.2 future_1.67.0
## [149] ggplot2_4.0.0 shiny_1.11.1
## [151] SummarizedExperiment_1.34.0 ROCR_1.0-11
## [153] igraph_2.2.0 broom_1.0.10
## [155] bslib_0.9.0