1 ECODA vs pseudobulk

ECODA can be more sensitive to detect differences between samples then pseudobulk, especially when studying complex conditions and tissues with high-resolution cell type annotation including numerous low abundant cell subtypes.

Additionally, ECODA is much more interpretable (10-100 cell types) compared to gene expression (2’000 - 20’000 genes).

In this example, you will see how a dataset without sample separation looks like and how it compares to a dataset with differentially abundant cell types, in the cell type compositional and gene expressional space.

Also, it will give you a feeling about the sensitivity of cell type composition based sample separation versus pseudobulk based sample separation.


2 R environment

Check package dependencies and install scECODA

# if (!requireNamespace("renv")) install.packages("renv")
# library(renv)
# renv::restore()
# renv::install("carmonalab/scECODA")

remotes::install_github("carmonalab/scECODA")
library(scECODA)

3 Pseudobulk analysis

scECODA provides a convenient function to calculate and visualize sample pseudobulks from scRNA-seq data, e.g. from a Seurat object.

In this example you will see:

  1. How to calculate pseudobulks
  2. How to plot a PCA from pseudobulks
  3. How pseudobulk compares to cell type compositional analysis (ECODA)
    • When pseudobulk and cell type composition perform similarly
    • Pseudobulk falls short of detecting differences between samples if the differences are in low abundant cell types

3.0.1 When pseudobulk and cell type composition perform similarly

We use the Seurat pbmc3k dataset.

# Loading an example Seurat dataset
library(Seurat)
library(SeuratData)
InstallData("pbmc3k")
data("pbmc3k")
seurat_object <- UpdateSeuratObject(pbmc3k)

The cells of this example dataset are annotated but not assigned to samples or biological conditions. Let us artificially assign them by creating a sample and condition column in the metadata.

n_samples <- 40
n_groups <- 2

seurat_object$sample_id <- ""
seurat_object$sample_id <- rep(
  x = paste0("sample", seq(n_samples)),
  length.out = length(seurat_object$sample_id))
seurat_object$condition_id <- ""
seurat_object$condition_id <- c(rep("condition1", n_samples/n_groups),
                      rep("condition2", n_samples/n_groups))

As we randomly assigned cells and samples to two arbitrary “conditions”, we see no separation between samples. All separation metrics show low scores around zero, indicating indeed a random distribution of our samples between conditions.

# Create ECODA object
ecoda_object <- create_ecoda_object(
  seurat_object,
  sample_col = "sample_id",
  celltype_col = "seurat_annotations"
)

# # Internally this is calculated with this function:
# pb <- calculate_pseudobulk(
#   count_matrix = seurat_object[["RNA"]]$counts,
#   sample_ids = seurat_object@meta.data[[sample_col]]
# )

plot_pca(
  ecoda_object,
  label_col = "condition_id",
  title = "PCA - Cell type composition"
)

plot_pca(
  ecoda_object,
  slot = "pb",
  label_col = "condition_id",
  title = "PCA - Pseudobulk"
)

Observe what happens if you introduce a differential abundance of CD8 T and NK cells between the two groups.

# Multiply CD8 T cells and NK cells within "condition2" by 5:
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
  seurat_object = seurat_object,
  condition_column = "condition_id",
  target_condition = "condition2",
  celltype_column = "seurat_annotations",
  target_celltypes = c("CD8 T", "NK"),
  multiplier = 5
)

# Verification (Optional):
# Check the cell counts before:
table(
  seurat_object@meta.data$condition,
  seurat_object@meta.data$seurat_annotation
)
##             
##              Naive CD4 T Memory CD4 T CD14+ Mono   B CD8 T FCGR3A+ Mono  NK  DC
##   condition1         336          252        246 176   139           80  77  19
##   condition2         361          231        234 168   132           82  78  13
##             
##              Platelet
##   condition1        8
##   condition2        6
# Check the cell counts after:
table(
  seurat_object_DiffAbu@meta.data$condition,
  seurat_object_DiffAbu@meta.data$seurat_annotations
)
##             
##              Naive CD4 T Memory CD4 T CD14+ Mono   B CD8 T FCGR3A+ Mono  NK  DC
##   condition1         336          252        246 176   139           80  77  19
##   condition2         361          231        234 168   660           82 390  13
##             
##              Platelet
##   condition1        8
##   condition2        6

All separation metrics increased and are now close to their maximum value of 1, indicating separation of the two groups of samples.

ecoda_object_DiffAbu <- create_ecoda_object(
  seurat_object_DiffAbu,
  sample_col = "sample_id",
  celltype_col = "seurat_annotations"
)
plot_pca(
  ecoda_object_DiffAbu,
  label_col = "condition_id",
  title = "PCA cell type composition",
  n_hv_feat_show = 2
)

plot_pca(
  ecoda_object_DiffAbu,
  slot = "pb",
  label_col = "condition_id",
  title = "PCA pseudobulk",
  n_hv_feat_show = 2
)

It can be seen that CD8 T and NK cells are significantly higher in the condition2. Interestingly, it can also be seen that all the other cell types seem to be lower abundant, even though their counts were not changed. However, their relative abundance decreased. Remember that parts of a composition are not independent but all are part of a total. If you change one part (e.g. relatively increase), all others change too (e.g. relatively decrease).

plot_boxplot(
  ecoda_object,
  label_col = "condition_id",
  title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)

plot_boxplot(
  ecoda_object_DiffAbu,
  label_col = "condition_id",
  title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)

This can also be plotted as stacked bar plots but it is harder to read.

# 1. Plotting Summary by condition_id (your original plot)
plot_barplot(
  ecoda_object,
  label_col = "condition_id",
  plot_by = "group",
  title = "Mean Relative Abundance by Condition"
)

plot_barplot(
  ecoda_object_DiffAbu,
  label_col = "condition_id",
  plot_by = "group",
  title = "Mean Relative Abundance by Condition"
)

# 2. Plotting Each Sample Separately (the new request)
plot_barplot(
  ecoda_object,
  label_col = "condition_id",
  plot_by = "sample",
  title = "Relative Abundance for Each Sample"
)

plot_barplot(
  ecoda_object_DiffAbu,
  label_col = "condition_id",
  plot_by = "sample",
  title = "Relative Abundance for Each Sample"
)

Before you introduced an artificial differential abundance, there was no major correlation between any of the cell types. Afterwards, however, it can be clearly seen that CD8 T and NK cells strongly correlate.

plot_corr(ecoda_object)

plot_corr(ecoda_object_DiffAbu)

3.0.2 ECODA outperforms pseudobulk in detecting changes in low abundant cell subtypes

Pseudobulk falls short of detecting differences between samples if the differences are in very low abundant cell types

To showcase this effect we can again use our semi-synthetic dataset:

First, we make the Seurat object so that it has a lot more cells. For this purpose, we multiply the cells in all samples (both conditions) by 10. Next,

all_celltypes <- colnames(ecoda_object@counts)
low_abundant_celltypes <- c("FCGR3A+ Mono", "B")
high_abundant_celltypes <- all_celltypes[
  !all_celltypes %in% low_abundant_celltypes
]

# Let's remove the platelets and DCs as they are zero in about half the samples
# They would introduce noise which would be further amplified when multiplying
seurat_object_clean <- seurat_object
seurat_object_clean$seurat_annotations[
  seurat_object_clean$seurat_annotations %in% c("Platelet", "DC")
] <- NA

# We multiply most cell types by 10 in both conditions:
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
  seurat_object = seurat_object_clean,
  condition_column = "condition_id",
  target_condition = "condition1",
  celltype_column = "seurat_annotations",
  target_celltypes = high_abundant_celltypes,
  multiplier = 10
)
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
  seurat_object = seurat_object_DiffAbu,
  condition_column = "condition_id",
  target_condition = "condition2",
  celltype_column = "seurat_annotations",
  target_celltypes = high_abundant_celltypes,
  multiplier = 10
)

# Next, we introduce a differential abundance in the low abundant cell types
seurat_object_DiffAbu <- seurat_multiply_cells_in_condition(
  seurat_object = seurat_object_DiffAbu,
  condition_column = "condition_id",
  target_condition = "condition2",
  celltype_column = "seurat_annotations",
  target_celltypes = low_abundant_celltypes,
  multiplier = 5
)


# Verification (Optional):
# Check the cell counts before:
table(
  seurat_object@meta.data$condition,
  seurat_object@meta.data$seurat_annotation
)
##             
##              Naive CD4 T Memory CD4 T CD14+ Mono   B CD8 T FCGR3A+ Mono  NK  DC
##   condition1         336          252        246 176   139           80  77  19
##   condition2         361          231        234 168   132           82  78  13
##             
##              Platelet
##   condition1        8
##   condition2        6
# Check the cell counts after:
table(
  seurat_object_DiffAbu@meta.data$condition,
  seurat_object_DiffAbu@meta.data$seurat_annotations
)
##             
##              Naive CD4 T Memory CD4 T CD14+ Mono    B CD8 T FCGR3A+ Mono   NK
##   condition1        3360         2520       2460  176  1390           80  770
##   condition2        3610         2310       2340  840  1320          410  780
##             
##                DC Platelet
##   condition1    0        0
##   condition2    0        0

While cell type composition is very sensitive to pick up differential abundances even in low abundant cell types, pseudobulk gene expression completely fails to pick up gene expressional differences in low abundant cell types:

ecoda_object_DiffAbu <- create_ecoda_object(
  seurat_object_DiffAbu,
  sample_col = "sample_id",
  celltype_col = "seurat_annotations"
)
plot_pca(
  ecoda_object_DiffAbu,
  label_col = "condition_id",
  title = "PCA cell type composition",
  n_hv_feat_show = 4
)

plot_pca(
  ecoda_object_DiffAbu,
  slot = "pb",
  label_col = "condition_id",
  title = "PCA pseudobulk",
  n_hv_feat_show = 4
)

4 References

Appendix

A Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2.9001   Seurat_5.3.0           
## [4] SeuratObject_5.2.0      sp_2.2-0                scECODA_0.9.1          
## [7] BiocStyle_2.32.1       
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.2              
##   [3] later_1.4.4                 tibble_3.3.0               
##   [5] polyclip_1.10-7             fastDummies_1.7.5          
##   [7] factoextra_1.0.7            lifecycle_1.0.4            
##   [9] rstatix_0.7.3               globals_0.18.0             
##  [11] lattice_0.22-7              MASS_7.3-65                
##  [13] backports_1.5.0             magrittr_2.0.4             
##  [15] plotly_4.11.0               sass_0.4.10                
##  [17] rmarkdown_2.30              jquerylib_0.1.4            
##  [19] yaml_2.3.10                 remotes_2.5.0              
##  [21] httpuv_1.6.16               otel_0.2.0                 
##  [23] sctransform_0.4.2           spam_2.11-1                
##  [25] spatstat.sparse_3.1-0       reticulate_1.43.0          
##  [27] cowplot_1.2.0               pbapply_1.7-4              
##  [29] RColorBrewer_1.1-3          abind_1.4-8                
##  [31] zlibbioc_1.50.0             Rtsne_0.17                 
##  [33] GenomicRanges_1.56.2        purrr_1.1.0                
##  [35] BiocGenerics_0.50.0         rappdirs_0.3.3             
##  [37] GenomeInfoDbData_1.2.12     IRanges_2.38.1             
##  [39] S4Vectors_0.42.1            ggrepel_0.9.6              
##  [41] irlba_2.3.5.1               listenv_0.9.1              
##  [43] spatstat.utils_3.2-0        pheatmap_1.0.13            
##  [45] vegan_2.7-2                 goftest_1.2-3              
##  [47] RSpectra_0.16-2             spatstat.random_3.4-2      
##  [49] fitdistrplus_1.2-4          parallelly_1.45.1          
##  [51] permute_0.9-8               codetools_0.2-20           
##  [53] DelayedArray_0.30.1         tidyselect_1.2.1           
##  [55] UCSC.utils_1.0.0            farver_2.1.2               
##  [57] matrixStats_1.5.0           stats4_4.4.2               
##  [59] spatstat.explore_3.5-3      jsonlite_2.0.0             
##  [61] progressr_0.17.0            Formula_1.2-5              
##  [63] ggridges_0.5.7              survival_3.8-3             
##  [65] tools_4.4.2                 ica_1.0-3                  
##  [67] Rcpp_1.1.0                  glue_1.8.0                 
##  [69] gridExtra_2.3               SparseArray_1.4.8          
##  [71] xfun_0.53                   mgcv_1.9-3                 
##  [73] DESeq2_1.44.0               MatrixGenerics_1.16.0      
##  [75] GenomeInfoDb_1.40.1         dplyr_1.1.4                
##  [77] withr_3.0.2                 BiocManager_1.30.26        
##  [79] fastmap_1.2.0               digest_0.6.37              
##  [81] R6_2.6.1                    mime_0.13                  
##  [83] scattermore_1.2             gtools_3.9.5               
##  [85] tensor_1.5.1                spatstat.data_3.1-8        
##  [87] tidyr_1.3.1                 generics_0.1.4             
##  [89] renv_1.1.5                  data.table_1.17.8          
##  [91] httr_1.4.7                  htmlwidgets_1.6.4          
##  [93] S4Arrays_1.4.1              uwot_0.2.3                 
##  [95] pkgconfig_2.0.3             gtable_0.3.6               
##  [97] lmtest_0.9-40               S7_0.2.0                   
##  [99] XVector_0.44.0              htmltools_0.5.8.1          
## [101] carData_3.0-5               dotCall64_1.2              
## [103] bookdown_0.45               scales_1.4.0               
## [105] Biobase_2.64.0              png_0.1-8                  
## [107] spatstat.univar_3.1-4       corrplot_0.95              
## [109] knitr_1.50                  reshape2_1.4.4             
## [111] nlme_3.1-168                curl_7.0.0                 
## [113] cachem_1.1.0                zoo_1.8-14                 
## [115] stringr_1.5.2               KernSmooth_2.23-26         
## [117] parallel_4.4.2              miniUI_0.1.2               
## [119] pillar_1.11.1               grid_4.4.2                 
## [121] vctrs_0.6.5                 RANN_2.6.2                 
## [123] promises_1.4.0              ggpubr_0.6.2               
## [125] car_3.1-3                   xtable_1.8-4               
## [127] cluster_2.1.8.1             evaluate_1.0.5             
## [129] cli_3.6.5                   locfit_1.5-9.12            
## [131] compiler_4.4.2              rlang_1.1.6                
## [133] crayon_1.5.3                future.apply_1.20.0        
## [135] ggsignif_0.6.4              labeling_0.4.3             
## [137] mclust_6.1.1                plyr_1.8.9                 
## [139] stringi_1.8.7               viridisLite_0.4.2          
## [141] deldir_2.0-4                BiocParallel_1.38.0        
## [143] lazyeval_0.2.2              spatstat.geom_3.6-0        
## [145] Matrix_1.7-4                RcppHNSW_0.6.0             
## [147] patchwork_1.3.2             future_1.67.0              
## [149] ggplot2_4.0.0               shiny_1.11.1               
## [151] SummarizedExperiment_1.34.0 ROCR_1.0-11                
## [153] igraph_2.2.0                broom_1.0.10               
## [155] bslib_0.9.0