1 Fine-grained Cell Type Annotation is Crucial for Uncovering Biological Insights

Single-cell technologies have allowed us to move beyond bulk gene expression, where measurements are often highly correlated with the underlying cell type composition. While it is now common practice to group single cells into functional states based on their gene or protein expression, the resolution of this grouping - the granularity of cell type annotation - remains a critical user choice.

When analyzing differences in cell composition across patient cohorts or experimental conditions, a low-resolution (broad) annotation (e.g., “T cell” instead of “Effector CD8+ T cell”) risks averaging out subtle yet highly relevant biological signals. A small, but highly consequential change in a specific T cell subset may be statistically masked by the stable abundance of the broader T cell population.

This case study demonstrates why the choice of annotation granularity is vital for exploratory compositional data analysis (ECODA). Using the scECODA R package, we will show how:

A broad, low-resolution annotation fails to detect significant differences or yields biologically uninformative results.

A fine-grained, high-resolution annotation of the exact same dataset successfully uncovers novel, statistically robust, and biologically meaningful shifts in cell type abundance across samples.

By highlighting this principle, we provide the primary justification for investing effort in high-resolution cell type annotation.


2 R environment

Check package dependencies and install scECODA

# if (!requireNamespace("renv")) install.packages("renv")
# library(renv)
# renv::restore()
# renv::install("carmonalab/scECODA")

remotes::install_github("carmonalab/scECODA")
library(scECODA)

3 Example datasets

Shown below is the healthy cohort by (Gong et al. 2024). Using a low granular, broad cell type annotation the samples do not separate. Using highly granular, detailed annotation of cell types and subtypes a grouping structure emerges: samples can be separated by age group and cytomegalovirus (CMV) infection status based on their high-resolution cell type composition.

data("example_data")

data <- example_data$GongSharma_full
metadata <- data$metadata
metadata$age_cmv <- paste0(metadata$age_group, " CMV-", metadata$subject.cmv)

counts_lowres <- data$cell_counts_lowresolution
number_of_celltypes_lr <- ncol(counts_lowres)

ecoda_object_lowres <- create_ecoda_object_helper(
  counts = counts_lowres,
  metadata = metadata
)

counts_highres <- data$cell_counts_highresolution
number_of_celltypes_hr <- ncol(counts_highres)

ecoda_object_highres <- create_ecoda_object_helper(
  counts = counts_highres,
  metadata = metadata
)


# Using only low granularity cell type annotation
plot_pca(
  ecoda_object_lowres,
  label_col = "age_cmv",
#   title = paste("PCA based on low granularity cell type annotation
# Number of cell types:", number_of_celltypes_lr),
  n_hv_feat_show = 5
)

# Using only high granularity cell type annotation
plot_pca(
  ecoda_object_highres,
  label_col = "age_cmv",
#   title = paste("PCA based on high granularity cell type annotation
# Number of cell types:", number_of_celltypes_hr),
  n_hv_feat_show = 9
)

This is another example based on the dataset from (Adams et al. 2020) of samples from pulmonary fibrosis patients, including lung samples from normal tissue, idiopathic pulmonary fibrosis or chronic obstructive pulmonary disease.

data("example_data")

data <- example_data$Adams
main_bio_condition <- data$main_biologicalcondition_columnname
metadata <- data$metadata

counts_lowres <- data$cell_counts_lowresolution
number_of_celltypes_lr <- ncol(counts_lowres)

ecoda_object_lowres <- create_ecoda_object_helper(
  counts = counts_lowres,
  metadata = metadata
)

counts_highres <- data$cell_counts_highresolution
number_of_celltypes_hr <- ncol(counts_highres)

ecoda_object_highres <- create_ecoda_object_helper(
  counts = counts_highres,
  metadata = metadata
)


# Using only low granularity cell type annotation
plot_pca(
  ecoda_object_lowres,
  label_col = main_bio_condition,
  title = paste("PCA based on low granularity cell type annotation"),
  n_hv_feat_show = 5
)

# Using only high granularity cell type annotation
plot_pca(
  ecoda_object_highres,
  label_col = main_bio_condition,
  title = paste("PCA based on high granularity cell type annotation"),
  n_hv_feat_show = 8
)

4 References

Adams, Taylor S., Jonas C. Schupp, Sergio Poli, Ehab A. Ayaub, Nir Neumark, Farida Ahangari, Sarah G. Chu, et al. 2020. “Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis.” Science Advances 6 (28): eaba1983. https://doi.org/10.1126/sciadv.aba1983.
Gong, Qiuyu, Mehul Sharma, Emma L. Kuan, Marla C. Glass, Aishwarya Chander, Mansi Singh, Lucas T. Graybuck, et al. 2024. “Longitudinal Multi-omic Immune Profiling Reveals Age-Related Immune Cell Dynamics in Healthy Adults.” bioRxiv: The Preprint Server for Biology, September, 2024.09.10.612119. https://doi.org/10.1101/2024.09.10.612119.

Appendix

A Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] scECODA_0.9.1    BiocStyle_2.32.1
## 
## loaded via a namespace (and not attached):
##   [1] remotes_2.5.0               permute_0.9-8              
##   [3] rlang_1.1.6                 magrittr_2.0.4             
##   [5] matrixStats_1.5.0           compiler_4.4.2             
##   [7] mgcv_1.9-3                  vctrs_0.6.5                
##   [9] stringr_1.5.2               pkgconfig_2.0.3            
##  [11] crayon_1.5.3                fastmap_1.2.0              
##  [13] backports_1.5.0             XVector_0.44.0             
##  [15] labeling_0.4.3              rmarkdown_2.30             
##  [17] UCSC.utils_1.0.0            purrr_1.1.0                
##  [19] xfun_0.53                   zlibbioc_1.50.0            
##  [21] cachem_1.1.0                GenomeInfoDb_1.40.1        
##  [23] jsonlite_2.0.0              DelayedArray_0.30.1        
##  [25] BiocParallel_1.38.0         broom_1.0.10               
##  [27] parallel_4.4.2              cluster_2.1.8.1            
##  [29] R6_2.6.1                    bslib_0.9.0                
##  [31] stringi_1.8.7               RColorBrewer_1.1-3         
##  [33] car_3.1-3                   GenomicRanges_1.56.2       
##  [35] jquerylib_0.1.4             Rcpp_1.1.0                 
##  [37] bookdown_0.45               SummarizedExperiment_1.34.0
##  [39] knitr_1.50                  IRanges_2.38.1             
##  [41] Matrix_1.7-4                splines_4.4.2              
##  [43] igraph_2.2.0                tidyselect_1.2.1           
##  [45] abind_1.4-8                 yaml_2.3.10                
##  [47] vegan_2.7-2                 codetools_0.2-20           
##  [49] curl_7.0.0                  lattice_0.22-7             
##  [51] tibble_3.3.0                Biobase_2.64.0             
##  [53] withr_3.0.2                 S7_0.2.0                   
##  [55] evaluate_1.0.5              mclust_6.1.1               
##  [57] pillar_1.11.1               BiocManager_1.30.26        
##  [59] ggpubr_0.6.2                MatrixGenerics_1.16.0      
##  [61] carData_3.0-5               corrplot_0.95              
##  [63] renv_1.1.5                  stats4_4.4.2               
##  [65] plotly_4.11.0               generics_0.1.4             
##  [67] S4Vectors_0.42.1            ggplot2_4.0.0              
##  [69] scales_1.4.0                gtools_3.9.5               
##  [71] glue_1.8.0                  pheatmap_1.0.13            
##  [73] lazyeval_0.2.2              tools_4.4.2                
##  [75] data.table_1.17.8           locfit_1.5-9.12            
##  [77] ggsignif_0.6.4              RANN_2.6.2                 
##  [79] grid_4.4.2                  tidyr_1.3.1                
##  [81] nlme_3.1-168                GenomeInfoDbData_1.2.12    
##  [83] Formula_1.2-5               cli_3.6.5                  
##  [85] S4Arrays_1.4.1              viridisLite_0.4.2          
##  [87] dplyr_1.1.4                 gtable_0.3.6               
##  [89] rstatix_0.7.3               DESeq2_1.44.0              
##  [91] sass_0.4.10                 digest_0.6.37              
##  [93] BiocGenerics_0.50.0         SparseArray_1.4.8          
##  [95] ggrepel_0.9.6               htmlwidgets_1.6.4          
##  [97] farver_2.1.2                htmltools_0.5.8.1          
##  [99] factoextra_1.0.7            lifecycle_1.0.4            
## [101] httr_1.4.7                  MASS_7.3-65