Characterizing intratumor heterogeneity with GeneNMF
Cancer cells are heterogeneous, expressing a variety of gene programs. Understanding the main axes of transcriptional variability in cancer cells can help shed light on the mechanisms of cancer progression and on the reasons for success/failure of cancer treatments. scRNA-seq data are particularly useful to study intratumoral heterogeneity, as they allow identifying “gene programs” consisting of genes that are coexpressed in subsets of cancer cells within individual tumors.
GeneNMF is a computational method aimed at extracting recurrent gene programs in cancer cells across multiple studies. Inspired on previous work on the application of non-negative matrix factorization (NMF) on scRNA-seq data (see e.g. Kotliar et al. (2019), Barkely et al. (2022), Gavish et al. (2023)), GeneNMF decomposes the expression matrix of individual datasets into a few, interpratable dimensions, and subsequently calculates “meta-programs” (or MP) defined as consensus gene sets that are found to be expressed across multiple samples and initializations of the method.
In this demo, we apply GeneNMF on 11 samples of basal cell carcinoma (BCC) malignant cells from two different studies (Ganier et al. (2024) and Yerly et al. (2022)).
Set up the environment
Here are some packages you’ll need for this demo:
BCC dataset
Download the test dataset for this demo from Figshare:
do_download <- FALSE
ddir <- "input"
data.path <- sprintf("%s/Tumor_combination_LY_CG.rds", ddir)
if (do_download) {
dir.create(ddir)
options(timeout = 3000)
download.file("https://figshare.com/ndownloader/files/47742634", destfile = data.path)
}
seu <- readRDS(data.path)
As with most cancer cell datasets, we observe large batch effects between patients.
With this level of batch effects, it is not trivial to analyse multiple samples together. Typical scRNA-seq analysis pipelines call for batch-effect correction methods, which aim at reducing technical variability while preserving real, biological differences (see e.g. Luecken et al. (2022)). However, cancer cells DO indeed have unique transcriptional phenotypes in different patients, making technical and biological variability indistinguishable. Extracting gene programs by NMF in individual patients is attractive because it bypasses the need for batch effect correction - we rather integrate multiple samples at the level of their gene program activities.
Consistent NMF programs across multiple samples
Identification of robust gene programs requires their detection
across samples and variability of input parameters. Perhaps the most
crucial parameter to NMF is the dimensionality k
, which
corresponds to the number of programs of the low-dimensional matrix. To
determine robust programs, we can run NMF over multiple numbers of
k
and determine programs that are consistenly found across
these runs. The multiNMF()
function automatically performs
NMF over a list of samples and for multiple values of
k
:
DefaultAssay(seu) <- "RNA"
seu.list <- SplitObject(seu, split.by = "Sample")
geneNMF.programs <- multiNMF(seu.list, assay="RNA", k=4:9, min.exp = 0.05)
We can now combine the gene programs identified over multiple samples
and numbers of k
into metaprograms (MPs),
i.e. consensus programs that are robustly identified across NMF runs.
Here we will define 10 MPs:
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
weight.explained = 0.5,
nMP=10)
It can be useful to visualize pairwise similarity (in terms of cosine
similarity or Jaccard index) between individual gene programs that
compose meta-programs. We can see “blocks” corresponding to gene
programs of high similarity across datasets and values of
k
. We can then cut the similarity tree at a given height to
find blocks of similar programs and derive consensus gene signatures for
each block. For example, here we cut the tree to the height
corresponding to 10 clusters of programs (i.e. 10 MPs):
We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).
Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity).
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 21 50
## MP2 1.0000000 0.297392424 0.433 33 60
## MP3 0.9090909 0.129408777 0.295 142 60
## MP4 0.9090909 -0.073756249 0.161 45 109
## MP5 0.8181818 0.509947198 0.624 23 45
## MP6 0.5454545 0.459199536 0.560 11 20
## MP7 0.5454545 0.448717232 0.543 19 17
## MP8 0.5454545 0.008033304 0.233 52 44
## MP9 0.3636364 0.549435393 0.617 45 12
## MP10 0.2727273 0.530347451 0.599 36 12
Some important parameters
We can also be more strict when we extract MPs from the matrix of
individual programs. For example, we can increase the
min.confidence
parameter to 0.7 to focus on genes
consistently found across at least 70% of the individual programs.
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
weight.explained = 0.5,
nMP=10,
min.confidence = 0.7)
## Dropped 1 empty meta-programs
We see the one on the MPs has now been dropped (we could have also picked it up by its poor metrics just above).
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 21 50
## MP2 1.0000000 0.297392424 0.433 21 60
## MP3 0.9090909 0.129408777 0.295 38 60
## MP4 0.8181818 0.509947198 0.624 23 45
## MP5 0.5454545 0.459199536 0.560 9 20
## MP6 0.5454545 0.448717232 0.543 15 17
## MP7 0.5454545 0.008033304 0.233 13 44
## MP8 0.3636364 0.549435393 0.617 36 12
## MP9 0.2727273 0.530347451 0.599 25 12
We see that most MPs are composed of around 10-20 genes (column
numberGenes
in the table above). We can control the size of
the MPs using the weight.explained
parameter. For example,
we can increase weight.explained
to 0.8 to include all
genes that allow to explain 80% of loadings in the consensus gene
set:
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
weight.explained = 0.8,
nMP=10,
min.confidence = 0.7)
## Dropped 1 empty meta-programs
Now we have larger meta-program gene sets:
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 38 50
## MP2 1.0000000 0.297392424 0.433 27 60
## MP3 0.9090909 0.129408777 0.295 38 60
## MP4 0.8181818 0.509947198 0.624 37 45
## MP5 0.5454545 0.459199536 0.560 11 20
## MP6 0.5454545 0.448717232 0.543 21 17
## MP7 0.5454545 0.008033304 0.233 13 44
## MP8 0.3636364 0.549435393 0.617 59 12
## MP9 0.2727273 0.530347451 0.599 38 12
Other important parameters to getMetaPrograms()
are
nMP
(the target number of meta-programs) and
specificity.weight
, which controls how much MP signatures
are allowed to overlap. From the default
specificity.weight=5
, lower values (e.g. =3
)
allows more overlapping MPs, higher values (e.g. =8
)
enforces higher sparsity in the decomposition.
Interpretation of gene programs
What are the genes driving each meta-program?
## $MP1
## [1] "TOP2A" "CENPF" "NUSAP1" "PCLAF" "MAD2L1" "TK1"
##
## $MP2
## [1] "KRT6B" "CALML5" "S100A7" "DMKN" "LGALS7B" "KRT1"
##
## $MP3
## [1] "NIN" "FAT3" "PEAK1" "NAV2" "SLC7A2" "INTS6"
##
## $MP4
## [1] "FOSB" "FOS" "EGR1" "ZFP36" "DNAJB1" "NR4A1"
##
## $MP5
## [1] "ACTA2" "TAGLN" "MYLK" "MYL9" "CSRP1" "ANXA3"
##
## $MP6
## [1] "S100A8" "S100A9" "CHI3L1" "TIMP1" "CCL2" "S100A7"
##
## $MP7
## [1] "ATF3" "ZFP36" "SOCS3" "IRF1" "BTG2" "NR4A1"
##
## $MP8
## [1] "CRCT1" "CRNN" "RBP2" "LINC00964" "FOS" "SOX18"
##
## $MP9
## [1] "SFRP1" "ALK" "COL14A1" "DCN" "MEGF6" "CHGA"
We can also inspect their relative weights to see how much each gene contributes to the MP. For example, for MP6:
## S100A8 S100A9 CHI3L1 TIMP1 CCL2 S100A7
## 0.215830597 0.210983115 0.115345508 0.105439904 0.090343059 0.051177038
## TAGLN MT2A INHBA KRT6A KRT6B S100A2
## 0.037815897 0.029383116 0.023189284 0.022741619 0.019063225 0.013392001
## IFITM3 ZFP36 KRT75 MGP SOCS3 ANXA3
## 0.011160279 0.010551846 0.009011202 0.006894923 0.006609389 0.005611245
## KLF10 TMEM45A SULF2
## 0.005390284 0.005292033 0.004774439
To aid the interpretation of gene programs, we can compare them to
known signatures from public databases. The runGSEA()
function can be useful to scan MSigDB and evaluate the overlap of
detected gene programs with signatures in the databases. Here we compare
to the “C5” category, biological process subcategory; but other classes
such as “H” (hallmark gene sets) or “C8” (cell type) may be more
relevant in other contexts.
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
runGSEA(program, universe=rownames(seu), category = "C5", subcategory = "GO:BP")
})
For example, MP1 appears to be associated with cell cycling / cell division.
## pathway pval padj overlap size
## <char> <num> <num> <int> <int>
## 1: GOBP_CELL_CYCLE_PROCESS 2.071463e-27 1.585705e-23 27 1175
## 2: GOBP_MITOTIC_CELL_CYCLE 9.610340e-26 3.678358e-22 24 866
## 3: GOBP_CELL_CYCLE 1.444483e-24 3.685839e-21 28 1716
## 4: GOBP_ORGANELLE_FISSION 2.410380e-24 3.842722e-21 20 490
## 5: GOBP_MITOTIC_CELL_CYCLE_PROCESS 2.509942e-24 3.842722e-21 22 714
## 6: GOBP_CELL_DIVISION 5.394203e-24 6.882104e-21 21 618
## overlapGenes
## <list>
## 1: ASPM, BR....
## 2: BRCA1, B....
## 3: ASPM, BR....
## 4: ASPM, BR....
## 5: BRCA1, B....
## 6: ASPM, BR....
A word of caution: GSEA can be very useful to gain a broad idea of MP function. However, gene sets in MSigDB are often very large (>1000 genes), noisy sets derived from specific conditions, so results should be interpreted with caution.
Signature scores for gene programs
Now that we derived gene signatures for our MPs, we can use them to
obtain gene sets scores to place each cell on the axis of each MP. We
can apply the UCell
method to derive MP scores between 0
and 1 for each cell.
mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, ncores=4, name = "")
We can explore whether certain MPs are enriched in individual samples:
Some MPs appear to be equally active across multiple samples; others such as MP4 or MP7 have different distributions in different samples. If clinical data are available for patients, one may use this information to study associations between MPs and relevant biological variables.
Signature scores to define integrated spaces
Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:
matrix <- seu@meta.data[,names(mp.genes)]
#dimred <- scale(matrix)
dimred <- as.matrix(matrix)
colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
cell.embeddings = dimred,
assay.used = "RNA",
key = "MP_",
global = FALSE)
We can also use these scores to generate a UMAP representation and visualize the data in 2D
set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
metric = "euclidean", reduction.name = "umap_MP")
We have now integrated the data based on their MP signature scores!
How do the signature scores for the meta-programs look like in the combined space?
Final remarks
NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly.
References
- GeneNMF package - LINK
- We use the ultra-fast RcppML method for NMF - LINK
- Analysis of BCC using GeneNMF - Yerly, Andreatta et al. (2024)
See also
- Decomposing PBMC scRNA-seq data using NMF - DEMO
- More demos are available at the GeneNMF GitHub repo. Questions? open an issue on GitHub.
Session Info
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] viridis_0.6.5 viridisLite_0.4.2 fgsea_1.30.0 msigdbr_7.5.1
## [5] GeneNMF_0.6.0 RColorBrewer_1.1-3 dplyr_1.1.4 tidyr_1.3.1
## [9] patchwork_1.2.0 UCell_2.8.0 ggplot2_3.5.1 Seurat_5.1.0
## [13] SeuratObject_5.0.2 sp_2.1-4 renv_1.0.7
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.0
## [3] later_1.3.2 tibble_3.2.1
## [5] polyclip_1.10-6 fastDummies_1.7.3
## [7] lifecycle_1.0.4 globals_0.16.3
## [9] lattice_0.22-6 MASS_7.3-60.2
## [11] SnowballC_0.7.1 magrittr_2.0.3
## [13] plotly_4.10.4 sass_0.4.9
## [15] rmarkdown_2.27 jquerylib_0.1.4
## [17] yaml_2.3.9 remotes_2.5.0
## [19] httpuv_1.6.15 sctransform_0.4.1
## [21] spam_2.10-0 spatstat.sparse_3.1-0
## [23] reticulate_1.38.0 cowplot_1.1.3
## [25] pbapply_1.7-2 abind_1.4-5
## [27] zlibbioc_1.50.0 Rtsne_0.17
## [29] GenomicRanges_1.56.1 purrr_1.0.2
## [31] BiocGenerics_0.50.0 GenomeInfoDbData_1.2.12
## [33] IRanges_2.38.1 S4Vectors_0.42.1
## [35] ggrepel_0.9.5 irlba_2.3.5.1
## [37] listenv_0.9.1 spatstat.utils_3.0-5
## [39] pheatmap_1.0.12 goftest_1.2-3
## [41] RSpectra_0.16-2 spatstat.random_3.3-1
## [43] fitdistrplus_1.2-1 parallelly_1.37.1
## [45] leiden_0.4.3.1 codetools_0.2-20
## [47] DelayedArray_0.30.1 tidyselect_1.2.1
## [49] UCSC.utils_1.0.0 farver_2.1.2
## [51] matrixStats_1.3.0 stats4_4.4.0
## [53] rmdformats_1.0.4 spatstat.explore_3.3-1
## [55] jsonlite_1.8.8 BiocNeighbors_1.22.0
## [57] progressr_0.14.0 ggridges_0.5.6
## [59] survival_3.5-8 tools_4.4.0
## [61] ica_1.0-3 Rcpp_1.0.13
## [63] glue_1.7.0 gridExtra_2.3
## [65] SparseArray_1.4.8 xfun_0.46
## [67] MatrixGenerics_1.16.0 GenomeInfoDb_1.40.1
## [69] withr_3.0.0 BiocManager_1.30.23
## [71] fastmap_1.2.0 fansi_1.0.6
## [73] RcppML_0.3.7 digest_0.6.36
## [75] R6_2.5.1 mime_0.12
## [77] colorspace_2.1-0 scattermore_1.2
## [79] tensor_1.5 spatstat.data_3.1-2
## [81] utf8_1.2.4 generics_0.1.3
## [83] data.table_1.15.4 httr_1.4.7
## [85] htmlwidgets_1.6.4 S4Arrays_1.4.1
## [87] uwot_0.2.2 pkgconfig_2.0.3
## [89] gtable_0.3.5 lmtest_0.9-40
## [91] SingleCellExperiment_1.26.0 XVector_0.44.0
## [93] htmltools_0.5.8.1 dotCall64_1.1-1
## [95] bookdown_0.40 scales_1.3.0
## [97] Biobase_2.64.0 png_0.1-8
## [99] spatstat.univar_3.0-0 knitr_1.48
## [101] reshape2_1.4.4 nlme_3.1-164
## [103] curl_5.2.1 cachem_1.1.0
## [105] zoo_1.8-12 stringr_1.5.1
## [107] KernSmooth_2.23-22 parallel_4.4.0
## [109] miniUI_0.1.1.1 pillar_1.9.0
## [111] grid_4.4.0 vctrs_0.6.5
## [113] RANN_2.6.1 lsa_0.73.3
## [115] promises_1.3.0 xtable_1.8-4
## [117] cluster_2.1.6 evaluate_0.24.0
## [119] cli_3.6.3 compiler_4.4.0
## [121] rlang_1.1.4 crayon_1.5.3
## [123] future.apply_1.11.2 labeling_0.4.3
## [125] plyr_1.8.9 stringi_1.8.4
## [127] deldir_2.0-4 BiocParallel_1.38.0
## [129] babelgene_22.9 munsell_0.5.1
## [131] lazyeval_0.2.2 spatstat.geom_3.3-2
## [133] Matrix_1.7-0 RcppHNSW_0.6.0
## [135] future_1.33.2 shiny_1.8.1.1
## [137] SummarizedExperiment_1.34.0 highr_0.11
## [139] ROCR_1.0-11 igraph_2.0.3
## [141] bslib_0.7.0 fastmatch_1.1-4