Characterizing intratumor heterogeneity with GeneNMF
Cancer cells are heterogeneous, expressing a variety of gene programs. Understanding the main axes of transcriptional variability in cancer cells can help shed light on the mechanisms of cancer progression and on the reasons for success/failure of cancer treatments. scRNA-seq data are particularly useful to study intratumoral heterogeneity, as they allow identifying “gene programs” consisting of genes that are coexpressed in subsets of cancer cells within individual tumors.
GeneNMF is a computational method aimed at extracting recurrent gene programs in cancer cells across multiple studies. Inspired on previous work on the application of non-negative matrix factorization (NMF) on scRNA-seq data (see e.g. Kotliar et al. (2019), Barkely et al. (2022), Gavish et al. (2023)), GeneNMF decomposes the expression matrix of individual datasets into a few, interpratable dimensions, and subsequently calculates “meta-programs” (or MP) defined as consensus gene sets that are found to be expressed across multiple samples and initializations of the method.
In this demo, we apply GeneNMF on 11 samples of basal cell carcinoma (BCC) malignant cells from two different studies (Ganier et al. (2024) and Yerly et al. (2022)).
Set up the environment
Here are some packages you’ll need for this demo:
BCC dataset
Download the test dataset for this demo from Figshare:
do_download <- FALSE
ddir <- "input"
data.path <- sprintf("%s/Tumor_combination_LY_CG.rds", ddir)
if (do_download) {
dir.create(ddir)
options(timeout = 3000)
download.file("https://figshare.com/ndownloader/files/47742634", destfile = data.path)
}
seu <- readRDS(data.path)
As with most cancer cell datasets, we observe large batch effects between patients.
With this level of batch effects, it is not trivial to analyse multiple samples together. Typical scRNA-seq analysis pipelines call for batch-effect correction methods, which aim at reducing technical variability while preserving real, biological differences (see e.g. Luecken et al. (2022)). However, cancer cells DO indeed have unique transcriptional phenotypes in different patients, making technical and biological variability indistinguishable. Extracting gene programs by NMF in individual patients is attractive because it bypasses the need for batch effect correction - we rather integrate multiple samples at the level of their gene program activities.
Consistent NMF programs across multiple samples
Identification of robust gene programs requires their detection
across samples and variability of input parameters. Perhaps the most
crucial parameter to NMF is the dimensionality k
, which
corresponds to the number of programs of the low-dimensional matrix. To
determine robust programs, we can run NMF over multiple numbers of
k
and determine programs that are consistenly found across
these runs. The multiNMF()
function automatically performs
NMF over a list of samples and for multiple values of
k
:
DefaultAssay(seu) <- "RNA"
seu.list <- SplitObject(seu, split.by = "Sample")
geneNMF.programs <- multiNMF(seu.list, assay="RNA", k=4:9, min.exp = 0.05)
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can now combine the gene programs identified over multiple samples
and numbers of k
into metaprograms (MPs),
i.e. consensus programs that are robustly identified across NMF runs.
Here we will define 10 MPs:
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
specificity.weight = 5,
weight.explained = 0.5,
nMP=10)
It can be useful to visualize pairwise similarity (in terms of cosine
similarity or Jaccard index) between individual gene programs that
compose meta-programs. We can see “blocks” corresponding to gene
programs of high similarity across datasets and values of
k
. We can then cut the similarity tree at a given height to
find blocks of similar programs and derive consensus gene signatures for
each block. For example, here we cut the tree to the height
corresponding to 10 clusters of programs (i.e. 10 MPs):
We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 23 50
## MP2 1.0000000 0.297392424 0.433 35 60
## MP3 0.9090909 0.129408777 0.295 200 60
## MP4 0.9090909 -0.073756249 0.161 163 109
## MP5 0.8181818 0.509947198 0.624 25 45
## MP6 0.5454545 0.459199536 0.560 16 20
## MP7 0.5454545 0.448717232 0.543 27 17
## MP8 0.5454545 0.008033304 0.233 73 44
## MP9 0.3636364 0.549435393 0.617 53 12
## MP10 0.2727273 0.530347451 0.599 45 12
Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity). For example, one may decide to drop MP4 because of its negative silhouette coefficient:
geneNMF.metaprograms.filtered <- dropMetaPrograms(geneNMF.metaprograms, dropMP = "MP4")
ph <- plotMetaPrograms(geneNMF.metaprograms.filtered,
similarity.cutoff = c(0.1,1))
ph
Some important parameters
We can also be more strict when we extract MPs from the matrix of
individual programs. For example, we can increase the
min.confidence
parameter to 0.7 to focus on genes
consistently found across at least 70% of the individual programs.
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
weight.explained = 0.5,
nMP=10,
min.confidence = 0.7)
We see the one on the MPs has now been dropped (we could have also picked it up by its poor metrics just above).
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 23 50
## MP2 1.0000000 0.297392424 0.433 32 60
## MP3 0.9090909 0.129408777 0.295 137 60
## MP4 0.9090909 -0.073756249 0.161 29 109
## MP5 0.8181818 0.509947198 0.624 25 45
## MP6 0.5454545 0.459199536 0.560 13 20
## MP7 0.5454545 0.448717232 0.543 22 17
## MP8 0.5454545 0.008033304 0.233 40 44
## MP9 0.3636364 0.549435393 0.617 44 12
## MP10 0.2727273 0.530347451 0.599 36 12
We see that most MPs are composed of around 10-20 genes (column
numberGenes
in the table above). We can control the size of
the MPs using the weight.explained
parameter. For example,
we can increase weight.explained
to 0.8 to include all
genes that allow to explain 80% of loadings in the consensus gene
set:
geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
metric = "cosine",
weight.explained = 0.8,
nMP=10,
min.confidence = 0.7)
Now we have larger meta-program gene sets:
## sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1 1.0000000 0.696882673 0.723 71 50
## MP2 1.0000000 0.297392424 0.433 70 60
## MP3 0.9090909 0.129408777 0.295 146 60
## MP4 0.9090909 -0.073756249 0.161 29 109
## MP5 0.8181818 0.509947198 0.624 70 45
## MP6 0.5454545 0.459199536 0.560 39 20
## MP7 0.5454545 0.448717232 0.543 76 17
## MP8 0.5454545 0.008033304 0.233 56 44
## MP9 0.3636364 0.549435393 0.617 144 12
## MP10 0.2727273 0.530347451 0.599 112 12
Other important parameters to getMetaPrograms()
are
nMP
(the target number of meta-programs) and
specificity.weight
, which controls how much MP signatures
are allowed to overlap. From the default
specificity.weight=5
, lower values (e.g. =3
)
allows more overlapping MPs, higher values (e.g. =8
)
enforces higher sparsity in the decomposition.
Interpretation of gene programs
What are the genes driving each meta-program?
## $MP1
## [1] "TOP2A" "CENPF" "NUSAP1" "PCLAF" "MAD2L1" "MKI67"
##
## $MP2
## [1] "KRT6B" "CALML5" "S100A7" "DMKN" "LGALS7B" "KRT1"
##
## $MP3
## [1] "NIN" "FAT3" "NAV2" "PEAK1" "SLC7A2" "LAMB1"
##
## $MP4
## [1] "TAGLN" "JUNB" "GADD45B" "BGN" "UCP2" "JUN"
##
## $MP5
## [1] "FOS" "FOSB" "EGR1" "ZFP36" "DNAJB1" "NR4A1"
##
## $MP6
## [1] "ACTA2" "TAGLN" "MYLK" "MYL9" "CSRP1" "ANXA3"
##
## $MP7
## [1] "S100A9" "S100A8" "CHI3L1" "TIMP1" "CCL2" "S100A7"
##
## $MP8
## [1] "ATF3" "SOCS3" "IRF1" "ZFP36" "BTG2" "EGR2"
##
## $MP9
## [1] "CRCT1" "CRNN" "RBP2" "LINC00964" "C1orf115" "FOS"
##
## $MP10
## [1] "SFRP1" "ALK" "COL14A1" "DCN" "MEGF6" "CHGA"
We can also inspect their relative weights to see how much each gene contributes to the MP. For example, for MP7:
## S100A9 S100A8 CHI3L1 TIMP1 CCL2 S100A7
## 0.1956438276 0.1494639996 0.0925062002 0.0811392938 0.0625631173 0.0354404096
## TAGLN TNC MT2A KRT6A KRT6B INHBA
## 0.0293119026 0.0212852940 0.0203479861 0.0195110305 0.0165492999 0.0160587201
## LIMA1 S100A2 PDLIM4 PLAU MYL9 CSRP1
## 0.0154813849 0.0123559650 0.0120441654 0.0099147117 0.0086837053 0.0080305866
## IFITM3 MT1X DNAJB1 ZFP36 SOCS3 FOS
## 0.0077285613 0.0074213950 0.0073980531 0.0073072171 0.0068466314 0.0066927376
## KRT75 LGALS7B PRSS22 ITGB6 JUNB MGP
## 0.0062403119 0.0058686473 0.0057165129 0.0053529563 0.0052693373 0.0047747759
## ENO1 IER3 NFKBIA CXCL14 TYMP ANXA3
## 0.0042876971 0.0041198657 0.0041007723 0.0040266645 0.0039501751 0.0038858211
## S100A6 RELB KLF10 TMEM45A C1R DNAJC1
## 0.0038650866 0.0038127494 0.0037328041 0.0036647649 0.0036596337 0.0034270734
## OSMR SULF2 GADD45A DUSP5 SOD2 HSPA1B
## 0.0034090916 0.0033063280 0.0030446309 0.0030363674 0.0029048631 0.0027667004
## S100A10 SGK1 EMP3 CSTB EGLN3 LDHA
## 0.0026511389 0.0024787175 0.0023897651 0.0023690469 0.0023379242 0.0021832317
## LGALS3 ZC3H12A HSPH1 KLF6 VMP1 APOE
## 0.0021797093 0.0020467788 0.0020243595 0.0019551576 0.0018936775 0.0018799266
## C12orf75 ZYX GJB3 RBP1 CKS1B IRF1
## 0.0018569922 0.0018449419 0.0018404853 0.0018040045 0.0017798540 0.0017500961
## SMTN SKA2 BHLHE40 DUSP2 GAPDH TRIB1
## 0.0017119770 0.0016906140 0.0016484753 0.0016477071 0.0016358230 0.0014946950
## SLC2A3 DST SFN MACF1
## 0.0014638653 0.0014510146 0.0010590628 0.0009811312
To aid the interpretation of gene programs, we can compare them to
known signatures from public databases. The runGSEA()
function can be useful to scan MSigDB and evaluate the overlap of
detected gene programs with signatures in the databases. Here we compare
to the “C5” category, biological process subcategory; but other classes
such as “H” (hallmark gene sets) or “C8” (cell type) may be more
relevant in other contexts.
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
runGSEA(program, universe=rownames(seu), category = "C5", subcategory = "GO:BP")
})
For example, MP1 appears to be associated with cell cycling / cell division.
## pathway pval padj foldEnrichment
## <char> <num> <num> <num>
## 1: GOBP_CELL_CYCLE_PROCESS 1.514700e-41 1.151778e-37 12.373055
## 2: GOBP_CELL_CYCLE 5.845609e-38 2.222500e-34 9.792141
## 3: GOBP_MITOTIC_CELL_CYCLE 3.630314e-34 9.201636e-31 14.037779
## 4: GOBP_CHROMOSOME_ORGANIZATION 2.536133e-33 4.821189e-30 17.626838
## 5: GOBP_MITOTIC_CELL_CYCLE_PROCESS 1.607658e-32 2.444926e-29 15.479497
## 6: GOBP_CHROMOSOME_SEGREGATION 2.546006e-29 3.226639e-26 21.649985
## overlap size overlapGenes
## <int> <int> <list>
## 1: 46 1261 ASPM, AT....
## 2: 47 1628 ASPM, AT....
## 3: 37 894 ATAD5, B....
## 4: 33 635 BRCA2, C....
## 5: 34 745 ATAD5, B....
## 6: 27 423 ASPM, BR....
A word of caution: GSEA can be very useful to gain a broad idea of MP function. However, gene sets in MSigDB are often very large (>1000 genes), noisy sets derived from specific conditions, so results should be interpreted with caution.
Signature scores for gene programs
Now that we derived gene signatures for our MPs, we can use them to
obtain gene sets scores to place each cell on the axis of each MP. We
can apply the UCell
method to derive MP scores between 0
and 1 for each cell.
mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, ncores=4, name = "")
We can explore whether certain MPs are enriched in individual samples:
## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Some MPs appear to be equally active across multiple samples; others
such as MP4 or MP7 have different distributions in different samples. If
clinical data are available for patients, one may use this information
to study associations between MPs and relevant biological variables.
Signature scores to define integrated spaces
Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:
matrix <- seu@meta.data[,names(mp.genes)]
#dimred <- scale(matrix)
dimred <- as.matrix(matrix)
colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
cell.embeddings = dimred,
assay.used = "RNA",
key = "MP_",
global = FALSE)
We can also use these scores to generate a UMAP representation and visualize the data in 2D
set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
metric = "euclidean", reduction.name = "umap_MP")
We have now integrated the data based on their MP signature scores!
How do the signature scores for the meta-programs look like in the combined space?
library(viridis)
FeaturePlot(seu, features = names(mp.genes), reduction = "umap_MP", ncol=4) &
scale_color_viridis(option="B") &
theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())
Final remarks
NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly.
References
- GeneNMF package - LINK
- We use the ultra-fast RcppML method for NMF - LINK
- Analysis of BCC using GeneNMF - Yerly, Andreatta et al. (2024)
See also
- Decomposing PBMC scRNA-seq data using NMF - DEMO
- More demos are available at the GeneNMF GitHub repo. Questions? open an issue on GitHub.
Session Info
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] viridis_0.6.5 viridisLite_0.4.2 fgsea_1.34.0 msigdbr_24.1.0
## [5] GeneNMF_0.9.0 RColorBrewer_1.1-3 dplyr_1.1.4 tidyr_1.3.1
## [9] patchwork_1.3.1 UCell_2.12.0 ggplot2_3.5.2 Seurat_5.3.0
## [13] SeuratObject_5.1.0 sp_2.2-0 renv_1.1.4
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.17.1 jsonlite_2.0.0
## [3] magrittr_2.0.3 spatstat.utils_3.1-4
## [5] farver_2.1.2 rmarkdown_2.29
## [7] vctrs_0.6.5 ROCR_1.0-11
## [9] spatstat.explore_3.4-3 S4Arrays_1.8.1
## [11] htmltools_0.5.8.1 curl_6.4.0
## [13] BiocNeighbors_2.2.0 SparseArray_1.8.0
## [15] sass_0.4.10 sctransform_0.4.2
## [17] parallelly_1.45.0 KernSmooth_2.23-26
## [19] bslib_0.9.0 htmlwidgets_1.6.4
## [21] ica_1.0-3 plyr_1.8.9
## [23] plotly_4.11.0 zoo_1.8-14
## [25] cachem_1.1.0 igraph_2.1.4
## [27] mime_0.13 lifecycle_1.0.4
## [29] pkgconfig_2.0.3 Matrix_1.7-3
## [31] R6_2.6.1 fastmap_1.2.0
## [33] GenomeInfoDbData_1.2.14 MatrixGenerics_1.20.0
## [35] fitdistrplus_1.2-3 future_1.58.0
## [37] shiny_1.11.1 digest_0.6.37
## [39] S4Vectors_0.46.0 tensor_1.5.1
## [41] RSpectra_0.16-2 irlba_2.3.5.1
## [43] SnowballC_0.7.1 GenomicRanges_1.60.0
## [45] labeling_0.4.3 progressr_0.15.1
## [47] spatstat.sparse_3.1-0 httr_1.4.7
## [49] polyclip_1.10-7 abind_1.4-8
## [51] compiler_4.5.0 withr_3.0.2
## [53] BiocParallel_1.42.1 fastDummies_1.7.5
## [55] dendextend_1.19.0 MASS_7.3-65
## [57] DelayedArray_0.34.1 tools_4.5.0
## [59] lmtest_0.9-40 httpuv_1.6.16
## [61] future.apply_1.20.0 goftest_1.2-3
## [63] glue_1.8.0 nlme_3.1-168
## [65] promises_1.3.3 grid_4.5.0
## [67] Rtsne_0.17 cluster_2.1.8.1
## [69] reshape2_1.4.4 generics_0.1.4
## [71] gtable_0.3.6 spatstat.data_3.1-6
## [73] rmdformats_1.0.4 RcppML_0.3.7
## [75] data.table_1.17.6 XVector_0.48.0
## [77] BiocGenerics_0.54.0 spatstat.geom_3.4-1
## [79] RcppAnnoy_0.0.22 ggrepel_0.9.6
## [81] RANN_2.6.2 pillar_1.10.2
## [83] stringr_1.5.1 babelgene_22.9
## [85] spam_2.11-1 RcppHNSW_0.6.0
## [87] later_1.4.2 splines_4.5.0
## [89] lattice_0.22-6 survival_3.8-3
## [91] deldir_2.0-4 tidyselect_1.2.1
## [93] SingleCellExperiment_1.30.1 miniUI_0.1.2
## [95] pbapply_1.7-2 knitr_1.50
## [97] gridExtra_2.3 bookdown_0.43
## [99] IRanges_2.42.0 SummarizedExperiment_1.38.1
## [101] scattermore_1.2 stats4_4.5.0
## [103] xfun_0.52 Biobase_2.68.0
## [105] matrixStats_1.5.0 pheatmap_1.0.13
## [107] UCSC.utils_1.4.0 stringi_1.8.7
## [109] lazyeval_0.2.2 yaml_2.3.10
## [111] evaluate_1.0.4 codetools_0.2-20
## [113] lsa_0.73.3 tibble_3.3.0
## [115] BiocManager_1.30.26 cli_3.6.5
## [117] uwot_0.2.3 xtable_1.8-4
## [119] reticulate_1.42.0 jquerylib_0.1.4
## [121] GenomeInfoDb_1.44.0 Rcpp_1.1.0
## [123] globals_0.18.0 spatstat.random_3.4-1
## [125] png_0.1-8 spatstat.univar_3.1-3
## [127] parallel_4.5.0 assertthat_0.2.1
## [129] dotCall64_1.2 listenv_0.9.1
## [131] scales_1.4.0 ggridges_0.5.6
## [133] crayon_1.5.3 purrr_1.0.4
## [135] rlang_1.1.6 fastmatch_1.1-6
## [137] cowplot_1.1.3