Characterizing intratumor heterogeneity with GeneNMF

Cancer cells are heterogeneous, expressing a variety of gene programs. Understanding the main axes of transcriptional variability in cancer cells can help shed light on the mechanisms of cancer progression and on the reasons for success/failure of cancer treatments. scRNA-seq data are particularly useful to study intratumoral heterogeneity, as they allow identifying “gene programs” consisting of genes that are coexpressed in subsets of cancer cells within individual tumors.

GeneNMF is a computational method aimed at extracting recurrent gene programs in cancer cells across multiple studies. Inspired on previous work on the application of non-negative matrix factorization (NMF) on scRNA-seq data (see e.g. Kotliar et al. (2019), Barkely et al. (2022), Gavish et al. (2023)), GeneNMF decomposes the expression matrix of individual datasets into a few, interpratable dimensions, and subsequently calculates “meta-programs” (or MP) defined as consensus gene sets that are found to be expressed across multiple samples and initializations of the method.

In this demo, we apply GeneNMF on 11 samples of basal cell carcinoma (BCC) malignant cells from two different studies (Ganier et al. (2024) and Yerly et al. (2022)).

Set up the environment

Here are some packages you’ll need for this demo:

library(Seurat)
library(ggplot2)
library(UCell)
library(patchwork)
library(tidyr)
library(dplyr)
library(RColorBrewer)

#install.packages("GeneNMF") #from CRAN 
#remotes::install_github("carmonalab/GeneNMF")  # or from GitHub
library(GeneNMF)

BCC dataset

Download the test dataset for this demo from Figshare:

do_download <- FALSE

ddir <- "input"
data.path <- sprintf("%s/Tumor_combination_LY_CG.rds", ddir)

if (do_download) {
  dir.create(ddir)
  options(timeout = 3000)
  download.file("https://figshare.com/ndownloader/files/47742634", destfile = data.path)
}

seu <- readRDS(data.path)

As with most cancer cell datasets, we observe large batch effects between patients.

DimPlot(seu, group.by="patient_bcc") + theme(aspect.ratio = 1)

With this level of batch effects, it is not trivial to analyse multiple samples together. Typical scRNA-seq analysis pipelines call for batch-effect correction methods, which aim at reducing technical variability while preserving real, biological differences (see e.g. Luecken et al. (2022)). However, cancer cells DO indeed have unique transcriptional phenotypes in different patients, making technical and biological variability indistinguishable. Extracting gene programs by NMF in individual patients is attractive because it bypasses the need for batch effect correction - we rather integrate multiple samples at the level of their gene program activities.

Consistent NMF programs across multiple samples

Identification of robust gene programs requires their detection across samples and variability of input parameters. Perhaps the most crucial parameter to NMF is the dimensionality k, which corresponds to the number of programs of the low-dimensional matrix. To determine robust programs, we can run NMF over multiple numbers of k and determine programs that are consistenly found across these runs. The multiNMF() function automatically performs NMF over a list of samples and for multiple values of k:

DefaultAssay(seu) <- "RNA"
seu.list <- SplitObject(seu, split.by = "Sample")

geneNMF.programs <- multiNMF(seu.list, assay="RNA", k=4:9, min.exp = 0.05)
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can now combine the gene programs identified over multiple samples and numbers of k into metaprograms (MPs), i.e. consensus programs that are robustly identified across NMF runs. Here we will define 10 MPs:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        specificity.weight = 5,
                                        weight.explained = 0.5,
                                        nMP=10)

It can be useful to visualize pairwise similarity (in terms of cosine similarity or Jaccard index) between individual gene programs that compose meta-programs. We can see “blocks” corresponding to gene programs of high similarity across datasets and values of k. We can then cut the similarity tree at a given height to find blocks of similar programs and derive consensus gene signatures for each block. For example, here we cut the tree to the height corresponding to 10 clusters of programs (i.e. 10 MPs):

ph <- plotMetaPrograms(geneNMF.metaprograms,
                       similarity.cutoff = c(0.1,1))
ph

We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage   silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.696882673          0.723          23             50
## MP2       1.0000000  0.297392424          0.433          35             60
## MP3       0.9090909  0.129408777          0.295         200             60
## MP4       0.9090909 -0.073756249          0.161         163            109
## MP5       0.8181818  0.509947198          0.624          25             45
## MP6       0.5454545  0.459199536          0.560          16             20
## MP7       0.5454545  0.448717232          0.543          27             17
## MP8       0.5454545  0.008033304          0.233          73             44
## MP9       0.3636364  0.549435393          0.617          53             12
## MP10      0.2727273  0.530347451          0.599          45             12

Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity). For example, one may decide to drop MP4 because of its negative silhouette coefficient:

geneNMF.metaprograms.filtered <- dropMetaPrograms(geneNMF.metaprograms, dropMP = "MP4")

ph <- plotMetaPrograms(geneNMF.metaprograms.filtered,
                       similarity.cutoff = c(0.1,1))
ph

Some important parameters

We can also be more strict when we extract MPs from the matrix of individual programs. For example, we can increase the min.confidence parameter to 0.7 to focus on genes consistently found across at least 70% of the individual programs.

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.5,
                                        nMP=10,
                                        min.confidence = 0.7)

We see the one on the MPs has now been dropped (we could have also picked it up by its poor metrics just above).

ph <- plotMetaPrograms(geneNMF.metaprograms,
                       similarity.cutoff = c(0.1,1))
ph

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage   silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.696882673          0.723          23             50
## MP2       1.0000000  0.297392424          0.433          32             60
## MP3       0.9090909  0.129408777          0.295         137             60
## MP4       0.9090909 -0.073756249          0.161          29            109
## MP5       0.8181818  0.509947198          0.624          25             45
## MP6       0.5454545  0.459199536          0.560          13             20
## MP7       0.5454545  0.448717232          0.543          22             17
## MP8       0.5454545  0.008033304          0.233          40             44
## MP9       0.3636364  0.549435393          0.617          44             12
## MP10      0.2727273  0.530347451          0.599          36             12

We see that most MPs are composed of around 10-20 genes (column numberGenes in the table above). We can control the size of the MPs using the weight.explained parameter. For example, we can increase weight.explained to 0.8 to include all genes that allow to explain 80% of loadings in the consensus gene set:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.8,
                                        nMP=10,
                                        min.confidence = 0.7)

Now we have larger meta-program gene sets:

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage   silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.696882673          0.723          71             50
## MP2       1.0000000  0.297392424          0.433          70             60
## MP3       0.9090909  0.129408777          0.295         146             60
## MP4       0.9090909 -0.073756249          0.161          29            109
## MP5       0.8181818  0.509947198          0.624          70             45
## MP6       0.5454545  0.459199536          0.560          39             20
## MP7       0.5454545  0.448717232          0.543          76             17
## MP8       0.5454545  0.008033304          0.233          56             44
## MP9       0.3636364  0.549435393          0.617         144             12
## MP10      0.2727273  0.530347451          0.599         112             12

Other important parameters to getMetaPrograms() are nMP (the target number of meta-programs) and specificity.weight, which controls how much MP signatures are allowed to overlap. From the default specificity.weight=5, lower values (e.g. =3) allows more overlapping MPs, higher values (e.g. =8) enforces higher sparsity in the decomposition.

Interpretation of gene programs

What are the genes driving each meta-program?

lapply(geneNMF.metaprograms$metaprograms.genes, head)
## $MP1
## [1] "TOP2A"  "CENPF"  "NUSAP1" "PCLAF"  "MAD2L1" "MKI67" 
## 
## $MP2
## [1] "KRT6B"   "CALML5"  "S100A7"  "DMKN"    "LGALS7B" "KRT1"   
## 
## $MP3
## [1] "NIN"    "FAT3"   "NAV2"   "PEAK1"  "SLC7A2" "LAMB1" 
## 
## $MP4
## [1] "TAGLN"   "JUNB"    "GADD45B" "BGN"     "UCP2"    "JUN"    
## 
## $MP5
## [1] "FOS"    "FOSB"   "EGR1"   "ZFP36"  "DNAJB1" "NR4A1" 
## 
## $MP6
## [1] "ACTA2" "TAGLN" "MYLK"  "MYL9"  "CSRP1" "ANXA3"
## 
## $MP7
## [1] "S100A9" "S100A8" "CHI3L1" "TIMP1"  "CCL2"   "S100A7"
## 
## $MP8
## [1] "ATF3"  "SOCS3" "IRF1"  "ZFP36" "BTG2"  "EGR2" 
## 
## $MP9
## [1] "CRCT1"     "CRNN"      "RBP2"      "LINC00964" "C1orf115"  "FOS"      
## 
## $MP10
## [1] "SFRP1"   "ALK"     "COL14A1" "DCN"     "MEGF6"   "CHGA"

We can also inspect their relative weights to see how much each gene contributes to the MP. For example, for MP7:

geneNMF.metaprograms$metaprograms.genes.weights$MP7
##       S100A9       S100A8       CHI3L1        TIMP1         CCL2       S100A7 
## 0.1956438276 0.1494639996 0.0925062002 0.0811392938 0.0625631173 0.0354404096 
##        TAGLN          TNC         MT2A        KRT6A        KRT6B        INHBA 
## 0.0293119026 0.0212852940 0.0203479861 0.0195110305 0.0165492999 0.0160587201 
##        LIMA1       S100A2       PDLIM4         PLAU         MYL9        CSRP1 
## 0.0154813849 0.0123559650 0.0120441654 0.0099147117 0.0086837053 0.0080305866 
##       IFITM3         MT1X       DNAJB1        ZFP36        SOCS3          FOS 
## 0.0077285613 0.0074213950 0.0073980531 0.0073072171 0.0068466314 0.0066927376 
##        KRT75      LGALS7B       PRSS22        ITGB6         JUNB          MGP 
## 0.0062403119 0.0058686473 0.0057165129 0.0053529563 0.0052693373 0.0047747759 
##         ENO1         IER3       NFKBIA       CXCL14         TYMP        ANXA3 
## 0.0042876971 0.0041198657 0.0041007723 0.0040266645 0.0039501751 0.0038858211 
##       S100A6         RELB        KLF10      TMEM45A          C1R       DNAJC1 
## 0.0038650866 0.0038127494 0.0037328041 0.0036647649 0.0036596337 0.0034270734 
##         OSMR        SULF2      GADD45A        DUSP5         SOD2       HSPA1B 
## 0.0034090916 0.0033063280 0.0030446309 0.0030363674 0.0029048631 0.0027667004 
##      S100A10         SGK1         EMP3         CSTB        EGLN3         LDHA 
## 0.0026511389 0.0024787175 0.0023897651 0.0023690469 0.0023379242 0.0021832317 
##       LGALS3      ZC3H12A        HSPH1         KLF6         VMP1         APOE 
## 0.0021797093 0.0020467788 0.0020243595 0.0019551576 0.0018936775 0.0018799266 
##     C12orf75          ZYX         GJB3         RBP1        CKS1B         IRF1 
## 0.0018569922 0.0018449419 0.0018404853 0.0018040045 0.0017798540 0.0017500961 
##         SMTN         SKA2      BHLHE40        DUSP2        GAPDH        TRIB1 
## 0.0017119770 0.0016906140 0.0016484753 0.0016477071 0.0016358230 0.0014946950 
##       SLC2A3          DST          SFN        MACF1 
## 0.0014638653 0.0014510146 0.0010590628 0.0009811312

To aid the interpretation of gene programs, we can compare them to known signatures from public databases. The runGSEA() function can be useful to scan MSigDB and evaluate the overlap of detected gene programs with signatures in the databases. Here we compare to the “C5” category, biological process subcategory; but other classes such as “H” (hallmark gene sets) or “C8” (cell type) may be more relevant in other contexts.

library(msigdbr)
library(fgsea)
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
  runGSEA(program, universe=rownames(seu), category = "C5", subcategory = "GO:BP")
})

For example, MP1 appears to be associated with cell cycling / cell division.

head(top_p$MP1)
##                            pathway         pval         padj foldEnrichment
##                             <char>        <num>        <num>          <num>
## 1:         GOBP_CELL_CYCLE_PROCESS 1.514700e-41 1.151778e-37      12.373055
## 2:                 GOBP_CELL_CYCLE 5.845609e-38 2.222500e-34       9.792141
## 3:         GOBP_MITOTIC_CELL_CYCLE 3.630314e-34 9.201636e-31      14.037779
## 4:    GOBP_CHROMOSOME_ORGANIZATION 2.536133e-33 4.821189e-30      17.626838
## 5: GOBP_MITOTIC_CELL_CYCLE_PROCESS 1.607658e-32 2.444926e-29      15.479497
## 6:     GOBP_CHROMOSOME_SEGREGATION 2.546006e-29 3.226639e-26      21.649985
##    overlap  size overlapGenes
##      <int> <int>       <list>
## 1:      46  1261 ASPM, AT....
## 2:      47  1628 ASPM, AT....
## 3:      37   894 ATAD5, B....
## 4:      33   635 BRCA2, C....
## 5:      34   745 ATAD5, B....
## 6:      27   423 ASPM, BR....

A word of caution: GSEA can be very useful to gain a broad idea of MP function. However, gene sets in MSigDB are often very large (>1000 genes), noisy sets derived from specific conditions, so results should be interpreted with caution.

Signature scores for gene programs

Now that we derived gene signatures for our MPs, we can use them to obtain gene sets scores to place each cell on the axis of each MP. We can apply the UCell method to derive MP scores between 0 and 1 for each cell.

mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, ncores=4, name = "")

We can explore whether certain MPs are enriched in individual samples:

VlnPlot(seu, features=names(mp.genes), group.by = "patient_bcc",
        pt.size = 0, ncol=5)
## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Some MPs appear to be equally active across multiple samples; others such as MP4 or MP7 have different distributions in different samples. If clinical data are available for patients, one may use this information to study associations between MPs and relevant biological variables.

Signature scores to define integrated spaces

Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:

matrix <- seu@meta.data[,names(mp.genes)]

#dimred <- scale(matrix)
dimred <- as.matrix(matrix)

colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
                                         cell.embeddings = dimred,
                                         assay.used = "RNA",
                                         key = "MP_",
                                         global = FALSE)

We can also use these scores to generate a UMAP representation and visualize the data in 2D

set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
               metric = "euclidean", reduction.name = "umap_MP")

We have now integrated the data based on their MP signature scores!

DimPlot(seu, reduction = "umap_MP", group.by = "patient_bcc") + theme(aspect.ratio = 1)

How do the signature scores for the meta-programs look like in the combined space?

library(viridis)
FeaturePlot(seu, features = names(mp.genes), reduction = "umap_MP", ncol=4) &
  scale_color_viridis(option="B") &
   theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())

Final remarks

NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly.

References

See also

Session Info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] viridis_0.6.5      viridisLite_0.4.2  fgsea_1.34.0       msigdbr_24.1.0    
##  [5] GeneNMF_0.9.0      RColorBrewer_1.1-3 dplyr_1.1.4        tidyr_1.3.1       
##  [9] patchwork_1.3.1    UCell_2.12.0       ggplot2_3.5.2      Seurat_5.3.0      
## [13] SeuratObject_5.1.0 sp_2.2-0           renv_1.1.4        
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.17.1           jsonlite_2.0.0             
##   [3] magrittr_2.0.3              spatstat.utils_3.1-4       
##   [5] farver_2.1.2                rmarkdown_2.29             
##   [7] vctrs_0.6.5                 ROCR_1.0-11                
##   [9] spatstat.explore_3.4-3      S4Arrays_1.8.1             
##  [11] htmltools_0.5.8.1           curl_6.4.0                 
##  [13] BiocNeighbors_2.2.0         SparseArray_1.8.0          
##  [15] sass_0.4.10                 sctransform_0.4.2          
##  [17] parallelly_1.45.0           KernSmooth_2.23-26         
##  [19] bslib_0.9.0                 htmlwidgets_1.6.4          
##  [21] ica_1.0-3                   plyr_1.8.9                 
##  [23] plotly_4.11.0               zoo_1.8-14                 
##  [25] cachem_1.1.0                igraph_2.1.4               
##  [27] mime_0.13                   lifecycle_1.0.4            
##  [29] pkgconfig_2.0.3             Matrix_1.7-3               
##  [31] R6_2.6.1                    fastmap_1.2.0              
##  [33] GenomeInfoDbData_1.2.14     MatrixGenerics_1.20.0      
##  [35] fitdistrplus_1.2-3          future_1.58.0              
##  [37] shiny_1.11.1                digest_0.6.37              
##  [39] S4Vectors_0.46.0            tensor_1.5.1               
##  [41] RSpectra_0.16-2             irlba_2.3.5.1              
##  [43] SnowballC_0.7.1             GenomicRanges_1.60.0       
##  [45] labeling_0.4.3              progressr_0.15.1           
##  [47] spatstat.sparse_3.1-0       httr_1.4.7                 
##  [49] polyclip_1.10-7             abind_1.4-8                
##  [51] compiler_4.5.0              withr_3.0.2                
##  [53] BiocParallel_1.42.1         fastDummies_1.7.5          
##  [55] dendextend_1.19.0           MASS_7.3-65                
##  [57] DelayedArray_0.34.1         tools_4.5.0                
##  [59] lmtest_0.9-40               httpuv_1.6.16              
##  [61] future.apply_1.20.0         goftest_1.2-3              
##  [63] glue_1.8.0                  nlme_3.1-168               
##  [65] promises_1.3.3              grid_4.5.0                 
##  [67] Rtsne_0.17                  cluster_2.1.8.1            
##  [69] reshape2_1.4.4              generics_0.1.4             
##  [71] gtable_0.3.6                spatstat.data_3.1-6        
##  [73] rmdformats_1.0.4            RcppML_0.3.7               
##  [75] data.table_1.17.6           XVector_0.48.0             
##  [77] BiocGenerics_0.54.0         spatstat.geom_3.4-1        
##  [79] RcppAnnoy_0.0.22            ggrepel_0.9.6              
##  [81] RANN_2.6.2                  pillar_1.10.2              
##  [83] stringr_1.5.1               babelgene_22.9             
##  [85] spam_2.11-1                 RcppHNSW_0.6.0             
##  [87] later_1.4.2                 splines_4.5.0              
##  [89] lattice_0.22-6              survival_3.8-3             
##  [91] deldir_2.0-4                tidyselect_1.2.1           
##  [93] SingleCellExperiment_1.30.1 miniUI_0.1.2               
##  [95] pbapply_1.7-2               knitr_1.50                 
##  [97] gridExtra_2.3               bookdown_0.43              
##  [99] IRanges_2.42.0              SummarizedExperiment_1.38.1
## [101] scattermore_1.2             stats4_4.5.0               
## [103] xfun_0.52                   Biobase_2.68.0             
## [105] matrixStats_1.5.0           pheatmap_1.0.13            
## [107] UCSC.utils_1.4.0            stringi_1.8.7              
## [109] lazyeval_0.2.2              yaml_2.3.10                
## [111] evaluate_1.0.4              codetools_0.2-20           
## [113] lsa_0.73.3                  tibble_3.3.0               
## [115] BiocManager_1.30.26         cli_3.6.5                  
## [117] uwot_0.2.3                  xtable_1.8-4               
## [119] reticulate_1.42.0           jquerylib_0.1.4            
## [121] GenomeInfoDb_1.44.0         Rcpp_1.1.0                 
## [123] globals_0.18.0              spatstat.random_3.4-1      
## [125] png_0.1-8                   spatstat.univar_3.1-3      
## [127] parallel_4.5.0              assertthat_0.2.1           
## [129] dotCall64_1.2               listenv_0.9.1              
## [131] scales_1.4.0                ggridges_0.5.6             
## [133] crayon_1.5.3                purrr_1.0.4                
## [135] rlang_1.1.6                 fastmatch_1.1-6            
## [137] cowplot_1.1.3