Characterizing intratumor heterogeneity with GeneNMF

Cancer cells are heterogeneous, expressing a variety of gene programs. Understanding the main axes of transcriptional variability in cancer cells can help shed light on the mechanisms of cancer progression and on the reasons for success/failure of cancer treatments. scRNA-seq data are particularly useful to study intratumoral heterogeneity, as they allow identifying “gene programs” consisting of genes that are coexpressed in subsets of cancer cells within individual tumors.

GeneNMF is a computational method aimed at extracting recurrent gene programs in cancer cells across multiple studies. Inspired on previous work on the application of non-negative matrix factorization (NMF) on scRNA-seq data (see e.g. Kotliar et al. (2019), Barkely et al. (2022), Gavish et al. (2023)), GeneNMF decomposes the expression matrix of individual datasets into a few, interpratable dimensions, and subsequently calculates “meta-programs” (or MP) defined as consensus gene sets that are found to be expressed across multiple samples and initializations of the method.

In this demo, we apply GeneNMF on 11 samples of basal cell carcinoma (BCC) malignant cells from two different studies (Ganier et al. (2024) and Yerly et al. (2022)).

Set up the environment

Here are some packages you’ll need for this demo:

library(Seurat)
library(ggplot2)
library(UCell)
library(patchwork)
library(tidyr)
library(dplyr)
library(RColorBrewer)
library(viridis)

#install.packages("GeneNMF") #from CRAN 
#remotes::install_github("carmonalab/GeneNMF")  # or from GitHub
library(GeneNMF)

BCC dataset

Download the test dataset for this demo from Figshare:

do_download <- FALSE

ddir <- "input"
data.path <- sprintf("%s/Tumor_combination_LY_CG.rds", ddir)

if (do_download) {
  dir.create(ddir)
  options(timeout = 3000)
  download.file("https://ndownloader.figshare.com/files/47742634", destfile = data.path)
}

seu <- readRDS(data.path)

As with most cancer cell datasets, we observe large batch effects between patients.

DimPlot(seu, group.by="patient_bcc") + theme(aspect.ratio = 1)

With this level of batch effects, it is not trivial to analyse multiple samples together. Typical scRNA-seq analysis pipelines call for batch-effect correction methods, which aim at reducing technical variability while preserving real, biological differences (see e.g. Luecken et al. (2022)). However, cancer cells DO indeed have unique transcriptional phenotypes in different patients, making technical and biological variability indistinguishable. Extracting gene programs by NMF in individual patients is attractive because it bypasses the need for batch effect correction - we rather integrate multiple samples at the level of their gene program activities.

Consistent NMF programs across multiple samples

Identification of robust gene programs requires their detection across samples and variability of input parameters. Perhaps the most crucial parameter to NMF is the dimensionality k, which corresponds to the number of programs of the low-dimensional matrix. To determine robust programs, we can run NMF over multiple numbers of k and determine programs that are consistenly found across these runs. The multiNMF() function automatically performs NMF over a list of samples and for multiple values of k:

DefaultAssay(seu) <- "RNA"
seu.list <- SplitObject(seu, split.by = "Sample")

geneNMF.programs <- multiNMF(seu.list, assay="RNA", k=4:9, min.exp = 0.05)

We can now combine the gene programs identified over multiple samples and numbers of k into metaprograms (MPs), i.e. consensus programs that are robustly identified across NMF runs. Here we will define 10 MPs:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        specificity.weight = 5,
                                        weight.explained = 0.5,
                                        nMP=10)

It can be useful to visualize pairwise similarity (in terms of cosine similarity or Jaccard index) between individual gene programs that compose meta-programs. We can see “blocks” corresponding to gene programs of high similarity across datasets and values of k. We can then cut the similarity tree at a given height to find blocks of similar programs and derive consensus gene signatures for each block. For example, here we cut the tree to the height corresponding to 10 clusters of programs (i.e. 10 MPs):

plotMetaPrograms(geneNMF.metaprograms,
                       similarity.cutoff = c(0.1,1))

We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage  silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.68161477          0.712          23             51
## MP2       1.0000000  0.31506300          0.436          36             66
## MP3       1.0000000  0.06644191          0.202         200             89
## MP4       0.9090909  0.34425056          0.483          29             54
## MP5       0.9090909 -0.00669364          0.206         149             67
## MP6       0.7272727  0.38534901          0.498          48             26
## MP7       0.4545455  0.43452945          0.527          13             17
## MP8       0.3636364  0.35061600          0.463          61             17
## MP9       0.3636364  0.05643251          0.266         114             30
## MP10      0.1818182  0.48946623          0.579          35             12

Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity). For example, one may decide to drop MP5 because of its negative silhouette coefficient:

geneNMF.metaprograms.filtered <- dropMetaPrograms(geneNMF.metaprograms, dropMP = "MP5")

plotMetaPrograms(geneNMF.metaprograms.filtered,
                       similarity.cutoff = c(0.1,1))

Some important parameters

We can also be more strict when we extract MPs from the matrix of individual programs. For example, we can increase the min.confidence parameter to 0.7 to focus on genes consistently found across at least 70% of the individual programs.

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.5,
                                        nMP=10,
                                        min.confidence = 0.7)
geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage  silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.68161477          0.712          23             51
## MP2       1.0000000  0.31506300          0.436          36             66
## MP3       1.0000000  0.06644191          0.202          58             89
## MP4       0.9090909  0.34425056          0.483          29             54
## MP5       0.9090909 -0.00669364          0.206          71             67
## MP6       0.7272727  0.38534901          0.498          23             26
## MP7       0.4545455  0.43452945          0.527          13             17
## MP8       0.3636364  0.35061600          0.463          50             17
## MP9       0.3636364  0.05643251          0.266          75             30
## MP10      0.1818182  0.48946623          0.579          34             12

We see that most MPs are composed of around 10-50 genes (column numberGenes in the table above). We can control the size of the MPs using the weight.explained parameter. For example, we can increase weight.explained to 0.8 to include all genes that allow to explain 80% of loadings in the consensus gene set:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.8,
                                        nMP=10,
                                        min.confidence = 0.7)

Now we have larger meta-program gene sets:

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage  silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.68161477          0.712          72             51
## MP2       1.0000000  0.31506300          0.436          84             66
## MP3       1.0000000  0.06644191          0.202          58             89
## MP4       0.9090909  0.34425056          0.483          68             54
## MP5       0.9090909 -0.00669364          0.206          78             67
## MP6       0.7272727  0.38534901          0.498          33             26
## MP7       0.4545455  0.43452945          0.527          49             17
## MP8       0.3636364  0.35061600          0.463         143             17
## MP9       0.3636364  0.05643251          0.266          98             30
## MP10      0.1818182  0.48946623          0.579          97             12

We can also change the annotation colors and/or the palette for program similarity:

anno_colors <- brewer.pal(n=10, name="Paired")
names(anno_colors) <- names(geneNMF.metaprograms$metaprograms.genes)


plotMetaPrograms(geneNMF.metaprograms,
                       palette=viridis(100, option = "H", direction = 1),
                       annotation_colors = anno_colors,
                       similarity.cutoff = c(0,1))

Other important parameters to getMetaPrograms() are nMP (the target number of meta-programs) and specificity.weight, which controls how much MP signatures are allowed to overlap. From the default specificity.weight=5, lower values (e.g. =3) allows more overlapping MPs, higher values (e.g. =8) enforces higher sparsity in the decomposition.

Interpretation of gene programs

What are the genes driving each meta-program?

lapply(geneNMF.metaprograms$metaprograms.genes, head)
## $MP1
## [1] "TOP2A"  "CENPF"  "NUSAP1" "PCLAF"  "MKI67"  "MAD2L1"
## 
## $MP2
## [1] "DMKN"    "KRT6B"   "CALML5"  "LGALS7B" "KRT1"    "S100A7" 
## 
## $MP3
## [1] "MEGF6"  "FAT3"   "SLC7A2" "NIN"    "NAV2"   "PEAK1" 
## 
## $MP4
## [1] "FOS"   "FOSB"  "EGR1"  "NR4A1" "ZFP36" "HSPA6"
## 
## $MP5
## [1] "SFRP1" "RAMP1" "UCP2"  "BGN"   "MYL9"  "JUNB" 
## 
## $MP6
## [1] "TAGLN" "ACTA2" "MYLK"  "MYL9"  "CSRP1" "ANXA3"
## 
## $MP7
## [1] "S100A8" "S100A9" "S100A7" "CHI3L1" "TIMP1"  "CCL2"  
## 
## $MP8
## [1] "CRCT1" "CRNN"  "HHIP"  "RBP2"  "STAC2" "FOS"  
## 
## $MP9
## [1] "ATF3"    "BHLHE40" "MAP3K8"  "EGR2"    "REL"     "BCL6"   
## 
## $MP10
## [1] "CSKMT"      "H2BC9"      "H2BC15"     "DNAJC3-DT"  "ATP2B1-AS1"
## [6] "SYP"

We can also inspect their relative weights to see how much each gene contributes to the MP. For example, for MP7:

geneNMF.metaprograms$metaprograms.genes.weights$MP7
##      S100A8      S100A9      S100A7      CHI3L1       TIMP1        CCL2 
## 0.200183187 0.183839099 0.128107578 0.088213125 0.088027609 0.051224759 
##       KRT6A       TAGLN         TNC        MT2A       INHBA       KRT6B 
## 0.031167401 0.022652270 0.019090175 0.016360064 0.014015517 0.013390960 
##       LIMA1        PLAU       ZFP36      S100A2        EGR1     TMEM45A 
## 0.012697931 0.010524175 0.007496426 0.006802411 0.006239283 0.006131156 
##     LGALS7B       SOCS3        EGR2      PRSS22        MT1X       KRT75 
## 0.005641086 0.005629104 0.005493554 0.004878771 0.004610789 0.004461525 
##      DNAJB1      IFITM3         C1R         FOS       KLF10         MGP 
## 0.004408615 0.004382688 0.003811360 0.003657462 0.003556166 0.003306738 
##       DUSP5        LAD1       ANXA3      NFKBIA        IER5      HSPA1B 
## 0.003020026 0.002964246 0.002951172 0.002816552 0.002320073 0.002273507 
##     GADD45A        LDHA       HSPH1       DUSP2        JUNB     ZC3H12A 
## 0.002192057 0.002168157 0.002082279 0.002051562 0.001925781 0.001836368 
##        MAFF         JUN        GJB3        KLF6     BHLHE40       CKS1B 
## 0.001797796 0.001783716 0.001760626 0.001668978 0.001565345 0.001548408 
##       TRIB1 
## 0.001272365

To aid the interpretation of gene programs, we can compare them to known signatures from public databases. The runGSEA() function can be useful to scan MSigDB and evaluate the overlap of detected gene programs with signatures in the databases. Here we compare to the “C5” category, biological process subcategory; but other classes such as “H” (hallmark gene sets) or “C8” (cell type) may be more relevant in other contexts.

library(msigdbr)
library(fgsea)
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
  runGSEA(program, universe=rownames(seu), category = "C5", subcategory = "GO:BP")
})

For example, MP1 appears to be associated with cell cycling / cell division.

head(top_p$MP1)
##                            pathway         pval         padj foldEnrichment
##                             <char>        <num>        <num>          <num>
## 1:         GOBP_CELL_CYCLE_PROCESS 2.290305e-38 1.725516e-34      11.182207
## 2:                 GOBP_CELL_CYCLE 1.000977e-36 3.770680e-33       9.274451
## 3:         GOBP_MITOTIC_CELL_CYCLE 2.305595e-33 5.790117e-30      13.407879
## 4: GOBP_MITOTIC_CELL_CYCLE_PROCESS 1.266787e-31 2.385993e-28      14.617038
## 5:    GOBP_CHROMOSOME_ORGANIZATION 2.610448e-29 3.933423e-26      16.668051
## 6:     GOBP_CHROMOSOME_SEGREGATION 1.783188e-28 2.239090e-25      20.203020
##    overlap  size                               overlapGenes
##      <int> <int>                                     <list>
## 1:      45  1346  ASPM,ATAD5,BRCA1,BRCA2,CCNB1,CDK1,...[45]
## 2:      47  1695  ASPM,ATAD5,BRCA1,BRCA2,CCNB1,CDK1,...[47]
## 3:      37   923 ATAD5,BRCA1,BRCA2,CCNB1,CDK1,CENPF,...[37]
## 4:      34   778 ATAD5,BRCA1,BRCA2,CCNB1,CDK1,CENPF,...[34]
## 5:      30   602 BRCA2,CCNB1,CDK1,CENPF,CENPH,CENPK,...[30]
## 6:      27   447  ASPM,BRCA1,CCNB1,CDK1,CENPF,CENPH,...[27]

A word of caution: GSEA can be very useful to gain a broad idea of MP function. However, gene sets in MSigDB are often very large (>1000 genes), noisy sets derived from specific conditions, so results should be interpreted with caution.

Signature scores for gene programs

Now that we derived gene signatures for our MPs, we can use them to obtain gene sets scores to place each cell on the axis of each MP. We can apply the UCell method to derive MP scores between 0 and 1 for each cell.

mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, ncores=4, name = "")

We can explore whether certain MPs are enriched in individual samples:

VlnPlot(seu, features=names(mp.genes), group.by = "patient_bcc",
        pt.size = 0, ncol=5)

Some MPs appear to be equally active across multiple samples; others such as MP4 or MP7 have different distributions in different samples. If clinical data are available for patients, one may use this information to study associations between MPs and relevant biological variables.

Signature scores to define integrated spaces

Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:

matrix <- seu@meta.data[,names(mp.genes)]

#dimred <- scale(matrix)
dimred <- as.matrix(matrix)

colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
                                         cell.embeddings = dimred,
                                         assay.used = "RNA",
                                         key = "MP_",
                                         global = FALSE)

We can also use these scores to generate a UMAP representation and visualize the data in 2D

set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
               metric = "euclidean", reduction.name = "umap_MP")

We have now integrated the data based on their MP signature scores!

DimPlot(seu, reduction = "umap_MP", group.by = "patient_bcc") + theme(aspect.ratio = 1)

How do the signature scores for the meta-programs look like in the combined space?

library(viridis)
FeaturePlot(seu, features = names(mp.genes), reduction = "umap_MP", ncol=4) &
  scale_color_viridis(option="B") &
   theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())

Final remarks

NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly.

References

See also

Session Info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] fgsea_1.36.2       msigdbr_26.1.0     GeneNMF_0.9.2      viridis_0.6.5     
##  [5] viridisLite_0.4.3  RColorBrewer_1.1-3 dplyr_1.2.0        tidyr_1.3.2       
##  [9] patchwork_1.3.2    UCell_2.14.0       ggplot2_4.0.2      Seurat_5.4.0      
## [13] SeuratObject_5.3.0 sp_2.2-1           renv_1.1.8        
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_2.0.0              magrittr_2.0.4             
##   [3] spatstat.utils_3.2-2        farver_2.1.2               
##   [5] rmarkdown_2.30              vctrs_0.7.1                
##   [7] ROCR_1.0-12                 spatstat.explore_3.7-0     
##   [9] S4Arrays_1.10.1             htmltools_0.5.9            
##  [11] curl_7.0.0                  BiocNeighbors_2.4.0        
##  [13] SparseArray_1.10.9          sass_0.4.10                
##  [15] sctransform_0.4.3           parallelly_1.46.1          
##  [17] KernSmooth_2.23-26          bslib_0.10.0               
##  [19] htmlwidgets_1.6.4           ica_1.0-3                  
##  [21] plyr_1.8.9                  plotly_4.12.0              
##  [23] zoo_1.8-15                  cachem_1.1.0               
##  [25] igraph_2.2.2                mime_0.13                  
##  [27] lifecycle_1.0.5             pkgconfig_2.0.3            
##  [29] Matrix_1.7-4                R6_2.6.1                   
##  [31] fastmap_1.2.0               MatrixGenerics_1.22.0      
##  [33] fitdistrplus_1.2-6          future_1.69.0              
##  [35] shiny_1.13.0                digest_0.6.39              
##  [37] colorspace_2.1-2            S4Vectors_0.48.0           
##  [39] tensor_1.5.1                RSpectra_0.16-2            
##  [41] irlba_2.3.7                 SnowballC_0.7.1            
##  [43] GenomicRanges_1.62.1        labeling_0.4.3             
##  [45] progressr_0.18.0            spatstat.sparse_3.1-0      
##  [47] httr_1.4.8                  polyclip_1.10-7            
##  [49] abind_1.4-8                 compiler_4.5.0             
##  [51] withr_3.0.2                 S7_0.2.1                   
##  [53] BiocParallel_1.44.0         fastDummies_1.7.5          
##  [55] dendextend_1.19.1           MASS_7.3-65                
##  [57] DelayedArray_0.36.0         tools_4.5.0                
##  [59] lmtest_0.9-40               otel_0.2.0                 
##  [61] httpuv_1.6.16               future.apply_1.20.2        
##  [63] goftest_1.2-3               glue_1.8.0                 
##  [65] nlme_3.1-168                promises_1.5.0             
##  [67] grid_4.5.0                  Rtsne_0.17                 
##  [69] cluster_2.1.8.2             reshape2_1.4.5             
##  [71] generics_0.1.4              gtable_0.3.6               
##  [73] spatstat.data_3.1-9         rmdformats_1.0.4           
##  [75] RcppML_0.3.7.1              data.table_1.18.2.1        
##  [77] XVector_0.50.0              BiocGenerics_0.56.0        
##  [79] spatstat.geom_3.7-0         RcppAnnoy_0.0.23           
##  [81] ggrepel_0.9.7               RANN_2.6.2                 
##  [83] pillar_1.11.1               stringr_1.6.0              
##  [85] babelgene_22.9              spam_2.11-3                
##  [87] RcppHNSW_0.6.0              later_1.4.8                
##  [89] splines_4.5.0               lattice_0.22-9             
##  [91] survival_3.8-6              deldir_2.0-4               
##  [93] tidyselect_1.2.1            SingleCellExperiment_1.32.0
##  [95] miniUI_0.1.2                pbapply_1.7-4              
##  [97] knitr_1.51                  gridExtra_2.3              
##  [99] bookdown_0.46               Seqinfo_1.0.0              
## [101] IRanges_2.44.0              SummarizedExperiment_1.40.0
## [103] scattermore_1.2             stats4_4.5.0               
## [105] xfun_0.56                   Biobase_2.70.0             
## [107] matrixStats_1.5.0           pheatmap_1.0.13            
## [109] stringi_1.8.7               lazyeval_0.2.2             
## [111] yaml_2.3.12                 evaluate_1.0.5             
## [113] codetools_0.2-20            lsa_0.73.4                 
## [115] tibble_3.3.1                BiocManager_1.30.27        
## [117] cli_3.6.5                   uwot_0.2.4                 
## [119] xtable_1.8-8                reticulate_1.45.0          
## [121] jquerylib_0.1.4             Rcpp_1.1.1                 
## [123] globals_0.19.1              spatstat.random_3.4-4      
## [125] png_0.1-8                   spatstat.univar_3.1-6      
## [127] parallel_4.5.0              assertthat_0.2.1           
## [129] dotCall64_1.2               listenv_0.10.1             
## [131] scales_1.4.0                ggridges_0.5.7             
## [133] purrr_1.2.1                 rlang_1.1.7                
## [135] fastmatch_1.1-8             cowplot_1.2.0