Characterizing intratumor heterogeneity with GeneNMF

Cancer cells are heterogeneous, expressing a variety of gene programs. Understanding the main axes of transcriptional variability in cancer cells can help shed light on the mechanisms of cancer progression and on the reasons for success/failure of cancer treatments. scRNA-seq data are particularly useful to study intratumoral heterogeneity, as they allow identifying “gene programs” consisting of genes that are coexpressed in subsets of cancer cells within individual tumors.

GeneNMF is a computational method aimed at extracting recurrent gene programs in cancer cells across multiple studies. Inspired on previous work on the application of non-negative matrix factorization (NMF) on scRNA-seq data (see e.g. Kotliar et al. (2019), Barkely et al. (2022), Gavish et al. (2023)), GeneNMF decomposes the expression matrix of individual datasets into a few, interpratable dimensions, and subsequently calculates “meta-programs” (or MP) defined as consensus gene sets that are found to be expressed across multiple samples and initializations of the method.

In this demo, we apply GeneNMF on 11 samples of basal cell carcinoma (BCC) malignant cells from two different studies (Ganier et al. (2024) and Yerly et al. (2022)).

Set up the environment

Here are some packages you’ll need for this demo:

library(Seurat)
library(ggplot2)
library(UCell)
library(patchwork)
library(tidyr)
library(dplyr)
library(RColorBrewer)

#install.packages("GeneNMF")
remotes::install_github("carmonalab/GeneNMF")  #from GitHub
library(GeneNMF)

BCC dataset

Download the test dataset for this demo from Figshare:

do_download <- FALSE

ddir <- "input"
data.path <- sprintf("%s/Tumor_combination_LY_CG.rds", ddir)

if (do_download) {
  dir.create(ddir)
  options(timeout = 3000)
  download.file("https://figshare.com/ndownloader/files/47742634", destfile = data.path)
}

seu <- readRDS(data.path)

As with most cancer cell datasets, we observe large batch effects between patients.

DimPlot(seu, group.by="patient_bcc") + theme(aspect.ratio = 1)

With this level of batch effects, it is not trivial to analyse multiple samples together. Typical scRNA-seq analysis pipelines call for batch-effect correction methods, which aim at reducing technical variability while preserving real, biological differences (see e.g. Luecken et al. (2022)). However, cancer cells DO indeed have unique transcriptional phenotypes in different patients, making technical and biological variability indistinguishable. Extracting gene programs by NMF in individual patients is attractive because it bypasses the need for batch effect correction - we rather integrate multiple samples at the level of their gene program activities.

Consistent NMF programs across multiple samples

Identification of robust gene programs requires their detection across samples and variability of input parameters. Perhaps the most crucial parameter to NMF is the dimensionality k, which corresponds to the number of programs of the low-dimensional matrix. To determine robust programs, we can run NMF over multiple numbers of k and determine programs that are consistenly found across these runs. The multiNMF() function automatically performs NMF over a list of samples and for multiple values of k:

DefaultAssay(seu) <- "RNA"
seu.list <- SplitObject(seu, split.by = "Sample")

geneNMF.programs <- multiNMF(seu.list, assay="RNA", k=4:9, min.exp = 0.05)

We can now combine the gene programs identified over multiple samples and numbers of k into metaprograms (MPs), i.e. consensus programs that are robustly identified across NMF runs. Here we will define 10 MPs:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.5,
                                        nMP=10)

It can be useful to visualize pairwise similarity (in terms of cosine similarity or Jaccard index) between individual gene programs that compose meta-programs. We can see “blocks” corresponding to gene programs of high similarity across datasets and values of k. We can then cut the similarity tree at a given height to find blocks of similar programs and derive consensus gene signatures for each block. For example, here we cut the tree to the height corresponding to 10 clusters of programs (i.e. 10 MPs):

ph <- plotMetaPrograms(geneNMF.metaprograms,
                       similarity.cutoff = c(0.1,1))
ph

We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).

Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity).

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage   silhouette meanSimilarity numberGenes numberPrograms
## MP1       1.0000000  0.696882673          0.723          21             50
## MP2       1.0000000  0.297392424          0.433          33             60
## MP3       0.9090909  0.129408777          0.295         142             60
## MP4       0.9090909 -0.073756249          0.161          45            109
## MP5       0.8181818  0.509947198          0.624          23             45
## MP6       0.5454545  0.459199536          0.560          11             20
## MP7       0.5454545  0.448717232          0.543          19             17
## MP8       0.5454545  0.008033304          0.233          52             44
## MP9       0.3636364  0.549435393          0.617          45             12
## MP10      0.2727273  0.530347451          0.599          36             12

Some important parameters

We can also be more strict when we extract MPs from the matrix of individual programs. For example, we can increase the min.confidence parameter to 0.7 to focus on genes consistently found across at least 70% of the individual programs.

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.5,
                                        nMP=10,
                                        min.confidence = 0.7)
## Dropped 1 empty meta-programs

We see the one on the MPs has now been dropped (we could have also picked it up by its poor metrics just above).

ph <- plotMetaPrograms(geneNMF.metaprograms,
                       similarity.cutoff = c(0.1,1))
ph

geneNMF.metaprograms$metaprograms.metrics
##     sampleCoverage  silhouette meanSimilarity numberGenes numberPrograms
## MP1      1.0000000 0.696882673          0.723          21             50
## MP2      1.0000000 0.297392424          0.433          21             60
## MP3      0.9090909 0.129408777          0.295          38             60
## MP4      0.8181818 0.509947198          0.624          23             45
## MP5      0.5454545 0.459199536          0.560           9             20
## MP6      0.5454545 0.448717232          0.543          15             17
## MP7      0.5454545 0.008033304          0.233          13             44
## MP8      0.3636364 0.549435393          0.617          36             12
## MP9      0.2727273 0.530347451          0.599          25             12

We see that most MPs are composed of around 10-20 genes (column numberGenes in the table above). We can control the size of the MPs using the weight.explained parameter. For example, we can increase weight.explained to 0.8 to include all genes that allow to explain 80% of loadings in the consensus gene set:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        metric = "cosine",
                                        weight.explained = 0.8,
                                        nMP=10,
                                        min.confidence = 0.7)
## Dropped 1 empty meta-programs

Now we have larger meta-program gene sets:

geneNMF.metaprograms$metaprograms.metrics
##     sampleCoverage  silhouette meanSimilarity numberGenes numberPrograms
## MP1      1.0000000 0.696882673          0.723          38             50
## MP2      1.0000000 0.297392424          0.433          27             60
## MP3      0.9090909 0.129408777          0.295          38             60
## MP4      0.8181818 0.509947198          0.624          37             45
## MP5      0.5454545 0.459199536          0.560          11             20
## MP6      0.5454545 0.448717232          0.543          21             17
## MP7      0.5454545 0.008033304          0.233          13             44
## MP8      0.3636364 0.549435393          0.617          59             12
## MP9      0.2727273 0.530347451          0.599          38             12

Other important parameters to getMetaPrograms() are nMP (the target number of meta-programs) and specificity.weight, which controls how much MP signatures are allowed to overlap. From the default specificity.weight=5, lower values (e.g. =3) allows more overlapping MPs, higher values (e.g. =8) enforces higher sparsity in the decomposition.

Interpretation of gene programs

What are the genes driving each meta-program?

lapply(geneNMF.metaprograms$metaprograms.genes, head)
## $MP1
## [1] "TOP2A"  "CENPF"  "NUSAP1" "PCLAF"  "MAD2L1" "TK1"   
## 
## $MP2
## [1] "KRT6B"   "CALML5"  "S100A7"  "DMKN"    "LGALS7B" "KRT1"   
## 
## $MP3
## [1] "NIN"    "FAT3"   "PEAK1"  "NAV2"   "SLC7A2" "INTS6" 
## 
## $MP4
## [1] "FOSB"   "FOS"    "EGR1"   "ZFP36"  "DNAJB1" "NR4A1" 
## 
## $MP5
## [1] "ACTA2" "TAGLN" "MYLK"  "MYL9"  "CSRP1" "ANXA3"
## 
## $MP6
## [1] "S100A8" "S100A9" "CHI3L1" "TIMP1"  "CCL2"   "S100A7"
## 
## $MP7
## [1] "ATF3"  "ZFP36" "SOCS3" "IRF1"  "BTG2"  "NR4A1"
## 
## $MP8
## [1] "CRCT1"     "CRNN"      "RBP2"      "LINC00964" "FOS"       "SOX18"    
## 
## $MP9
## [1] "SFRP1"   "ALK"     "COL14A1" "DCN"     "MEGF6"   "CHGA"

We can also inspect their relative weights to see how much each gene contributes to the MP. For example, for MP6:

geneNMF.metaprograms$metaprograms.genes.weights$MP6
##      S100A8      S100A9      CHI3L1       TIMP1        CCL2      S100A7 
## 0.215830597 0.210983115 0.115345508 0.105439904 0.090343059 0.051177038 
##       TAGLN        MT2A       INHBA       KRT6A       KRT6B      S100A2 
## 0.037815897 0.029383116 0.023189284 0.022741619 0.019063225 0.013392001 
##      IFITM3       ZFP36       KRT75         MGP       SOCS3       ANXA3 
## 0.011160279 0.010551846 0.009011202 0.006894923 0.006609389 0.005611245 
##       KLF10     TMEM45A       SULF2 
## 0.005390284 0.005292033 0.004774439

To aid the interpretation of gene programs, we can compare them to known signatures from public databases. The runGSEA() function can be useful to scan MSigDB and evaluate the overlap of detected gene programs with signatures in the databases. Here we compare to the “C5” category, biological process subcategory; but other classes such as “H” (hallmark gene sets) or “C8” (cell type) may be more relevant in other contexts.

library(msigdbr)
library(fgsea)
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
  runGSEA(program, universe=rownames(seu), category = "C5", subcategory = "GO:BP")
})

For example, MP1 appears to be associated with cell cycling / cell division.

head(top_p$MP1)
##                            pathway         pval         padj overlap  size
##                             <char>        <num>        <num>   <int> <int>
## 1:         GOBP_CELL_CYCLE_PROCESS 2.071463e-27 1.585705e-23      27  1175
## 2:         GOBP_MITOTIC_CELL_CYCLE 9.610340e-26 3.678358e-22      24   866
## 3:                 GOBP_CELL_CYCLE 1.444483e-24 3.685839e-21      28  1716
## 4:          GOBP_ORGANELLE_FISSION 2.410380e-24 3.842722e-21      20   490
## 5: GOBP_MITOTIC_CELL_CYCLE_PROCESS 2.509942e-24 3.842722e-21      22   714
## 6:              GOBP_CELL_DIVISION 5.394203e-24 6.882104e-21      21   618
##    overlapGenes
##          <list>
## 1: ASPM, BR....
## 2: BRCA1, B....
## 3: ASPM, BR....
## 4: ASPM, BR....
## 5: BRCA1, B....
## 6: ASPM, BR....

A word of caution: GSEA can be very useful to gain a broad idea of MP function. However, gene sets in MSigDB are often very large (>1000 genes), noisy sets derived from specific conditions, so results should be interpreted with caution.

Signature scores for gene programs

Now that we derived gene signatures for our MPs, we can use them to obtain gene sets scores to place each cell on the axis of each MP. We can apply the UCell method to derive MP scores between 0 and 1 for each cell.

mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, ncores=4, name = "")

We can explore whether certain MPs are enriched in individual samples:

VlnPlot(seu, features=names(mp.genes), group.by = "patient_bcc",
        pt.size = 0, ncol=5)

Some MPs appear to be equally active across multiple samples; others such as MP4 or MP7 have different distributions in different samples. If clinical data are available for patients, one may use this information to study associations between MPs and relevant biological variables.

Signature scores to define integrated spaces

Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:

matrix <- seu@meta.data[,names(mp.genes)]

#dimred <- scale(matrix)
dimred <- as.matrix(matrix)

colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
                                         cell.embeddings = dimred,
                                         assay.used = "RNA",
                                         key = "MP_",
                                         global = FALSE)

We can also use these scores to generate a UMAP representation and visualize the data in 2D

set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
               metric = "euclidean", reduction.name = "umap_MP")

We have now integrated the data based on their MP signature scores!

DimPlot(seu, reduction = "umap_MP", group.by = "patient_bcc") + theme(aspect.ratio = 1)

How do the signature scores for the meta-programs look like in the combined space?

library(viridis)
FeaturePlot(seu, features = names(mp.genes), reduction = "umap_MP", ncol=4) &
  scale_color_viridis(option="B") &
   theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())

Final remarks

NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly.

References

See also

Session Info

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] viridis_0.6.5      viridisLite_0.4.2  fgsea_1.30.0       msigdbr_7.5.1     
##  [5] GeneNMF_0.6.0      RColorBrewer_1.1-3 dplyr_1.1.4        tidyr_1.3.1       
##  [9] patchwork_1.2.0    UCell_2.8.0        ggplot2_3.5.1      Seurat_5.1.0      
## [13] SeuratObject_5.0.2 sp_2.1-4           renv_1.0.7        
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.0              
##   [3] later_1.3.2                 tibble_3.2.1               
##   [5] polyclip_1.10-6             fastDummies_1.7.3          
##   [7] lifecycle_1.0.4             globals_0.16.3             
##   [9] lattice_0.22-6              MASS_7.3-60.2              
##  [11] SnowballC_0.7.1             magrittr_2.0.3             
##  [13] plotly_4.10.4               sass_0.4.9                 
##  [15] rmarkdown_2.27              jquerylib_0.1.4            
##  [17] yaml_2.3.9                  remotes_2.5.0              
##  [19] httpuv_1.6.15               sctransform_0.4.1          
##  [21] spam_2.10-0                 spatstat.sparse_3.1-0      
##  [23] reticulate_1.38.0           cowplot_1.1.3              
##  [25] pbapply_1.7-2               abind_1.4-5                
##  [27] zlibbioc_1.50.0             Rtsne_0.17                 
##  [29] GenomicRanges_1.56.1        purrr_1.0.2                
##  [31] BiocGenerics_0.50.0         GenomeInfoDbData_1.2.12    
##  [33] IRanges_2.38.1              S4Vectors_0.42.1           
##  [35] ggrepel_0.9.5               irlba_2.3.5.1              
##  [37] listenv_0.9.1               spatstat.utils_3.0-5       
##  [39] pheatmap_1.0.12             goftest_1.2-3              
##  [41] RSpectra_0.16-2             spatstat.random_3.3-1      
##  [43] fitdistrplus_1.2-1          parallelly_1.37.1          
##  [45] leiden_0.4.3.1              codetools_0.2-20           
##  [47] DelayedArray_0.30.1         tidyselect_1.2.1           
##  [49] UCSC.utils_1.0.0            farver_2.1.2               
##  [51] matrixStats_1.3.0           stats4_4.4.0               
##  [53] rmdformats_1.0.4            spatstat.explore_3.3-1     
##  [55] jsonlite_1.8.8              BiocNeighbors_1.22.0       
##  [57] progressr_0.14.0            ggridges_0.5.6             
##  [59] survival_3.5-8              tools_4.4.0                
##  [61] ica_1.0-3                   Rcpp_1.0.13                
##  [63] glue_1.7.0                  gridExtra_2.3              
##  [65] SparseArray_1.4.8           xfun_0.46                  
##  [67] MatrixGenerics_1.16.0       GenomeInfoDb_1.40.1        
##  [69] withr_3.0.0                 BiocManager_1.30.23        
##  [71] fastmap_1.2.0               fansi_1.0.6                
##  [73] RcppML_0.3.7                digest_0.6.36              
##  [75] R6_2.5.1                    mime_0.12                  
##  [77] colorspace_2.1-0            scattermore_1.2            
##  [79] tensor_1.5                  spatstat.data_3.1-2        
##  [81] utf8_1.2.4                  generics_0.1.3             
##  [83] data.table_1.15.4           httr_1.4.7                 
##  [85] htmlwidgets_1.6.4           S4Arrays_1.4.1             
##  [87] uwot_0.2.2                  pkgconfig_2.0.3            
##  [89] gtable_0.3.5                lmtest_0.9-40              
##  [91] SingleCellExperiment_1.26.0 XVector_0.44.0             
##  [93] htmltools_0.5.8.1           dotCall64_1.1-1            
##  [95] bookdown_0.40               scales_1.3.0               
##  [97] Biobase_2.64.0              png_0.1-8                  
##  [99] spatstat.univar_3.0-0       knitr_1.48                 
## [101] reshape2_1.4.4              nlme_3.1-164               
## [103] curl_5.2.1                  cachem_1.1.0               
## [105] zoo_1.8-12                  stringr_1.5.1              
## [107] KernSmooth_2.23-22          parallel_4.4.0             
## [109] miniUI_0.1.1.1              pillar_1.9.0               
## [111] grid_4.4.0                  vctrs_0.6.5                
## [113] RANN_2.6.1                  lsa_0.73.3                 
## [115] promises_1.3.0              xtable_1.8-4               
## [117] cluster_2.1.6               evaluate_0.24.0            
## [119] cli_3.6.3                   compiler_4.4.0             
## [121] rlang_1.1.4                 crayon_1.5.3               
## [123] future.apply_1.11.2         labeling_0.4.3             
## [125] plyr_1.8.9                  stringi_1.8.4              
## [127] deldir_2.0-4                BiocParallel_1.38.0        
## [129] babelgene_22.9              munsell_0.5.1              
## [131] lazyeval_0.2.2              spatstat.geom_3.3-2        
## [133] Matrix_1.7-0                RcppHNSW_0.6.0             
## [135] future_1.33.2               shiny_1.8.1.1              
## [137] SummarizedExperiment_1.34.0 highr_0.11                 
## [139] ROCR_1.0-11                 igraph_2.0.3               
## [141] bslib_0.7.0                 fastmatch_1.1-4