Decomposing scRNA-seq data using NMF - a demo

Non-negative matrix factorization is a method for the analysis of high dimensional data that allows extracting sparse and meaningful features from a set of non-negative data vectors. It is well suited for decomposing scRNA-seq data, effectively reducing large complex matrices (\(10^4\) of genes times \(10^5\) of cells) into a few interpretable gene programs. It has been especially used to extract recurrent gene programs in cancer cells (see e.g. Barkely et al. (2022) and Gavish et al. (2023)), which are otherwise difficult to integrate and analyse jointly. See also our demo on characterizing intratumor heterogeneity with GeneNMF

Here, to illustrate the methods implemented in the GeneNMF package, we will apply NMF on a single-cell cell dataset of human PBMCs - a downsampled version of the dataset published by Hao et al. (2021).

Set up the environment

Here are some packages you’ll need for this demo:

library(remotes)
remotes::install_github("carmonalab/GeneNMF") #from Github
library(GeneNMF)
library(Seurat)
library(ggplot2)
library(UCell)
library(patchwork)
library(Matrix)
library(RcppML)
library(viridis)

Then download the test dataset for this demo.

options(timeout=1000)

ddir <- "input"
data.path <- sprintf("%s/pbmc_multimodal.downsampled20k.seurat.rds", ddir)

if (!file.exists(data.path)) {
    dir.create(ddir)
    dataUrl <- "https://www.dropbox.com/s/akzu3hp4uz2mpkv/pbmc_multimodal.downsampled20k.seurat.rds?dl=1"
    download.file(dataUrl, data.path)
}

seu <- readRDS(data.path)

NMF for dimensionality reduction

NMF can be applied to reduce the dimensionality of the data from tens of thousand of genes to a few dimensions (similarly to PCA). With the RunNMF() function, it can be directly applied on a Seurat object, and it will save the NMF results as a new dimensionality reduction.

ndim <- 15

seu <- FindVariableFeatures(seu, nfeatures = 1000)
seu <- runNMF(seu, k = ndim, assay="SCT")
seu@reductions$NMF
## A dimensional reduction object with key NMF_ 
##  Number of dimensions: 15 
##  Number of cells: 20000 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: SCT

We can also further reduced the dimensionality to 2 dimensions using UMAP; in this space we can visualize all cells in a single plot.

seu <- RunUMAP(seu, reduction = "NMF", dims=1:ndim, reduction.name = "NMF_UMAP", reduction.key = "nmfUMAP_")
DimPlot(seu, reduction = "NMF_UMAP", group.by = "celltype.l1", label=T) + theme(aspect.ratio = 1,
                                                            axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank()) + ggtitle("NMF UMAP") + NoLegend()

Consistent NMF programs across multiple samples

Identification of robust gene programs requires their detection across samples and variability of input parameters. Perhaps the most crucial parameter to NMF is the dimensionality k, which corresponds to the number of programs of the low-dimensional matrix. To determine robust programs, we can run NMF over multiple numbers of k and determine programs that are consistenly found across these runs. The multiNMF() function automatically performs NMF over a list of samples and for multiple values of k:

seu.list <- SplitObject(seu, split.by = "donor")

geneNMF.programs <- multiNMF(seu.list, assay="SCT", slot="data", k=4:9, nfeatures = 1000)

We can now combine the gene programs identified over multiple samples and numbers of k into metaprograms (MPs), i.e. consensus programs that are robustly identified across NMF runs. Here we will define 10 MPs:

geneNMF.metaprograms <- getMetaPrograms(geneNMF.programs,
                                        nMP=10,
                                        weight.explained = 0.7,
                                        max.genes=100)

It can be useful to visualize pairwise similarity (in terms of cosine similarity or Jaccard index) between individual gene programs that compose meta-programs. We can see “blocks” corresponding to gene programs of high similarity across datasets and values of k. We can then cut the similarity tree at a given height to find blocks of similar programs and derive consensus gene signatures for each block. For example, here we cut the tree to the height corresponding to 10 clusters of programs (i.e. 10 MPs):

ph <- plotMetaPrograms(geneNMF.metaprograms)

We can also inspect useful statistics, such as the “sample coverage” (in what fraction of samples the MP was detected); or the silhouette coefficient (how similar are individual programs in a MP relative to programs in other MPs - the higher the better).

Based on these metrics, one may decide to drop some of the programs, e.g. if they are specific for few samples only (low sample coverage), or have bad internal consistency (low silhouette and meanSimilarity).

geneNMF.metaprograms$metaprograms.metrics
##      sampleCoverage silhouette meanSimilarity numberGenes numberPrograms
## MP1           1.000 0.91762184          0.924          33             44
## MP2           1.000 0.70847579          0.754          40             44
## MP3           1.000 0.67909952          0.733          74             44
## MP4           1.000 0.65178741          0.724          28             48
## MP5           1.000 0.59000367          0.649          74             49
## MP6           0.875 0.94719185          0.950          18             23
## MP7           0.875 0.67804908          0.742          89             18
## MP8           0.875 0.39549793          0.549          39             19
## MP9           0.500 0.60792198          0.662          23             19
## MP10          0.500 0.02332299          0.106           3              4

What are the genes driving each MP?

lapply(geneNMF.metaprograms$metaprograms.genes, head)
## $MP1
## [1] "IGHM"    "MS4A1"   "CD79A"   "BANK1"   "IGHD"    "RALGPS2"
## 
## $MP2
## [1] "IL7R"    "LEF1"    "TCF7"    "MAL"     "TRABD2A" "CAMK4"  
## 
## $MP3
## [1] "CDKN1C" "HES4"   "MS4A7"  "TCF7L2" "SMIM25" "CTSL"  
## 
## $MP4
## [1] "GNLY"   "PRF1"   "FGFBP2" "GZMB"   "KLRF1"  "SPON2" 
## 
## $MP5
## [1] "S100A12" "VCAN"    "CYP1B1"  "CD14"    "S100A8"  "RBP7"   
## 
## $MP6
## [1] "PPBP"   "PF4"    "GNG11"  "CAVIN2" "TUBB1"  "CLU"   
## 
## $MP7
## [1] "CLEC10A" "FCER1A"  "LGALS2"  "PID1"    "FLT3"    "CD1C"   
## 
## $MP8
## [1] "GZMK"  "DUSP2" "KLRG1" "TRGC2" "CD8A"  "GNLY" 
## 
## $MP9
## [1] "CCL3L1" "CCL3"   "IL1B"   "G0S2"   "CCL4L2" "CXCL8" 
## 
## $MP10
## [1] "JCHAIN" "MZB1"   "IGLC2"

Intepretation of gene programs by GSEA

To aid the interpretation of gene programs, we can compare them to known signatures from public databases. The runGSEA() function can be useful to scan msigDB and evaluate the overlap of detected gene programs with signatures in the databases. Here we compare to the “C8” category (cell type signature gene sets); but other classes such as “H” (hallmark gene sets) may be more relevant in other contexts.

library(msigdbr)
library(fgsea)
top_p <- lapply(geneNMF.metaprograms$metaprograms.genes, function(program) {
  runGSEA(program, universe=rownames(seu), category = "C8")
})

For example, MP4 appears to correlate significantly with natural killer cell (NK) signatures:

head(top_p$MP4)
##                                             pathway         pval         padj
##                                              <char>        <num>        <num>
## 1:              TRAVAGLINI_LUNG_NATURAL_KILLER_CELL 1.926720e-53 1.344850e-50
## 2: DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_NK_CELLS 2.158335e-43 7.532589e-41
## 3:                         HAY_BONE_MARROW_NK_CELLS 1.237929e-41 2.880248e-39
## 4:                 TRAVAGLINI_LUNG_CD8_NAIVE_T_CELL 2.814477e-38 4.911262e-36
## 5:              RUBENSTEIN_SKELETAL_MUSCLE_NK_CELLS 1.780951e-36 2.486207e-34
## 6:          FAN_EMBRYONIC_CTX_BRAIN_EFFECTOR_T_CELL 5.254691e-34 6.112958e-32
##    overlap  size overlapGenes
##      <int> <int>       <list>
## 1:      25   129 ADGRG1, ....
## 2:      20    82 CCL4, CC....
## 3:      25   359 ADGRG1, ....
## 4:      20   140 ADGRG1, ....
## 5:      21   222 AKR1C3, ....
## 6:      18   127 CCL4, CL....

Signature scores for gene programs

A simple way to evaluate gene programs learned from the data is to calculate gene signature scores with the UCell package.

mp.genes <- geneNMF.metaprograms$metaprograms.genes
seu <- AddModuleScore_UCell(seu, features = mp.genes, assay="SCT", ncores=4, name = "")

We can see how many of the programs are enriched in specific cell subtypes (cell type annotation from the original study).

VlnPlot(seu, features=names(mp.genes), group.by = "celltype.l1",
        pt.size = 0, ncol=5)

Signature scores to define integrated space

Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as meta-programs (MPs) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:

matrix <- seu@meta.data[,names(mp.genes)]

#dimred <- scale(matrix)
dimred <- as.matrix(matrix)

colnames(dimred) <- paste0("MP_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["MPsignatures"]] <- new("DimReduc",
                                         cell.embeddings = dimred,
                                         assay.used = "RNA",
                                         key = "MP_",
                                         global = FALSE)

We can also use these scores to generate a UMAP representation and visualize the data in 2D:

set.seed(123)
seu <- RunUMAP(seu, reduction="MPsignatures", dims=1:length(seu@reductions[["MPsignatures"]]),
               metric = "euclidean", reduction.name = "umap_MP")

How do the signature scores for the meta-programs look like in the combined space?

FeaturePlot(seu, features = names(mp.genes), reduction = "umap_MP", ncol=4) &
  scale_color_viridis(option="B") &
   theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())

a <- DimPlot(seu, reduction = "umap_MP", group.by = "celltype.l1", label=T) + theme(aspect.ratio = 1,
                                                            axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank()) + ggtitle("Original cell types") + NoLegend()

b <- DimPlot(seu, reduction = "umap_MP", group.by = "donor", label=T) + theme(aspect.ratio = 1,
                                                            axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank()) + ggtitle("Donor") + NoLegend()
a | b

Final remarks

NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly. This aspect makes it particularly interesting for the analysis of gene programs in cancer cells (see e.g. Barkely et al. (2022), Gavish et al. (2023) Yerly et al. (2024).

References

  • GeneNMF package - LINK
  • We use the ultra-fast RcppML method for NMF - LINK

See also

  • Characterizing intratumor heterogeneity with GeneNMF - DEMO
  • More demos are available at the GeneNMF GitHub repo. Questions? open an issue on GitHub.

Session Info

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] fgsea_1.30.0       msigdbr_7.5.1      viridis_0.6.5      viridisLite_0.4.2 
##  [5] RcppML_0.3.7       Matrix_1.7-0       patchwork_1.2.0    UCell_2.8.0       
##  [9] ggplot2_3.5.1      Seurat_5.1.0       SeuratObject_5.0.2 sp_2.1-4          
## [13] GeneNMF_0.6.0      remotes_2.5.0      renv_1.0.7        
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.0              
##   [3] later_1.3.2                 tibble_3.2.1               
##   [5] polyclip_1.10-6             fastDummies_1.7.3          
##   [7] lifecycle_1.0.4             globals_0.16.3             
##   [9] lattice_0.22-6              MASS_7.3-60.2              
##  [11] SnowballC_0.7.1             magrittr_2.0.3             
##  [13] plotly_4.10.4               sass_0.4.9                 
##  [15] rmarkdown_2.27              jquerylib_0.1.4            
##  [17] yaml_2.3.9                  httpuv_1.6.15              
##  [19] sctransform_0.4.1           spam_2.10-0                
##  [21] spatstat.sparse_3.1-0       reticulate_1.38.0          
##  [23] cowplot_1.1.3               pbapply_1.7-2              
##  [25] RColorBrewer_1.1-3          abind_1.4-5                
##  [27] zlibbioc_1.50.0             Rtsne_0.17                 
##  [29] GenomicRanges_1.56.1        purrr_1.0.2                
##  [31] BiocGenerics_0.50.0         GenomeInfoDbData_1.2.12    
##  [33] IRanges_2.38.1              S4Vectors_0.42.1           
##  [35] ggrepel_0.9.5               irlba_2.3.5.1              
##  [37] listenv_0.9.1               spatstat.utils_3.0-5       
##  [39] pheatmap_1.0.12             goftest_1.2-3              
##  [41] RSpectra_0.16-2             spatstat.random_3.3-1      
##  [43] fitdistrplus_1.2-1          parallelly_1.37.1          
##  [45] leiden_0.4.3.1              codetools_0.2-20           
##  [47] DelayedArray_0.30.1         tidyselect_1.2.1           
##  [49] UCSC.utils_1.0.0            farver_2.1.2               
##  [51] matrixStats_1.3.0           stats4_4.4.0               
##  [53] rmdformats_1.0.4            spatstat.explore_3.3-1     
##  [55] jsonlite_1.8.8              BiocNeighbors_1.22.0       
##  [57] progressr_0.14.0            ggridges_0.5.6             
##  [59] survival_3.5-8              tools_4.4.0                
##  [61] ica_1.0-3                   Rcpp_1.0.13                
##  [63] glue_1.7.0                  gridExtra_2.3              
##  [65] SparseArray_1.4.8           xfun_0.46                  
##  [67] MatrixGenerics_1.16.0       GenomeInfoDb_1.40.1        
##  [69] dplyr_1.1.4                 withr_3.0.0                
##  [71] BiocManager_1.30.23         fastmap_1.2.0              
##  [73] fansi_1.0.6                 digest_0.6.36              
##  [75] R6_2.5.1                    mime_0.12                  
##  [77] colorspace_2.1-0            scattermore_1.2            
##  [79] tensor_1.5                  spatstat.data_3.1-2        
##  [81] utf8_1.2.4                  tidyr_1.3.1                
##  [83] generics_0.1.3              data.table_1.15.4          
##  [85] httr_1.4.7                  htmlwidgets_1.6.4          
##  [87] S4Arrays_1.4.1              uwot_0.2.2                 
##  [89] pkgconfig_2.0.3             gtable_0.3.5               
##  [91] lmtest_0.9-40               SingleCellExperiment_1.26.0
##  [93] XVector_0.44.0              htmltools_0.5.8.1          
##  [95] dotCall64_1.1-1             bookdown_0.40              
##  [97] scales_1.3.0                Biobase_2.64.0             
##  [99] png_0.1-8                   spatstat.univar_3.0-0      
## [101] knitr_1.48                  reshape2_1.4.4             
## [103] nlme_3.1-164                curl_5.2.1                 
## [105] cachem_1.1.0                zoo_1.8-12                 
## [107] stringr_1.5.1               KernSmooth_2.23-22         
## [109] parallel_4.4.0              miniUI_0.1.1.1             
## [111] pillar_1.9.0                grid_4.4.0                 
## [113] vctrs_0.6.5                 RANN_2.6.1                 
## [115] lsa_0.73.3                  promises_1.3.0             
## [117] xtable_1.8-4                cluster_2.1.6              
## [119] evaluate_0.24.0             cli_3.6.3                  
## [121] compiler_4.4.0              rlang_1.1.4                
## [123] crayon_1.5.3                future.apply_1.11.2        
## [125] labeling_0.4.3              plyr_1.8.9                 
## [127] stringi_1.8.4               deldir_2.0-4               
## [129] BiocParallel_1.38.0         babelgene_22.9             
## [131] munsell_0.5.1               lazyeval_0.2.2             
## [133] spatstat.geom_3.3-2         RcppHNSW_0.6.0             
## [135] future_1.33.2               shiny_1.8.1.1              
## [137] highr_0.11                  SummarizedExperiment_1.34.0
## [139] ROCR_1.0-11                 igraph_2.0.3               
## [141] bslib_0.7.0                 fastmatch_1.1-4