Augmenting reference subtype classifications using gene programs

An example quantifying proliferation, response to interferons and tissue-residency gene programs across T cell subtypes

Background

Cells responding to external cues can transiently activate gene programs, or groups of genes that are collectively involved in specific gene processes, such as cell division, or response to interferons. In scRNA-seq data, transcriptional variance can be largely explained by these transient gene expression programs masking differences between different cell types, and leading to the clustering of cells of different types.

Some typical gene programs observed in T cell scRNA-seq data:

Cell cycle (eg. MCM5, TOP2A)
Interferon (IFN) response (eg. MX1, ISG15)
Heat shock protein (HSP) response (eg. HSPA1, HSPA2)
TCR Activation (eg. JUN, FOS, NR4A1)
Tissue-residency a program associated to homing and/or retention of T cells in non-lymphoid tissues (eg. ZNF683, ITGAE)

Here we will show to use reference-based projection to classify cells into robust subtypes, irrespective of the activation of transient gene programs, but still recover this important information to more granularly define cell states.

In this example we will use data from Gueguen et al. 2021.

Data setup

Loading the reference map

First, let’s download and load the CD8 TIL reference map Seurat object

# Load the reference
options(timeout = max(900, getOption("timeout")))
#download.file("https://figshare.com/ndownloader/files/38921366", destfile = "CD8T_human_ref_v1.rds")
ref.cd8 <- load.reference.map("CD8T_human_ref_v1.rds")

## [1] "Loading Custom Reference Atlas..."
## [1] "Loaded Custom Reference map Human CD8 TILs"

# Setup colors
mycols <- ref.cd8@misc$atlas.palette

# Plotting
DimPlot(ref.cd8,  group.by = 'functional.cluster', label = T, repel = T, cols = mycols) + theme(aspect.ratio = 1)

We can verify the reference map only contains “stable” or “basal” cell types (cells strongly expressing transient gene programs, such as cell division, IFN-response, and HSP-response were removed from the reference by design).

Load query data

Download and load Seurat object containing the pre-annotated Gueguen dataset. In this example, the original author annotations are available in the seurat_clusters metadata column

#download.file("https://figshare.com/ndownloader/files/39082049", destfile = "gueguen.cd3.Rds")
gueguen.cd3 <- readRDS("gueguen.cd3.Rds")
DefaultAssay(gueguen.cd3) <- "RNA"
gueguen.cd3$seurat_clusters <- Idents(gueguen.cd3)

# DimPlot
DimPlot(gueguen.cd3, order = T,  label = T, repel = T)

ProjecTILs reference-based classification

Next we will run ProjecTILs ProjecTILs.classifier function. This running mode will add the predicted celltype labels as metadata (i.e. functional.cluster), while keeping intact the original low dimensional embeddings.

Here we will run the prediction separately for each patient/sample contained in the dataset. To this end, we include the parameter split.by indicating the metadata column containing the sample information (in this example orig.ident).

# Projection
gueguen.cd3 <- ProjecTILs.classifier(gueguen.cd3, ref = ref.cd8, filter.cells = T, split.by = 'orig.ident', ncores = 6)

# Plotting
DimPlot(gueguen.cd3, group.by = 'functional.cluster', order = T, cols = mycols, label = T, repel = T)

table(gueguen.cd3$functional.cluster)

## 
##        CD8.CM        CD8.EM      CD8.MAIT CD8.NaiveLike     CD8.TEMRA 
##          2132          3181           147           238           276 
##       CD8.TEX      CD8.TPEX 
##          3340           337

Augmenting cell states resolution by quantifying activation of gene programs

To further sub-classify T cell reference subtypes according to the activation of gene programs, we will use UCell. In this example we will also use SignatuR that stores useful gene signatures (e.g cell cycle).

library(SignatuR)
# Load Human signatures
signatuR.list <- GetSignature(SignatuR$Hs)

# Select cycling signature and merge the two cycling phases
signatures <- list()
signatures[["Cycling"]] <- unlist(signatuR.list[c("cellCycle.G1S","cellCycle.G2M")])
  
# for IFN and tissue-residecy we will simply define them manually
signatures[["IFN"]] <- c("ISG15","IFI6","IFI44L","MX1")
signatures[["Tissue_Resident"]] <-  c("ITGAE")

Now let’s compute the gene signature scores using UCell

gueguen.cd3 <- AddModuleScore_UCell(gueguen.cd3, features = signatures, ncores = 8,
                       assay = "RNA")

We can visualize these scores on the UMAP, and verify their agreement with the original author annotations (CD4/8-MCM5, CD4/8-TOP2A cell clusters with high cycling score, and CD4/8-ISG15 cluster with high IFN response score).

FeaturePlot(gueguen.cd3, features = paste0(names(signatures), "_UCell"),
    order = T, pt.size = 0.3) & scale_color_paletteer_c("pals::coolwarm")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

We can now classify each cell into low/high for each gene program, according to their signature scores.

Signatures scores distributions

We can explore signatures scores distributions and set a threshold to classify cells into negative/positive for each gene program. Here we will use a global threshold of 0.2.

qplot(gueguen.cd3$Cycling_UCell, main = "Cycling") + qplot(gueguen.cd3$IFN_UCell, main = "IFN") + qplot(gueguen.cd3$Tissue_Resident_UCell, main = "Tissue Resident") & theme_bw()

Example with the Cycling signature:

cycling.high <- subset(gueguen.cd3, subset = Cycling_UCell > 0.2)
cycling.low <- subset(gueguen.cd3, subset = Cycling_UCell <= 0.2)


cycling.list <- list(cycling.low,cycling.high)
names(cycling.list) <- c("cycling.low","cycling.high")

Visualize using ProjecTILs plot.states.radar function

plot.states.radar(ref = ref.cd8, query = cycling.list, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c("SELL","LEF1","TCF7","CCR7","KLF2","GZMK", "GZMA", "FCGR3A","FGFBP2","XCL1","XCL2", "KLRB1",'MCM5',"MCM3","TOP2A"))

Here Cycling high and Cycling low cells display gene expression profiles in terms of marker genes, but differ in their expression of cell cycle genes, such as MCM5, MCM3 and TOP2A.

Now we can do this for the 4 signatures that we selected at the beginning.

wrap_plots(pll, guides = "collect")

As previously shown, CD8.TEX and CD8.TPEX populations contain the highest frequency of proliferating, tissue-resident and IFN-responding cells.

Bonus track

Upset plot to check program co-occurences

Let’s focus on the CD8.TEX population, and see how the different gene programs relate using a UpSet plot, using UpSetR package.

gueguen.cd3.TEX <- subset(gueguen.cd3, subset = functional.cluster == "CD8.TEX")
listInput <- list(IFN = which(gueguen.cd3.TEX$IFN.class == "Positive"), 
                  Cycling = which(gueguen.cd3.TEX$Cycling.class == "Positive"),
                  Tissue_resident = which(gueguen.cd3.TEX$Tissue_Resident.class == "Positive"))


upset(fromList(listInput), order.by = "freq") 
grid.text("CD8.TEX gene programs overlap",x = 0.65, y=0.95, gp=gpar(fontsize=12))

We can see that roughly one third of CD8.TEX with high tissue-residency program are also high for IFN-response program.

Comparison with the original annotation

In the original study, cycling cells are defined as “CD8.LAYN” and “CD8.GZMH”

Now let’s compare this with ProjecTILs classifications. First, let’s subset cells in the 2 cycling clusters CD4/8-TOP2A and CD4/8-MCM5.

gueguen.cycling <- subset(gueguen.cd3, subset = seurat_clusters %in% c('CD4/8-TOP2A', "CD4/8-MCM5"))

# Dimplot
DimPlot(gueguen.cycling, group.by = 'functional.cluster', label = T, repel = T, cols = mycols)

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We can have a look at clusters proportions in the original cycling clusters. It seems that there is more heterogeneity among cycling cells than previously observed, including Central Memory and Effector Memory T cells.

plot.statepred.composition(ref.cd8, gueguen.cycling, metric = "Percent") +
    ggtitle("Cycling cells") + ylim(0, 75) + theme_bw() + RotatedAxis()

Let’s check radar plots to confirm identities, where we included some markers genes for cell cycle: TOP2A, MCM5 and MCM3.

plot.states.radar(ref = ref.cd8, query = gueguen.cycling, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c("SELL","LEF1","TCF7","CCR7","KLF2","GZMK", "GZMA", "FCGR3A","FGFBP2","XCL1","KLRB1",'MCM5',"MCM3","TOP2A","PDCD1","TOX","HAVCR2"))

The marker gene profiles support a correct classification.

Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices datasets  utils     methods  
## [8] base     
## 
## other attached packages:
##  [1] SignatuR_0.1.0      UpSetR_1.4.0        cowplot_1.1.1      
##  [4] plotly_4.10.1       pheatmap_1.0.12     UCell_2.2.0        
##  [7] paletteer_1.5.0     scales_1.2.1        ggrepel_0.9.2      
## [10] gridExtra_2.3       ProjecTILs_3.0.2    patchwork_1.1.2    
## [13] GEOquery_2.66.0     Biobase_2.58.0      BiocGenerics_0.44.0
## [16] data.table_1.14.6   STACAS_2.0.0        scGate_1.4.1       
## [19] forcats_0.5.2       stringr_1.5.0       dplyr_1.0.10       
## [22] purrr_1.0.1         readr_2.1.3         tidyr_1.2.1        
## [25] tibble_3.1.8        ggplot2_3.4.0       tidyverse_1.3.2    
## [28] SeuratObject_4.1.3  Seurat_4.3.0       
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  spatstat.explore_3.0-5     
##   [3] reticulate_1.27             tidyselect_1.2.0           
##   [5] htmlwidgets_1.6.1           BiocParallel_1.32.5        
##   [7] Rtsne_0.16                  munsell_0.5.0              
##   [9] codetools_0.2-18            ica_1.0-3                  
##  [11] umap_0.2.9.0                future_1.30.0              
##  [13] miniUI_0.1.1.1              withr_2.5.0                
##  [15] spatstat.random_3.1-2       colorspace_2.1-0           
##  [17] progressr_0.13.0            highr_0.10                 
##  [19] knitr_1.41                  rstudioapi_0.14            
##  [21] stats4_4.2.1                SingleCellExperiment_1.20.0
##  [23] ROCR_1.0-11                 tensor_1.5                 
##  [25] listenv_0.9.0               labeling_0.4.2             
##  [27] MatrixGenerics_1.10.0       GenomeInfoDbData_1.2.9     
##  [29] polyclip_1.10-4             farver_2.1.1               
##  [31] parallelly_1.34.0           vctrs_0.5.2                
##  [33] generics_0.1.3              xfun_0.36                  
##  [35] timechange_0.2.0            R6_2.5.1                   
##  [37] GenomeInfoDb_1.34.7         rmdformats_1.0.4           
##  [39] pals_1.7                    bitops_1.0-7               
##  [41] spatstat.utils_3.0-1        cachem_1.0.6               
##  [43] DelayedArray_0.24.0         assertthat_0.2.1           
##  [45] promises_1.2.0.1            googlesheets4_1.0.1        
##  [47] gtable_0.3.1                globals_0.16.2             
##  [49] goftest_1.2-3               rlang_1.0.6                
##  [51] splines_4.2.1               lazyeval_0.2.2             
##  [53] gargle_1.2.1                dichromat_2.0-0.1          
##  [55] prismatic_1.1.1             spatstat.geom_3.0-5        
##  [57] broom_1.0.2                 BiocManager_1.30.19        
##  [59] yaml_2.3.7                  reshape2_1.4.4             
##  [61] abind_1.4-5                 modelr_0.1.10              
##  [63] backports_1.4.1             httpuv_1.6.8               
##  [65] tools_4.2.1                 bookdown_0.32              
##  [67] ellipsis_0.3.2              jquerylib_0.1.4            
##  [69] RColorBrewer_1.1-3          ggridges_0.5.4             
##  [71] Rcpp_1.0.10                 plyr_1.8.8                 
##  [73] zlibbioc_1.44.0             RCurl_1.98-1.9             
##  [75] openssl_2.0.5               deldir_1.0-6               
##  [77] pbapply_1.7-0               S4Vectors_0.36.1           
##  [79] zoo_1.8-11                  SummarizedExperiment_1.28.0
##  [81] haven_2.5.1                 cluster_2.1.4              
##  [83] fs_1.6.0                    magrittr_2.0.3             
##  [85] RSpectra_0.16-1             scattermore_0.8            
##  [87] lmtest_0.9-40               reprex_2.0.2               
##  [89] RANN_2.6.1                  googledrive_2.0.0          
##  [91] fitdistrplus_1.1-8          matrixStats_0.63.0         
##  [93] hms_1.1.2                   mime_0.12                  
##  [95] evaluate_0.20               xtable_1.8-4               
##  [97] readxl_1.4.1                IRanges_2.32.0             
##  [99] compiler_4.2.1              maps_3.4.1                 
## [101] KernSmooth_2.23-20          crayon_1.5.2               
## [103] htmltools_0.5.4             later_1.3.0                
## [105] tzdb_0.3.0                  lubridate_1.9.0            
## [107] DBI_1.1.3                   dbplyr_2.3.0               
## [109] MASS_7.3-58.2               data.tree_1.0.0            
## [111] Matrix_1.5-3                cli_3.6.0                  
## [113] parallel_4.2.1              igraph_1.3.5               
## [115] GenomicRanges_1.50.2        pkgconfig_2.0.3            
## [117] sp_1.6-0                    spatstat.sparse_3.0-0      
## [119] xml2_1.3.3                  bslib_0.4.2                
## [121] XVector_0.38.0              rvest_1.0.3                
## [123] digest_0.6.31               pracma_2.4.2               
## [125] sctransform_0.3.5           RcppAnnoy_0.0.20           
## [127] spatstat.data_3.0-0         rmarkdown_2.20             
## [129] cellranger_1.1.0            leiden_0.4.3               
## [131] uwot_0.1.14                 shiny_1.7.4                
## [133] lifecycle_1.0.3             nlme_3.1-161               
## [135] jsonlite_1.8.4              BiocNeighbors_1.16.0       
## [137] mapproj_1.2.9               askpass_1.1                
## [139] viridisLite_0.4.1           limma_3.54.0               
## [141] fansi_1.0.4                 pillar_1.8.1               
## [143] lattice_0.20-45             fastmap_1.1.0              
## [145] httr_1.4.4                  survival_3.5-0             
## [147] glue_1.6.2                  png_0.1-8                  
## [149] stringi_1.7.12              sass_0.4.5                 
## [151] rematch2_2.1.2              renv_0.15.5                
## [153] irlba_2.3.5.1               future.apply_1.10.0