Isolating target cell types from scATAC-seq data using scGate

Introduction

This vignette illustrates the application of scGate to isolate different target cell populations from scATAC-seq data.

For this demo, we will use a multi-modal dataset of paired ATAC and gene expression sequencing for human PBMCs, sequenced by 10x Chromium single-cell multi-ome technology. The data can be obtained directly from the 10x website: single-cell-multiome-atac-gex.

Make sure you download the following files (and save them to the ./data/ folder):

pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi

To translate chromatin accessibility into per-gene measurements, we will apply the simple but effective approach implemented in Signac, which calculates ‘Gene activity’ values by proximity of accessible peaks to gene body in terms of genomic coordinates. Other methods exists to infer per-gene counts from chromatic accessibility, and can be applied in the same manner as a pre-processing step to scGate.

The code for the pre-processing analysis in Seurat and Signac was partly pulled from: Multimodal analysis in Seurat

R environment

renv::restore()
library(ggplot2)
library(dplyr)
library(UCell)
library(scGate)
library(viridis)
library(stringr)   
library(patchwork)

library(Seurat)
library(biovizBase)
library(Signac)
library(EnsDb.Hsapiens.v86)

Load multi-modal data

First we will load the data for this demo, and store it in a Seurat object with two assays: one for gene expression data, the second for ATAC-seq data.

For more information on the generation of ‘ChromatinAssay’ objects refer to the Signac - getting started vignette.

#Before you begin, download and save the input data to the 'data/' folder
ddir <- "./data/"

# The 10x H5file contains both data types
inputdata.10x <- Read10X_h5(sprintf("%s/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5",ddir))

# Extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks

# Create Seurat object on RNA counts
pbmc <- CreateSeuratObject(counts = rna_counts)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# Add in the ATAC-seq data
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# Formatting genome information and annotations
ucsc.levels <- str_replace(string=paste("chr",seqlevels(annotations),sep=""), pattern="chrMT", replacement="chrM")
seqlevels(annotations) <- ucsc.levels

genome(annotations) <- "hg38"
seqlevelsStyle(annotations) <- 'UCSC'

# Load the fragment file
frag.file <- sprintf("%s/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz",ddir)
chrom_assay <- CreateChromatinAssay(
   counts = atac_counts,
   sep = c(":", "-"),
 #  genome = 'hg38',
   fragments = frag.file,
   min.cells = 10,
   annotation = annotations
 )
pbmc[["ATAC"]] <- chrom_assay

Pre-processing of GEX and ATAC data

A basic QC based can reveal outlier cells, which we will remove

VlnPlot(pbmc, features = c("nCount_ATAC", "nCount_RNA","percent.mt"), ncol = 3,
  log = TRUE, pt.size = 0) + NoLegend()

Set some reasonable thresholds according to observed data distributions, and remove outliers

pbmc <- subset(
  x = pbmc,
  subset = nCount_ATAC < 7e4 &
    nCount_ATAC > 5e3 &
    nCount_RNA < 25000 &
    nCount_RNA > 1000 &
    percent.mt < 20
)

Standard pre-processing for the two data modalities is applied to normalize the data and calculate dimensionality reductions

GEX pre-processing using the SCT normalization protocol

DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_')

ATAC pre-processing using the TF-IDF normalization, and singular-value decomposition (SVD) for dimensionality reduction

DefaultAssay(pbmc) <- "ATAC"
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
# The first LSI dimension is excluded, as it typically correlates with sequencing depth
pbmc <- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_")

The GeneActivity() function from Signac allows calculating per-gene counts from ATAC counts. We will store this information in the ‘GeneActivity’ assay.

gene.activities <- GeneActivity(pbmc) # This step may take several minutes

# Add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc[['GeneActivity']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(object = pbmc,assay = 'GeneActivity',normalization.method = 'LogNormalize',
                      scale.factor = median(pbmc$nCount_GeneActivity))

Low-dimensional representations

We can visualize Gene Activity counts for some classical immune markers in the UMAP space derived from ATAC counts

DefaultAssay(pbmc) <- 'GeneActivity'

feats <- c('LCK','CD3D','KLRD1','SPI1','MS4A1','CD36')
FeaturePlot(object = pbmc,features = feats ,reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95',ncol = 3)

The same can be done for RNA counts

DefaultAssay(pbmc) <- 'SCT'

feats <- c('LCK','CD3D','KLRD1','SPI1','MS4A1','CD36')
FeaturePlot(object = pbmc,features = feats ,reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95',ncol = 3)

scGate

On these processed data, we can now apply scGate to isolate target cell population based on ATAC-seq readouts or RNA expression. Importantly, we can use one modalitiy for gating and the other modality as a validation.

To gate based on ATAC counts, we will use the associated ‘Gene Activity’ values, which quantify the amount of accessible peaks per gene. Based on Gene Activity, how many genes are “expressed” in ATAC-seq?

DefaultAssay(pbmc) <- 'GeneActivity'

VlnPlot(pbmc, features=c("nFeature_SCT","nFeature_RNA","nFeature_GeneActivity"), pt.size = 0)

apply(pbmc@meta.data[,c("nFeature_SCT","nFeature_RNA","nFeature_GeneActivity")], MARGIN = 2, median)
##          nFeature_SCT          nFeature_RNA nFeature_GeneActivity 
##                1847.0                1849.5                6695.0

Compared to scRNA-seq, more genes have non-zero counts in ATAC-derived Gene Activities. We will adjust the maxRank parameter of scGate to reflect this different distribution of values, to approximately the median number of detected (non-zero) genes per cell.

# For scRNA-seq, the default maxRank is 1500 (reflecting the number of detected genes).
# Increase it for scATAC-seq based on the distribution of detected genes
maxRank.ATAC <- 7000

To verify presence of genetic signals, and test out signatures for target cell types, we can run UCell with custom signatures.

signatures <- list("Tcell" = c("CD3E","CD3G","CD3D","CD2"),
                   "Bcell"=c("MS4A1"),
                   "NK"=c("KLRD1","NKG7","NCR1","FCGR3A","CD3D-","CD3E-","CD8A-","CD4-"),
                   "Myeloid"=c("SPI1","CD36","LCK-"),
                   "Monocyte"=c("LYZ","CD14","SPI1","MAFB","MSR1"))

pbmc <- AddModuleScore_UCell(pbmc, features = signatures, assay = "GeneActivity", maxRank = maxRank.ATAC)

FeaturePlot(pbmc, features = c("Tcell_UCell","Bcell_UCell","NK_UCell","Myeloid_UCell"), 
            reduction= "umap.atac", max.cutoff = 'q95')

Purifiying B Cells

To filter B cells, we can set up a simple gating model based on the single gene MS4A1. We will specify assay = "GeneActivity" to make scGate filter based on ATAC-derived counts.

# model definition
model <- scGate::gating_model(name="Bcell.ATAC",signature = c("MS4A1"))

# filter data 
pbmc <- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "Bcell.ATAC", 
                 maxRank = maxRank.ATAC)

Visualize gating results

# UMAP plot of purified population
DimPlot(pbmc, group.by = "Bcell.ATAC", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate B cells on ATAC") + theme(aspect.ratio = 1)

How is the MS4A1 gene detected by the RNA-seq assay?

DefaultAssay(pbmc) <- "SCT"

a <- VlnPlot(pbmc,features = c("MS4A1"), assay = "SCT",
             cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0) 

b <- FeaturePlot(object = pbmc, features = c('MS4A1'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') + 
  ggtitle("MS4A1 RNA-seq") + theme(aspect.ratio = 1)

a | b

Purifiying T cells

Next, we aim to filter T cell population by using a signature composed of CD2 and CD3[edg] genes

# model definition
model <- scGate::gating_model(name="Tcell.ATAC",signature = c("CD3E","CD3D","CD3G","CD2"))

#Filtering process
pbmc <- scGate(data = pbmc, model = model, assay = "GeneActivity", output.col.name = "Tcell.ATAC", 
                 maxRank = maxRank.ATAC)

Visualize results

# UMAP plot 
DimPlot(pbmc, group.by = "Tcell.ATAC", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate T cells on ATAC") + theme(aspect.ratio = 1)

We can verify gating quality by RNA expression of key marker genes

# CD2 and CD3 expression levels
a <- VlnPlot(pbmc,features = c("CD3E","CD2"), assay = "SCT",
             cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0)

# Feature plot of 
DefaultAssay(pbmc) <- "SCT"
b <- FeaturePlot(object = pbmc, features = c('CD3E'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') + 
  ggtitle("CD3E RNA-seq") + theme(aspect.ratio = 1)

a | b

Purifying Monocytes

Some cells populations may require more complex gating strategies.

To filter CD14+ monocytes, we will set up a two layer gating model: level 1 aims at isolating myeloid cells, level 2 purifies CD14+ monocytes from myeloid cells. This kind of hierarchical gating can be effortlessly designed using scGate, and mimicks gating strategies commonly applied in flow cytometry.

#model definition 
#first layer (Myeloid)
model <- scGate::gating_model(name="Myeloid.ATAC",signature = c("SPI1","CD36","LCK-"))
#second layer (Monocytes)
model <- scGate::gating_model(model,name="Monocyte.ATAC",signature = c("CD14","MSR1","MAFB"),level = 2)

#apply scGate
pbmc <- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "Monocyte.ATAC", 
                 maxRank = maxRank.ATAC, save.levels = TRUE)

We can visualize gating results for each layer of the model


a <- DimPlot(pbmc, group.by = "Monocyte.ATAC.level1", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("Myeloid cells (ATAC)") + theme(aspect.ratio = 1)

b <- DimPlot(pbmc, group.by = "Monocyte.ATAC", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("CD14 Monocytes (ATAC)") + theme(aspect.ratio = 1)

a | b

Verify RNA expression of classical markers SPI1 (myeloid lineage) and CD14 (CD14 monocytes)

a <- VlnPlot(pbmc,features = c("SPI1","CD14"), assay = "SCT",
             cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0,ncol = 2)

DefaultAssay(pbmc) <- "SCT"
b <- FeaturePlot(object = pbmc, features = c('SPI1'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') + 
  ggtitle("SPI1 RNA-seq") + theme(aspect.ratio = 1)

a | b

Purifying NK cells

To isolate NK cells, we can set up a gating model on two levels: level 1 to detect lymphoid cells; level 2 to purify NKs from the lymphoid subset.

# model definition
#Layer 1
model <- scGate::gating_model(name="Lymphocyte.ATAC",signature = c("LCK"), positive = T)  # positive set
model <- scGate::gating_model(model,name="Myeloid.ATAC",signature = c("SPI1"), positive = F)  # negative set

# Layer 2
model <- scGate::gating_model(model,name="NK.ATAC",signature = c("KLRD1","NCAM1","NCR1","CD8A-"), level = 2, positive = T)  # positive set
model <- scGate::gating_model(model,name="Tcell.ATAC",signature = c("CD3G","CD3D","CD3E","CD2"), level = 2, positive = F)   # negative set

#apply scgate
pbmc <- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "NK.ATAC", maxRank = maxRank.ATAC, save.levels = TRUE)

Visualize the results

# Layer 1 (lymphoid cells)
a <- DimPlot(pbmc, group.by = "NK.ATAC.level1", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate Lymphoid (ATAC)") + theme(aspect.ratio = 1)

# Layer 2 (NK cells)
b <- DimPlot(pbmc, group.by = "NK.ATAC", reduction= "umap.atac",
             cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate NKs (ATAC)") + theme(aspect.ratio = 1)

a | b

Verify RNA expression of NK markers (KLRD1 and NCAM1) in pure/impure populations

b <- VlnPlot(pbmc,features = c("KLRD1","NCAM1","CD3D"), assay = "SCT",
             cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0, ncol=3)

DefaultAssay(pbmc) <- "SCT"
pll <- FeaturePlot(object = pbmc, features = c('KLRD1',"CD3D"), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95', combine = F)
pll[[1]] <- pll[[1]] + ggtitle("KLRD1 RNA-seq") + theme(aspect.ratio = 1)
pll[[2]] <- pll[[2]] + ggtitle("CD3D RNA-seq") + theme(aspect.ratio = 1)

d <- wrap_plots(pll)

b / d

References

Multi-modal dataset of paired ATAC and gene expression sequencing for human PBMCs from 10x - Download link (requires sign-in)
Andreatta, M., Carmona, S. J. (2021) UCell: Robust and scalable single-cell gene signature scoring Computational and Structural Biotechnology Journal
Stuart, T., Srivastava, A., Madad, S., Lareau, C.A., & Satija, R. (2021) Single-cell chromatin state analysis with Signac Nature Methods

Isolating target cell types from scATAC-seq data using scGate