Isolating target cell types from scATAC-seq data using scGate
Introduction
This vignette illustrates the application of scGate to isolate different target cell populations from scATAC-seq data.
For this demo, we will use a multi-modal dataset of paired ATAC and gene expression sequencing for human PBMCs, sequenced by 10x Chromium single-cell multi-ome technology. The data can be obtained directly from the 10x website: single-cell-multiome-atac-gex.
Make sure you download the following files (and save them to the ./data/ folder):
pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
To translate chromatin accessibility into per-gene measurements, we will apply the simple but effective approach implemented in Signac, which calculates ‘Gene activity’ values by proximity of accessible peaks to gene body in terms of genomic coordinates. Other methods exists to infer per-gene counts from chromatic accessibility, and can be applied in the same manner as a pre-processing step to scGate.
The code for the pre-processing analysis in Seurat and Signac was partly pulled from: Multimodal analysis in Seurat
R environment
::restore()
renvlibrary(ggplot2)
library(dplyr)
library(UCell)
library(scGate)
library(viridis)
library(stringr)
library(patchwork)
library(Seurat)
library(biovizBase)
library(Signac)
library(EnsDb.Hsapiens.v86)
Load multi-modal data
First we will load the data for this demo, and store it in a Seurat object with two assays: one for gene expression data, the second for ATAC-seq data.
For more information on the generation of ‘ChromatinAssay’ objects refer to the Signac - getting started vignette.
#Before you begin, download and save the input data to the 'data/' folder
<- "./data/"
ddir
# The 10x H5file contains both data types
.10x <- Read10X_h5(sprintf("%s/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5",ddir))
inputdata
# Extract RNA and ATAC data
<- inputdata.10x$`Gene Expression`
rna_counts <- inputdata.10x$Peaks
atac_counts
# Create Seurat object on RNA counts
<- CreateSeuratObject(counts = rna_counts)
pbmc "percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc[[
# Add in the ATAC-seq data
<- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.counts <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
grange.use <- atac_counts[as.vector(grange.use), ]
atac_counts <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
annotations
# Formatting genome information and annotations
<- str_replace(string=paste("chr",seqlevels(annotations),sep=""), pattern="chrMT", replacement="chrM")
ucsc.levels seqlevels(annotations) <- ucsc.levels
genome(annotations) <- "hg38"
seqlevelsStyle(annotations) <- 'UCSC'
# Load the fragment file
<- sprintf("%s/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz",ddir)
frag.file <- CreateChromatinAssay(
chrom_assay counts = atac_counts,
sep = c(":", "-"),
# genome = 'hg38',
fragments = frag.file,
min.cells = 10,
annotation = annotations
)"ATAC"]] <- chrom_assay pbmc[[
Pre-processing of GEX and ATAC data
A basic QC based can reveal outlier cells, which we will remove
VlnPlot(pbmc, features = c("nCount_ATAC", "nCount_RNA","percent.mt"), ncol = 3,
log = TRUE, pt.size = 0) + NoLegend()
Set some reasonable thresholds according to observed data distributions, and remove outliers
<- subset(
pbmc x = pbmc,
subset = nCount_ATAC < 7e4 &
> 5e3 &
nCount_ATAC < 25000 &
nCount_RNA > 1000 &
nCount_RNA < 20
percent.mt )
Standard pre-processing for the two data modalities is applied to normalize the data and calculate dimensionality reductions
- GEX pre-processing using the SCT normalization protocol
DefaultAssay(pbmc) <- "RNA"
<- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_') pbmc
- ATAC pre-processing using the TF-IDF normalization, and singular-value decomposition (SVD) for dimensionality reduction
DefaultAssay(pbmc) <- "ATAC"
<- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc # The first LSI dimension is excluded, as it typically correlates with sequencing depth
<- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_") pbmc
The GeneActivity()
function from Signac allows
calculating per-gene counts from ATAC counts. We will store this
information in the ‘GeneActivity’ assay.
<- GeneActivity(pbmc) # This step may take several minutes
gene.activities
# Add the gene activity matrix to the Seurat object as a new assay and normalize it
'GeneActivity']] <- CreateAssayObject(counts = gene.activities)
pbmc[[<- NormalizeData(object = pbmc,assay = 'GeneActivity',normalization.method = 'LogNormalize',
pbmc scale.factor = median(pbmc$nCount_GeneActivity))
Low-dimensional representations
We can visualize Gene Activity counts for some classical immune markers in the UMAP space derived from ATAC counts
DefaultAssay(pbmc) <- 'GeneActivity'
<- c('LCK','CD3D','KLRD1','SPI1','MS4A1','CD36')
feats FeaturePlot(object = pbmc,features = feats ,reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95',ncol = 3)
The same can be done for RNA counts
DefaultAssay(pbmc) <- 'SCT'
<- c('LCK','CD3D','KLRD1','SPI1','MS4A1','CD36')
feats FeaturePlot(object = pbmc,features = feats ,reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95',ncol = 3)
scGate
On these processed data, we can now apply scGate to isolate target cell population based on ATAC-seq readouts or RNA expression. Importantly, we can use one modalitiy for gating and the other modality as a validation.
To gate based on ATAC counts, we will use the associated ‘Gene Activity’ values, which quantify the amount of accessible peaks per gene. Based on Gene Activity, how many genes are “expressed” in ATAC-seq?
DefaultAssay(pbmc) <- 'GeneActivity'
VlnPlot(pbmc, features=c("nFeature_SCT","nFeature_RNA","nFeature_GeneActivity"), pt.size = 0)
apply(pbmc@meta.data[,c("nFeature_SCT","nFeature_RNA","nFeature_GeneActivity")], MARGIN = 2, median)
## nFeature_SCT nFeature_RNA nFeature_GeneActivity
## 1847.0 1849.5 6695.0
Compared to scRNA-seq, more genes have non-zero counts in
ATAC-derived Gene Activities. We will adjust the maxRank
parameter of scGate to reflect this different distribution of values, to
approximately the median number of detected (non-zero) genes per
cell.
# For scRNA-seq, the default maxRank is 1500 (reflecting the number of detected genes).
# Increase it for scATAC-seq based on the distribution of detected genes
<- 7000 maxRank.ATAC
To verify presence of genetic signals, and test out signatures for target cell types, we can run UCell with custom signatures.
<- list("Tcell" = c("CD3E","CD3G","CD3D","CD2"),
signatures "Bcell"=c("MS4A1"),
"NK"=c("KLRD1","NKG7","NCR1","FCGR3A","CD3D-","CD3E-","CD8A-","CD4-"),
"Myeloid"=c("SPI1","CD36","LCK-"),
"Monocyte"=c("LYZ","CD14","SPI1","MAFB","MSR1"))
<- AddModuleScore_UCell(pbmc, features = signatures, assay = "GeneActivity", maxRank = maxRank.ATAC)
pbmc
FeaturePlot(pbmc, features = c("Tcell_UCell","Bcell_UCell","NK_UCell","Myeloid_UCell"),
reduction= "umap.atac", max.cutoff = 'q95')
Purifiying B Cells
To filter B cells, we can set up a simple gating model based on the
single gene MS4A1. We will specify assay = "GeneActivity"
to make scGate filter based on ATAC-derived counts.
# model definition
<- scGate::gating_model(name="Bcell.ATAC",signature = c("MS4A1"))
model
# filter data
<- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "Bcell.ATAC",
pbmc maxRank = maxRank.ATAC)
Visualize gating results
# UMAP plot of purified population
DimPlot(pbmc, group.by = "Bcell.ATAC", reduction= "umap.atac",
cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate B cells on ATAC") + theme(aspect.ratio = 1)
How is the MS4A1 gene detected by the RNA-seq assay?
DefaultAssay(pbmc) <- "SCT"
<- VlnPlot(pbmc,features = c("MS4A1"), assay = "SCT",
a cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0)
<- FeaturePlot(object = pbmc, features = c('MS4A1'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') +
b ggtitle("MS4A1 RNA-seq") + theme(aspect.ratio = 1)
| b a
Purifiying T cells
Next, we aim to filter T cell population by using a signature composed of CD2 and CD3[edg] genes
# model definition
<- scGate::gating_model(name="Tcell.ATAC",signature = c("CD3E","CD3D","CD3G","CD2"))
model
#Filtering process
<- scGate(data = pbmc, model = model, assay = "GeneActivity", output.col.name = "Tcell.ATAC",
pbmc maxRank = maxRank.ATAC)
Visualize results
# UMAP plot
DimPlot(pbmc, group.by = "Tcell.ATAC", reduction= "umap.atac",
cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate T cells on ATAC") + theme(aspect.ratio = 1)
We can verify gating quality by RNA expression of key marker genes
# CD2 and CD3 expression levels
<- VlnPlot(pbmc,features = c("CD3E","CD2"), assay = "SCT",
a cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0)
# Feature plot of
DefaultAssay(pbmc) <- "SCT"
<- FeaturePlot(object = pbmc, features = c('CD3E'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') +
b ggtitle("CD3E RNA-seq") + theme(aspect.ratio = 1)
| b a
Purifying Monocytes
Some cells populations may require more complex gating strategies.
To filter CD14+ monocytes, we will set up a two layer gating model: level 1 aims at isolating myeloid cells, level 2 purifies CD14+ monocytes from myeloid cells. This kind of hierarchical gating can be effortlessly designed using scGate, and mimicks gating strategies commonly applied in flow cytometry.
#model definition
#first layer (Myeloid)
<- scGate::gating_model(name="Myeloid.ATAC",signature = c("SPI1","CD36","LCK-"))
model #second layer (Monocytes)
<- scGate::gating_model(model,name="Monocyte.ATAC",signature = c("CD14","MSR1","MAFB"),level = 2)
model
#apply scGate
<- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "Monocyte.ATAC",
pbmc maxRank = maxRank.ATAC, save.levels = TRUE)
We can visualize gating results for each layer of the model
<- DimPlot(pbmc, group.by = "Monocyte.ATAC.level1", reduction= "umap.atac",
a cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("Myeloid cells (ATAC)") + theme(aspect.ratio = 1)
<- DimPlot(pbmc, group.by = "Monocyte.ATAC", reduction= "umap.atac",
b cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("CD14 Monocytes (ATAC)") + theme(aspect.ratio = 1)
| b a
Verify RNA expression of classical markers SPI1 (myeloid lineage) and CD14 (CD14 monocytes)
<- VlnPlot(pbmc,features = c("SPI1","CD14"), assay = "SCT",
a cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0,ncol = 2)
DefaultAssay(pbmc) <- "SCT"
<- FeaturePlot(object = pbmc, features = c('SPI1'), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95') +
b ggtitle("SPI1 RNA-seq") + theme(aspect.ratio = 1)
| b a
Purifying NK cells
To isolate NK cells, we can set up a gating model on two levels: level 1 to detect lymphoid cells; level 2 to purify NKs from the lymphoid subset.
# model definition
#Layer 1
<- scGate::gating_model(name="Lymphocyte.ATAC",signature = c("LCK"), positive = T) # positive set
model <- scGate::gating_model(model,name="Myeloid.ATAC",signature = c("SPI1"), positive = F) # negative set
model
# Layer 2
<- scGate::gating_model(model,name="NK.ATAC",signature = c("KLRD1","NCAM1","NCR1","CD8A-"), level = 2, positive = T) # positive set
model <- scGate::gating_model(model,name="Tcell.ATAC",signature = c("CD3G","CD3D","CD3E","CD2"), level = 2, positive = F) # negative set
model
#apply scgate
<- scGate(data =pbmc, model = model, assay = "GeneActivity", output.col.name = "NK.ATAC", maxRank = maxRank.ATAC, save.levels = TRUE) pbmc
Visualize the results
# Layer 1 (lymphoid cells)
<- DimPlot(pbmc, group.by = "NK.ATAC.level1", reduction= "umap.atac",
a cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate Lymphoid (ATAC)") + theme(aspect.ratio = 1)
# Layer 2 (NK cells)
<- DimPlot(pbmc, group.by = "NK.ATAC", reduction= "umap.atac",
b cols = c(list("Impure" = "gray", "Pure" = "green"))) + ggtitle("scGate NKs (ATAC)") + theme(aspect.ratio = 1)
| b a
Verify RNA expression of NK markers (KLRD1 and NCAM1) in pure/impure populations
<- VlnPlot(pbmc,features = c("KLRD1","NCAM1","CD3D"), assay = "SCT",
b cols = c(list("Impure" = "gray", "Pure" = "green")),pt.size = 0, ncol=3)
DefaultAssay(pbmc) <- "SCT"
<- FeaturePlot(object = pbmc, features = c('KLRD1',"CD3D"), reduction= "umap.atac",pt.size = 0.1,max.cutoff = 'q95', combine = F)
pll 1]] <- pll[[1]] + ggtitle("KLRD1 RNA-seq") + theme(aspect.ratio = 1)
pll[[2]] <- pll[[2]] + ggtitle("CD3D RNA-seq") + theme(aspect.ratio = 1)
pll[[
<- wrap_plots(pll)
d
/ d b
Further reading
The scGate package and installation instructions are available at: scGate package
The code for this demo (and additional demos) can be found on GitHub
References
Multi-modal dataset of paired ATAC and gene expression sequencing for human PBMCs from 10x - Download link (requires sign-in)
Andreatta, M., Carmona, S. J. (2021) UCell: Robust and scalable single-cell gene signature scoring Computational and Structural Biotechnology Journal
Stuart, T., Srivastava, A., Madad, S., Lareau, C.A., & Satija, R. (2021) Single-cell chromatin state analysis with Signac Nature Methods