pyUCell - running time¶
In this notebook, we generate arbitrarily large single-cell datasets to evaluate the running time of pyUCell. We show the influence of important parameters such as the number of parallel cores and "chunk size" for parallel processing.
import anndata as ad
import scanpy as sc
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.stats import rankdata
import numpy as np
import pyucell
import math
import warnings
warnings.filterwarnings("ignore")
Load a test dataset
adata = sc.datasets.pbmc3k()
adata
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
Make a large dataset (e.g. 100k cells) by concatenating multiple copies of the test object.
def make_large_set(target_n_cells=100_000):
n_cells = adata.n_obs
n_repeats = math.ceil(target_n_cells / n_cells)
adatas = [adata.copy() for _ in range(n_repeats)]
adata_big = ad.concat(adatas, axis=0, merge="same")
# Trim to exact size
adata_big = adata_big[:target_n_cells].copy()
# Make IDs unique
adata_big.obs_names_make_unique()
return(adata_big)
adata_big = make_large_set(target_n_cells=100_000)
adata_big
AnnData object with n_obs × n_vars = 100000 × 32738
var: 'gene_ids'
Define a few simple signatures to test
signatures = {
'Tcell': ['CD3D', 'CD3E', 'CD2'],
'Bcell': ['MS4A1', 'CD79A', 'CD79B'],
'CD4T': ['CD2','CD4+','CD40LG+','CD8A-','CD8B-'],
'CD8T': ['CD4-','CD40LG-','CD8A+','CD8B+']
}
Run UCell!
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures)
CPU times: user 1.83 s, sys: 1.65 s, total: 3.49 s Wall time: 5.76 s
Number of jobs¶
By default, pyUCell uses all available cores. We can manually change the number of parallel jobs with the n_jobs parameter.
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=4)
CPU times: user 1.78 s, sys: 681 ms, total: 2.47 s Wall time: 7.97 s
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=1)
CPU times: user 16.7 s, sys: 181 ms, total: 16.9 s Wall time: 16.9 s
Chunk size¶
UCell processes expression matrices in mini-batches, or “chunks”, typically consisting of a few hundred cells. This approach significantly reduces the method’s memory footprint, but also enables parallel execution of mini-batches over multiple threads.
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=100)
CPU times: user 7.8 s, sys: 2.73 s, total: 10.5 s Wall time: 17 s
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=500)
CPU times: user 1.8 s, sys: 942 ms, total: 2.74 s Wall time: 5.35 s
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=1000)
CPU times: user 918 ms, sys: 773 ms, total: 1.69 s Wall time: 5.24 s
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=10000)
CPU times: user 109 ms, sys: 395 ms, total: 503 ms Wall time: 6.62 s
A larger dataset¶
Generate a half a million cell dataset and test running time
adata_big = make_large_set(target_n_cells=500_000)
adata_big
AnnData object with n_obs × n_vars = 500000 × 32738
var: 'gene_ids'
Run on 500k cells and check running time.
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures)
CPU times: user 40.2 s, sys: 12.6 s, total: 52.8 s Wall time: 1min
Conclusion¶
pyUCell scales well to large-size datasets and can process in the order of 10^5 cells in seconds/minutes. Parallelization parameters can be adapted to the computational resources available and further improve execution speed.
See also¶
pyUCell GitHub repository: https://github.com/carmonalab/pyucell
pyUCell on PyPI: https://www.piwheels.org/project/pyucell/
pyUCell documentation: https://pyucell.readthedocs.io/en/latest/