pyUCell - running time¶

In this notebook, we generate arbitrarily large single-cell datasets to evaluate the running time of pyUCell. We show the influence of important parameters such as the number of parallel cores and "chunk size" for parallel processing.

In [126]:
import anndata as ad
import scanpy as sc
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.stats import rankdata
import numpy as np
import pyucell
import math
import warnings
warnings.filterwarnings("ignore")

Load a test dataset

In [127]:
adata = sc.datasets.pbmc3k()
In [128]:
adata
Out[128]:
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

Make a large dataset (e.g. 100k cells) by concatenating multiple copies of the test object.

In [129]:
def make_large_set(target_n_cells=100_000):
    n_cells = adata.n_obs
    
    n_repeats = math.ceil(target_n_cells / n_cells)
    
    adatas = [adata.copy() for _ in range(n_repeats)]
    adata_big = ad.concat(adatas, axis=0, merge="same")
    
    # Trim to exact size
    adata_big = adata_big[:target_n_cells].copy()
    # Make IDs unique
    adata_big.obs_names_make_unique()
    
    return(adata_big)
In [130]:
adata_big = make_large_set(target_n_cells=100_000)
adata_big
Out[130]:
AnnData object with n_obs × n_vars = 100000 × 32738
    var: 'gene_ids'

Define a few simple signatures to test

In [131]:
signatures = {
    'Tcell': ['CD3D', 'CD3E', 'CD2'],
    'Bcell': ['MS4A1', 'CD79A', 'CD79B'],
    'CD4T': ['CD2','CD4+','CD40LG+','CD8A-','CD8B-'],
    'CD8T': ['CD4-','CD40LG-','CD8A+','CD8B+']
}

Run UCell!

In [132]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures)
CPU times: user 1.83 s, sys: 1.65 s, total: 3.49 s
Wall time: 5.76 s

Number of jobs¶

By default, pyUCell uses all available cores. We can manually change the number of parallel jobs with the n_jobs parameter.

In [133]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=4)
CPU times: user 1.78 s, sys: 681 ms, total: 2.47 s
Wall time: 7.97 s
In [134]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=1)
CPU times: user 16.7 s, sys: 181 ms, total: 16.9 s
Wall time: 16.9 s

Chunk size¶

UCell processes expression matrices in mini-batches, or “chunks”, typically consisting of a few hundred cells. This approach significantly reduces the method’s memory footprint, but also enables parallel execution of mini-batches over multiple threads.

In [135]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=100)
CPU times: user 7.8 s, sys: 2.73 s, total: 10.5 s
Wall time: 17 s
In [136]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=500)
CPU times: user 1.8 s, sys: 942 ms, total: 2.74 s
Wall time: 5.35 s
In [137]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=1000)
CPU times: user 918 ms, sys: 773 ms, total: 1.69 s
Wall time: 5.24 s
In [138]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures, n_jobs=8, chunk_size=10000)
CPU times: user 109 ms, sys: 395 ms, total: 503 ms
Wall time: 6.62 s

A larger dataset¶

Generate a half a million cell dataset and test running time

In [139]:
adata_big = make_large_set(target_n_cells=500_000)
adata_big
Out[139]:
AnnData object with n_obs × n_vars = 500000 × 32738
    var: 'gene_ids'

Run on 500k cells and check running time.

In [140]:
%%time
pyucell.compute_ucell_scores(adata_big, signatures=signatures)
CPU times: user 40.2 s, sys: 12.6 s, total: 52.8 s
Wall time: 1min

Conclusion¶

pyUCell scales well to large-size datasets and can process in the order of 10^5 cells in seconds/minutes. Parallelization parameters can be adapted to the computational resources available and further improve execution speed.

See also¶

pyUCell GitHub repository: https://github.com/carmonalab/pyucell

pyUCell on PyPI: https://www.piwheels.org/project/pyucell/

pyUCell documentation: https://pyucell.readthedocs.io/en/latest/

In [ ]: