feat: support np.random.Generator by flying-sheep · Pull Request #3983 · scverse/scanpy

flying-sheep · 2026-02-23T15:27:34Z

Closes Switch to numpy.random.Generator #3371
Tests included or not required because:

Release notes not necessary because:

The idea is to make our code behave as closely as possible to how it did, but make it read like modern code:

add decorator that converts random_state into rng argument (deduplicating them and using 0 by default)
add helpers that allow old behavior (e.g. for APIs that only take RandomState)
- _FakeRandomGen.wrap_global and _if_legacy_apply_global to conditionally replace np.random.seed and other global-state-mutating functions
- _legacy_random_state to get back a random_state argument for APIs that don’t take rng, e.g. scikit-learn stuff and "random_state" in adata.uns[thing]["params"]
after this PR: make feat: presets #3653 change the default behavior when a preset is passed.

I also didn’t convert the transformer argument to neighbors (yet?), or deprecated stuff like louvain or the external APIs.

Reviewing

First a short info abut how Generators work:

they can spawn independent children (doing so advances their internal state once)
all their other methods advance their internal state
they are reproducible in the same environment (when initialized with the same seed of course) but make no reproducibility guarantee across versions
the convention is to use rng: SeedLike | RNGLike | None = None for the argument, None meaning random initialization

Now questions to the reviewers:

Should we store the new RNG in adata.uns? If no, this fixes Passing a RandomState instance can cause failures to save #1131
Should we keep random_state in the docs?
How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?
Should I handle passing rng to neighbors transformer?
Did I miss other spots where rng can be passed?
Did I miss any spots where we called np.random.seed()

TODO:

add the decorator
add helpers for restarting (e.g. if 0 was passed, it’d be reused by the functions called in function body)
for the functions that called it: re-add the np.random.seed calls (maybe if isinstance(rng, _FakeRandomGen): gen = legacy_numpy_gen(rng) or so?)
- partially done, finish the work
add spawn to _FakeRandomGen (that does nothing) and use spawn for a tree structure
- ENH: Should there be an rng.clone() or similar? numpy/numpy#24086 (comment)
ingest

codecov · 2026-02-24T13:00:56Z

Codecov Report

❌ Patch coverage is 92.62537% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.21%. Comparing base (af57cff) to head (d8d0e80).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/scanpy/plotting/_tools/paga.py	68.96%	9 Missing ⚠️
src/scanpy/_utils/random.py	88.88%	7 Missing ⚠️
src/scanpy/tools/_draw_graph.py	78.57%	6 Missing ⚠️
src/scanpy/neighbors/__init__.py	94.44%	1 Missing ⚠️
src/scanpy/preprocessing/_pca/__init__.py	87.50%	1 Missing ⚠️
src/scanpy/tools/_ingest.py	97.22%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3983      +/-   ##
==========================================
+ Coverage   77.96%   78.21%   +0.24%     
==========================================
  Files         118      119       +1     
  Lines       12517    12672     +155     
==========================================
+ Hits         9759     9911     +152     
- Misses       2758     2761       +3

Flag	Coverage Δ
hatch-test.low-vers	`77.50% <92.62%> (+0.25%)`	⬆️
hatch-test.pre	`77.16% <92.03%> (+0.25%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/_docs.py	`100.00% <100.00%> (ø)`
src/scanpy/datasets/_datasets.py	`91.26% <100.00%> (+0.36%)`	⬆️
src/scanpy/experimental/_docs.py	`100.00% <ø> (ø)`
src/scanpy/experimental/pp/_normalization.py	`94.18% <100.00%> (+0.13%)`	⬆️
src/scanpy/experimental/pp/_recipes.py	`100.00% <100.00%> (ø)`
src/scanpy/preprocessing/_deprecated/sampling.py	`100.00% <100.00%> (ø)`
src/scanpy/preprocessing/_pca/_compat.py	`100.00% <100.00%> (ø)`
src/scanpy/preprocessing/_recipes.py	`91.07% <100.00%> (+0.33%)`	⬆️
src/scanpy/preprocessing/_scrublet/__init__.py	`97.05% <100.00%> (+0.31%)`	⬆️
src/scanpy/preprocessing/_scrublet/core.py	`92.81% <100.00%> (+0.09%)`	⬆️
... and 19 more

... and 1 file with indirect coverage changes

src/scanpy/preprocessing/_pca/_compat.py

src/scanpy/preprocessing/_scrublet/sparse_utils.py

src/scanpy/tools/_umap.py

src/scanpy/neighbors/__init__.py

src/scanpy/preprocessing/_simple.py

src/scanpy/preprocessing/_pca/__init__.py

selmanozleyen · 2026-02-27T15:39:52Z

src/scanpy/preprocessing/_simple.py

+        for rowidx, sub_rng in zip(
+            under_target, rng.spawn(len(under_target)), strict=True
+        ):
            _downsample_array(


I think spawn is more for parallel independent streams. This is correct for reproducibility but it might be causing performance overhead for independence that is usually not needed. So far I saw usage of spawn in either when rng is being given to another function while having keeping the rng still independent in the current code flow or in parallel cases. Not in a sequential for loop. My argument goes like this: If this for loop was unrolled and the code was written sequentially would we still use spawn. I think no because in sparse multiply we don't do that for example.

In fact I ran an ai generated script and these were the results. For some reason FakeRandomGen is super slow. But also shows that spawn actually has an overhead.

python bench_downsample.py Warming up numba JIT... Done. counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840 legacy (random_state=0) 20.07 ms real Generator (with spawn) 8.30 ms real Generator (no spawn) 4.53 ms --> spawn overhead: +83% real vs legacy: -59% counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289 legacy (random_state=0) 399.50 ms real Generator (with spawn) 86.80 ms real Generator (no spawn) 53.74 ms --> spawn overhead: +62% real vs legacy: -78% counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861 legacy (random_state=0) 2968.35 ms real Generator (with spawn) 320.06 ms real Generator (no spawn) 185.98 ms --> spawn overhead: +72% real vs legacy: -89% ❯ python bench_downsample.py Warming up numba JIT... Done. counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840 legacy (_FakeRandomGen) 18.96 ms real Generator (with spawn) 7.77 ms real Generator (no spawn) 4.22 ms --> spawn overhead: +84% real vs legacy: -59% counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289 legacy (_FakeRandomGen) 412.81 ms real Generator (with spawn) 88.76 ms real Generator (no spawn) 54.64 ms --> spawn overhead: +62% real vs legacy: -78% counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861 legacy (_FakeRandomGen) 2956.91 ms real Generator (with spawn) 322.48 ms real Generator (no spawn) 187.00 ms --> spawn overhead: +72% real vs legacy: -89% ~/p/scanpy │ pa/rng ⇣1 *1 !2 ?2 ✔ │ 28s │ scanpy Py │ 16:30:20

the script: https://gist.github.com/selmanozleyen/2046c849bb10f541642cea2ac7daa0db

For some reason FakeRandomGen is super slow.

that should be caused by the legacy rng being slow, not anything we do, but might make sense to check.

OK, new numbers. random_state means legacy RNG, rng means new RNG; spawn means spawning one rng per cell, reuse means reusing the same rng.

Seems like my changes did introduce a little bit of overhead for the default / legacy case (rightmost column). I wonder if that comes from the fact that the rng doesn’t happen inside of the numba code anymore, or if the all the wrapping is that noticable (it really shouldn’t be).

Otherwise the results match yours: spawning takes a less than 2× the time of reusing and the new generator is faster than the old one.

For scanpy commit af57cffc <main>: ================= ============= ============= ====================== ====================== -- layer / param3 ----------------- ------------------------------------------------------------------------- dataset rng / spawn rng / reuse random_state / spawn random_state / reuse ================= ============= ============= ====================== ====================== pbmc68k_reduced n/a n/a n/a 7.56±0.02ms pbmc3k n/a n/a n/a 119±0.5ms ================= ============= ============= ====================== ======================

For scanpy commit 93a9d90e <pa/rng>: ================= ============= ============= ====================== ====================== -- layer / param3 ----------------- ------------------------------------------------------------------------- dataset rng / spawn rng / reuse random_state / spawn random_state / reuse ================= ============= ============= ====================== ====================== pbmc68k_reduced 20.8±0.2ms 11.1±0.5ms n/a 17.4±0.9ms pbmc3k 92.4±3ms 54.3±0.5ms n/a 137±1ms ================= ============= ============= ====================== ======================

But current version doesn't reuse the rng right? Maybe we can create performance issue to create rng per thread. But this will limit reproducibility when n_jobs are the same.

ilan-gold · 2026-02-27T15:56:27Z

they can spawn independent children (doing so advances their internal state once)

Is this True? I was looking into your comment and found this:

# High quality entropy created with: f"0x{secrets.randbits(128):x}"
import numpy as np
entropy = 0x3034c61a9ae04ff8cb62ab8ec2c4b501
rng = np.random.default_rng(entropy)
# Generate a random number as the first step
first_random = rng.uniform()
rng = np.random.default_rng(entropy)
# Now, spawn as the first step
child_rng1, child_rng2 = rng.spawn(2)
# And then do RNG
assert first_random == rng.uniform()

Also, reading through the linked thread, it seems the consensus was spawn does not advance internal state, although maybe I'm misunderstanding what you mean by "internal state" (you even say in your comment "but this means that when users pass a Generator, its state isn’t advanced at all (since spawn doesn’t do that)"). Do you mean the internal state of the children?

flying-sheep · 2026-02-27T16:40:11Z

@ilan-gold yeah I'm sure! I was confused: they have two parts of internal state. Spawning advances the SeedSequence. I can do a writeup when I'm back working

ilan-gold · 2026-02-27T17:10:55Z

Ok @flying-sheep That makes a load of sense - I assumed as much was going on (how else could what I posted be true unless they were literally tracking the number of children created) and it turns out, that is exactly what is happening. Like literally:

SeedSequence(
    entropy=64076961259285389890164002958222865665,
    n_children_spawned=2,
)

after spawning so the entropy is unchanged.

ilan-gold

Should we store the new RNG in adata.uns? If no, this fixes #1131

No opinion, seems moderately unrelated. I guess with the new structure it's a bit meaningless because seeds have no impact if you're passing in a Generator and you can't serialize a Generator

Should we keep random_state in the docs?

No

How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?

I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs. Is there a way to create a default docstring that generalizes but can refer to an old version of the pakcage/docs as a disclaimer for "default" behavior? Maybe not. It's hard for me to hold everything in my head.

Should I handle passing rng to neighbors transformer?

It seems like we don't do it now unless it's a self-constructed transformer, so I probably wouldn't. Is this what you mean?

src/scanpy/tools/_umap.py

src/scanpy/_utils/random.py

src/scanpy/tools/_draw_graph.py

src/scanpy/preprocessing/_utils.py

ilan-gold · 2026-03-06T13:26:14Z

src/scanpy/preprocessing/_simple.py

+    return _downsample_array_inner(col, cumcounts, sample, inplace=inplace)
+
+
+# TODO: can/should this be parallelized?


Seems like a good candidate!

yeah, I just left it alone to not cause any more changes in this PR as there were.

src/scanpy/preprocessing/_pca/_compat.py

flying-sheep · 2026-03-06T14:47:11Z

you can't serialize a Generator

Not in AnnData, but being serialized in general is an explicit feature of them.

I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs.

great point, I’ll add that, and that’ll make documenting this easy.

It seems like we don't do it now unless it's a self-constructed transformer

I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?)

My point was: sklearn doesn’t support rng yet, should we pre-empt the possible change and somehow handle both random_state and rng even though there probably won’t be transformers that accept rng for a while.

ilan-gold · 2026-03-11T10:03:20Z

I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?)

Could you check this out? My impression looking at our code was that there was no mechanism for "updating" a user-passed-in transformer i.e., giving it an RNG (state). So I'm not sure even what this would look like. You would use https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.set_params or something?

flying-sheep · 2026-03-12T14:23:54Z

@ilan-gold ah yeah I assumed we supported passing a transformer class, but we only support passing an instance.

scverse-benchmark · 2026-03-12T14:56:27Z

Benchmark changes

Change	Before [`af57cff`]	After [`d8d0e80`]	Ratio	Benchmark (Parameter)
+	180±3ms	206±1ms	1.15	preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_per_cell('pbmc3k', 'random_state')
+	11.2±0.05ms	25.0±0.4ms	2.24	preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_per_cell('pbmc68k_reduced', 'random_state')
-	307±4ms	191±0.4ms	0.62	preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_total('pbmc3k', 'random_state')
-	16.0±0.08ms	13.1±0.4ms	0.82	preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_total('pbmc68k_reduced', 'random_state')

Comparison: https://github.com/scverse/scanpy/compare/af57cffc6eb7fa77618b2ab026231f72cd029c12..d8d0e8047b977ae834479f0e7d835461e6bd23b2
Last changed: Fri, 13 Mar 2026 13:53:20 +0000

More details: https://github.com/scverse/scanpy/pull/3983/checks?check_run_id=66957280346

ilan-gold · 2026-03-13T12:28:54Z

src/scanpy/tools/_umap.py

        UMAP parameters.

    """
-    rng_init, rng_umap = np.random.default_rng(rng).spawn(2)


Sorry, why did we undo this? I thought I had said here it seems like a good idea: #3983 (comment)

When discussing with Selman, I realized that the only stable pattern would be

[root_rng] = rng.spawn(1) # advance rng’s seed sequence by 1 rng_task1, rng_task2 = root_rng.spawn(2) # advance root_rng’s seed sequence by 2, can be changed del rng, root_ rng # make sure they aren’t used anymore

which isn’t what we used and is a bit complex for little gain – numpy has no stability guarantee anyway, so what this would allow is to add and use a rng_task3 which would probably change results anyway, as there is no place where we create completely independent results using the RNGs (like, predict two different things and then only store them without using them further)

At this point I have no opinion here anymore. @selmanozleyen and @ilan-gold please align on something and I’ll do it. (not trying to sound peeved here, just ready to move on)

I think I agree with current version. If the return value is strongly coupled with multiple rng's there is little benefit IMO (unless it was computed parallel). Because the initialization is also going to effect the core algo anyway.

I don't really get why this is better than just spawning two tasks out of rng

I think it should be done only when there is clear benefit. That's just my opinion for clarity. Otherwise I would be confused if rng_init and rng_umap are supposed to be very independent or not. I don't think its better or worse though

src/scanpy/preprocessing/_pca/__init__.py

ilan-gold · 2026-03-13T12:34:30Z

src/scanpy/preprocessing/_pca/_compat.py

-        u, v, u_based_decision=pkg_version("scikit-learn") < Version("1.5.0rc1")
+    random_init = rng.uniform(size=np.min(x.shape))
+    kw: Any = (
+        dict(rng=rng) if SCIPY_1_15 else dict(random_state=_legacy_random_state(rng))


Same here, what happened to spawn?

selmanozleyen

I think adding TODO's and issues will clarify the spawn usages are for possible parallel implementations.

Also we need to clarify if we will create rng per core or independent component.

flying-sheep added 2 commits February 23, 2026 16:26

feat: support np.random.Generator

de9c481

add decorator

8ab6661

flying-sheep added 7 commits February 24, 2026 14:19

scrublet

1ef8780

almost done

5308a1a

Merge branch 'main' into pa/rng

93a8c0b

fix scrublet_simulate_doublets

32b3ddc

fix _RNGIgraph compat

c3da2bb

Merge branch 'main' into pa/rng

2c82b67

whoops

bd85d95

flying-sheep added this to the 1.13.0 milestone Feb 26, 2026

relnote

8247cdb

flying-sheep marked this pull request as ready for review February 26, 2026 12:55

flying-sheep added 5 commits February 26, 2026 14:37

don’t store rng in random_state arg

47f3ceb

make consistent

1e43b2a

use sub-generators

64a0f26

docs

baf2c85

paga

7e2fab5

flying-sheep requested review from ilan-gold and selmanozleyen February 27, 2026 13:22

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/preprocessing/_pca/_compat.py Outdated Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/preprocessing/_scrublet/sparse_utils.py Outdated Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/tools/_umap.py Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/neighbors/__init__.py Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/preprocessing/_simple.py Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

src/scanpy/preprocessing/_pca/__init__.py Outdated Show resolved Hide resolved

flying-sheep added 2 commits February 27, 2026 16:19

test

8ad699a

Selman’s findings

a4b2d12

flying-sheep commented Feb 27, 2026

View reviewed changes

src/scanpy/preprocessing/_pca/__init__.py Outdated Show resolved Hide resolved

selmanozleyen reviewed Feb 27, 2026

View reviewed changes

ilan-gold reviewed Mar 6, 2026

View reviewed changes

flying-sheep added 6 commits March 9, 2026 09:44

ingest

0c30aa4

rename

e8a411e

spawn docs

4077502

fix paga

5797129

docs reuse

9bc61f0

fix docs

43ad1ee

flying-sheep added 3 commits March 12, 2026 12:43

no spawning without loops/parallel

c98ce73

test

93a9d90

undo spawn param

11155c3

flying-sheep added the benchmark label Mar 12, 2026

flying-sheep mentioned this pull request Mar 12, 2026

Add Harmony to scanpy #3953

Open

fix pca

4570320

flying-sheep requested review from ilan-gold and selmanozleyen March 13, 2026 09:38

flying-sheep added 3 commits March 13, 2026 11:30

more bench

d4814ee

fix tests

4d82e6b

whoops

65fb583

ilan-gold reviewed Mar 13, 2026

View reviewed changes

no rng warning

d8d0e80

selmanozleyen approved these changes Mar 16, 2026

View reviewed changes

		return _downsample_array_inner(col, cumcounts, sample, inplace=inplace)


		# TODO: can/should this be parallelized?

Conversation

flying-sheep commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing

TODO:

Uh oh!

codecov bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Feb 27, 2026

Uh oh!

flying-sheep commented Feb 27, 2026

Uh oh!

ilan-gold commented Feb 27, 2026

Uh oh!

ilan-gold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flying-sheep commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Mar 11, 2026

Uh oh!

flying-sheep commented Mar 12, 2026

Uh oh!

scverse-benchmark bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flying-sheep Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

selmanozleyen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

flying-sheep commented Feb 23, 2026 •

edited

Loading

codecov bot commented Feb 24, 2026 •

edited

Loading

ilan-gold left a comment •

edited

Loading

flying-sheep commented Mar 6, 2026 •

edited

Loading

scverse-benchmark bot commented Mar 12, 2026 •

edited

Loading

flying-sheep Mar 13, 2026 •

edited

Loading