Skip to content

feat: support np.random.Generator#3983

Open
flying-sheep wants to merge 31 commits intomainfrom
pa/rng
Open

feat: support np.random.Generator#3983
flying-sheep wants to merge 31 commits intomainfrom
pa/rng

Conversation

@flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Feb 23, 2026

The idea is to make our code behave as closely as possible to how it did, but make it read like modern code:

  1. add decorator that converts random_state into rng argument (deduplicating them and using 0 by default)
  2. add helpers that allow old behavior (e.g. for APIs that only take RandomState)
    • _FakeRandomGen.wrap_global and _if_legacy_apply_global to conditionally replace np.random.seed and other global-state-mutating functions
    • _legacy_random_state to get back a random_state argument for APIs that don’t take rng, e.g. scikit-learn stuff and "random_state" in adata.uns[thing]["params"]
  3. after this PR: make feat: presets #3653 change the default behavior when a preset is passed.

I also didn’t convert the transformer argument to neighbors (yet?), or deprecated stuff like louvain or the external APIs.

Reviewing

First a short info abut how Generators work:

  • they can spawn independent children (doing so advances their internal state once)
  • all their other methods advance their internal state
  • they are reproducible in the same environment (when initialized with the same seed of course) but make no reproducibility guarantee across versions
  • the convention is to use rng: SeedLike | RNGLike | None = None for the argument, None meaning random initialization

Now questions to the reviewers:

  • Should we store the new RNG in adata.uns? If no, this fixes Passing a RandomState instance can cause failures to save #1131
  • Should we keep random_state in the docs?
  • How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?
  • Should I handle passing rng to neighbors transformer?
  • Did I miss other spots where rng can be passed?
  • Did I miss any spots where we called np.random.seed()

TODO:

  • add the decorator
  • add helpers for restarting (e.g. if 0 was passed, it’d be reused by the functions called in function body)
  • for the functions that called it: re-add the np.random.seed calls (maybe if isinstance(rng, _FakeRandomGen): gen = legacy_numpy_gen(rng) or so?)
    • partially done, finish the work
  • add spawn to _FakeRandomGen (that does nothing) and use spawn for a tree structure
  • ingest

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 92.62537% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.21%. Comparing base (af57cff) to head (d8d0e80).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/plotting/_tools/paga.py 68.96% 9 Missing ⚠️
src/scanpy/_utils/random.py 88.88% 7 Missing ⚠️
src/scanpy/tools/_draw_graph.py 78.57% 6 Missing ⚠️
src/scanpy/neighbors/__init__.py 94.44% 1 Missing ⚠️
src/scanpy/preprocessing/_pca/__init__.py 87.50% 1 Missing ⚠️
src/scanpy/tools/_ingest.py 97.22% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3983      +/-   ##
==========================================
+ Coverage   77.96%   78.21%   +0.24%     
==========================================
  Files         118      119       +1     
  Lines       12517    12672     +155     
==========================================
+ Hits         9759     9911     +152     
- Misses       2758     2761       +3     
Flag Coverage Δ
hatch-test.low-vers 77.50% <92.62%> (+0.25%) ⬆️
hatch-test.pre 77.16% <92.03%> (+0.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/_docs.py 100.00% <100.00%> (ø)
src/scanpy/datasets/_datasets.py 91.26% <100.00%> (+0.36%) ⬆️
src/scanpy/experimental/_docs.py 100.00% <ø> (ø)
src/scanpy/experimental/pp/_normalization.py 94.18% <100.00%> (+0.13%) ⬆️
src/scanpy/experimental/pp/_recipes.py 100.00% <100.00%> (ø)
src/scanpy/preprocessing/_deprecated/sampling.py 100.00% <100.00%> (ø)
src/scanpy/preprocessing/_pca/_compat.py 100.00% <100.00%> (ø)
src/scanpy/preprocessing/_recipes.py 91.07% <100.00%> (+0.33%) ⬆️
src/scanpy/preprocessing/_scrublet/__init__.py 97.05% <100.00%> (+0.31%) ⬆️
src/scanpy/preprocessing/_scrublet/core.py 92.81% <100.00%> (+0.09%) ⬆️
... and 19 more

... and 1 file with indirect coverage changes

@flying-sheep flying-sheep added this to the 1.13.0 milestone Feb 26, 2026
@flying-sheep flying-sheep marked this pull request as ready for review February 26, 2026 12:55
Comment on lines 1068 to 1071
for rowidx, sub_rng in zip(
under_target, rng.spawn(len(under_target)), strict=True
):
_downsample_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think spawn is more for parallel independent streams. This is correct for reproducibility but it might be causing performance overhead for independence that is usually not needed. So far I saw usage of spawn in either when rng is being given to another function while having keeping the rng still independent in the current code flow or in parallel cases. Not in a sequential for loop. My argument goes like this: If this for loop was unrolled and the code was written sequentially would we still use spawn. I think no because in sparse multiply we don't do that for example.

In fact I ran an ai generated script and these were the results. For some reason FakeRandomGen is super slow. But also shows that spawn actually has an overhead.

python bench_downsample.py
Warming up numba JIT...
Done.

counts_per_cell=50  n_obs=500 n_vars=200 nnz=29840
  legacy (random_state=0)                20.07 ms
  real Generator (with spawn)             8.30 ms
  real Generator (no spawn)               4.53 ms
  --> spawn overhead: +83%   real vs legacy: -59%

counts_per_cell=100  n_obs=5000 n_vars=500 nnz=746289
  legacy (random_state=0)               399.50 ms
  real Generator (with spawn)            86.80 ms
  real Generator (no spawn)              53.74 ms
  --> spawn overhead: +62%   real vs legacy: -78%

counts_per_cell=50  n_obs=20000 n_vars=1000 nnz=5969861
  legacy (random_state=0)              2968.35 ms
  real Generator (with spawn)           320.06 ms
  real Generator (no spawn)             185.98 ms
  --> spawn overhead: +72%   real vs legacy: -89%
❯ python bench_downsample.py
Warming up numba JIT...
Done.

counts_per_cell=50  n_obs=500 n_vars=200 nnz=29840
  legacy (_FakeRandomGen)                18.96 ms
  real Generator (with spawn)             7.77 ms
  real Generator (no spawn)               4.22 ms
  --> spawn overhead: +84%   real vs legacy: -59%

counts_per_cell=100  n_obs=5000 n_vars=500 nnz=746289
  legacy (_FakeRandomGen)               412.81 ms
  real Generator (with spawn)            88.76 ms
  real Generator (no spawn)              54.64 ms
  --> spawn overhead: +62%   real vs legacy: -78%

counts_per_cell=50  n_obs=20000 n_vars=1000 nnz=5969861
  legacy (_FakeRandomGen)              2956.91 ms
  real Generator (with spawn)           322.48 ms
  real Generator (no spawn)             187.00 ms
  --> spawn overhead: +72%   real vs legacy: -89%
 ~/p/scanpy │ pa/rng ⇣1 *1 !2 ?2                ✔ │ 28s │ scanpy Py │ 16:30:20 

the script: https://gist.github.com/selmanozleyen/2046c849bb10f541642cea2ac7daa0db

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason FakeRandomGen is super slow.

that should be caused by the legacy rng being slow, not anything we do, but might make sense to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, new numbers. random_state means legacy RNG, rng means new RNG; spawn means spawning one rng per cell, reuse means reusing the same rng.

Seems like my changes did introduce a little bit of overhead for the default / legacy case (rightmost column). I wonder if that comes from the fact that the rng doesn’t happen inside of the numba code anymore, or if the all the wrapping is that noticable (it really shouldn’t be).

Otherwise the results match yours: spawning takes a less than 2× the time of reusing and the new generator is faster than the old one.

For scanpy commit af57cffc <main>:
================= ============= ============= ====================== ======================
--                                              layer / param3                             
----------------- -------------------------------------------------------------------------
     dataset       rng / spawn   rng / reuse   random_state / spawn   random_state / reuse 
================= ============= ============= ====================== ======================
 pbmc68k_reduced       n/a           n/a               n/a                7.56±0.02ms      
      pbmc3k           n/a           n/a               n/a                 119±0.5ms       
================= ============= ============= ====================== ======================
For scanpy commit 93a9d90e <pa/rng>:

================= ============= ============= ====================== ======================
--                                              layer / param3                             
----------------- -------------------------------------------------------------------------
     dataset       rng / spawn   rng / reuse   random_state / spawn   random_state / reuse 
================= ============= ============= ====================== ======================
 pbmc68k_reduced    20.8±0.2ms    11.1±0.5ms           n/a                 17.4±0.9ms      
      pbmc3k         92.4±3ms     54.3±0.5ms           n/a                  137±1ms        
================= ============= ============= ====================== ======================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But current version doesn't reuse the rng right? Maybe we can create performance issue to create rng per thread. But this will limit reproducibility when n_jobs are the same.

@ilan-gold
Copy link
Contributor

they can spawn independent children (doing so advances their internal state once)

Is this True? I was looking into your comment and found this:

# High quality entropy created with: f"0x{secrets.randbits(128):x}"
import numpy as np
entropy = 0x3034c61a9ae04ff8cb62ab8ec2c4b501
rng = np.random.default_rng(entropy)
# Generate a random number as the first step
first_random = rng.uniform()
rng = np.random.default_rng(entropy)
# Now, spawn as the first step
child_rng1, child_rng2 = rng.spawn(2)
# And then do RNG
assert first_random == rng.uniform()

Also, reading through the linked thread, it seems the consensus was spawn does not advance internal state, although maybe I'm misunderstanding what you mean by "internal state" (you even say in your comment "but this means that when users pass a Generator, its state isn’t advanced at all (since spawn doesn’t do that)"). Do you mean the internal state of the children?

@flying-sheep
Copy link
Member Author

@ilan-gold yeah I'm sure! I was confused: they have two parts of internal state. Spawning advances the SeedSequence. I can do a writeup when I'm back working

@ilan-gold
Copy link
Contributor

Ok @flying-sheep That makes a load of sense - I assumed as much was going on (how else could what I posted be true unless they were literally tracking the number of children created) and it turns out, that is exactly what is happening. Like literally:

SeedSequence(
    entropy=64076961259285389890164002958222865665,
    n_children_spawned=2,
)

after spawning so the entropy is unchanged.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we store the new RNG in adata.uns? If no, this fixes #1131

No opinion, seems moderately unrelated. I guess with the new structure it's a bit meaningless because seeds have no impact if you're passing in a Generator and you can't serialize a Generator

Should we keep random_state in the docs?

No

How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?

I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs. Is there a way to create a default docstring that generalizes but can refer to an old version of the pakcage/docs as a disclaimer for "default" behavior? Maybe not. It's hard for me to hold everything in my head.

Should I handle passing rng to neighbors transformer?

It seems like we don't do it now unless it's a self-constructed transformer, so I probably wouldn't. Is this what you mean?

return _downsample_array_inner(col, cumcounts, sample, inplace=inplace)


# TODO: can/should this be parallelized?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good candidate!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I just left it alone to not cause any more changes in this PR as there were.

@flying-sheep
Copy link
Member Author

flying-sheep commented Mar 6, 2026

you can't serialize a Generator

Not in AnnData, but being serialized in general is an explicit feature of them.

I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs.

great point, I’ll add that, and that’ll make documenting this easy.

It seems like we don't do it now unless it's a self-constructed transformer

I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?)

My point was: sklearn doesn’t support rng yet, should we pre-empt the possible change and somehow handle both random_state and rng even though there probably won’t be transformers that accept rng for a while.

@ilan-gold
Copy link
Contributor

I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?)

Could you check this out? My impression looking at our code was that there was no mechanism for "updating" a user-passed-in transformer i.e., giving it an RNG (state). So I'm not sure even what this would look like. You would use https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.set_params or something?

@flying-sheep
Copy link
Member Author

@ilan-gold ah yeah I assumed we supported passing a transformer class, but we only support passing an instance.

@scverse-benchmark
Copy link

scverse-benchmark bot commented Mar 12, 2026

Benchmark changes

Change Before [af57cff] After [d8d0e80] Ratio Benchmark (Parameter)
+ 180±3ms 206±1ms 1.15 preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_per_cell('pbmc3k', 'random_state')
+ 11.2±0.05ms 25.0±0.4ms 2.24 preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_per_cell('pbmc68k_reduced', 'random_state')
- 307±4ms 191±0.4ms 0.62 preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_total('pbmc3k', 'random_state')
- 16.0±0.08ms 13.1±0.4ms 0.82 preprocessing_counts.PreprocessingCountsRngSuite.time_downsample_total('pbmc68k_reduced', 'random_state')

Comparison: https://github.com/scverse/scanpy/compare/af57cffc6eb7fa77618b2ab026231f72cd029c12..d8d0e8047b977ae834479f0e7d835461e6bd23b2
Last changed: Fri, 13 Mar 2026 13:53:20 +0000

More details: https://github.com/scverse/scanpy/pull/3983/checks?check_run_id=66957280346

UMAP parameters.

"""
rng_init, rng_umap = np.random.default_rng(rng).spawn(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, why did we undo this? I thought I had said here it seems like a good idea: #3983 (comment)

Copy link
Member Author

@flying-sheep flying-sheep Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When discussing with Selman, I realized that the only stable pattern would be

[root_rng] = rng.spawn(1)  # advance rng’s seed sequence by 1
rng_task1, rng_task2 = root_rng.spawn(2)  # advance root_rng’s seed sequence by 2, can be changed
del rng, root_ rng  # make sure they aren’t used anymore

which isn’t what we used and is a bit complex for little gain – numpy has no stability guarantee anyway, so what this would allow is to add and use a rng_task3 which would probably change results anyway, as there is no place where we create completely independent results using the RNGs (like, predict two different things and then only store them without using them further)

At this point I have no opinion here anymore. @selmanozleyen and @ilan-gold please align on something and I’ll do it. (not trying to sound peeved here, just ready to move on)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with current version. If the return value is strongly coupled with multiple rng's there is little benefit IMO (unless it was computed parallel). Because the initialization is also going to effect the core algo anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get why this is better than just spawning two tasks out of rng

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be done only when there is clear benefit. That's just my opinion for clarity. Otherwise I would be confused if rng_init and rng_umap are supposed to be very independent or not. I don't think its better or worse though

u, v, u_based_decision=pkg_version("scikit-learn") < Version("1.5.0rc1")
random_init = rng.uniform(size=np.min(x.shape))
kw: Any = (
dict(rng=rng) if SCIPY_1_15 else dict(random_state=_legacy_random_state(rng))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, what happened to spawn?

Copy link
Member

@selmanozleyen selmanozleyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding TODO's and issues will clarify the spawn usages are for possible parallel implementations.

Also we need to clarify if we will create rng per core or independent component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch to numpy.random.Generator Passing a RandomState instance can cause failures to save

3 participants