Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3983 +/- ##
==========================================
+ Coverage 77.96% 78.21% +0.24%
==========================================
Files 118 119 +1
Lines 12517 12672 +155
==========================================
+ Hits 9759 9911 +152
- Misses 2758 2761 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
src/scanpy/preprocessing/_simple.py
Outdated
| for rowidx, sub_rng in zip( | ||
| under_target, rng.spawn(len(under_target)), strict=True | ||
| ): | ||
| _downsample_array( |
There was a problem hiding this comment.
I think spawn is more for parallel independent streams. This is correct for reproducibility but it might be causing performance overhead for independence that is usually not needed. So far I saw usage of spawn in either when rng is being given to another function while having keeping the rng still independent in the current code flow or in parallel cases. Not in a sequential for loop. My argument goes like this: If this for loop was unrolled and the code was written sequentially would we still use spawn. I think no because in sparse multiply we don't do that for example.
In fact I ran an ai generated script and these were the results. For some reason FakeRandomGen is super slow. But also shows that spawn actually has an overhead.
python bench_downsample.py
Warming up numba JIT...
Done.
counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840
legacy (random_state=0) 20.07 ms
real Generator (with spawn) 8.30 ms
real Generator (no spawn) 4.53 ms
--> spawn overhead: +83% real vs legacy: -59%
counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289
legacy (random_state=0) 399.50 ms
real Generator (with spawn) 86.80 ms
real Generator (no spawn) 53.74 ms
--> spawn overhead: +62% real vs legacy: -78%
counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861
legacy (random_state=0) 2968.35 ms
real Generator (with spawn) 320.06 ms
real Generator (no spawn) 185.98 ms
--> spawn overhead: +72% real vs legacy: -89%
❯ python bench_downsample.py
Warming up numba JIT...
Done.
counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840
legacy (_FakeRandomGen) 18.96 ms
real Generator (with spawn) 7.77 ms
real Generator (no spawn) 4.22 ms
--> spawn overhead: +84% real vs legacy: -59%
counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289
legacy (_FakeRandomGen) 412.81 ms
real Generator (with spawn) 88.76 ms
real Generator (no spawn) 54.64 ms
--> spawn overhead: +62% real vs legacy: -78%
counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861
legacy (_FakeRandomGen) 2956.91 ms
real Generator (with spawn) 322.48 ms
real Generator (no spawn) 187.00 ms
--> spawn overhead: +72% real vs legacy: -89%
~/p/scanpy │ pa/rng ⇣1 *1 !2 ?2 ✔ │ 28s │ scanpy Py │ 16:30:20
the script: https://gist.github.com/selmanozleyen/2046c849bb10f541642cea2ac7daa0db
There was a problem hiding this comment.
For some reason FakeRandomGen is super slow.
that should be caused by the legacy rng being slow, not anything we do, but might make sense to check.
There was a problem hiding this comment.
OK, new numbers. random_state means legacy RNG, rng means new RNG; spawn means spawning one rng per cell, reuse means reusing the same rng.
Seems like my changes did introduce a little bit of overhead for the default / legacy case (rightmost column). I wonder if that comes from the fact that the rng doesn’t happen inside of the numba code anymore, or if the all the wrapping is that noticable (it really shouldn’t be).
Otherwise the results match yours: spawning takes a less than 2× the time of reusing and the new generator is faster than the old one.
For scanpy commit af57cffc <main>:
================= ============= ============= ====================== ======================
-- layer / param3
----------------- -------------------------------------------------------------------------
dataset rng / spawn rng / reuse random_state / spawn random_state / reuse
================= ============= ============= ====================== ======================
pbmc68k_reduced n/a n/a n/a 7.56±0.02ms
pbmc3k n/a n/a n/a 119±0.5ms
================= ============= ============= ====================== ======================
For scanpy commit 93a9d90e <pa/rng>:
================= ============= ============= ====================== ======================
-- layer / param3
----------------- -------------------------------------------------------------------------
dataset rng / spawn rng / reuse random_state / spawn random_state / reuse
================= ============= ============= ====================== ======================
pbmc68k_reduced 20.8±0.2ms 11.1±0.5ms n/a 17.4±0.9ms
pbmc3k 92.4±3ms 54.3±0.5ms n/a 137±1ms
================= ============= ============= ====================== ======================
There was a problem hiding this comment.
But current version doesn't reuse the rng right? Maybe we can create performance issue to create rng per thread. But this will limit reproducibility when n_jobs are the same.
Is this True? I was looking into your comment and found this: # High quality entropy created with: f"0x{secrets.randbits(128):x}"
import numpy as np
entropy = 0x3034c61a9ae04ff8cb62ab8ec2c4b501
rng = np.random.default_rng(entropy)
# Generate a random number as the first step
first_random = rng.uniform()
rng = np.random.default_rng(entropy)
# Now, spawn as the first step
child_rng1, child_rng2 = rng.spawn(2)
# And then do RNG
assert first_random == rng.uniform()Also, reading through the linked thread, it seems the consensus was |
|
@ilan-gold yeah I'm sure! I was confused: they have two parts of internal state. Spawning advances the SeedSequence. I can do a writeup when I'm back working |
|
Ok @flying-sheep That makes a load of sense - I assumed as much was going on (how else could what I posted be true unless they were literally tracking the number of children created) and it turns out, that is exactly what is happening. Like literally: after spawning so the entropy is unchanged. |
There was a problem hiding this comment.
Should we store the new RNG in adata.uns? If no, this fixes #1131
No opinion, seems moderately unrelated. I guess with the new structure it's a bit meaningless because seeds have no impact if you're passing in a Generator and you can't serialize a Generator
Should we keep random_state in the docs?
No
How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?
I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs. Is there a way to create a default docstring that generalizes but can refer to an old version of the pakcage/docs as a disclaimer for "default" behavior? Maybe not. It's hard for me to hold everything in my head.
Should I handle passing rng to neighbors transformer?
It seems like we don't do it now unless it's a self-constructed transformer, so I probably wouldn't. Is this what you mean?
| return _downsample_array_inner(col, cumcounts, sample, inplace=inplace) | ||
|
|
||
|
|
||
| # TODO: can/should this be parallelized? |
There was a problem hiding this comment.
Seems like a good candidate!
There was a problem hiding this comment.
yeah, I just left it alone to not cause any more changes in this PR as there were.
Not in AnnData, but being serialized in general is an explicit feature of them.
great point, I’ll add that, and that’ll make documenting this easy.
I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?) My point was: sklearn doesn’t support |
Could you check this out? My impression looking at our code was that there was no mechanism for "updating" a user-passed-in transformer i.e., giving it an RNG (state). So I'm not sure even what this would look like. You would use https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.set_params or something? |
|
@ilan-gold ah yeah I assumed we supported passing a transformer class, but we only support passing an instance. |
Benchmark changes
Comparison: https://github.com/scverse/scanpy/compare/af57cffc6eb7fa77618b2ab026231f72cd029c12..d8d0e8047b977ae834479f0e7d835461e6bd23b2 More details: https://github.com/scverse/scanpy/pull/3983/checks?check_run_id=66957280346 |
src/scanpy/tools/_umap.py
Outdated
| UMAP parameters. | ||
|
|
||
| """ | ||
| rng_init, rng_umap = np.random.default_rng(rng).spawn(2) |
There was a problem hiding this comment.
Sorry, why did we undo this? I thought I had said here it seems like a good idea: #3983 (comment)
There was a problem hiding this comment.
When discussing with Selman, I realized that the only stable pattern would be
[root_rng] = rng.spawn(1) # advance rng’s seed sequence by 1
rng_task1, rng_task2 = root_rng.spawn(2) # advance root_rng’s seed sequence by 2, can be changed
del rng, root_ rng # make sure they aren’t used anymorewhich isn’t what we used and is a bit complex for little gain – numpy has no stability guarantee anyway, so what this would allow is to add and use a rng_task3 which would probably change results anyway, as there is no place where we create completely independent results using the RNGs (like, predict two different things and then only store them without using them further)
At this point I have no opinion here anymore. @selmanozleyen and @ilan-gold please align on something and I’ll do it. (not trying to sound peeved here, just ready to move on)
There was a problem hiding this comment.
I think I agree with current version. If the return value is strongly coupled with multiple rng's there is little benefit IMO (unless it was computed parallel). Because the initialization is also going to effect the core algo anyway.
There was a problem hiding this comment.
I don't really get why this is better than just spawning two tasks out of rng
There was a problem hiding this comment.
I think it should be done only when there is clear benefit. That's just my opinion for clarity. Otherwise I would be confused if rng_init and rng_umap are supposed to be very independent or not. I don't think its better or worse though
| u, v, u_based_decision=pkg_version("scikit-learn") < Version("1.5.0rc1") | ||
| random_init = rng.uniform(size=np.min(x.shape)) | ||
| kw: Any = ( | ||
| dict(rng=rng) if SCIPY_1_15 else dict(random_state=_legacy_random_state(rng)) |
There was a problem hiding this comment.
Same here, what happened to spawn?
selmanozleyen
left a comment
There was a problem hiding this comment.
I think adding TODO's and issues will clarify the spawn usages are for possible parallel implementations.
Also we need to clarify if we will create rng per core or independent component.
numpy.random.Generator#3371The idea is to make our code behave as closely as possible to how it did, but make it read like modern code:
random_stateintorngargument (deduplicating them and using 0 by default)RandomState)_FakeRandomGen.wrap_globaland_if_legacy_apply_globalto conditionally replacenp.random.seedand other global-state-mutating functions_legacy_random_stateto get back arandom_stateargument for APIs that don’t takerng, e.g.scikit-learnstuff and"random_state"inadata.uns[thing]["params"]I also didn’t convert the
transformerargument toneighbors(yet?), or deprecated stuff likelouvainor theexternalAPIs.Reviewing
First a short info abut how Generators work:
spawnindependent children (doing so advances their internal state once)rng: SeedLike | RNGLike | None = Nonefor the argument,Nonemeaning random initializationNow questions to the reviewers:
adata.uns? If no, this fixes Passing a RandomState instance can cause failures to save #1131random_statein the docs?rngisn’t actuallyNonebut “a new instance of_LegacyRandom(0)” but people can passrng=Noneto get the future default behavior?rngto neighbors transformer?rngcan be passed?np.random.seed()TODO:
np.random.seedcalls (maybeif isinstance(rng, _FakeRandomGen): gen = legacy_numpy_gen(rng)or so?)spawnto_FakeRandomGen(that does nothing) and usespawnfor a tree structurerng.clone()or similar? numpy/numpy#24086 (comment)