diff options
Diffstat (limited to 'doc/source/reference/random/parallel.rst')
-rw-r--r-- | doc/source/reference/random/parallel.rst | 83 |
1 files changed, 82 insertions, 1 deletions
diff --git a/doc/source/reference/random/parallel.rst b/doc/source/reference/random/parallel.rst index bff955948..b625d34b7 100644 --- a/doc/source/reference/random/parallel.rst +++ b/doc/source/reference/random/parallel.rst @@ -1,7 +1,7 @@ Parallel Random Number Generation ================================= -There are three strategies implemented that can be used to produce +There are four main strategies implemented that can be used to produce repeatable pseudo-random numbers across multiple processes (local or distributed). @@ -109,6 +109,87 @@ territory ([2]_). .. _`not unique to numpy`: https://www.iro.umontreal.ca/~lecuyer/myftp/papers/parallel-rng-imacs.pdf +.. _sequence-of-seeds: + +Sequence of Integer Seeds +------------------------- + +As discussed in the previous section, `~SeedSequence` can not only take an +integer seed, it can also take an arbitrary-length sequence of (non-negative) +integers. If one exercises a little care, one can use this feature to design +*ad hoc* schemes for getting safe parallel PRNG streams with similar safety +guarantees as spawning. + +For example, one common use case is that a worker process is passed one +root seed integer for the whole calculation and also an integer worker ID (or +something more granular like a job ID, batch ID, or something similar). If +these IDs are created deterministically and uniquely, then one can derive +reproducible parallel PRNG streams by combining the ID and the root seed +integer in a list. + +.. code-block:: python + + # default_rng() and each of the BitGenerators use SeedSequence underneath, so + # they all accept sequences of integers as seeds the same way. + from numpy.random import default_rng + + def worker(root_seed, worker_id): + rng = default_rng([worker_id, root_seed]) + # Do work ... + + root_seed = 0x8c3c010cb4754c905776bdac5ee7501 + results = [worker(root_seed, worker_id) for worker_id in range(10)] + +.. end_block + +This can be used to replace a number of unsafe strategies that have been used +in the past which try to combine the root seed and the ID back into a single +integer seed value. For example, it is common to see users add the worker ID to +the root seed, especially with the legacy `~RandomState` code. + +.. code-block:: python + + # UNSAFE! Do not do this! + worker_seed = root_seed + worker_id + rng = np.random.RandomState(worker_seed) + +.. end_block + +It is true that for any one run of a parallel program constructed this way, +each worker will have distinct streams. However, it is quite likely that +multiple invocations of the program with different seeds will get overlapping +sets of worker seeds. It is not uncommon (in the author's self-experience) to +change the root seed merely by an increment or two when doing these repeat +runs. If the worker seeds are also derived by small increments of the worker +ID, then subsets of the workers will return identical results, causing a bias +in the overall ensemble of results. + +Combining the worker ID and the root seed as a list of integers eliminates this +risk. Lazy seeding practices will still be fairly safe. + +This scheme does require that the extra IDs be unique and deterministically +created. This may require coordination between the worker processes. It is +recommended to place the varying IDs *before* the unvarying root seed. +`~SeedSequence.spawn` *appends* integers after the user-provided seed, so if +you might be mixing both this *ad hoc* mechanism and spawning, or passing your +objects down to library code that might be spawning, then it is a little bit +safer to prepend your worker IDs rather than append them to avoid a collision. + +.. code-block:: python + + # Good. + worker_seed = [worker_id, root_seed] + + # Less good. It will *work*, but it's less flexible. + worker_seed = [root_seed, worker_id] + +.. end_block + +With those caveats in mind, the safety guarantees against collision are about +the same as with spawning, discussed in the previous section. The algorithmic +mechanisms are the same. + + .. _independent-streams: Independent Streams |