Merge pull request #18906 from rkern/enh/pcg64dxsm

ENH: Add PCG64DXSM BitGenerator
author: Charles Harris <charlesr.harris@gmail.com> 2021-05-05 11:15:18 -0600
committer: GitHub <noreply@github.com> 2021-05-05 11:15:18 -0600
commit: 73a0892f50165b5a98d3eca2967956878d684748 (patch)
tree: 22ede3e0db4e91afe3fd26c3afa8091ac7f5cff0 /doc
parent: 0f7f313d847e8e36e9fb3c4e576619a3af3bfe56 (diff)
parent: d97c634a04e539de2b66ccd0fbd526fea310571a (diff)
download: numpy-73a0892f50165b5a98d3eca2967956878d684748.tar.gz
8 files changed, 230 insertions, 15 deletions
diff --git a/doc/release/upcoming_changes/18906.new_function.rst b/doc/release/upcoming_changes/18906.new_function.rst
new file mode 100644
index 000000000..38444009d
--- /dev/null
+++ b/doc/release/upcoming_changes/18906.new_function.rst
@@ -0,0 +1,17 @@
+.. currentmodule:: numpy.random
+
+Add `PCG64DXSM` `BitGenerator`
+------------------------------
+
+Uses of the `PCG64` `BitGenerator` in a massively-parallel context have been
+shown to have statistical weaknesses that were not apparent at the first
+release in numpy 1.17. Most users will never observe this weakness and are
+safe to continue to use `PCG64`. We have introduced a new `PCG64DXSM`
+`BitGenerator` that will eventually become the new default `BitGenerator`
+implementation used by `default_rng` in future releases. `PCG64DXSM` solves
+the statistical weakness while preserving the performance and the features of
+`PCG64`.
+
+See :ref:`upgrading-pcg64` for more details.
+
+.. currentmodule:: numpy
diff --git a/doc/source/reference/random/bit_generators/index.rst b/doc/source/reference/random/bit_generators/index.rst
index 6f8cf02ca..c5c349806 100644
--- a/doc/source/reference/random/bit_generators/index.rst
+++ b/doc/source/reference/random/bit_generators/index.rst
@@ -15,10 +15,13 @@ Supported BitGenerators
 
 The included BitGenerators are:
 
-* PCG-64 - The default. A fast generator that supports many parallel streams
-  and can be advanced by an arbitrary amount. See the documentation for
-  :meth:`~.PCG64.advance`. PCG-64 has a period of :math:`2^{128}`. See the `PCG
-  author's page`_ for more details about this class of PRNG.
+* PCG-64 - The default. A fast generator that can be advanced by an arbitrary
+  amount. See the documentation for :meth:`~.PCG64.advance`. PCG-64 has
+  a period of :math:`2^{128}`. See the `PCG author's page`_ for more details
+  about this class of PRNG.
+* PCG-64 DXSM - An upgraded version of PCG-64 with better statistical
+  properties in parallel contexts. See :ref:`upgrading-pcg64` for more
+  information on these improvements.
 * MT19937 - The standard Python BitGenerator. Adds a `MT19937.jumped`
   function that returns a new generator with state as-if :math:`2^{128}` draws have
   been made.
@@ -43,6 +46,7 @@ The included BitGenerators are:
 
     MT19937 <mt19937>
     PCG64 <pcg64>
+    PCG64DXSM <pcg64dxsm>
     Philox <philox>
     SFC64 <sfc64>
 
diff --git a/doc/source/reference/random/bit_generators/pcg64dxsm.rst b/doc/source/reference/random/bit_generators/pcg64dxsm.rst
new file mode 100644
index 000000000..e37efa5d3
--- /dev/null
+++ b/doc/source/reference/random/bit_generators/pcg64dxsm.rst
@@ -0,0 +1,32 @@
+Permuted Congruential Generator (64-bit, PCG64 DXSM)
+----------------------------------------------------
+
+.. currentmodule:: numpy.random
+
+.. autoclass:: PCG64DXSM
+    :members: __init__
+    :exclude-members: __init__
+
+State
+=====
+
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.state
+
+Parallel generation
+===================
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.advance
+   ~PCG64DXSM.jumped
+
+Extending
+=========
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.cffi
+   ~PCG64DXSM.ctypes
diff --git a/doc/source/reference/random/index.rst b/doc/source/reference/random/index.rst
index fc7743c64..96cd47017 100644
--- a/doc/source/reference/random/index.rst
+++ b/doc/source/reference/random/index.rst
@@ -222,6 +222,9 @@ one of three ways:
 * :ref:`independent-streams`
 * :ref:`parallel-jumped`
 
+Users with a very large amount of parallelism will want to consult
+:ref:`upgrading-pcg64`.
+
 Concepts
 --------
 .. toctree::
@@ -230,6 +233,7 @@ Concepts
    generator
    Legacy Generator (RandomState) <legacy>
    BitGenerators, SeedSequences <bit_generators/index>
+   Upgrading PCG64 with PCG64DXSM <upgrading-pcg64>
 
 Features
 --------
diff --git a/doc/source/reference/random/parallel.rst b/doc/source/reference/random/parallel.rst
index 721584014..7f0207bde 100644
--- a/doc/source/reference/random/parallel.rst
+++ b/doc/source/reference/random/parallel.rst
@@ -88,10 +88,11 @@ territory ([2]_).
        estimate the naive upper bound on a napkin and take comfort knowing
        that the probability is actually lower.
 
-.. [2] In this calculation, we can ignore the amount of numbers drawn from each
-       stream. Each of the PRNGs we provide has some extra protection built in
+.. [2] In this calculation, we can mostly ignore the amount of numbers drawn from each
+       stream. See :ref:`upgrading-pcg64` for the technical details about
+       `PCG64`. The other PRNGs we provide have some extra protection built in
        that avoids overlaps if the `~SeedSequence` pools differ in the
-       slightest bit. `PCG64` has :math:`2^{127}` separate cycles
+       slightest bit. `PCG64DXSM` has :math:`2^{127}` separate cycles
        determined by the seed in addition to the position in the
        :math:`2^{128}` long period for each cycle, so one has to both get on or
        near the same cycle *and* seed a nearby position in the cycle.
@@ -150,12 +151,14 @@ BitGenerator, the size of the jump and the bits in the default unsigned random
 are listed below.
 
 +-----------------+-------------------------+-------------------------+-------------------------+
-| BitGenerator    | Period                  |  Jump Size              | Bits                    |
+| BitGenerator    | Period                  |  Jump Size              | Bits per Draw           |
 +=================+=========================+=========================+=========================+
-| MT19937         | :math:`2^{19937}`       | :math:`2^{128}`         | 32                      |
+| MT19937         | :math:`2^{19937}-1`     | :math:`2^{128}`         | 32                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
 | PCG64           | :math:`2^{128}`         | :math:`~2^{127}` ([3]_) | 64                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
+| PCG64DXSM       | :math:`2^{128}`         | :math:`~2^{127}` ([3]_) | 64                      |
++-----------------+-------------------------+-------------------------+-------------------------+
 | Philox          | :math:`2^{256}`         | :math:`2^{128}`         | 64                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
 
diff --git a/doc/source/reference/random/performance.py b/doc/source/reference/random/performance.py
index 3267197f5..794142836 100644
--- a/doc/source/reference/random/performance.py
+++ b/doc/source/reference/random/performance.py
@@ -3,9 +3,9 @@ from timeit import repeat
 import pandas as pd
 
 import numpy as np
-from numpy.random import MT19937, PCG64, Philox, SFC64
+from numpy.random import MT19937, PCG64, PCG64DXSM, Philox, SFC64
 
-PRNGS = [MT19937, PCG64, Philox, SFC64]
+PRNGS = [MT19937, PCG64, PCG64DXSM, Philox, SFC64]
 
 funcs = {}
 integers = 'integers(0, 2**{bits},size=1000000, dtype="uint{bits}")'
@@ -53,7 +53,7 @@ for key in npfuncs:
     col[key] = 1000 * min(t)
 table['RandomState'] = pd.Series(col)
 
-columns = ['MT19937','PCG64','Philox','SFC64', 'RandomState']
+columns = ['MT19937', 'PCG64', 'PCG64DXSM', 'Philox', 'SFC64', 'RandomState']
 table = pd.DataFrame(table)
 order = np.log(table).mean().sort_values().index
 table = table.T
diff --git a/doc/source/reference/random/performance.rst b/doc/source/reference/random/performance.rst
index 74dad4cc3..812c719f8 100644
--- a/doc/source/reference/random/performance.rst
+++ b/doc/source/reference/random/performance.rst
@@ -5,9 +5,12 @@ Performance
 
 Recommendation
 **************
-The recommended generator for general use is `PCG64`. It is
-statistically high quality, full-featured, and fast on most platforms, but
-somewhat slow when compiled for 32-bit processes.
+
+The recommended generator for general use is `PCG64` or its upgraded variant
+`PCG64DXSM` for heavily-parallel use cases. They are statistically high quality,
+full-featured, and fast on most platforms, but somewhat slow when compiled for
+32-bit processes. See :ref:`upgrading-pcg64` for details on when heavy
+parallelism would indicate using `PCG64DXSM`.
 
 `Philox` is fairly slow, but its statistical properties have
 very high quality, and it is easy to get assuredly-independent stream by using
diff --git a/doc/source/reference/random/upgrading-pcg64.rst b/doc/source/reference/random/upgrading-pcg64.rst
new file mode 100644
index 000000000..9e540ace9
--- /dev/null
+++ b/doc/source/reference/random/upgrading-pcg64.rst
@@ -0,0 +1,152 @@
+.. _upgrading-pcg64:
+
+.. currentmodule:: numpy.random
+
+Upgrading ``PCG64`` with ``PCG64DXSM``
+--------------------------------------
+
+Uses of the `PCG64` `BitGenerator` in a massively-parallel context have been
+shown to have statistical weaknesses that were not apparent at the first
+release in numpy 1.17. Most users will never observe this weakness and are
+safe to continue to use `PCG64`. We have introduced a new `PCG64DXSM`
+`BitGenerator` that will eventually become the new default `BitGenerator`
+implementation used by `default_rng` in future releases. `PCG64DXSM` solves
+the statistical weakness while preserving the performance and the features of
+`PCG64`.
+
+Does this affect me?
+====================
+
+If you
+
+  1. only use a single `Generator` instance,
+  2. only use `RandomState` or the functions in `numpy.random`,
+  3. only use the `PCG64.jumped` method to generate parallel streams,
+  4. explicitly use a `BitGenerator` other than `PCG64`,
+
+then this weakness does not affect you at all. Carry on.
+
+If you use moderate numbers of parallel streams created with `default_rng` or
+`SeedSequence.spawn`, in the 1000s, then the chance of observing this weakness
+is negligibly small. You can continue to use `PCG64` comfortably.
+
+If you use very large numbers of parallel streams, in the millions, and draw
+large amounts of numbers from each, then the chance of observing this weakness
+can become non-negligible, if still small. An example of such a use case would
+be a very large distributed reinforcement learning problem with millions of
+long Monte Carlo playouts each generating billions of random number draws. Such
+use cases should consider using `PCG64DXSM` explicitly or another
+modern `BitGenerator` like `SFC64` or `Philox`, but it is unlikely that any
+old results you may have calculated are invalid. In any case, the weakness is
+a kind of `Birthday Paradox <https://en.wikipedia.org/wiki/Birthday_problem>`_
+collision. That is, a single pair of parallel streams out of the millions,
+considered together, might fail a stringent set of statistical tests of
+randomness. The remaining millions of streams would all be perfectly fine, and
+the effect of the bad pair in the whole calculation is very likely to be
+swamped by the remaining streams in most applications.
+
+.. _upgrading-pcg64-details:
+
+Technical Details
+=================
+
+Like many PRNG algorithms, `PCG64` is constructed from a transition function,
+which advances a 128-bit state, and an output function, that mixes the 128-bit
+state into a 64-bit integer to be output. One of the guiding design principles
+of the PCG family of PRNGs is to balance the computational cost (and
+pseudorandomness strength) between the transition function and the output
+function. The transition function is a 128-bit linear congruential generator
+(LCG), which consists of multiplying the 128-bit state with a fixed
+multiplication constant and then adding a user-chosen increment, in 128-bit
+modular arithmetic. LCGs are well-analyzed PRNGs with known weaknesses, though
+128-bit LCGs are large enough to pass stringent statistical tests on their own,
+with only the trivial output function. The output function of `PCG64` is
+intended to patch up some of those known weaknesses by doing "just enough"
+scrambling of the bits to assist in the statistical properties without adding
+too much computational cost.
+
+One of these known weaknesses is that advancing the state of the LCG by steps
+numbering a power of two (``bg.advance(2**N)``) will leave the lower ``N`` bits
+identical to the state that was just left. For a single stream drawn from
+sequentially, this is of little consequence. The remaining :math:`128-N` bits provide
+plenty of pseudorandomness that will be mixed in for any practical ``N`` that can
+be observed in a single stream, which is why one does not need to worry about
+this if you only use a single stream in your application. Similarly, the
+`PCG64.jumped` method uses a carefully chosen number of steps to avoid creating
+these collisions. However, once you start creating "randomly-initialized"
+parallel streams, either using OS entropy by calling `default_rng` repeatedly
+or using `SeedSequence.spawn`, then we need to consider how many lower bits
+need to "collide" in order to create a bad pair of streams, and then evaluate
+the probability of creating such a collision.
+`Empirically <https://github.com/numpy/numpy/issues/16313>`_, it has been
+determined that if one shares the lower 58 bits of state and shares an
+increment, then the pair of streams, when interleaved, will fail 
+`PractRand <http://pracrand.sourceforge.net/>`_ in
+a reasonable amount of time, after drawing a few gigabytes of data. Following
+the standard Birthday Paradox calculations for a collision of 58 bits, we can
+see that we can create :math:`2^{29}`, or about half a billion, streams which is when
+the probability of such a collision becomes high. Half a billion streams is
+quite high, and the amount of data each stream needs to draw before the
+statistical correlations become apparent to even the strict ``PractRand`` tests
+is in the gigabytes. But this is on the horizon for very large applications
+like distributed reinforcement learning. There are reasons to expect that even
+in these applications a collision probably will not have a practical effect in
+the total result, since the statistical problem is constrained to just the
+colliding pair.
+
+Now, let us consider the case when the increment is not constrained to be the
+same. Our implementation of `PCG64` seeds both the state and the increment;
+that is, two calls to `default_rng` (almost certainly) have different states
+and increments. Upon our first release, we believed that having the seeded
+increment would provide a certain amount of extra protection, that one would
+have to be "close" in both the state space and increment space in order to
+observe correlations (``PractRand`` failures) in a pair of streams. If that were
+true, then the "bottleneck" for collisions would be the 128-bit entropy pool
+size inside of `SeedSequence` (and 128-bit collisions are in the
+"preposterously unlikely" category). Unfortunately, this is not true.
+
+One of the known properties of an LCG is that different increments create
+*distinct* streams, but with a known relationship. Each LCG has an orbit that
+traverses all :math:`2^{128}` different 128-bit states. Two LCGs with different
+increments are related in that one can "rotate" the orbit of the first LCG
+(advance it by a number of steps that we can compute from the two increments)
+such that then both LCGs will always then have the same state, up to an
+additive constant and maybe an inversion of the bits. If you then iterate both
+streams in lockstep, then the states will *always* remain related by that same
+additive constant (and the inversion, if present). Recall that `PCG64` is
+constructed from both a transition function (the LCG) and an output function.
+It was expected that the scrambling effect of the output function would have
+been strong enough to make the distinct streams practically independent (i.e.
+"passing the ``PractRand`` tests") unless the two increments were
+pathologically related to each other (e.g. 1 and 3). The output function XSL-RR
+of the then-standard PCG algorithm that we implemented in `PCG64` turns out to
+be too weak to cover up for the 58-bit collision of the underlying LCG that we
+described above. For any given pair of increments, the size of the "colliding"
+space of states is the same, so for this weakness, the extra distinctness
+provided by the increments does not translate into extra protection from
+statistical correlations that ``PractRand`` can detect.
+
+Fortunately, strengthening the output function is able to correct this weakness
+and *does* turn the extra distinctness provided by differing increments into
+additional protection from these low-bit collisions. To the `PCG author's
+credit <https://github.com/numpy/numpy/issues/13635#issuecomment-506088698>`_,
+she had developed a stronger output function in response to related discussions
+during the long birth of the new `BitGenerator` system. We NumPy developers
+chose to be "conservative" and use the XSL-RR variant that had undergone
+a longer period of testing at that time. The DXSM output function adopts
+a "xorshift-multiply" construction used in strong integer hashes that has much
+better avalanche properties than the XSL-RR output function. While there are
+"pathological" pairs of increments that induce "bad" additive constants that
+relate the two streams, the vast majority of pairs induce "good" additive
+constants that make the merely-distinct streams of LCG states into
+practically-independent output streams. Indeed, now the claim we once made
+about `PCG64` is actually true of `PCG64DXSM`: collisions are possible, but
+both streams have to simultaneously be both "close" in the 128 bit state space
+*and* "close" in the 127-bit increment space, so that would be less likely than
+the negligible chance of colliding in the 128-bit internal `SeedSequence` pool.
+The DXSM output function is more computationally intensive than XSL-RR, but
+some optimizations in the LCG more than make up for the performance hit on most
+machines, so `PCG64DXSM` is a good, safe upgrade. There are, of course, an
+infinite number of stronger output functions that one could consider, but most
+will have a greater computational cost, and the DXSM output function has now
+received many CPU cycles of testing via ``PractRand`` at this time.
author	Charles Harris <charlesr.harris@gmail.com>	2021-05-05 11:15:18 -0600
committer	GitHub <noreply@github.com>	2021-05-05 11:15:18 -0600
commit	73a0892f50165b5a98d3eca2967956878d684748 (patch)
tree	22ede3e0db4e91afe3fd26c3afa8091ac7f5cff0 /doc
parent	0f7f313d847e8e36e9fb3c4e576619a3af3bfe56 (diff)
parent	d97c634a04e539de2b66ccd0fbd526fea310571a (diff)
download	numpy-73a0892f50165b5a98d3eca2967956878d684748.tar.gz