diff options
author | Ralf Gommers <ralf.gommers@gmail.com> | 2019-10-31 10:19:37 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2019-10-31 10:19:37 +0100 |
commit | ed7a0771803233665e9c0f072528b564e7489ead (patch) | |
tree | cf6ec49681adf1638cf313c1d55696cde6fe990f /doc/neps | |
parent | c03ce145cc991e1873b1595a128825992300d97a (diff) | |
parent | 1a7c11ee6b4297eb146295918abc86d9151930fc (diff) | |
download | numpy-ed7a0771803233665e9c0f072528b564e7489ead.tar.gz |
Merge pull request #14674 from mattip/nep-0034
NEP: add default-dtype-object-deprecation nep 34
Diffstat (limited to 'doc/neps')
-rw-r--r-- | doc/neps/nep-0034.rst | 141 |
1 files changed, 141 insertions, 0 deletions
diff --git a/doc/neps/nep-0034.rst b/doc/neps/nep-0034.rst new file mode 100644 index 000000000..d9a9c62f2 --- /dev/null +++ b/doc/neps/nep-0034.rst @@ -0,0 +1,141 @@ +=========================================================== +NEP 34 — Disallow inferring ``dtype=object`` from sequences +=========================================================== + +:Author: Matti Picus +:Status: Draft +:Type: Standards Track +:Created: 2019-10-10 + + +Abstract +-------- + +When users create arrays with sequences-of-sequences, they sometimes err in +matching the lengths of the nested sequences_, commonly called "ragged +arrays". Here we will refer to them as ragged nested sequences. Creating such +arrays via ``np.array([<ragged_nested_sequence>])`` with no ``dtype`` keyword +argument will today default to an ``object``-dtype array. Change the behaviour to +raise a ``ValueError`` instead. + +Motivation and Scope +-------------------- + +Users who specify lists-of-lists when creating a `numpy.ndarray` via +``np.array`` may mistakenly pass in lists of different lengths. Currently we +accept this input and automatically create an array with ``dtype=object``. This +can be confusing, since it is rarely what is desired. Changing the automatic +dtype detection to never return ``object`` for ragged nested sequences (defined as a +recursive sequence of sequences, where not all the sequences on the same +level have the same length) will force users who actually wish to create +``object`` arrays to specify that explicitly. Note that ``lists``, ``tuples``, +and ``nd.ndarrays`` are all sequences [0]_. See for instance `issue 5303`_. + +Usage and Impact +---------------- + +After this change, array creation with ragged nested sequences must explicitly +define a dtype: + + >>> np.array([[1, 2], [1]]) + ValueError: cannot guess the desired dtype from the input + + >>> np.array([[1, 2], [1]], dtype=object) + # succeeds, with no change from current behaviour + +The deprecation will affect any call that internally calls ``np.asarray``. For +instance, the ``assert_equal`` family of functions calls ``np.asarray``, so +users will have to change code like:: + + np.assert_equal(a, [[1, 2], 3]) + +to:: + + np.assert_equal(a, np.array([[1, 2], 3], dtype=object)) + +Detailed description +-------------------- + +To explicitly set the shape of the object array, since it is sometimes hard to +determine what shape is desired, one could use: + + >>> arr = np.empty(correct_shape, dtype=object) + >>> arr[...] = values + +We will also reject mixed sequences of non-sequence and sequence, for instance +all of these will be rejected: + + >>> arr = np.array([np.arange(10), [10]]) + >>> arr = np.array([[range(3), range(3), range(3)], [range(3), 0, 0]]) + +Related Work +------------ + +`PR 14341`_ tried to raise an error when ragged nested sequences were specified +with a numeric dtype ``np.array, [[1], [2, 3]], dtype=int)`` but failed due to +false-positives, for instance ``np.array([1, np.array([5])], dtype=int)``. + +.. _`PR 14341`: https://github.com/numpy/numpy/pull/14341 + +Implementation +-------------- + +The code to be changed is inside ``PyArray_GetArrayParamsFromObject`` and the +internal ``discover_dimentions`` function. See `PR 14794`_. + +Backward compatibility +---------------------- + +Anyone depending on creating object arrays from ragged nested sequences will +need to modify their code. There will be a deprecation period during which the +current behaviour will emit a ``DeprecationWarning``. + +Alternatives +------------ + +- We could continue with the current situation. + +- It was also suggested to add a kwarg ``depth`` to array creation, or perhaps + to add another array creation API function ``ragged_array_object``. The goal + was to eliminate the ambiguity in creating an object array from ``array([[1, + 2], [1]], dtype=object)``: should the returned array have a shape of + ``(1,)``, or ``(2,)``? This NEP does not deal with that issue, and only + deprecates the use of ``array`` with no ``dtype=object`` for ragged nested + sequences. Users of ragged nested sequences may face another deprecation + cycle in the future. Rationale: we expect that there are very few users who + intend to use ragged arrays like that, this was never intended as a use case + of NumPy arrays. Users are likely better off with `another library`_ or just + using list of lists. + +- It was also suggested to deprecate all automatic creation of ``object``-dtype + arrays, which would require adding an explicit ``dtype=object`` for something + like ``np.array([Decimal(10), Decimal(10)])``. This too is out of scope for + the current NEP. Rationale: it's harder to asses the impact of this larger + change, we're not sure how many users this may impact. + +Discussion +---------- + +Comments to `issue 5303`_ indicate this is unintended behaviour as far back as +2014. Suggestions to change it have been made in the ensuing years, but none +have stuck. The WIP implementation in `PR 14794`_ seems to point to the +viability of this approach. + +References and Footnotes +------------------------ + +.. _`issue 5303`: https://github.com/numpy/numpy/issues/5303 +.. _sequences: https://docs.python.org/3.7/glossary.html#term-sequence +.. _`PR 14794`: https://github.com/numpy/numpy/pull/14794 +.. _`another library`: https://github.com/scikit-hep/awkward-array + +.. [0] ``np.ndarrays`` are not recursed into, rather their shape is used + directly. This will not emit warnings:: + + ragged = np.array([[1], [1, 2, 3]], dtype=object) + np.array([ragged, ragged]) # no dtype needed + +Copyright +--------- + +This document has been placed in the public domain. |