diff options
author | Stephan Hoyer <shoyer@gmail.com> | 2018-06-25 17:21:58 -0400 |
---|---|---|
committer | GitHub <noreply@github.com> | 2018-06-25 17:21:58 -0400 |
commit | d262c78a3132de42b1813d7f713acc6f92323c1f (patch) | |
tree | 7a201d10b9089e0dda25e541f40aa405d5eafc11 | |
parent | 464f79eb1d05bf938d16b49da1c39a4e02506fa3 (diff) | |
download | numpy-d262c78a3132de42b1813d7f713acc6f92323c1f.tar.gz |
DOC: major revision of NEP 21, advanced indexing (#11414)
* DOC: major revision of NEP 21, advanced indexing
My focus here has been on clarifying the motivation for these changes,
and clearly defining indexing terminology. I've also made the proposal
a bit more definitive about what changes *should* be made, and have
added myself as a co-author.
* Add link to another PR
-rw-r--r-- | doc/neps/nep-0021-advanced-indexing.rst | 359 |
1 files changed, 240 insertions, 119 deletions
diff --git a/doc/neps/nep-0021-advanced-indexing.rst b/doc/neps/nep-0021-advanced-indexing.rst index 92d40d2eb..0279146be 100644 --- a/doc/neps/nep-0021-advanced-indexing.rst +++ b/doc/neps/nep-0021-advanced-indexing.rst @@ -1,8 +1,9 @@ -========================================================== -Implementing intuitive and full featured advanced indexing -========================================================== +========================================= +Simplified and explicit advanced indexing +========================================= :Author: Sebastian Berg +:Author: Stephan Hoyer <shoyer@google.com> :Status: Draft :Type: Standards Track :Created: 2015-08-27 @@ -11,30 +12,175 @@ Implementing intuitive and full featured advanced indexing Abstract -------- -Advanced indexing with multiple array indices is typically confusing to -both new, and in many cases even old, users of NumPy. To avoid this problem -and allow for more and clearer features, we propose to: +NumPy's "advanced" indexing support for indexing array with other arrays is +one of its most powerful and popular features. Unfortunately, the existing +rules for advanced indexing with multiple array indices are typically confusing +to both new, and in many cases even old, users of NumPy. Here we propose an +overhaul and simplification of advanced indexing, including two new "indexer" +attributes ``oindex`` and ``vindex`` to facilitate explicit indexing. -1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but +Background +---------- + +Existing indexing operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +NumPy arrays currently support a flexible range of indexing operations: + +- "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis + (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element + from the 0th axis, the first three elements from the 1st axis and inserting a + new axis of size 1 at the end. Basic indexing always return a view of the + indexed array's data. +- "Advanced" indexing, also called "fancy" indexing, includes all cases where + arrays are indexed by other arrays. Advanced indexing always makes a copy: + + - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for + selecting positive elements. + - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]`` + for selecting the first two elements along the first axis. With multiple + arrays, vectorized indexing uses broadcasting rules to combine indices along + multiple dimensions. This allows for producing a result of arbitrary shape + with arbitrary elements from the original arrays. + - "Mixed" indexing involving any combinations of the other advancing types. + This is no more powerful than vectorized indexing, but is sometimes more + convenient. + +For clarity, we will refer to these existing rules as "legacy indexing". +This is only a high-level summary; for more details, see NumPy's documentation +and and `Examples` below. + +Outer indexing +~~~~~~~~~~~~~~ + +One broadly useful class of indexing operations is not supported: + +- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to + slices for determining output shapes. The rule for outer indexing is that the + result should be equivalent to independently indexing along each dimension + with integer or boolean arrays as if both the indexed and indexing arrays + were one-dimensional. This form of indexing is familiar to many users of other + programming languages such as MATLAB, Fortran and R. + +The reason why NumPy omits support for outer indexing is that the rules for +outer and vectorized conflict. Consider indexing a 2D array by two 1D integer +arrays, e.g., ``x[[0, 1], [0, 1]]``: + +- Outer indexing is equivalent to combining multiple integer indices with + ``itertools.product()``. The result in this case is another 2D array with + all combinations of indexed elements, e.g., + ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])`` +- Vectorized indexing is equivalent to combining multiple integer indices with + ``zip()``. The result in this case is a 1D array containing the diagonal + elements, e.g., ``np.array([x[0, 0], x[1, 1]])``. + +This difference is a frequent stumbling block for new NumPy users. The outer +indexing model is easier to understand, and is a natural generalization of +slicing rules. But NumPy instead chose to support vectorized indexing, because +it is strictly more powerful. + +It is always possible to emulate outer indexing by vectorized indexing with +the right indices. To make this easier, NumPy includes utility objects and +functions such as ``np.ogrid`` and ``np.ix_``, e.g., +``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating +fully general/mixed outer indexing, which could unambiguously allow for slices, +integers, and 1D boolean and integer arrays. + +Mixed indexing +~~~~~~~~~~~~~~ + +NumPy's existing rules for combining multiple types of indexing in the same +operation are quite complex, involving a number of edge cases. + +One reason why mixed indexing is particularly confusing is that at first glance +the result works deceptively like outer indexing. Returning to our example of a +2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with +axes in the same order as the original array. + +However, as soon as two or more non-slice objects (including integers) are +introduced, vectorized indexing rules apply. The axes introduced by the array +indices are at the front, unless all array indices are consecutive, in which +case NumPy deduces where the user "expects" them to be. Consider indexing a 3D +array ``arr`` with shape ``(X, Y, Z)``: + +1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``. +2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``. +3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``! + +These first two cases are intuitive and consistent with outer indexing, but +this last case is quite surprising, even to many higly experienced NumPy users. + +Mixed cases involving multiple array indices are also surprising, and only +less problematic because the current behavior is so useless that it is rarely +encountered in practice. When a boolean array index is mixed with another boolean or +integer array, boolean array is converted to integer array indices (equivalent +to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of +size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector +with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``. + +Mixed indexing seems so tricky that it is tempting to say that it never should +be used. However, it is not easy to avoid, because NumPy implicitly adds full +slices if there are fewer indices than the full dimensionality of the indexed +array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to +``x[[0, 1], :]``. These cases are not surprising, but they constrain the +behavior of mixed indexing. + +Indexing in other Python array libraries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Indexing is a useful and widely recognized mechanism for accessing +multi-dimensional array data, so it is no surprise that many other libraries in +the scientific Python ecosystem also support array indexing. + +Unfortunately, the full complexity of NumPy's indexing rules mean that it is +both challenging and undesirable for other libraries to copy its behavior in all +of its nuance. The only full implementation of NumPy-style indexing is NumPy +itself. This includes projects like dask.array and h5py, which support *most* +types of array indexing in some form, and otherwise attempt to copy NumPy's API +exactly. + +Vectorized indexing in particular can be challenging to implement with array +storage backends not based on NumPy. In contrast, indexing by 1D arrays along +at least one dimension in the style of outer indexing is much more acheivable. +This has led many libraries (including dask and h5py) to attempt to define a +safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g., +by only allowing indexing with an array along at most one dimension. However, +this is quite challenging to do correctly in a general enough way to be useful. +For example, the current versions of dask and h5py both handle mixed indexing +in case 3 above inconsistently with NumPy. This is quite likely to lead to +bugs. + +These inconsistencies, in addition to the broader challenge of implementing +every type of indexing logic, make it challenging to write high-level array +libraries like xarray or dask.array that can interchangeably index many types of +array storage. In contrast, explicit APIs for outer and vectorized indexing in +NumPy would provide a model that external libraries could reliably emulate, even +if they don't support every type of indexing. + +High level changes +------------------ + +Inspired by multiple "indexer" attributes for controlling different types +of indexing behavior in pandas, we propose to: + +1. Introduce ``arr.oindex[indices]`` which allows array indices, but uses outer indexing logic. 2. Introduce ``arr.vindex[indices]`` which use the current "vectorized"/broadcasted logic but with two differences from - fancy indexing: + legacy indexing: - 1. Boolean indices always use the outer indexing logic. - (Multi dimensional booleans should be allowed). - 2. The integer index result dimensions are always the first axes - of the result array. No transpose is done, even for a single - integer array index. - -3. Plain indexing on the array will only give warnings and eventually - errors either: - - * when there is ambiguity between legacy fancy and outer indexing - (note that ``arr[[1, 2], :, 0]`` is such a case, an integer - can be the "second" integer index array), - * when any integer index array is present (possibly additional for - more then one boolean index array). + * Boolean indices are not supported. All indices must be integers, + integer arrays or slices. + * The integer index result dimensions are always the first axes + of the result array. No transpose is done, even for a single + integer array index. + +3. Plain indexing on arrays will start to give warnings and eventually + errors in cases where one of the explicit indexers should be preferred: + + * First, in all cases where legacy and outer indexing would give + different results. + * Later, potentially in all cases involving an integer array. These constraints are sufficient for making indexing generally consistent with expectations and providing a less surprising learning curve with @@ -53,38 +199,6 @@ anyone not intimately familiar with advanced indexing. Detailed Description -------------------- -Old style advanced indexing with multiple array (boolean or integer) indices, -also called "fancy indexing", tends to be very confusing for new users. -While fancy (or legacy) indexing is useful in many cases one would naively -assume that the result of multiple 1-d ranges is analogous to multiple -slices along each dimension (also called "outer indexing"). - -However, legacy fancy indexing with multiple arrays broadcasts these arrays -into a single index over multiple dimensions. There are three main points -of confusion when multiple array indices are involved: - -1. Most new users will usually expect outer indexing (consistent with - slicing). This is also the most common way of handling this in other - packages or languages. -2. The axes introduced by the array indices are at the front, unless - all array indices are consecutive, in which case one can deduce where - the user "expects" them to be: - - * `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2. - * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2. - -3. When a boolean array index is mixed with another boolean or integer - array, the result is very hard to understand (the boolean array is - converted to integer array indices and then broadcast), and hardly - useful. - There is no well defined broadcast for booleans, so that boolean - indices are logically always "outer" type indices. - - -Furthermore, since other packages use outer type logic, enabling its -use in NumPy, will allow simpler cross package code. - - Proposed rules ~~~~~~~~~~~~~~ @@ -103,29 +217,33 @@ be deduced: no transposing should be done. The axes created by the integer array indices are always inserted at the front, even for a single index. -4. Boolean indexing is conceptionally outer indexing. A broadcasting +4. Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy - "fancy indexing" is generally not helpful or well defined. + indexing is generally not helpful or well defined. A user who wishes the "``nonzero``" plus broadcast behaviour can thus - be expected to do this manually. Thus, ``vindex`` would still use - outer type indexing for boolean index arrays. - Note that using this rule a single boolean index can still index into - multiple dimensions at once. + be expected to do this manually. Thus, ``vindex`` does not need to + support boolean index arrays. -5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to allow - legacy fancy indexing indefinetly. This also gives a simple way to - update fancy indexing code making deprecations to plain indexing - easier. +5. An ``arr.legacy_index`` attribute should be implemented to support + legacy indexing. This gives a simple way to update existing codebases + using legacy indexing, which will make the deprecation of plain indexing + behavior easier. The longer name ``legacy_index`` is intentionally chosen + to be explicit and discourage its use in new code. 6. Plain indexing ``arr[...]`` should return an error for ambiguous cases. For the beginning, this probably means cases where ``arr[ind]`` and ``arr.oindex[ind]`` return different results give deprecation warnings. + This includes every use of vectorized indexing with multiple integer arrays. Due to the transposing behaviour, this means that``arr[0, :, index_arr]`` - will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time - being. + will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being. + +7. To ensure that existing subclasses of `ndarray` that override indexing + do not inadvertently revert to default behavior for indexing attributes, + these attribute should have explicit checks that disable them if + ``__getitem__`` or ``__setitem__`` has been overriden. Unlike plain indexing, the new indexing attributes are explicitly aimed -at higher dimensional indexing, two more changes should be implemented: +at higher dimensional indexing, several additional changes should be implemented: * The indexing attributes will enforce exact dimension and indexing match. This means that no implicit ellipsis (``...``) will be added. Unless @@ -133,6 +251,9 @@ at higher dimensional indexing, two more changes should be implemented: an array with a specific number of dimensions. This makes the expression more explicit and safeguards against wrong dimensionality of arrays. + There should be no implications for "duck typing" compatibility with + builtin Python sequences, because Python sequences only support a limited + form of "basic indexing" with integers and slices. * The current plain indexing allows for the use of non-tuples for multi-dimensional indexing such as ``arr[[slice(None), 2]]``. @@ -141,22 +262,17 @@ at higher dimensional indexing, two more changes should be implemented: (Whether or not this should be the case for plain indexing is a different issue.) -* The new attributes should not do the getitem to implement setitem - branch, since it is a cludge and not useful for vectorized +* The new attributes should not use getitem to implement setitem, + since it is a cludge and not useful for vectorized indexing. (not implemented yet) Open Questions ~~~~~~~~~~~~~~ -* The names ``oindex`` and ``vindex`` are just suggestions at the time of - writing this, another name NumPy has used for something like ``oindex`` - is ``np.ix_``. See also below. - -* It would be possible to limit the use of boolean indices in ``vindex``, - assuming that they are rare and to some degree special. Alternatively, - boolean indices could be broadcasted as in legacy fancy indexing in - principle. +* The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at + the time of writing this, another name NumPy has used for something like + ``oindex`` is ``np.ix_``. See also below. * ``oindex`` and ``vindex`` could always return copies, even when no array operation occurs. One argument for allowing a view return is that this way @@ -175,25 +291,17 @@ Open Questions * The proposed changes to plain indexing could be postponed indefinitely or not taken in order to not break or force major fixes to existing code bases. -* Should there be/is it possible to have a mechanism to not allow the new - indexing attributes for subclasses unless specifically implemented? - -* Should we try to warn users about the special indexing attributes - not being implemented? If a subclass has its own indexing, inheriting - it from ndarray should be wrong. - - Alternative Names ~~~~~~~~~~~~~~~~~ Possible names suggested (more suggestions will be added). -============== ======== ======= ============ -**Orthogonal** oindex oix -**Vectorized** vindex vix -**Legacy** l/findex legacy_index -============== ======== ======= ============ +============== ============ ======== +**Orthogonal** oindex oix +**Vectorized** vindex vix +**Legacy** legacy_index l/findex +============== ============ ======== Subclasses @@ -205,10 +313,13 @@ provide ``__getitem__`` or ``__setitem__``) the special attributes should just work. Subclasses that *do* provide it must be updated accordingly and should preferably not subclass working versions of these attributes. -All subclasses will inherit the attributes, however, it seems possible -to test ``subclass.__getitem__.__classobj__`` when getting i.e. -``subclass.vindex``. If this is not ``ndarray``, the subclass has special -handling and an ``AttributeError`` can be given. +All subclasses will inherit the attributes, however, the implementation +of ``__getitem__`` on these attributes should test +``subclass.__getitem__ is ndarray.__getitem__``. If not, the +subclass has special handling for indexing and ``NotImplementedError`` +should be raised, requiring that the indexing attributes is also explicitly +overwritten. Likewise, implementations of ``__setitem__`` should check to see +if ``__setitem__`` is overriden. A further question is how to facilitate implementing the special attributes. Also there is the weird functionality where ``__setitem__`` calls @@ -218,28 +329,17 @@ confusing. To facilitate implementations we could provide functions similar to ``operator.itemgetter`` and ``operator.setitem`` for the attributes. -Possibly a MixIn could be provided to help implementation, but further -ideas on this are necessary. - -Related Operation -~~~~~~~~~~~~~~~~~ - -There exist a further indexing or indexing like method. That is the -inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick -the specific elements *along* an axis given an array of (at least -typically) the same size. - -These function are added to NumPy in versoin 15 as `take_along_axis` and -`put_along_axis`. - +Possibly a mixin could be provided to help implementation. These improvements +are not essential to the initial implementation, so they are saved for +future work. Implementation -------------- -Implementation of a special indexing object available through -``arr.oindex``, ``arr.vindex``, and ``arr.lindex`` to allow these indexing -operations. Also starting to deprecate those plain index operations -which are not ambiguous. +Implementation would start with writing special indexing objects available +through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these +indexing operations. Also, we would need to start to deprecate those plain index +operations which are not ambiguous. Furthermore, the NumPy code base will need to use the new attributes and tests will have to be adapted. @@ -247,7 +347,10 @@ tests will have to be adapted. Backward compatibility ---------------------- -As a new feature, no backward compatibility issues would arise. +As a new feature, no backward compatibility issues with the new ``vindex`` +and ``oindex`` attributes would arise. To facilitate backwards compatibility +as much as possible, we expect a long deprecation cycle for legacy indexing +behavior and propose the new ``legacy_index`` attribute. Some forward compatibility issues with subclasses that do not specifically implement the new methods may arise. @@ -256,20 +359,37 @@ Alternatives ------------ NumPy may not choose to offer these different type of indexing methods, or -choose to only offer them through specific function instead of the proposed +choose to only offer them through specific functions instead of the proposed notation above. -For variations see also the open questions section above. +We don't think that new functions are a good alternative, because indexing +notation ``[]`` offer some syntactic advantages in Python (i.e., direct +creation of slice objects) compared to functions. + +A more reasonable alternative would be write new wrapper objects for alternative +indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]`` +instead of ``arr.oindex[indices]``). Functionally, this would be equivalent, +but indexing is such a common operation that we think it is important to +minimize syntax and worth implementing it directly on `ndarray` objects +themselves. Indexing attributes also define a clear interface that is easier +for alternative array implementations to copy, nonwithstanding ongoing +efforts to make it easier to override NumPy functions [2]_. Discussion ---------- -Some discussion can be found on the pull request: +The original discussion about vectorized vs outer/orthogonal indexing arose +on the NumPy mailing list: + + * https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html + +Some discussion can be found on the original pull request for this NEP: * https://github.com/numpy/numpy/pull/6256 -A python implementation of the indexing operations can be found at: +Python implementations of the indexing operations can be found at: + * https://github.com/numpy/numpy/pull/5749 * https://gist.github.com/shoyer/c700193625347eb68fee4d1f0dc8c0c8 @@ -292,7 +412,7 @@ Example array:: Legacy fancy indexing --------------------- -Note that the same result can be achieved with ``arr.lindex``, but the +Note that the same result can be achieved with ``arr.legacy_index``, but the "future error" will still work in this case. Single index is transposed (this is the same for all indexing types):: @@ -534,4 +654,5 @@ References and Footnotes with this work has waived all copyright and related or neighboring rights to this work. The CC0 license may be found at https://creativecommons.org/publicdomain/zero/1.0/ - +.. [2] e.g., see NEP 18, + http://www.numpy.org/neps/nep-0018-array-function-protocol.html |