DOC: major revision of NEP 21, advanced indexing (#11414)

* DOC: major revision of NEP 21, advanced indexing My focus here has been on clarifying the motivation for these changes, and clearly defining indexing terminology. I've also made the proposal a bit more definitive about what changes *should* be made, and have added myself as a co-author. * Add link to another PR
author: Stephan Hoyer <shoyer@gmail.com> 2018-06-25 17:21:58 -0400
committer: GitHub <noreply@github.com> 2018-06-25 17:21:58 -0400
commit: d262c78a3132de42b1813d7f713acc6f92323c1f (patch)
tree: 7a201d10b9089e0dda25e541f40aa405d5eafc11
parent: 464f79eb1d05bf938d16b49da1c39a4e02506fa3 (diff)
download: numpy-d262c78a3132de42b1813d7f713acc6f92323c1f.tar.gz
1 files changed, 240 insertions, 119 deletions
diff --git a/doc/neps/nep-0021-advanced-indexing.rst b/doc/neps/nep-0021-advanced-indexing.rst
index 92d40d2eb..0279146be 100644
--- a/doc/neps/nep-0021-advanced-indexing.rst
+++ b/doc/neps/nep-0021-advanced-indexing.rst
@@ -1,8 +1,9 @@
-==========================================================
-Implementing intuitive and full featured advanced indexing
-==========================================================
+=========================================
+Simplified and explicit advanced indexing
+=========================================
 
 :Author: Sebastian Berg
+:Author: Stephan Hoyer <shoyer@google.com>
 :Status: Draft
 :Type: Standards Track
 :Created: 2015-08-27
@@ -11,30 +12,175 @@ Implementing intuitive and full featured advanced indexing
 Abstract
 --------
 
-Advanced indexing with multiple array indices is typically confusing to
-both new, and in many cases even old, users of NumPy. To avoid this problem
-and allow for more and clearer features, we propose to:
+NumPy's "advanced" indexing support for indexing array with other arrays is
+one of its most powerful and popular features. Unfortunately, the existing
+rules for advanced indexing with multiple array indices are typically confusing
+to both new, and in many cases even old, users of NumPy. Here we propose an
+overhaul and simplification of advanced indexing, including two new "indexer"
+attributes ``oindex`` and ``vindex`` to facilitate explicit indexing.
 
-1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but
+Background
+----------
+
+Existing indexing operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NumPy arrays currently support a flexible range of indexing operations:
+
+- "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis
+  (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element
+  from the 0th axis, the first three elements from the 1st axis and inserting a
+  new axis of size 1 at the end. Basic indexing always return a view of the
+  indexed array's data.
+- "Advanced" indexing, also called "fancy" indexing, includes all cases where
+  arrays are indexed by other arrays. Advanced indexing always makes a copy:
+
+  - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for
+    selecting positive elements.
+  - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]``
+    for selecting the first two elements along the first axis. With multiple
+    arrays, vectorized indexing uses broadcasting rules to combine indices along
+    multiple dimensions. This allows for producing a result of arbitrary shape
+    with arbitrary elements from the original arrays.
+  - "Mixed" indexing involving any combinations of the other advancing types.
+    This is no more powerful than vectorized indexing, but is sometimes more
+    convenient.
+
+For clarity, we will refer to these existing rules as "legacy indexing".
+This is only a high-level summary; for more details, see NumPy's documentation
+and and `Examples` below.
+
+Outer indexing
+~~~~~~~~~~~~~~
+
+One broadly useful class of indexing operations is not supported:
+
+- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to
+  slices for determining output shapes. The rule for outer indexing is that the
+  result should be equivalent to independently indexing along each dimension
+  with integer or boolean arrays as if both the indexed and indexing arrays
+  were one-dimensional. This form of indexing is familiar to many users of other
+  programming languages such as MATLAB, Fortran and R.
+
+The reason why NumPy omits support for outer indexing is that the rules for
+outer and vectorized conflict. Consider indexing a 2D array by two 1D integer
+arrays, e.g., ``x[[0, 1], [0, 1]]``:
+
+- Outer indexing is equivalent to combining multiple integer indices with
+  ``itertools.product()``. The result in this case is another 2D array with
+  all combinations of indexed elements, e.g.,
+  ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])``
+- Vectorized indexing is equivalent to combining multiple integer indices with
+  ``zip()``. The result in this case is a 1D array containing the diagonal
+  elements, e.g., ``np.array([x[0, 0], x[1, 1]])``.
+
+This difference is a frequent stumbling block for new NumPy users. The outer
+indexing model is easier to understand, and is a natural generalization of
+slicing rules. But NumPy instead chose to support vectorized indexing, because
+it is strictly more powerful.
+
+It is always possible to emulate outer indexing by vectorized indexing with
+the right indices. To make this easier, NumPy includes utility objects and
+functions such as ``np.ogrid`` and ``np.ix_``, e.g.,
+``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating
+fully general/mixed outer indexing, which could unambiguously allow for slices,
+integers, and 1D boolean and integer arrays.
+
+Mixed indexing
+~~~~~~~~~~~~~~
+
+NumPy's existing rules for combining multiple types of indexing in the same
+operation are quite complex, involving a number of edge cases.
+
+One reason why mixed indexing is particularly confusing is that at first glance
+the result works deceptively like outer indexing. Returning to our example of a
+2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with
+axes in the same order as the original array.
+
+However, as soon as two or more non-slice objects (including integers) are
+introduced, vectorized indexing rules apply. The axes introduced by the array
+indices are at the front, unless all array indices are consecutive, in which
+case NumPy deduces where the user "expects" them to be. Consider indexing a 3D
+array ``arr`` with shape ``(X, Y, Z)``:
+
+1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``.
+2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``.
+3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``!
+
+These first two cases are intuitive and consistent with outer indexing, but
+this last case is quite surprising, even to many higly experienced NumPy users.
+
+Mixed cases involving multiple array indices are also surprising, and only
+less problematic because the current behavior is so useless that it is rarely
+encountered in practice. When a boolean array index is mixed with another boolean or
+integer array, boolean array is converted to integer array indices (equivalent
+to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of
+size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector
+with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``.
+
+Mixed indexing seems so tricky that it is tempting to say that it never should
+be used. However, it is not easy to avoid, because NumPy implicitly adds full
+slices if there are fewer indices than the full dimensionality of the indexed
+array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to
+``x[[0, 1], :]``. These cases are not surprising, but they constrain the
+behavior of mixed indexing.
+
+Indexing in other Python array libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Indexing is a useful and widely recognized mechanism for accessing
+multi-dimensional array data, so it is no surprise that many other libraries in
+the scientific Python ecosystem also support array indexing.
+
+Unfortunately, the full complexity of NumPy's indexing rules mean that it is
+both challenging and undesirable for other libraries to copy its behavior in all
+of its nuance. The only full implementation of NumPy-style indexing is NumPy
+itself. This includes projects like dask.array and h5py, which support *most*
+types of array indexing in some form, and otherwise attempt to copy NumPy's API
+exactly.
+
+Vectorized indexing in particular can be challenging to implement with array
+storage backends not based on NumPy. In contrast, indexing by 1D arrays along
+at least one dimension in the style of outer indexing is much more acheivable.
+This has led many libraries (including dask and h5py) to attempt to define a
+safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g.,
+by only allowing indexing with an array along at most one dimension. However,
+this is quite challenging to do correctly in a general enough way to be useful.
+For example, the current versions of dask and h5py both handle mixed indexing
+in case 3 above inconsistently with NumPy. This is quite likely to lead to
+bugs.
+
+These inconsistencies, in addition to the broader challenge of implementing
+every type of indexing logic, make it challenging to write high-level array
+libraries like xarray or dask.array that can interchangeably index many types of
+array storage. In contrast, explicit APIs for outer and vectorized indexing in
+NumPy would provide a model that external libraries could reliably emulate, even
+if they don't support every type of indexing.
+
+High level changes
+------------------
+
+Inspired by multiple "indexer" attributes for controlling different types
+of indexing behavior in pandas, we propose to:
+
+1. Introduce ``arr.oindex[indices]`` which allows array indices, but
    uses outer indexing logic.
 2. Introduce ``arr.vindex[indices]`` which use the current
    "vectorized"/broadcasted logic but with two differences from
-   fancy indexing:
+   legacy indexing:
        
-   1. Boolean indices always use the outer indexing logic.
-      (Multi dimensional booleans should be allowed).
-   2. The integer index result dimensions are always the first axes
-      of the result array. No transpose is done, even for a single
-      integer array index.
-
-3. Plain indexing on the array will only give warnings and eventually
-   errors either:
-     
-   * when there is ambiguity between legacy fancy and outer indexing
-     (note that ``arr[[1, 2], :, 0]`` is such a case, an integer
-     can be the "second" integer index array),
-   * when any integer index array is present (possibly additional for
-     more then one boolean index array).
+   * Boolean indices are not supported. All indices must be integers,
+     integer arrays or slices.
+   * The integer index result dimensions are always the first axes
+     of the result array. No transpose is done, even for a single
+     integer array index.
+
+3. Plain indexing on arrays will start to give warnings and eventually
+   errors in cases where one of the explicit indexers should be preferred:
+
+   * First, in all cases where legacy and outer indexing would give
+     different results.
+   * Later, potentially in all cases involving an integer array.
 
 These constraints are sufficient for making indexing generally consistent
 with expectations and providing a less surprising learning curve with
@@ -53,38 +199,6 @@ anyone not intimately familiar with advanced indexing.
 Detailed Description
 --------------------
 
-Old style advanced indexing with multiple array (boolean or integer) indices,
-also called "fancy indexing", tends to be very confusing for new users.
-While fancy (or legacy) indexing is useful in many cases one would naively
-assume that the result of multiple 1-d ranges is analogous to multiple
-slices along each dimension (also called "outer indexing").
-
-However, legacy fancy indexing with multiple arrays broadcasts these arrays
-into a single index over multiple dimensions. There are three main points
-of confusion when multiple array indices are involved:
-
-1. Most new users will usually expect outer indexing (consistent with
-   slicing). This is also the most common way of handling this in other
-   packages or languages.
-2. The axes introduced by the array indices are at the front, unless
-   all array indices are consecutive, in which case one can deduce where
-   the user "expects" them to be:
-
-   * `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2.
-   * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2.
-
-3. When a boolean array index is mixed with another boolean or integer
-   array, the result is very hard to understand (the boolean array is
-   converted to integer array indices and then broadcast), and hardly
-   useful.
-   There is no well defined broadcast for booleans, so that boolean
-   indices are logically always "outer" type indices.
-
-
-Furthermore, since other packages use outer type logic, enabling its
-use in NumPy, will allow simpler cross package code.
-
-
 Proposed rules
 ~~~~~~~~~~~~~~
 
@@ -103,29 +217,33 @@ be deduced:
    no transposing should be done. The axes created by the integer array
    indices are always inserted at the front, even for a single index.
 
-4. Boolean indexing is conceptionally outer indexing. A broadcasting
+4. Boolean indexing is conceptionally outer indexing. Broadcasting
    together with other advanced indices in the manner of legacy
-   "fancy indexing" is generally not helpful or well defined.
+   indexing is generally not helpful or well defined.
    A user who wishes the "``nonzero``" plus broadcast behaviour can thus
-   be expected to do this manually. Thus, ``vindex`` would still use
-   outer type indexing for boolean index arrays.
-   Note that using this rule a single boolean index can still index into
-   multiple dimensions at once.
+   be expected to do this manually. Thus, ``vindex`` does not need to
+   support boolean index arrays.
 
-5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to allow
-   legacy fancy indexing indefinetly. This also gives a simple way to
-   update fancy indexing code making deprecations to plain indexing
-   easier.
+5. An ``arr.legacy_index`` attribute should be implemented to support
+   legacy indexing. This gives a simple way to update existing codebases
+   using legacy indexing, which will make the deprecation of plain indexing
+   behavior easier. The longer name ``legacy_index`` is intentionally chosen
+   to be explicit and discourage its use in new code.
 
 6. Plain indexing ``arr[...]`` should return an error for ambiguous cases.
    For the beginning, this probably means cases where ``arr[ind]`` and
    ``arr.oindex[ind]`` return different results give deprecation warnings.
+   This includes every use of vectorized indexing with multiple integer arrays.
    Due to the transposing behaviour, this means that``arr[0, :, index_arr]``
-   will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time
-   being.
+   will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being.
+
+7. To ensure that existing subclasses of `ndarray` that override indexing
+   do not inadvertently revert to default behavior for indexing attributes,
+   these attribute should have explicit checks that disable them if
+   ``__getitem__`` or ``__setitem__`` has been overriden.
 
 Unlike plain indexing, the new indexing attributes are explicitly aimed
-at higher dimensional indexing, two more changes should be implemented:
+at higher dimensional indexing, several additional changes should be implemented:
 
 * The indexing attributes will enforce exact dimension and indexing match.
   This means that no implicit ellipsis (``...``) will be added. Unless
@@ -133,6 +251,9 @@ at higher dimensional indexing, two more changes should be implemented:
   an array with a specific number of dimensions.
   This makes the expression more explicit and safeguards against wrong
   dimensionality of arrays.
+  There should be no implications for "duck typing" compatibility with
+  builtin Python sequences, because Python sequences only support a limited
+  form of "basic indexing" with integers and slices.
 
 * The current plain indexing allows for the use of non-tuples for
   multi-dimensional indexing such as ``arr[[slice(None), 2]]``.
@@ -141,22 +262,17 @@ at higher dimensional indexing, two more changes should be implemented:
   (Whether or not this should be the case for plain indexing is a
   different issue.)
 
-* The new attributes should not do the getitem to implement setitem
-  branch, since it is a cludge and not useful for vectorized
+* The new attributes should not use getitem to implement setitem,
+  since it is a cludge and not useful for vectorized
   indexing. (not implemented yet)
 
 
 Open Questions
 ~~~~~~~~~~~~~~
 
-* The names ``oindex`` and ``vindex`` are just suggestions at the time of
-  writing this, another name NumPy has used for something like ``oindex``
-  is ``np.ix_``. See also below.
-
-* It would be possible to limit the use of boolean indices in ``vindex``,
-  assuming that they are rare and to some degree special. Alternatively,
-  boolean indices could be broadcasted as in legacy fancy indexing in
-  principle.
+* The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at
+  the time of writing this, another name NumPy has used for something like
+  ``oindex`` is ``np.ix_``. See also below.
 
 * ``oindex`` and ``vindex`` could always return copies, even when no array
   operation occurs. One argument for allowing a view return is that this way
@@ -175,25 +291,17 @@ Open Questions
 * The proposed changes to plain indexing could be postponed indefinitely or
   not taken in order to not break or force major fixes to existing code bases.
 
-* Should there be/is it possible to have a mechanism to not allow the new
-  indexing attributes for subclasses unless specifically implemented?
-
-* Should we try to warn users about the special indexing attributes
-  not being implemented? If a subclass has its own indexing, inheriting
-  it from ndarray should be wrong.
- 
-
 
 Alternative Names
 ~~~~~~~~~~~~~~~~~
 
 Possible names suggested (more suggestions will be added).
 
-==============  ======== ======= ============
-**Orthogonal**  oindex   oix
-**Vectorized**  vindex   vix
-**Legacy**      l/findex         legacy_index
-==============  ======== ======= ============
+==============  ============ ========
+**Orthogonal**  oindex       oix
+**Vectorized**  vindex       vix
+**Legacy**      legacy_index l/findex
+==============  ============ ========
 
 
 Subclasses
@@ -205,10 +313,13 @@ provide ``__getitem__`` or ``__setitem__``) the special attributes should
 just work. Subclasses that *do* provide it must be updated accordingly
 and should preferably not subclass working versions of these attributes.
 
-All subclasses will inherit the attributes, however, it seems possible
-to test ``subclass.__getitem__.__classobj__`` when getting i.e.
-``subclass.vindex``. If this is not ``ndarray``, the subclass has special
-handling and an ``AttributeError`` can be given.
+All subclasses will inherit the attributes, however, the implementation
+of ``__getitem__`` on these attributes should test
+``subclass.__getitem__ is ndarray.__getitem__``. If not, the
+subclass has special handling for indexing and ``NotImplementedError``
+should be raised, requiring that the indexing attributes is also explicitly
+overwritten. Likewise, implementations of ``__setitem__`` should check to see
+if ``__setitem__`` is overriden.
 
 A further question is how to facilitate implementing the special attributes.
 Also there is the weird functionality where ``__setitem__`` calls
@@ -218,28 +329,17 @@ confusing.
 
 To facilitate implementations we could provide functions similar to
 ``operator.itemgetter`` and ``operator.setitem`` for the attributes.
-Possibly a MixIn could be provided to help implementation, but further
-ideas on this are necessary.
-
-Related Operation
-~~~~~~~~~~~~~~~~~
-
-There exist a further indexing or indexing like method. That is the
-inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick
-the specific elements *along* an axis given an array of (at least
-typically) the same size.
-
-These function are added to NumPy in versoin 15 as `take_along_axis` and
-`put_along_axis`.
-
+Possibly a mixin could be provided to help implementation. These improvements
+are not essential to the initial implementation, so they are saved for
+future work.
 
 Implementation
 --------------
 
-Implementation of a special indexing object available through
-``arr.oindex``, ``arr.vindex``, and ``arr.lindex`` to allow these indexing
-operations. Also starting to deprecate those plain index operations
-which are not ambiguous.
+Implementation would start with writing special indexing objects available
+through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these
+indexing operations. Also, we would need to start to deprecate those plain index
+operations which are not ambiguous.
 Furthermore, the NumPy code base will need to use the new attributes and
 tests will have to be adapted.
 
@@ -247,7 +347,10 @@ tests will have to be adapted.
 Backward compatibility
 ----------------------
 
-As a new feature, no backward compatibility issues would arise.
+As a new feature, no backward compatibility issues with the new ``vindex``
+and ``oindex`` attributes would arise. To facilitate backwards compatibility
+as much as possible, we expect a long deprecation cycle for legacy indexing
+behavior and propose the new ``legacy_index`` attribute.
 Some forward compatibility issues with subclasses that do not specifically
 implement the new methods may arise.
 
@@ -256,20 +359,37 @@ Alternatives
 ------------
 
 NumPy may not choose to offer these different type of indexing methods, or
-choose to only offer them through specific function instead of the proposed
+choose to only offer them through specific functions instead of the proposed
 notation above.
-For variations see also the open questions section above.
 
+We don't think that new functions are a good alternative, because indexing
+notation ``[]`` offer some syntactic advantages in Python (i.e., direct
+creation of slice objects) compared to functions.
+
+A more reasonable alternative would be write new wrapper objects for alternative
+indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]``
+instead of ``arr.oindex[indices]``). Functionally, this would be equivalent,
+but indexing is such a common operation that we think it is important to
+minimize syntax and worth implementing it directly on `ndarray` objects
+themselves. Indexing attributes also define a clear interface that is easier
+for alternative array implementations to copy, nonwithstanding ongoing
+efforts to make it easier to override NumPy functions [2]_.
 
 Discussion
 ----------
 
-Some discussion can be found on the pull request:
+The original discussion about vectorized vs outer/orthogonal indexing arose
+on the NumPy mailing list:
+
+ * https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html
+
+Some discussion can be found on the original pull request for this NEP:
 
  * https://github.com/numpy/numpy/pull/6256
 
-A python implementation of the indexing operations can be found at:
+Python implementations of the indexing operations can be found at:
 
+ * https://github.com/numpy/numpy/pull/5749
  * https://gist.github.com/shoyer/c700193625347eb68fee4d1f0dc8c0c8
 
 
@@ -292,7 +412,7 @@ Example array::
 Legacy fancy indexing
 ---------------------
 
-Note that the same result can be achieved with ``arr.lindex``, but the
+Note that the same result can be achieved with ``arr.legacy_index``, but the
 "future error" will still work in this case.
 
 Single index is transposed (this is the same for all indexing types)::
@@ -534,4 +654,5 @@ References and Footnotes
    with this work has waived all copyright and related or neighboring
    rights to this work. The CC0 license may be found at
    https://creativecommons.org/publicdomain/zero/1.0/
-
+.. [2] e.g., see NEP 18,
+   http://www.numpy.org/neps/nep-0018-array-function-protocol.html
author	Stephan Hoyer <shoyer@gmail.com>	2018-06-25 17:21:58 -0400
committer	GitHub <noreply@github.com>	2018-06-25 17:21:58 -0400
commit	d262c78a3132de42b1813d7f713acc6f92323c1f (patch)
tree	7a201d10b9089e0dda25e541f40aa405d5eafc11
parent	464f79eb1d05bf938d16b49da1c39a4e02506fa3 (diff)
download	numpy-d262c78a3132de42b1813d7f713acc6f92323c1f.tar.gz