summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorStephan Hoyer <shoyer@gmail.com>2018-06-25 17:21:58 -0400
committerGitHub <noreply@github.com>2018-06-25 17:21:58 -0400
commitd262c78a3132de42b1813d7f713acc6f92323c1f (patch)
tree7a201d10b9089e0dda25e541f40aa405d5eafc11
parent464f79eb1d05bf938d16b49da1c39a4e02506fa3 (diff)
downloadnumpy-d262c78a3132de42b1813d7f713acc6f92323c1f.tar.gz
DOC: major revision of NEP 21, advanced indexing (#11414)
* DOC: major revision of NEP 21, advanced indexing My focus here has been on clarifying the motivation for these changes, and clearly defining indexing terminology. I've also made the proposal a bit more definitive about what changes *should* be made, and have added myself as a co-author. * Add link to another PR
-rw-r--r--doc/neps/nep-0021-advanced-indexing.rst359
1 files changed, 240 insertions, 119 deletions
diff --git a/doc/neps/nep-0021-advanced-indexing.rst b/doc/neps/nep-0021-advanced-indexing.rst
index 92d40d2eb..0279146be 100644
--- a/doc/neps/nep-0021-advanced-indexing.rst
+++ b/doc/neps/nep-0021-advanced-indexing.rst
@@ -1,8 +1,9 @@
-==========================================================
-Implementing intuitive and full featured advanced indexing
-==========================================================
+=========================================
+Simplified and explicit advanced indexing
+=========================================
:Author: Sebastian Berg
+:Author: Stephan Hoyer <shoyer@google.com>
:Status: Draft
:Type: Standards Track
:Created: 2015-08-27
@@ -11,30 +12,175 @@ Implementing intuitive and full featured advanced indexing
Abstract
--------
-Advanced indexing with multiple array indices is typically confusing to
-both new, and in many cases even old, users of NumPy. To avoid this problem
-and allow for more and clearer features, we propose to:
+NumPy's "advanced" indexing support for indexing array with other arrays is
+one of its most powerful and popular features. Unfortunately, the existing
+rules for advanced indexing with multiple array indices are typically confusing
+to both new, and in many cases even old, users of NumPy. Here we propose an
+overhaul and simplification of advanced indexing, including two new "indexer"
+attributes ``oindex`` and ``vindex`` to facilitate explicit indexing.
-1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but
+Background
+----------
+
+Existing indexing operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NumPy arrays currently support a flexible range of indexing operations:
+
+- "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis
+ (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element
+ from the 0th axis, the first three elements from the 1st axis and inserting a
+ new axis of size 1 at the end. Basic indexing always return a view of the
+ indexed array's data.
+- "Advanced" indexing, also called "fancy" indexing, includes all cases where
+ arrays are indexed by other arrays. Advanced indexing always makes a copy:
+
+ - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for
+ selecting positive elements.
+ - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]``
+ for selecting the first two elements along the first axis. With multiple
+ arrays, vectorized indexing uses broadcasting rules to combine indices along
+ multiple dimensions. This allows for producing a result of arbitrary shape
+ with arbitrary elements from the original arrays.
+ - "Mixed" indexing involving any combinations of the other advancing types.
+ This is no more powerful than vectorized indexing, but is sometimes more
+ convenient.
+
+For clarity, we will refer to these existing rules as "legacy indexing".
+This is only a high-level summary; for more details, see NumPy's documentation
+and and `Examples` below.
+
+Outer indexing
+~~~~~~~~~~~~~~
+
+One broadly useful class of indexing operations is not supported:
+
+- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to
+ slices for determining output shapes. The rule for outer indexing is that the
+ result should be equivalent to independently indexing along each dimension
+ with integer or boolean arrays as if both the indexed and indexing arrays
+ were one-dimensional. This form of indexing is familiar to many users of other
+ programming languages such as MATLAB, Fortran and R.
+
+The reason why NumPy omits support for outer indexing is that the rules for
+outer and vectorized conflict. Consider indexing a 2D array by two 1D integer
+arrays, e.g., ``x[[0, 1], [0, 1]]``:
+
+- Outer indexing is equivalent to combining multiple integer indices with
+ ``itertools.product()``. The result in this case is another 2D array with
+ all combinations of indexed elements, e.g.,
+ ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])``
+- Vectorized indexing is equivalent to combining multiple integer indices with
+ ``zip()``. The result in this case is a 1D array containing the diagonal
+ elements, e.g., ``np.array([x[0, 0], x[1, 1]])``.
+
+This difference is a frequent stumbling block for new NumPy users. The outer
+indexing model is easier to understand, and is a natural generalization of
+slicing rules. But NumPy instead chose to support vectorized indexing, because
+it is strictly more powerful.
+
+It is always possible to emulate outer indexing by vectorized indexing with
+the right indices. To make this easier, NumPy includes utility objects and
+functions such as ``np.ogrid`` and ``np.ix_``, e.g.,
+``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating
+fully general/mixed outer indexing, which could unambiguously allow for slices,
+integers, and 1D boolean and integer arrays.
+
+Mixed indexing
+~~~~~~~~~~~~~~
+
+NumPy's existing rules for combining multiple types of indexing in the same
+operation are quite complex, involving a number of edge cases.
+
+One reason why mixed indexing is particularly confusing is that at first glance
+the result works deceptively like outer indexing. Returning to our example of a
+2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with
+axes in the same order as the original array.
+
+However, as soon as two or more non-slice objects (including integers) are
+introduced, vectorized indexing rules apply. The axes introduced by the array
+indices are at the front, unless all array indices are consecutive, in which
+case NumPy deduces where the user "expects" them to be. Consider indexing a 3D
+array ``arr`` with shape ``(X, Y, Z)``:
+
+1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``.
+2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``.
+3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``!
+
+These first two cases are intuitive and consistent with outer indexing, but
+this last case is quite surprising, even to many higly experienced NumPy users.
+
+Mixed cases involving multiple array indices are also surprising, and only
+less problematic because the current behavior is so useless that it is rarely
+encountered in practice. When a boolean array index is mixed with another boolean or
+integer array, boolean array is converted to integer array indices (equivalent
+to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of
+size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector
+with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``.
+
+Mixed indexing seems so tricky that it is tempting to say that it never should
+be used. However, it is not easy to avoid, because NumPy implicitly adds full
+slices if there are fewer indices than the full dimensionality of the indexed
+array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to
+``x[[0, 1], :]``. These cases are not surprising, but they constrain the
+behavior of mixed indexing.
+
+Indexing in other Python array libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Indexing is a useful and widely recognized mechanism for accessing
+multi-dimensional array data, so it is no surprise that many other libraries in
+the scientific Python ecosystem also support array indexing.
+
+Unfortunately, the full complexity of NumPy's indexing rules mean that it is
+both challenging and undesirable for other libraries to copy its behavior in all
+of its nuance. The only full implementation of NumPy-style indexing is NumPy
+itself. This includes projects like dask.array and h5py, which support *most*
+types of array indexing in some form, and otherwise attempt to copy NumPy's API
+exactly.
+
+Vectorized indexing in particular can be challenging to implement with array
+storage backends not based on NumPy. In contrast, indexing by 1D arrays along
+at least one dimension in the style of outer indexing is much more acheivable.
+This has led many libraries (including dask and h5py) to attempt to define a
+safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g.,
+by only allowing indexing with an array along at most one dimension. However,
+this is quite challenging to do correctly in a general enough way to be useful.
+For example, the current versions of dask and h5py both handle mixed indexing
+in case 3 above inconsistently with NumPy. This is quite likely to lead to
+bugs.
+
+These inconsistencies, in addition to the broader challenge of implementing
+every type of indexing logic, make it challenging to write high-level array
+libraries like xarray or dask.array that can interchangeably index many types of
+array storage. In contrast, explicit APIs for outer and vectorized indexing in
+NumPy would provide a model that external libraries could reliably emulate, even
+if they don't support every type of indexing.
+
+High level changes
+------------------
+
+Inspired by multiple "indexer" attributes for controlling different types
+of indexing behavior in pandas, we propose to:
+
+1. Introduce ``arr.oindex[indices]`` which allows array indices, but
uses outer indexing logic.
2. Introduce ``arr.vindex[indices]`` which use the current
"vectorized"/broadcasted logic but with two differences from
- fancy indexing:
+ legacy indexing:
- 1. Boolean indices always use the outer indexing logic.
- (Multi dimensional booleans should be allowed).
- 2. The integer index result dimensions are always the first axes
- of the result array. No transpose is done, even for a single
- integer array index.
-
-3. Plain indexing on the array will only give warnings and eventually
- errors either:
-
- * when there is ambiguity between legacy fancy and outer indexing
- (note that ``arr[[1, 2], :, 0]`` is such a case, an integer
- can be the "second" integer index array),
- * when any integer index array is present (possibly additional for
- more then one boolean index array).
+ * Boolean indices are not supported. All indices must be integers,
+ integer arrays or slices.
+ * The integer index result dimensions are always the first axes
+ of the result array. No transpose is done, even for a single
+ integer array index.
+
+3. Plain indexing on arrays will start to give warnings and eventually
+ errors in cases where one of the explicit indexers should be preferred:
+
+ * First, in all cases where legacy and outer indexing would give
+ different results.
+ * Later, potentially in all cases involving an integer array.
These constraints are sufficient for making indexing generally consistent
with expectations and providing a less surprising learning curve with
@@ -53,38 +199,6 @@ anyone not intimately familiar with advanced indexing.
Detailed Description
--------------------
-Old style advanced indexing with multiple array (boolean or integer) indices,
-also called "fancy indexing", tends to be very confusing for new users.
-While fancy (or legacy) indexing is useful in many cases one would naively
-assume that the result of multiple 1-d ranges is analogous to multiple
-slices along each dimension (also called "outer indexing").
-
-However, legacy fancy indexing with multiple arrays broadcasts these arrays
-into a single index over multiple dimensions. There are three main points
-of confusion when multiple array indices are involved:
-
-1. Most new users will usually expect outer indexing (consistent with
- slicing). This is also the most common way of handling this in other
- packages or languages.
-2. The axes introduced by the array indices are at the front, unless
- all array indices are consecutive, in which case one can deduce where
- the user "expects" them to be:
-
- * `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2.
- * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2.
-
-3. When a boolean array index is mixed with another boolean or integer
- array, the result is very hard to understand (the boolean array is
- converted to integer array indices and then broadcast), and hardly
- useful.
- There is no well defined broadcast for booleans, so that boolean
- indices are logically always "outer" type indices.
-
-
-Furthermore, since other packages use outer type logic, enabling its
-use in NumPy, will allow simpler cross package code.
-
-
Proposed rules
~~~~~~~~~~~~~~
@@ -103,29 +217,33 @@ be deduced:
no transposing should be done. The axes created by the integer array
indices are always inserted at the front, even for a single index.
-4. Boolean indexing is conceptionally outer indexing. A broadcasting
+4. Boolean indexing is conceptionally outer indexing. Broadcasting
together with other advanced indices in the manner of legacy
- "fancy indexing" is generally not helpful or well defined.
+ indexing is generally not helpful or well defined.
A user who wishes the "``nonzero``" plus broadcast behaviour can thus
- be expected to do this manually. Thus, ``vindex`` would still use
- outer type indexing for boolean index arrays.
- Note that using this rule a single boolean index can still index into
- multiple dimensions at once.
+ be expected to do this manually. Thus, ``vindex`` does not need to
+ support boolean index arrays.
-5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to allow
- legacy fancy indexing indefinetly. This also gives a simple way to
- update fancy indexing code making deprecations to plain indexing
- easier.
+5. An ``arr.legacy_index`` attribute should be implemented to support
+ legacy indexing. This gives a simple way to update existing codebases
+ using legacy indexing, which will make the deprecation of plain indexing
+ behavior easier. The longer name ``legacy_index`` is intentionally chosen
+ to be explicit and discourage its use in new code.
6. Plain indexing ``arr[...]`` should return an error for ambiguous cases.
For the beginning, this probably means cases where ``arr[ind]`` and
``arr.oindex[ind]`` return different results give deprecation warnings.
+ This includes every use of vectorized indexing with multiple integer arrays.
Due to the transposing behaviour, this means that``arr[0, :, index_arr]``
- will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time
- being.
+ will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being.
+
+7. To ensure that existing subclasses of `ndarray` that override indexing
+ do not inadvertently revert to default behavior for indexing attributes,
+ these attribute should have explicit checks that disable them if
+ ``__getitem__`` or ``__setitem__`` has been overriden.
Unlike plain indexing, the new indexing attributes are explicitly aimed
-at higher dimensional indexing, two more changes should be implemented:
+at higher dimensional indexing, several additional changes should be implemented:
* The indexing attributes will enforce exact dimension and indexing match.
This means that no implicit ellipsis (``...``) will be added. Unless
@@ -133,6 +251,9 @@ at higher dimensional indexing, two more changes should be implemented:
an array with a specific number of dimensions.
This makes the expression more explicit and safeguards against wrong
dimensionality of arrays.
+ There should be no implications for "duck typing" compatibility with
+ builtin Python sequences, because Python sequences only support a limited
+ form of "basic indexing" with integers and slices.
* The current plain indexing allows for the use of non-tuples for
multi-dimensional indexing such as ``arr[[slice(None), 2]]``.
@@ -141,22 +262,17 @@ at higher dimensional indexing, two more changes should be implemented:
(Whether or not this should be the case for plain indexing is a
different issue.)
-* The new attributes should not do the getitem to implement setitem
- branch, since it is a cludge and not useful for vectorized
+* The new attributes should not use getitem to implement setitem,
+ since it is a cludge and not useful for vectorized
indexing. (not implemented yet)
Open Questions
~~~~~~~~~~~~~~
-* The names ``oindex`` and ``vindex`` are just suggestions at the time of
- writing this, another name NumPy has used for something like ``oindex``
- is ``np.ix_``. See also below.
-
-* It would be possible to limit the use of boolean indices in ``vindex``,
- assuming that they are rare and to some degree special. Alternatively,
- boolean indices could be broadcasted as in legacy fancy indexing in
- principle.
+* The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at
+ the time of writing this, another name NumPy has used for something like
+ ``oindex`` is ``np.ix_``. See also below.
* ``oindex`` and ``vindex`` could always return copies, even when no array
operation occurs. One argument for allowing a view return is that this way
@@ -175,25 +291,17 @@ Open Questions
* The proposed changes to plain indexing could be postponed indefinitely or
not taken in order to not break or force major fixes to existing code bases.
-* Should there be/is it possible to have a mechanism to not allow the new
- indexing attributes for subclasses unless specifically implemented?
-
-* Should we try to warn users about the special indexing attributes
- not being implemented? If a subclass has its own indexing, inheriting
- it from ndarray should be wrong.
-
-
Alternative Names
~~~~~~~~~~~~~~~~~
Possible names suggested (more suggestions will be added).
-============== ======== ======= ============
-**Orthogonal** oindex oix
-**Vectorized** vindex vix
-**Legacy** l/findex legacy_index
-============== ======== ======= ============
+============== ============ ========
+**Orthogonal** oindex oix
+**Vectorized** vindex vix
+**Legacy** legacy_index l/findex
+============== ============ ========
Subclasses
@@ -205,10 +313,13 @@ provide ``__getitem__`` or ``__setitem__``) the special attributes should
just work. Subclasses that *do* provide it must be updated accordingly
and should preferably not subclass working versions of these attributes.
-All subclasses will inherit the attributes, however, it seems possible
-to test ``subclass.__getitem__.__classobj__`` when getting i.e.
-``subclass.vindex``. If this is not ``ndarray``, the subclass has special
-handling and an ``AttributeError`` can be given.
+All subclasses will inherit the attributes, however, the implementation
+of ``__getitem__`` on these attributes should test
+``subclass.__getitem__ is ndarray.__getitem__``. If not, the
+subclass has special handling for indexing and ``NotImplementedError``
+should be raised, requiring that the indexing attributes is also explicitly
+overwritten. Likewise, implementations of ``__setitem__`` should check to see
+if ``__setitem__`` is overriden.
A further question is how to facilitate implementing the special attributes.
Also there is the weird functionality where ``__setitem__`` calls
@@ -218,28 +329,17 @@ confusing.
To facilitate implementations we could provide functions similar to
``operator.itemgetter`` and ``operator.setitem`` for the attributes.
-Possibly a MixIn could be provided to help implementation, but further
-ideas on this are necessary.
-
-Related Operation
-~~~~~~~~~~~~~~~~~
-
-There exist a further indexing or indexing like method. That is the
-inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick
-the specific elements *along* an axis given an array of (at least
-typically) the same size.
-
-These function are added to NumPy in versoin 15 as `take_along_axis` and
-`put_along_axis`.
-
+Possibly a mixin could be provided to help implementation. These improvements
+are not essential to the initial implementation, so they are saved for
+future work.
Implementation
--------------
-Implementation of a special indexing object available through
-``arr.oindex``, ``arr.vindex``, and ``arr.lindex`` to allow these indexing
-operations. Also starting to deprecate those plain index operations
-which are not ambiguous.
+Implementation would start with writing special indexing objects available
+through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these
+indexing operations. Also, we would need to start to deprecate those plain index
+operations which are not ambiguous.
Furthermore, the NumPy code base will need to use the new attributes and
tests will have to be adapted.
@@ -247,7 +347,10 @@ tests will have to be adapted.
Backward compatibility
----------------------
-As a new feature, no backward compatibility issues would arise.
+As a new feature, no backward compatibility issues with the new ``vindex``
+and ``oindex`` attributes would arise. To facilitate backwards compatibility
+as much as possible, we expect a long deprecation cycle for legacy indexing
+behavior and propose the new ``legacy_index`` attribute.
Some forward compatibility issues with subclasses that do not specifically
implement the new methods may arise.
@@ -256,20 +359,37 @@ Alternatives
------------
NumPy may not choose to offer these different type of indexing methods, or
-choose to only offer them through specific function instead of the proposed
+choose to only offer them through specific functions instead of the proposed
notation above.
-For variations see also the open questions section above.
+We don't think that new functions are a good alternative, because indexing
+notation ``[]`` offer some syntactic advantages in Python (i.e., direct
+creation of slice objects) compared to functions.
+
+A more reasonable alternative would be write new wrapper objects for alternative
+indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]``
+instead of ``arr.oindex[indices]``). Functionally, this would be equivalent,
+but indexing is such a common operation that we think it is important to
+minimize syntax and worth implementing it directly on `ndarray` objects
+themselves. Indexing attributes also define a clear interface that is easier
+for alternative array implementations to copy, nonwithstanding ongoing
+efforts to make it easier to override NumPy functions [2]_.
Discussion
----------
-Some discussion can be found on the pull request:
+The original discussion about vectorized vs outer/orthogonal indexing arose
+on the NumPy mailing list:
+
+ * https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html
+
+Some discussion can be found on the original pull request for this NEP:
* https://github.com/numpy/numpy/pull/6256
-A python implementation of the indexing operations can be found at:
+Python implementations of the indexing operations can be found at:
+ * https://github.com/numpy/numpy/pull/5749
* https://gist.github.com/shoyer/c700193625347eb68fee4d1f0dc8c0c8
@@ -292,7 +412,7 @@ Example array::
Legacy fancy indexing
---------------------
-Note that the same result can be achieved with ``arr.lindex``, but the
+Note that the same result can be achieved with ``arr.legacy_index``, but the
"future error" will still work in this case.
Single index is transposed (this is the same for all indexing types)::
@@ -534,4 +654,5 @@ References and Footnotes
with this work has waived all copyright and related or neighboring
rights to this work. The CC0 license may be found at
https://creativecommons.org/publicdomain/zero/1.0/
-
+.. [2] e.g., see NEP 18,
+ http://www.numpy.org/neps/nep-0018-array-function-protocol.html