diff options
Diffstat (limited to 'doc/neps/nep-0018-array-function-protocol.rst')
-rw-r--r-- | doc/neps/nep-0018-array-function-protocol.rst | 543 |
1 files changed, 543 insertions, 0 deletions
diff --git a/doc/neps/nep-0018-array-function-protocol.rst b/doc/neps/nep-0018-array-function-protocol.rst new file mode 100644 index 000000000..943ca4cbf --- /dev/null +++ b/doc/neps/nep-0018-array-function-protocol.rst @@ -0,0 +1,543 @@ +================================================== +NEP: Dispatch Mechanism for NumPy's high level API +================================================== + +:Author: Stephan Hoyer <shoyer@google.com> +:Author: Matthew Rocklin <mrocklin@gmail.com> +:Status: Draft +:Type: Standards Track +:Created: 2018-05-29 + +Abstact +------- + +We propose a protocol to allow arguments of numpy functions to define +how that function operates on them. This allows other libraries that +implement NumPy's high level API to reuse Numpy functions. This allows +libraries that extend NumPy's high level API to apply to more NumPy-like +libraries. + +Detailed description +-------------------- + +Numpy's high level ndarray API has been implemented several times +outside of NumPy itself for different architectures, such as for GPU +arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel +arrays (Dask array) as well as various Numpy-like implementations in the +deep learning frameworks, like TensorFlow and PyTorch. + +Similarly there are several projects that build on top of the Numpy API +for labeled and indexed arrays (XArray), automatic differentation +(Autograd, Tangent), higher order array factorizations (TensorLy), etc. +that add additional functionality on top of the Numpy API. + +We would like to be able to use these libraries together, for example we +would like to be able to place a CuPy array within XArray, or perform +automatic differentiation on Dask array code. This would be easier to +accomplish if code written for NumPy ndarrays could also be used by +other NumPy-like projects. + +For example, we would like for the following code example to work +equally well with any Numpy-like array object: + +.. code:: python + + def f(x): + y = np.tensordot(x, x.T) + return np.mean(np.exp(y)) + +Some of this is possible today with various protocol mechanisms within +Numpy. + +- The ``np.exp`` function checks the ``__array_ufunc__`` protocol +- The ``.T`` method works using Python's method dispatch +- The ``np.mean`` function explicitly checks for a ``.mean`` method on + the argument + +However other functions, like ``np.tensordot`` do not dispatch, and +instead are likely to coerce to a Numpy array (using the ``__array__``) +protocol, or err outright. To achieve enough coverage of the NumPy API +to support downstream projects like XArray and autograd we want to +support *almost all* functions within Numpy, which calls for a more +reaching protocol than just ``__array_ufunc__``. We would like a +protocol that allows arguments of a NumPy function to take control and +divert execution to another function (for example a GPU or parallel +implementation) in a way that is safe and consistent across projects. + +Implementation +-------------- + +We propose adding support for a new protocol in NumPy, +``__array_function__``. + +This protocol is intended to be a catch-all for NumPy functionality that +is not covered by existing protocols, like reductions (like ``np.sum``) +or universal functions (like ``np.exp``). The semantics are very similar +to ``__array_ufunc__``, except the operation is specified by an +arbitrary callable object rather than a ufunc instance and method. + +The interface +~~~~~~~~~~~~~ + +We propose the following signature for implementations of +``__array_function__``: + +.. code-block:: python + + def __array_function__(self, func, types, args, kwargs) + +- ``func`` is an arbitrary callable exposed by NumPy's public API, + which was called in the form ``func(*args, **kwargs)``. +- ``types`` is a list of types for all arguments to the original NumPy + function call that will be checked for an ``__array_function__`` + implementation. +- The tuple ``args`` and dict ``**kwargs`` are directly passed on from the + original call. + +Unlike ``__array_ufunc__``, there are no high-level guarantees about the +type of ``func``, or about which of ``args`` and ``kwargs`` may contain objects +implementing the array API. As a convenience for ``__array_function__`` +implementors of the NumPy API, the ``types`` keyword contains a list of all +types that implement the ``__array_function__`` protocol. This allows +downstream implementations to quickly determine if they are likely able to +support the operation. + +Still be determined: what guarantees can we offer for ``types``? Should +we promise that types are unique, and appear in the order in which they +are checked? + +Example for a project implementing the NumPy API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Most implementations of ``__array_function__`` will start with two +checks: + +1. Is the given function something that we know how to overload? +2. Are all arguments of a type that we know how to handle? + +If these conditions hold, ``__array_function__`` should return +the result from calling its implementation for ``func(*args, **kwargs)``. +Otherwise, it should return the sentinel value ``NotImplemented``, indicating +that the function is not implemented by these types. + +.. code:: python + + class MyArray: + def __array_function__(self, func, types, args, kwargs): + if func not in HANDLED_FUNCTIONS: + return NotImplemented + if not all(issubclass(t, MyArray) for t in types): + return NotImplemented + return HANDLED_FUNCTIONS[func](*args, **kwargs) + + HANDLED_FUNCTIONS = { + np.concatenate: my_concatenate, + np.broadcast_to: my_broadcast_to, + np.sum: my_sum, + ... + } + +Necessary changes within the Numpy codebase itself +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This will require two changes within the Numpy codebase: + +1. A function to inspect available inputs, look for the + ``__array_function__`` attribute on those inputs, and call those + methods appropriately until one succeeds. This needs to be fast in the + common all-NumPy case. + + This is one additional function of moderate complexity. +2. Calling this function within all relevant Numpy functions. + + This affects many parts of the Numpy codebase, although with very low + complexity. + +Finding and calling the right ``__array_function__`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Given a Numpy function, ``*args`` and ``**kwargs`` inputs, we need to +search through ``*args`` and ``**kwargs`` for all appropriate inputs +that might have the ``__array_function__`` attribute. Then we need to +select among those possible methods and execute the right one. +Negotiating between several possible implementations can be complex. + +Finding arguments +''''''''''''''''' + +Valid arguments may be directly in the ``*args`` and ``**kwargs``, such +as in the case for ``np.tensordot(left, right, out=out)``, or they may +be nested within lists or dictionaries, such as in the case of +``np.concatenate([x, y, z])``. This can be problematic for two reasons: + +1. Some functions are given long lists of values, and traversing them + might be prohibitively expensive +2. Some function may have arguments that we don't want to inspect, even + if they have the ``__array_function__`` method + +To resolve these we ask the functions to provide an explicit list of +arguments that should be traversed. This is the ``relevant_arguments=`` +keyword in the examples below. + +Trying ``__array_function__`` methods until the right one works +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Many arguments may implement the ``__array_function__`` protocol. Some +of these may decide that, given the available inputs, they are unable to +determine the correct result. How do we call the right one? If several +are valid then which has precedence? + +The rules for dispatch with ``__array_function__`` match those for +``__array_ufunc__`` (see +`NEP-13 <http://www.numpy.org/neps/nep-0013-ufunc-overrides.html>`_). +In particular: + +- NumPy will gather implementations of ``__array_function__`` from all + specified inputs and call them in order: subclasses before + superclasses, and otherwise left to right. Note that in some edge cases, + this differs slightly from the + `current behavior <https://bugs.python.org/issue30140>`_ of Python. +- Implementations of ``__array_function__`` indicate that they can + handle the operation by returning any value other than + ``NotImplemented``. +- If all ``__array_function__`` methods return ``NotImplemented``, + NumPy will raise ``TypeError``. + +Changes within Numpy functions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Given a function defined above, for now call it +``do_array_function_dance``, we now need to call that function from +within every relevant Numpy function. This is a pervasive change, but of +fairly simple and innocuous code that should complete quickly and +without effect if no arguments implement the ``__array_function__`` +protocol. Let us consider a few examples of NumPy functions and how they +might be affected by this change: + +.. code:: python + + def broadcast_to(array, shape, subok=False): + success, value = do_array_function_dance( + func=broadcast_to, + relevant_arguments=[array], + args=(array,), + kwargs=dict(shape=shape, subok=subok)) + if success: + return value + + ... # continue with the definition of broadcast_to + + def concatenate(arrays, axis=0, out=None) + success, value = do_array_function_dance( + func=concatenate, + relevant_arguments=[arrays, out], + args=(arrays,), + kwargs=dict(axis=axis, out=out)) + if success: + return value + + ... # continue with the definition of concatenate + +The list of objects passed to ``relevant_arguments`` are those that should +be inspected for ``__array_function__`` implementations. + +Alternatively, we could write these overloads with a decorator, e.g., + +.. code:: python + + @overload_for_array_function(['array']) + def broadcast_to(array, shape, subok=False): + ... # continue with the definition of broadcast_to + + @overload_for_array_function(['arrays', 'out']) + def concatenate(arrays, axis=0, out=None): + ... # continue with the definition of concatenate + +The decorator ``overload_for_array_function`` would be written in terms +of ``do_array_function_dance``. + +The downside of this approach would be a loss of introspection capability +for NumPy functions on Python 2, since this requires the use of +``inspect.Signature`` (only available on Python 3). However, NumPy won't +be supporting Python 2 for `very much longer <http://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html>`_. + +Use outside of NumPy +~~~~~~~~~~~~~~~~~~~~ + +Nothing about this protocol that is particular to NumPy itself. Should +we enourage use of the same ``__array_function__`` protocol third-party +libraries for overloading non-NumPy functions, e.g., for making +array-implementation generic functionality in SciPy? + +This would offer significant advantages (SciPy wouldn't need to invent +its own dispatch system) and no downsides that we can think of, because +every function that dispatches with ``__array_function__`` already needs +to be explicitly recognized. Libraries like Dask, CuPy, and Autograd +already wrap a limited subset of SciPy functionality (e.g., +``scipy.linalg``) similarly to how they wrap NumPy. + +If we want to do this, we should consider exposing the helper function +``do_array_function_dance()`` above as a public API. + +Non-goals +--------- + +We are aiming for basic strategy that can be relatively mechanistically +applied to almost all functions in NumPy's API in a relatively short +period of time, the development cycle of a single NumPy release. + +We hope to get both the ``__array_function__`` protocol and all specific +overloads right on the first try, but our explicit aim here is to get +something that mostly works (and can be iterated upon), rather than to +wait for an optimal implementation. The price of moving fast is that for +now **this protocol should be considered strictly experimental**. We +reserve the right to change the details of this protocol and how +specific NumPy functions use it at any time in the future -- even in +otherwise bug-fix only releases of NumPy. + +In particular, we don't plan to write additional NEPs that list all +specific functions to overload, with exactly how they should be +overloaded. We will leave this up to the discretion of committers on +individual pull requests, trusting that they will surface any +controversies for discussion by interested parties. + +However, we already know several families of functions that should be +explicitly exclude from ``__array_function__``. These will need their +own protocols: + +- universal functions, which already have their own protocol. +- ``array`` and ``asarray``, because they are explicitly intended for + coercion to actual ``numpy.ndarray`` object. +- dispatch for methods of any kind, e.g., methods on + ``np.random.RandomState`` objects. + +As a concrete example of how we expect to break behavior in the future, +some functions such as ``np.where`` are currently not NumPy universal +functions, but conceivably could become universal functions in the +future. When/if this happens, we will change such overloads from using +``__array_function__`` to the more specialized ``__array_ufunc__``. + + +Backward compatibility +---------------------- + +This proposal does not change existing semantics, except for those arguments +that currently have ``__array_function__`` methods, which should be rare. + + +Alternatives +------------ + +Specialized protocols +~~~~~~~~~~~~~~~~~~~~~ + +We could (and should) continue to develop protocols like +``__array_ufunc__`` for cohesive subsets of Numpy functionality. + +As mentioned above, if this means that some functions that we overload +with ``__array_function__`` should switch to a new protocol instead, +that is explicitly OK for as long as ``__array_function__`` retains its +experimental status. + +Separate namespace +~~~~~~~~~~~~~~~~~~ + +A separate namespace for overloaded functions is another possibility, +either inside or outside of NumPy. + +This has the advantage of alleviating any possible concerns about +backwards compatibility and would provide the maximum freedom for quick +experimentation. In the long term, it would provide a clean abstration +layer, separating NumPy's high level API from default implementations on +``numpy.ndarray`` objects. + +The downsides are that this would require an explicit opt-in from all +existing code, e.g., ``import numpy.api as np``, and in the long term +would result in the maintainence of two separate NumPy APIs. Also, many +functions from ``numpy`` itself are already overloaded (but +inadequately), so confusion about high vs. low level APIs in NumPy would +still persist. + +Multiple dispatch +~~~~~~~~~~~~~~~~~ + +An alternative to our suggestion of the ``__array_function__`` protocol +would be implementing NumPy's core functions as +`multi-methods <https://en.wikipedia.org/wiki/Multiple_dispatch>`_. +Although one of us wrote a `multiple dispatch +library <https://github.com/mrocklin/multipledispatch>`_ for Python, we +don't think this approach makes sense for NumPy in the near term. + +The main reason is that NumPy already has a well-proven dispatching +mechanism with ``__array_ufunc__``, based on Python's own dispatching +system for arithemtic, and it would be confusing to add another +mechanism that works in a very different way. This would also be more +invasive change to NumPy itself, which would need to gain a multiple +dispatch implementation. + +It is possible that multiple dispatch implementation for NumPy's high +level API could make sense in the future. Fortunately, +``__array_function__`` does not preclude this possibility, because it +would be straightforward to write a shim for a default +``__array_function__`` implementation in terms of multiple dispatch. + +Implementations in terms of a limited core API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The internal implemenations of some NumPy functions is extremely simple. +For example: - ``np.stack()`` is implemented in only a few lines of code +by combining indexing with ``np.newaxis``, ``np.concatenate`` and the +``shape`` attribute. - ``np.mean()`` is implemented internally in terms +of ``np.sum()``, ``np.divide()``, ``.astype()`` and ``.shape``. + +This suggests the possibility of defining a minimal "core" ndarray +interface, and relying upon it internally in NumPy to implement the full +API. This is an attractive option, because it could significantly reduce +the work required for new array implementations. + +However, this also comes with several downsides: 1. The details of how +NumPy implements a high-level function in terms of overloaded functions +now becomes an implicit part of NumPy's public API. For example, +refactoring ``stack`` to use ``np.block()`` instead of +``np.concatenate()`` internally would now become a breaking change. 2. +Array libraries may prefer to implement high level functions differently +than NumPy. For example, a library might prefer to implement a +fundamental operations like ``mean()`` directly rather than relying on +``sum()`` followed by division. More generally, it's not clear yet what +exactly qualifies as core functionality, and figuring this out could be +a large project. 3. We don't yet have an overloading system for +attributes and methods on array objects, e.g., for accessing ``.dtype`` +and ``.shape``. This should be the subject of a future NEP, but until +then we should be reluctant to rely on these properties. + +Given these concerns, we encourage relying on this approach only in +limited cases. + +Coersion to a NumPy array as a catch-all fallback +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With the current design, classes that implement ``__array_function__`` +to overload at least one function implicitly declare an intent to +implement the entire NumPy API. It's not possible to implement *only* +``np.concatenate()`` on a type, but fall back to NumPy's default +behavior of casting with ``np.asarray()`` for all other functions. + +This could present a backwards compatibility concern that would +discourage libraries from adopting ``__array_function__`` in an +incremental fashion. For example, currently most numpy functions will +implicitly convert ``pandas.Series`` objects into NumPy arrays, behavior +that assuredly many pandas users rely on. If pandas implemented +``__array_function__`` only for ``np.concatenate``, unrelated NumPy +functions like ``np.nanmean`` would suddenly break on pandas objects by +raising TypeError. + +With ``__array_ufunc__``, it's possible to alleviate this concern by +casting all arguments to numpy arrays and re-calling the ufunc, but the +heterogeneous function signatures supported by ``__array_function__`` +make it impossible to implement this generic fallback behavior for +``__array_function__``. + +We could resolve this issue by change the handling of return values in +``__array_function__`` in either of two possible ways: 1. Change the +meaning of all arguments returning ``NotImplemented`` to indicate that +all arguments should be coerced to NumPy arrays instead. However, many +array libraries (e.g., scipy.sparse) really don't want implicit +conversions to NumPy arrays, and often avoid implementing ``__array__`` +for exactly this reason. Implicit conversions can result in silent bugs +and performance degradation. 2. Use another sentinel value of some sort +to indicate that a class implementing part of the higher level array API +is coercible as a fallback, e.g., a return value of +``np.NotImplementedButCoercible`` from ``__array_function__``. + +If we take this second approach, we would need to define additional +rules for how coercible array arguments are coerced, e.g., - Would we +try for ``__array_function__`` overloads again after coercing coercible +arguments? - If so, would we coerce coercible arguments one-at-a-time, +or all-at-once? + +These are slightly tricky design questions, so for now we propose to +defer this issue. We can always implement +``np.NotImplementedButCoercible`` at some later time if it proves +critical to the numpy community in the future. Importantly, we don't +think this will stop critical libraries that desire to implement most of +the high level NumPy API from adopting this proposal. + +NOTE: If you are reading this NEP in its draft state and disagree, +please speak up on the mailing list! + +Drawbacks of this approach +-------------------------- + +Future difficulty extending NumPy's API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One downside of passing on all arguments directly on to +``__array_function__`` is that it makes it hard to extend the signatures +of overloaded NumPy functions with new arguments, because adding even an +optional keyword argument would break existing overloads. + +This is not a new problem for NumPy. NumPy has occasionally changed the +signature for functions in the past, including functions like +``numpy.sum`` which support overloads. + +For adding new keyword arguments that do not change default behavior, we +would only include these as keyword arguments when they have changed +from default values. This is similar to `what NumPy already has +done <https://github.com/numpy/numpy/blob/v1.14.2/numpy/core/fromnumeric.py#L1865-L1867>`_, +e.g., for the optional ``keepdims`` argument in ``sum``: + +.. code:: python + + def sum(array, ..., keepdims=np._NoValue): + kwargs = {} + if keepdims is not np._NoValue: + kwargs['keepdims'] = keepdims + return array.sum(..., **kwargs) + +In other cases, such as deprecated arguments, preserving the existing +behavior of overloaded functions may not be possible. Libraries that use +``__array_function__`` should be aware of this risk: we don't propose to +freeze NumPy's API in stone any more than it already is. + +Difficulty adding implementation specific arguments +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some array implementations generally follow NumPy's API, but have +additional optional keyword arguments (e.g., ``dask.array.sum()`` has +``split_every`` and ``tensorflow.reduce_sum()`` has ``name``). A generic +dispatching library could potentially pass on all unrecognized keyword +argument directly to the implementation, but extending ``np.sum()`` to +pass on ``**kwargs`` would entail public facing changes in NumPy. +Customizing the detailed behavior of array libraries will require using +library specific functions, which could be limiting in the case of +libraries that consume the NumPy API such as xarray. + + +Discussion +---------- + +Various alternatives to this proposal were discussed in a few Github issues: + +1. `pydata/sparse #1 <https://github.com/pydata/sparse/issues/1>`_ +2. `numpy/numpy #11129 <https://github.com/numpy/numpy/issues/11129>`_ + +Additionally it was the subject of `a blogpost +<http://matthewrocklin.com/blog/work/2018/05/27/beyond-numpy>`_ Following this +it was discussed at a `NumPy developer sprint +<https://scisprints.github.io/#may-numpy-developer-sprint>`_ at the `UC +Berkeley Institute for Data Science (BIDS) <https://bids.berkeley.edu/>`_. + + +References and Footnotes +------------------------ + +.. [1] Each NEP must either be explicitly labeled as placed in the public domain (see + this NEP as an example) or licensed under the `Open Publication License`_. + +.. _Open Publication License: http://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ |