diff options
Diffstat (limited to 'doc/neps/nep-0037-array-module.rst')
-rw-r--r-- | doc/neps/nep-0037-array-module.rst | 550 |
1 files changed, 550 insertions, 0 deletions
diff --git a/doc/neps/nep-0037-array-module.rst b/doc/neps/nep-0037-array-module.rst new file mode 100644 index 000000000..387356490 --- /dev/null +++ b/doc/neps/nep-0037-array-module.rst @@ -0,0 +1,550 @@ +=================================================== +NEP 37 — A dispatch protocol for NumPy-like modules +=================================================== + +:Author: Stephan Hoyer <shoyer@google.com> +:Author: Hameer Abbasi +:Author: Sebastian Berg +:Status: Draft +:Type: Standards Track +:Created: 2019-12-29 + +Abstract +-------- + +NEP-18's ``__array_function__`` has been a mixed success. Some projects (e.g., +dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others +(e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new +protocol, ``__array_module__``, that we expect could eventually subsume most +use-cases for ``__array_function__``. The protocol requires explicit adoption +by both users and library authors, which ensures backwards compatibility, and +is also significantly simpler than ``__array_function__``, both of which we +expect will make it easier to adopt. + +Why ``__array_function__`` hasn't been enough +--------------------------------------------- + +There are two broad ways in which NEP-18 has fallen short of its goals: + +1. **Maintainability concerns**. `__array_function__` has significant + implications for libraries that use it: + + - Projects like `PyTorch + <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX + <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse + <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant to + implement `__array_function__` in part because they are concerned about + **breaking existing code**: users expect NumPy functions like + ``np.concatenate`` to return NumPy arrays. This is a fundamental + limitation of the ``__array_function__`` design, which we chose to allow + overriding the existing ``numpy`` namespace. + - ``__array_function__`` currently requires an "all or nothing" approach to + implementing NumPy's API. There is no good pathway for **incremental + adoption**, which is particularly problematic for established projects + for which adopting ``__array_function__`` would result in breaking + changes. + - It is no longer possible to use **aliases to NumPy functions** within + modules that support overrides. For example, both CuPy and JAX set + ``result_type = np.result_type``. + - Implementing **fall-back mechanisms** for unimplemented NumPy functions + by using NumPy's implementation is hard to get right (but see the + `version from dask <https://github.com/dask/dask/pull/5043>`_), because + ``__array_function__`` does not present a consistent interface. + Converting all arguments of array type requires recursing into generic + arguments of the form ``*args, **kwargs``. + +2. **Limitations on what can be overridden.** ``__array_function__`` has some + important gaps, most notably array creation and coercion functions: + + - **Array creation** routines (e.g., ``np.arange`` and those in + ``np.random``) need some other mechanism for indicating what type of + arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715>`_ + proposed adding optional ``like=`` arguments to functions without + existing array arguments. However, we still lack any mechanism to + override methods on objects, such as those needed by + ``np.random.RandomState``. + - **Array conversion** can't reuse the existing coercion functions like + ``np.asarray``, because ``np.asarray`` sometimes means "convert to an + exact ``np.ndarray``" and other times means "convert to something _like_ + a NumPy array." This led to the `NEP 30 + <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ proposal for + a separate ``np.duckarray`` function, but this still does not resolve how + to cast one duck array into a type matching another duck array. + +``get_array_module`` and the ``__array_module__`` protocol +---------------------------------------------------------- + +We propose a new user-facing mechanism for dispatching to a duck-array +implementation, ``numpy.get_array_module``. ``get_array_module`` performs the +same type resolution as ``__array_function__`` and returns a module with an API +promised to match the standard interface of ``numpy`` that can implement +operations on all provided array types. + +The protocol itself is both simpler and more powerful than +``__array_function__``, because it doesn't need to worry about actually +implementing functions. We believe it resolves most of the maintainability and +functionality limitations of ``__array_function__``. + +The new protocol is opt-in, explicit and with local control; see +:ref:`appendix-design-choices` for discussion on the importance of these design +features. + +The array module contract +========================= + +Modules returned by ``get_array_module``/``__array_module__`` should make a +best effort to implement NumPy's core functionality on new array types(s). +Unimplemented functionality should simply be omitted (e.g., accessing an +unimplemented function should raise ``AttributeError``). In the future, we +anticipate codifying a protocol for requesting restricted subsets of ``numpy``; +see :ref:`requesting-restricted-subsets` for more details. + +How to use ``get_array_module`` +=============================== + +Code that wants to support generic duck arrays should explicitly call +``get_array_module`` to determine an appropriate array module from which to +call functions, rather than using the ``numpy`` namespace directly. For +example: + +.. code:: python + + # calls the appropriate version of np.something for x and y + module = np.get_array_module(x, y) + module.something(x, y) + +Both array creation and array conversion are supported, because dispatching is +handled by ``get_array_module`` rather than via the types of function +arguments. For example, to use random number generation functions or methods, +we can simply pull out the appropriate submodule: + +.. code:: python + + def duckarray_add_random(array): + module = np.get_array_module(array) + noise = module.random.randn(*array.shape) + return array + noise + +We can also write the duck-array ``stack`` function from `NEP 30 +<https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the need +for a new ``np.duckarray`` function: + +.. code:: python + + def duckarray_stack(arrays): + module = np.get_array_module(*arrays) + arrays = [module.asarray(arr) for arr in arrays] + shapes = {arr.shape for arr in arrays} + if len(shapes) != 1: + raise ValueError('all input arrays must have the same shape') + expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] + return module.concatenate(expanded_arrays, axis=0) + +By default, ``get_array_module`` will return the ``numpy`` module if no +arguments are arrays. This fall-back can be explicitly controlled by providing +the ``module`` keyword-only argument. It is also possible to indicate that an +exception should be raised instead of returning a default array module by +setting ``module=None``. + +How to implement ``__array_module__`` +===================================== + +Libraries implementing a duck array type that want to support +``get_array_module`` need to implement the corresponding protocol, +``__array_module__``. This new protocol is based on Python's dispatch protocol +for arithmetic, and is essentially a simpler version of ``__array_function__``. + +Only one argument is passed into ``__array_module__``, a Python collection of +unique array types passed into ``get_array_module``, i.e., all arguments with +an ``__array_module__`` attribute. + +The special method should either return an namespace with an API matching +``numpy``, or ``NotImplemented``, indicating that it does not know how to +handle the operation: + +.. code:: python + + class MyArray: + def __array_module__(self, types): + if not all(issubclass(t, MyArray) for t in types): + return NotImplemented + return my_array_module + +Returning custom objects from ``__array_module__`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``my_array_module`` will typically, but need not always, be a Python module. +Returning a custom objects (e.g., with functions implemented via +``__getattr__``) may be useful for some advanced use cases. + +For example, custom objects could allow for partial implementations of duck +array modules that fall-back to NumPy (although this is not recommended in +general because such fall-back behavior can be error prone): + +.. code:: python + + class MyArray: + def __array_module__(self, types): + if all(issubclass(t, MyArray) for t in types): + return ArrayModule() + else: + return NotImplemented + + class ArrayModule: + def __getattr__(self, name): + import base_module + return getattr(base_module, name, getattr(numpy, name)) + +Subclassing from ``numpy.ndarray`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All of the same guidance about well-defined type casting hierarchies from +NEP-18 still applies. ``numpy.ndarray`` itself contains a matching +implementation of ``__array_module__``, which is convenient for subclasses: + +.. code:: python + + class ndarray: + def __array_module__(self, types): + if all(issubclass(t, ndarray) for t in types): + return numpy + else: + return NotImplemented + +NumPy's internal machinery +========================== + +The type resolution rules of ``get_array_module`` follow the same model as +Python and NumPy's existing dispatch protocols: subclasses are called before +super-classes, and otherwise left to right. ``__array_module__`` is guaranteed +to be called only a single time on each unique type. + +The actual implementation of `get_array_module` will be in C, but should be +equivalent to this Python code: + +.. code:: python + + def get_array_module(*arrays, default=numpy): + implementing_arrays, types = _implementing_arrays_and_types(arrays) + if not implementing_arrays and default is not None: + return default + for array in implementing_arrays: + module = array.__array_module__(types) + if module is not NotImplemented: + return module + raise TypeError("no common array module found") + + def _implementing_arrays_and_types(relevant_arrays): + types = [] + implementing_arrays = [] + for array in relevant_arrays: + t = type(array) + if t not in types and hasattr(t, '__array_module__'): + types.append(t) + # Subclasses before superclasses, otherwise left to right + index = len(implementing_arrays) + for i, old_array in enumerate(implementing_arrays): + if issubclass(t, type(old_array)): + index = i + break + implementing_arrays.insert(index, array) + return implementing_arrays, types + +Relationship with ``__array_ufunc__`` and ``__array_function__`` +---------------------------------------------------------------- + +These older protocols have distinct use-cases and should remain +=============================================================== + +``__array_module__`` is intended to resolve limitations of +``__array_function__``, so it is natural to consider whether it could entirely +replace ``__array_function__``. This would offer dual benefits: (1) simplifying +the user-story about how to override NumPy and (2) removing the slowdown +associated with checking for dispatch when calling every NumPy function. + +However, ``__array_module__`` and ``__array_function__`` are pretty different +from a user perspective: it requires explicit calls to ``get_array_function``, +rather than simply reusing original ``numpy`` functions. This is probably fine +for *libraries* that rely on duck-arrays, but may be frustratingly verbose for +interactive use. + +Some of the dispatching use-cases for ``__array_ufunc__`` are also solved by +``__array_module__``, but not all of them. For example, it is still useful to +be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a generic way +on non-NumPy arrays (e.g., with dask.array). + +Given their existing adoption and distinct use cases, we don't think it makes +sense to remove or deprecate ``__array_function__`` and ``__array_ufunc__`` at +this time. + +Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` +========================================================================= + +Despite the user-facing differences, ``__array_module__`` and a module +implementing NumPy's API still contain sufficient functionality needed to +implement dispatching with the existing duck array protocols. + +For example, the following mixin classes would provide sensible defaults for +these special methods in terms of ``get_array_module`` and +``__array_module__``: + +.. code:: python + + class ArrayUfuncFromModuleMixin: + + def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): + arrays = inputs + kwargs.get('out', ()) + try: + array_module = np.get_array_module(*arrays) + except TypeError: + return NotImplemented + + try: + # Note this may have false positive matches, if ufunc.__name__ + # matches the name of a ufunc defined by NumPy. Unfortunately + # there is no way to determine in which module a ufunc was + # defined. + new_ufunc = getattr(array_module, ufunc.__name__) + except AttributeError: + return NotImplemented + + try: + callable = getattr(new_ufunc, method) + except AttributeError: + return NotImplemented + + return callable(*inputs, **kwargs) + + class ArrayFunctionFromModuleMixin: + + def __array_function__(self, func, types, args, kwargs): + array_module = self.__array_module__(types) + if array_module is NotImplemented: + return NotImplemented + + # Traverse submodules to find the appropriate function + modules = func.__module__.split('.') + assert modules[0] == 'numpy' + for submodule in modules[1:]: + module = getattr(module, submodule, None) + new_func = getattr(module, func.__name__, None) + if new_func is None: + return NotImplemented + + return new_func(*args, **kwargs) + +To make it easier to write duck arrays, we could also add these mixin classes +into ``numpy.lib.mixins`` (but the examples above may suffice). + +Alternatives considered +----------------------- + +Naming +====== + +We like the name ``__array_module__`` because it mirrors the existing +``__array_function__`` and ``__array_ufunc__`` protocols. Another reasonable +choice could be ``__array_namespace__``. + +It is less clear what the NumPy function that calls this protocol should be +called (``get_array_module`` in this proposal). Some possible alternatives: +``array_module``, ``common_array_module``, ``resolve_array_module``, +``get_namespace``, ``get_numpy``, ``get_numpylike_module``, +``get_duck_array_module``. + +.. _requesting-restricted-subsets: + +Requesting restricted subsets of NumPy's API +============================================ + +Over time, NumPy has accumulated a very large API surface, with over 600 +attributes in the top level ``numpy`` module alone. It is unlikely that any +duck array library could or would want to implement all of these functions and +classes, because the frequently used subset of NumPy is much smaller. + +We think it would be useful exercise to define "minimal" subset(s) of NumPy's +API, omitting rarely used or non-recommended functionality. For example, +minimal NumPy might include ``stack``, but not the other stacking functions +``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly +indicate to duck array authors and users want functionality is core and what +functionality they can skip. + +Support for requesting a restricted subset of NumPy's API would be a natural +feature to include in ``get_array_function`` and ``__array_module__``, e.g., + +.. code:: python + + # array_module is only guaranteed to contain "minimal" NumPy + array_module = np.get_array_module(*arrays, request='minimal') + +To facilitate testing with NumPy and use with any valid duck array library, +NumPy itself would return restricted versions of the ``numpy`` module when +``get_array_module`` is called only on NumPy arrays. Omitted functions would +simply not exist. + +Unfortunately, we have not yet figured out what these restricted subsets should +be, so it doesn't make sense to do this yet. When/if we do, we could either add +new keyword arguments to ``get_array_module`` or add new top level functions, +e.g., ``get_minimal_array_module``. We would also need to add either a new +protocol patterned off of ``__array_module__`` (e.g., +``__array_module_minimal__``), or could add an optional second argument to +``__array_module__`` (catching errors with ``try``/``except``). + +A new namespace for implicit dispatch +===================================== + +Instead of supporting overrides in the main `numpy` namespace with +``__array_function__``, we could create a new opt-in namespace, e.g., +``numpy.api``, with versions of NumPy functions that support dispatching. These +overrides would need new opt-in protocols, e.g., ``__array_function_api__`` +patterned off of ``__array_function__``. + +This would resolve the biggest limitations of ``__array_function__`` by being +opt-in and would also allow for unambiguously overriding functions like +``asarray``, because ``np.api.asarray`` would always mean "convert an +array-like object." But it wouldn't solve all the dispatching needs met by +``__array_module__``, and would leave us with supporting a considerably more +complex protocol both for array users and implementors. + +We could potentially implement such a new namespace *via* the +``__array_module__`` protocol. Certainly some users would find this convenient, +because it is slightly less boilerplate. But this would leave users with a +confusing choice: when should they use `get_array_module` vs. +`np.api.something`. Also, we would have to add and maintain a whole new module, +which is considerably more expensive than merely adding a function. + +Dispatching on both types and arrays instead of only types +========================================================== + +Instead of supporting dispatch only via unique array types, we could also +support dispatch via array objects, e.g., by passing an ``arrays`` argument as +part of the ``__array_module__`` protocol. This could potentially be useful for +dispatch for arrays with metadata, such provided by Dask and Pint, but would +impose costs in terms of type safety and complexity. + +For example, a library that supports arrays on both CPUs and GPUs might decide +on which device to create a new arrays from functions like ``ones`` based on +input arguments: + +.. code:: python + + class Array: + def __array_module__(self, types, arrays): + useful_arrays = tuple(a in arrays if isinstance(a, Array)) + if not useful_arrays: + return NotImplemented + prefer_gpu = any(a.prefer_gpu for a in useful_arrays) + return ArrayModule(prefer_gpu) + + class ArrayModule: + def __init__(self, prefer_gpu): + self.prefer_gpu = prefer_gpu + + def __getattr__(self, name): + import base_module + base_func = getattr(base_module, name) + return functools.partial(base_func, prefer_gpu=self.prefer_gpu) + +This might be useful, but it's not clear if we really need it. Pint seems to +get along OK without any explicit array creation routines (favoring +multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most part +Dask is also OK with existing ``__array_function__`` style overides (e.g., +favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an array +on the CPU or GPU could be solved by `making array creation lazy +<https://github.com/google/jax/pull/1668>`_. + +.. _appendix-design-choices: + +Appendix: design choices for API overrides +------------------------------------------ + +There is a large range of possible design choices for overriding NumPy's API. +Here we discuss three major axes of the design decision that guided our design +for ``__array_module__``. + +Opt-in vs. opt-out for users +============================ + +The ``__array_ufunc__`` and ``__array_function__`` protocols provide a +mechanism for overriding NumPy functions *within NumPy's existing namespace*. +This means that users need to explicitly opt-out if they do not want any +overridden behavior, e.g., by casting arrays with ``np.asarray()``. + +In theory, this approach lowers the barrier for adopting these protocols in +user code and libraries, because code that uses the standard NumPy namespace is +automatically compatible. But in practice, this hasn't worked out. For example, +most well-maintained libraries that use NumPy follow the best practice of +casting all inputs with ``np.asarray()``, which they would have to explicitly +relax to use ``__array_function__``. Our experience has been that making a +library compatible with a new duck array type typically requires at least a +small amount of work to accommodate differences in the data model and operations +that can be implemented efficiently. + +These opt-out approaches also considerably complicate backwards compatibility +for libraries that adopt these protocols, because by opting in as a library +they also opt-in their users, whether they expect it or not. For winning over +libraries that have been unable to adopt ``__array_function__``, an opt-in +approach seems like a must. + +Explicit vs. implicit choice of implementation +============================================== + +Both ``__array_ufunc__`` and ``__array_function__`` have implicit control over +dispatching: the dispatched functions are determined via the appropriate +protocols in every function call. This generalizes well to handling many +different types of objects, as evidenced by its use for implementing arithmetic +operators in Python, but it has two downsides: + +1. *Speed*: it imposes additional overhead in every function call, because each + function call needs to inspect each of its arguments for overrides. This is + why arithmetic on builtin Python numbers is slow. +2. *Readability*: it is not longer immediately evident to readers of code what + happens when a function is called, because the function's implementation + could be overridden by any of its arguments. + +In contrast, importing a new library (e.g., ``import dask.array as da``) with +an API matching NumPy is entirely explicit. There is no overhead from dispatch +or ambiguity about which implementation is being used. + +Explicit and implicit choice of implementations are not mutually exclusive +options. Indeed, most implementations of NumPy API overrides via +``__array_function__`` that we are familiar with (namely, dask, CuPy and +sparse, but not Pint) also include an explicit way to use their version of +NumPy's API by importing a module directly (``dask.array``, ``cupy`` or +``sparse``, respectively). + +Local vs. non-local vs. global control +====================================== + +The final design axis is how users control the choice of API: + +- **Local control**, as exemplified by multiple dispatch and Python protocols for + arithmetic, determines which implementation to use either by checking types + or calling methods on the direct arguments of a function. +- **Non-local control** such as `np.errstate + <https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html>`_ + overrides behavior with global-state via function decorators or + context-managers. Control is determined hierarchically, via the inner-most + context. +- **Global control** provides a mechanism for users to set default behavior, + either via function calls or configuration files. For example, matplotlib + allows setting a global choice of plotting backend. + +Local control is generally considered a best practice for API design, because +control flow is entirely explicit, which makes it the easiest to understand. +Non-local and global control are occasionally used, but generally either due to +ignorance or a lack of better alternatives. + +In the case of duck typing for NumPy's public API, we think non-local or global +control would be mistakes, mostly because they **don't compose well**. If one +library sets/needs one set of overrides and then internally calls a routine +that expects another set of overrides, the resulting behavior may be very +surprising. Higher order functions are especially problematic, because the +context in which functions are evaluated may not be the context in which they +are defined. + +One class of override use cases where we think non-local and global control are +appropriate is for choosing a backend system that is guaranteed to have an +entirely consistent interface, such as a faster alternative implementation of +``numpy.fft`` on NumPy arrays. However, these are out of scope for the current +proposal, which is focused on duck arrays. |