diff options
author | Nathaniel J. Smith <njs@pobox.com> | 2018-03-08 01:20:09 -0800 |
---|---|---|
committer | Stephan Hoyer <shoyer@google.com> | 2018-10-14 14:09:40 -0700 |
commit | f25b464d6ff33f0d2ca7281e72b461fb91dabb43 (patch) | |
tree | b6a191baf23817e66a142f0c82c605cb56a8bb1a /doc/neps | |
parent | 2c4c93af0b2d20d85a7432093f31318cbf3c457f (diff) | |
download | numpy-f25b464d6ff33f0d2ca7281e72b461fb91dabb43.tar.gz |
New NEP for identifying and coercing duck arrays
Diffstat (limited to 'doc/neps')
-rw-r--r-- | doc/neps/nep-0016-abstract-array.rst | 322 | ||||
-rw-r--r-- | doc/neps/nep-0016-benchmark.py | 48 |
2 files changed, 370 insertions, 0 deletions
diff --git a/doc/neps/nep-0016-abstract-array.rst b/doc/neps/nep-0016-abstract-array.rst new file mode 100644 index 000000000..85f0941f6 --- /dev/null +++ b/doc/neps/nep-0016-abstract-array.rst @@ -0,0 +1,322 @@ +==================================================== +An abstract base class for identifying "duck arrays" +==================================================== + +:Author: Nathaniel J. Smith <njs@pobox.com> +:Status: Draft +:Type: Standards Track +:Created: 2018-03-06 + + +Abstract +-------- + +We propose to add an abstract base class ``AbstractArray`` so that +third-party classes can declare their ability to "quack like" an +``ndarray``, and an ``asabstractarray`` function that performs +similarly to ``asarray`` except that it passes through +``AbstractArray`` instances unchanged. + + +Detailed description +-------------------- + +Many functions, in NumPy and in third-party packages, start with some +code like:: + + def myfunc(a, b): + a = np.asarray(a) + b = np.asarray(b) + ... + +This ensures that ``a`` and ``b`` are ``np.ndarray`` objects, so +``myfunc`` can carry on assuming that they'll act like ndarrays both +semantically (at the Python level), and also in terms of how they're +stored in memory (at the C level). But many of these functions only +work with arrays at the Python level, which means that they don't +actually need ``ndarray`` objects *per se*: they could work just as +well with any Python object that "quacks like" an ndarray, such as +sparse arrays, dask's lazy arrays, or xarray's labeled arrays. + +However, currently, there's no way for these libraries to express that +their objects can quack like an ndarray, and there's no way for +functions like ``myfunc`` to express that they'd be happy with +anything that quacks like an ndarray. The purpose of this NEP is to +provide those two features. + +Sometimes people suggest using ``np.asanyarray`` for this purpose, but +unfortunately its semantics are exactly backwards: it guarantees that +the object it returns uses the same memory layout as an ``ndarray``, +but tells you nothing at all about its semantics, which makes it +essentially impossible to use safely in practice. Indeed, the two +``ndarray`` subclasses distributed with NumPy – ``np.matrix`` and +``np.ma.masked_array`` – do have incompatible semantics, and if they +were passed to a function like ``myfunc`` that doesn't check for them +as a special-case, then it may silently return incorrect results. + + +Declaring that an object can quack like an array +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are two basic approaches we could use for checking whether an +object quacks like an array. We could check for a special attribute on +the class:: + + def quacks_like_array(obj): + return bool(getattr(type(obj), "__quacks_like_array__", False)) + +Or, we could define an `abstract base class (ABC) +<https://docs.python.org/3/library/collections.abc.html>`__:: + + def quacks_like_array(obj): + return isinstance(obj, AbstractArray) + +If you look at how ABCs work, this is essentially equivalent to +keeping a global set of types that have been declared to implement the +``AbstractArray`` interface, and then checking it for membership. + +Between these, the ABC approach seems to have a number of advantages: + +* It's Python's standard, "one obvious way" of doing this. + +* ABCs can be introspected (e.g. ``help(np.AbstractArray)`` does + something useful). + +* ABCs can provide useful mixin methods. + +* ABCs integrate with other features like mypy type-checking, + ``functools.singledispatch``, etc. + +One obvious thing to check is whether this choice affects speed. Using +the attached benchmark script on a CPython 3.7 prerelease (revision +c4d77a661138d, self-compiled, no PGO), on a Thinkpad T450s running +Linux, we find:: + + np.asarray(ndarray_obj) 330 ns + np.asarray([]) 1400 ns + + Attribute check, success 80 ns + Attribute check, failure 80 ns + + ABC, success via subclass 340 ns + ABC, success via register() 700 ns + ABC, failure 370 ns + +Notes: + +* The first two lines are included to put the other lines in context. + +* This used 3.7 because both ``getattr`` and ABCs are receiving + substantial optimizations in this release, and it's more + representative of the long-term future of Python. (Failed + ``getattr`` doesn't necessarily construct an exception object + anymore, and ABCs were reimplemented in C.) + +* The "success" lines refer to cases where ``quacks_like_array`` would + return True. The "failure" lines are cases where it would return + False. + +* The first measurement for ABCs is subclasses defined like:: + + class MyArray(AbstractArray): + ... + + The second is for subclasses defined like:: + + class MyArray: + ... + + AbstractArray.register(MyArray) + + I don't know why there's such a large difference between these. + +In practice, either way we'd only do the full test after first +checking for well-known types like ``ndarray``, ``list``, etc. `This +is how NumPy currently checks for other double-underscore attributes +<https://github.com/numpy/numpy/blob/master/numpy/core/src/private/get_attr_string.h>`__ +and the same idea applies here to either approach. So these numbers +won't affect the common case, just the case where we actually have an +``AbstractArray``, or else another third-party object that will end up +going through ``__array__`` or ``__array_interface__`` or end up as an +object array. + +So in summary, using an ABC will be slightly slower than using an +attribute, but this doesn't affect the most common paths, and the +magnitude of slowdown is fairly small (~250 ns on an operation that +already takes longer than that). Furthermore, we can potentially +optimize this further (e.g. by keeping a tiny LRU cache of types that +are known to be AbstractArray subclasses, on the assumption that most +code will only use one or two of these types at a time), and it's very +unclear that this even matters – if the speed of ``asarray`` no-op +pass-throughs were a bottleneck that showed up in profiles, then +probably we would have made them faster already! (It would be trivial +to fast-path this, but we don't.) + +Given the semantic and usability advantages of ABCs, this seems like +an acceptable trade-off. + +.. + CPython 3.6 (from Debian):: + + Attribute check, success 110 ns + Attribute check, failure 370 ns + + ABC, success via subclass 690 ns + ABC, success via register() 690 ns + ABC, failure 1220 ns + + +Specification of ``asabstractarray`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Given ``AbstractArray``, the definition of ``asabstractarray`` is simple:: + + def asabstractarray(a, dtype=None): + if isinstance(a, AbstractArray): + if dtype is not None and dtype != a.dtype: + return a.astype(dtype) + return a + return asarray(a, dtype=dtype) + +Things to note: + +* ``asarray`` also accepts an ``order=`` argument, but we don't + include that here because it's about details of memory + representation, and the whole point of this function is that you use + it to declare that you don't care about details of memory + representation. + +* Using the ``astype`` method allows the ``a`` object to decide how to + implement casting for its particular type. + +* For strict compatibility with ``asarray``, we skip calling + ``astype`` when the dtype is already correct. Compare:: + + >>> a = np.arange(10) + + # astype() always returns a view: + >>> a.astype(a.dtype) is a + False + + # asarray() returns the original object if possible: + >>> np.asarray(a, dtype=a.dtype) is a + True + + +What exactly are you promising if you inherit from ``AbstractArray``? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This will presumably be refined over time. The ideal of course is that +your class should be indistinguishable from a real ``ndarray``, but +nothing enforces that except the expectations of users. In practice, +declaring that your class implements the ``AbstractArray`` interface +simply means that it will start passing through ``asabstractarray``, +and so by subclassing it you're saying that if some code works for +``ndarray``\s but breaks for your class, then you're willing to accept +bug reports on that. + +To start with, we should declare ``__array_ufunc__`` to be an abstract +method, and add the ``NDArrayOperatorsMixin`` methods as mixin +methods. + +Declaring ``astype`` as an ``@abstractmethod`` probably makes sense as +well, since it's used by ``asabstractarray``. We might also want to go +ahead and add some basic attributes like ``ndim``, ``shape``, +``dtype``. + +Adding new abstract methods will be a bit trick, because ABCs enforce +these at subclass time; therefore, simply adding a new +`@abstractmethod` will be a backwards compatibility break. If this +becomes a problem then we can use some hacks to implement an +`@upcoming_abstractmethod` decorator that only issues a warning if the +method is missing, and treat it like a regular deprecation cycle. (In +this case, the thing we'd be deprecating is "support for abstract +arrays that are missing feature X".) + + +Naming +~~~~~~ + +The name of the ABC doesn't matter too much, because it will only be +referenced rarely and in relatively specialized situations. The name +of the function matters a lot, because most existing instances of +``asarray`` should be replaced by this, and in the future it's what +everyone should be reaching for by default unless they have a specific +reason to use ``asarray`` instead. This suggests that its name really +should be *shorter* and *more memorable* than ``asarray``... which +is difficult. I've used ``asabstractarray`` in this draft, but I'm not +really happy with it, because it's too long and people are unlikely to +start using it by habit without endless exhortations. + +One option would be to actually change ``asarray``\'s semantics so +that *it* passes through ``AbstractArray`` objects unchanged. But I'm +worried that there may be a lot of code out there that calls +``asarray`` and then passes the result into some C function that +doesn't do any further type checking (because it knows that its caller +has already used ``asarray``). If we allow ``asarray`` to return +``AbstractArray`` objects, and then someone calls one of these C +wrappers and passes it an ``AbstractArray`` object like a sparse +array, then they'll get a segfault. Right now, in the same situation, +``asarray`` will instead invoke the object's ``__array__`` method, or +use the buffer interface to make a view, or pass through an array with +object dtype, or raise an error, or similar. Probably none of these +outcomes are actually desireable in most cases, so maybe making it a +segfault instead would be OK? But it's dangerous given that we don't +know how common such code is. OTOH, if we were starting from scratch +then this would probably be the ideal solution. + +We can't use ``asanyarray`` or ``array``, since those are already +taken. + +Any other ideas? ``np.cast``, ``np.coerce``? + + +Implementation +-------------- + +1. Rename ``NDArrayOperatorsMixin`` to ``AbstractArray`` (leaving + behind an alias for backwards compatibility) and make it an ABC. + +2. Add ``asabstractarray`` (or whatever we end up calling it), and + probably a C API equivalent. + +3. Begin migrating NumPy internal functions to using + ``asabstractarray`` where appropriate. + + +Backward compatibility +---------------------- + +This is purely a new feature, so there are no compatibility issues. +(Unless we decide to change the semantics of ``asarray`` itself.) + + +Rejected alternatives +--------------------- + +One suggestion that has come up is to define multiple abstract classes +for different subsets of the array interface. Nothing in this proposal +stops either NumPy or third-parties from doing this in the future, but +it's very difficult to guess ahead of time which subsets would be +useful. Also, "the full ndarray interface" is something that existing +libraries are written to expect (because they work with actual +ndarrays) and test (because they test with actual ndarrays), so it's +by far the easiest place to start. + + +Links to discussion +------------------- + +TBD + + +Appendix: Benchmark script +-------------------------- + +.. literal-include:: nep-0016-benchmark.py + + +Copyright +--------- + +This document has been placed in the public domain. diff --git a/doc/neps/nep-0016-benchmark.py b/doc/neps/nep-0016-benchmark.py new file mode 100644 index 000000000..ec8e44726 --- /dev/null +++ b/doc/neps/nep-0016-benchmark.py @@ -0,0 +1,48 @@ +import perf +import abc +import numpy as np + +class NotArray: + pass + +class AttrArray: + __array_implementer__ = True + +class ArrayBase(abc.ABC): + pass + +class ABCArray1(ArrayBase): + pass + +class ABCArray2: + pass + +ArrayBase.register(ABCArray2) + +not_array = NotArray() +attr_array = AttrArray() +abc_array_1 = ABCArray1() +abc_array_2 = ABCArray2() + +# Make sure ABC cache is primed +isinstance(not_array, ArrayBase) +isinstance(abc_array_1, ArrayBase) +isinstance(abc_array_2, ArrayBase) + +runner = perf.Runner() +def t(name, statement): + runner.timeit(name, statement, globals=globals()) + +t("np.asarray([])", "np.asarray([])") +arrobj = np.array([]) +t("np.asarray(arrobj)", "np.asarray(arrobj)") + +t("attr, False", + "getattr(not_array, '__array_implementer__', False)") +t("attr, True", + "getattr(attr_array, '__array_implementer__', False)") + +t("ABC, False", "isinstance(not_array, ArrayBase)") +t("ABC, True, via inheritance", "isinstance(abc_array_1, ArrayBase)") +t("ABC, True, via register", "isinstance(abc_array_2, ArrayBase)") + |