summaryrefslogtreecommitdiff
path: root/doc/neps
diff options
context:
space:
mode:
authorNathaniel J. Smith <njs@pobox.com>2018-03-08 01:20:09 -0800
committerStephan Hoyer <shoyer@google.com>2018-10-14 14:09:40 -0700
commitf25b464d6ff33f0d2ca7281e72b461fb91dabb43 (patch)
treeb6a191baf23817e66a142f0c82c605cb56a8bb1a /doc/neps
parent2c4c93af0b2d20d85a7432093f31318cbf3c457f (diff)
downloadnumpy-f25b464d6ff33f0d2ca7281e72b461fb91dabb43.tar.gz
New NEP for identifying and coercing duck arrays
Diffstat (limited to 'doc/neps')
-rw-r--r--doc/neps/nep-0016-abstract-array.rst322
-rw-r--r--doc/neps/nep-0016-benchmark.py48
2 files changed, 370 insertions, 0 deletions
diff --git a/doc/neps/nep-0016-abstract-array.rst b/doc/neps/nep-0016-abstract-array.rst
new file mode 100644
index 000000000..85f0941f6
--- /dev/null
+++ b/doc/neps/nep-0016-abstract-array.rst
@@ -0,0 +1,322 @@
+====================================================
+An abstract base class for identifying "duck arrays"
+====================================================
+
+:Author: Nathaniel J. Smith <njs@pobox.com>
+:Status: Draft
+:Type: Standards Track
+:Created: 2018-03-06
+
+
+Abstract
+--------
+
+We propose to add an abstract base class ``AbstractArray`` so that
+third-party classes can declare their ability to "quack like" an
+``ndarray``, and an ``asabstractarray`` function that performs
+similarly to ``asarray`` except that it passes through
+``AbstractArray`` instances unchanged.
+
+
+Detailed description
+--------------------
+
+Many functions, in NumPy and in third-party packages, start with some
+code like::
+
+ def myfunc(a, b):
+ a = np.asarray(a)
+ b = np.asarray(b)
+ ...
+
+This ensures that ``a`` and ``b`` are ``np.ndarray`` objects, so
+``myfunc`` can carry on assuming that they'll act like ndarrays both
+semantically (at the Python level), and also in terms of how they're
+stored in memory (at the C level). But many of these functions only
+work with arrays at the Python level, which means that they don't
+actually need ``ndarray`` objects *per se*: they could work just as
+well with any Python object that "quacks like" an ndarray, such as
+sparse arrays, dask's lazy arrays, or xarray's labeled arrays.
+
+However, currently, there's no way for these libraries to express that
+their objects can quack like an ndarray, and there's no way for
+functions like ``myfunc`` to express that they'd be happy with
+anything that quacks like an ndarray. The purpose of this NEP is to
+provide those two features.
+
+Sometimes people suggest using ``np.asanyarray`` for this purpose, but
+unfortunately its semantics are exactly backwards: it guarantees that
+the object it returns uses the same memory layout as an ``ndarray``,
+but tells you nothing at all about its semantics, which makes it
+essentially impossible to use safely in practice. Indeed, the two
+``ndarray`` subclasses distributed with NumPy – ``np.matrix`` and
+``np.ma.masked_array`` – do have incompatible semantics, and if they
+were passed to a function like ``myfunc`` that doesn't check for them
+as a special-case, then it may silently return incorrect results.
+
+
+Declaring that an object can quack like an array
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are two basic approaches we could use for checking whether an
+object quacks like an array. We could check for a special attribute on
+the class::
+
+ def quacks_like_array(obj):
+ return bool(getattr(type(obj), "__quacks_like_array__", False))
+
+Or, we could define an `abstract base class (ABC)
+<https://docs.python.org/3/library/collections.abc.html>`__::
+
+ def quacks_like_array(obj):
+ return isinstance(obj, AbstractArray)
+
+If you look at how ABCs work, this is essentially equivalent to
+keeping a global set of types that have been declared to implement the
+``AbstractArray`` interface, and then checking it for membership.
+
+Between these, the ABC approach seems to have a number of advantages:
+
+* It's Python's standard, "one obvious way" of doing this.
+
+* ABCs can be introspected (e.g. ``help(np.AbstractArray)`` does
+ something useful).
+
+* ABCs can provide useful mixin methods.
+
+* ABCs integrate with other features like mypy type-checking,
+ ``functools.singledispatch``, etc.
+
+One obvious thing to check is whether this choice affects speed. Using
+the attached benchmark script on a CPython 3.7 prerelease (revision
+c4d77a661138d, self-compiled, no PGO), on a Thinkpad T450s running
+Linux, we find::
+
+ np.asarray(ndarray_obj) 330 ns
+ np.asarray([]) 1400 ns
+
+ Attribute check, success 80 ns
+ Attribute check, failure 80 ns
+
+ ABC, success via subclass 340 ns
+ ABC, success via register() 700 ns
+ ABC, failure 370 ns
+
+Notes:
+
+* The first two lines are included to put the other lines in context.
+
+* This used 3.7 because both ``getattr`` and ABCs are receiving
+ substantial optimizations in this release, and it's more
+ representative of the long-term future of Python. (Failed
+ ``getattr`` doesn't necessarily construct an exception object
+ anymore, and ABCs were reimplemented in C.)
+
+* The "success" lines refer to cases where ``quacks_like_array`` would
+ return True. The "failure" lines are cases where it would return
+ False.
+
+* The first measurement for ABCs is subclasses defined like::
+
+ class MyArray(AbstractArray):
+ ...
+
+ The second is for subclasses defined like::
+
+ class MyArray:
+ ...
+
+ AbstractArray.register(MyArray)
+
+ I don't know why there's such a large difference between these.
+
+In practice, either way we'd only do the full test after first
+checking for well-known types like ``ndarray``, ``list``, etc. `This
+is how NumPy currently checks for other double-underscore attributes
+<https://github.com/numpy/numpy/blob/master/numpy/core/src/private/get_attr_string.h>`__
+and the same idea applies here to either approach. So these numbers
+won't affect the common case, just the case where we actually have an
+``AbstractArray``, or else another third-party object that will end up
+going through ``__array__`` or ``__array_interface__`` or end up as an
+object array.
+
+So in summary, using an ABC will be slightly slower than using an
+attribute, but this doesn't affect the most common paths, and the
+magnitude of slowdown is fairly small (~250 ns on an operation that
+already takes longer than that). Furthermore, we can potentially
+optimize this further (e.g. by keeping a tiny LRU cache of types that
+are known to be AbstractArray subclasses, on the assumption that most
+code will only use one or two of these types at a time), and it's very
+unclear that this even matters – if the speed of ``asarray`` no-op
+pass-throughs were a bottleneck that showed up in profiles, then
+probably we would have made them faster already! (It would be trivial
+to fast-path this, but we don't.)
+
+Given the semantic and usability advantages of ABCs, this seems like
+an acceptable trade-off.
+
+..
+ CPython 3.6 (from Debian)::
+
+ Attribute check, success 110 ns
+ Attribute check, failure 370 ns
+
+ ABC, success via subclass 690 ns
+ ABC, success via register() 690 ns
+ ABC, failure 1220 ns
+
+
+Specification of ``asabstractarray``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Given ``AbstractArray``, the definition of ``asabstractarray`` is simple::
+
+ def asabstractarray(a, dtype=None):
+ if isinstance(a, AbstractArray):
+ if dtype is not None and dtype != a.dtype:
+ return a.astype(dtype)
+ return a
+ return asarray(a, dtype=dtype)
+
+Things to note:
+
+* ``asarray`` also accepts an ``order=`` argument, but we don't
+ include that here because it's about details of memory
+ representation, and the whole point of this function is that you use
+ it to declare that you don't care about details of memory
+ representation.
+
+* Using the ``astype`` method allows the ``a`` object to decide how to
+ implement casting for its particular type.
+
+* For strict compatibility with ``asarray``, we skip calling
+ ``astype`` when the dtype is already correct. Compare::
+
+ >>> a = np.arange(10)
+
+ # astype() always returns a view:
+ >>> a.astype(a.dtype) is a
+ False
+
+ # asarray() returns the original object if possible:
+ >>> np.asarray(a, dtype=a.dtype) is a
+ True
+
+
+What exactly are you promising if you inherit from ``AbstractArray``?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This will presumably be refined over time. The ideal of course is that
+your class should be indistinguishable from a real ``ndarray``, but
+nothing enforces that except the expectations of users. In practice,
+declaring that your class implements the ``AbstractArray`` interface
+simply means that it will start passing through ``asabstractarray``,
+and so by subclassing it you're saying that if some code works for
+``ndarray``\s but breaks for your class, then you're willing to accept
+bug reports on that.
+
+To start with, we should declare ``__array_ufunc__`` to be an abstract
+method, and add the ``NDArrayOperatorsMixin`` methods as mixin
+methods.
+
+Declaring ``astype`` as an ``@abstractmethod`` probably makes sense as
+well, since it's used by ``asabstractarray``. We might also want to go
+ahead and add some basic attributes like ``ndim``, ``shape``,
+``dtype``.
+
+Adding new abstract methods will be a bit trick, because ABCs enforce
+these at subclass time; therefore, simply adding a new
+`@abstractmethod` will be a backwards compatibility break. If this
+becomes a problem then we can use some hacks to implement an
+`@upcoming_abstractmethod` decorator that only issues a warning if the
+method is missing, and treat it like a regular deprecation cycle. (In
+this case, the thing we'd be deprecating is "support for abstract
+arrays that are missing feature X".)
+
+
+Naming
+~~~~~~
+
+The name of the ABC doesn't matter too much, because it will only be
+referenced rarely and in relatively specialized situations. The name
+of the function matters a lot, because most existing instances of
+``asarray`` should be replaced by this, and in the future it's what
+everyone should be reaching for by default unless they have a specific
+reason to use ``asarray`` instead. This suggests that its name really
+should be *shorter* and *more memorable* than ``asarray``... which
+is difficult. I've used ``asabstractarray`` in this draft, but I'm not
+really happy with it, because it's too long and people are unlikely to
+start using it by habit without endless exhortations.
+
+One option would be to actually change ``asarray``\'s semantics so
+that *it* passes through ``AbstractArray`` objects unchanged. But I'm
+worried that there may be a lot of code out there that calls
+``asarray`` and then passes the result into some C function that
+doesn't do any further type checking (because it knows that its caller
+has already used ``asarray``). If we allow ``asarray`` to return
+``AbstractArray`` objects, and then someone calls one of these C
+wrappers and passes it an ``AbstractArray`` object like a sparse
+array, then they'll get a segfault. Right now, in the same situation,
+``asarray`` will instead invoke the object's ``__array__`` method, or
+use the buffer interface to make a view, or pass through an array with
+object dtype, or raise an error, or similar. Probably none of these
+outcomes are actually desireable in most cases, so maybe making it a
+segfault instead would be OK? But it's dangerous given that we don't
+know how common such code is. OTOH, if we were starting from scratch
+then this would probably be the ideal solution.
+
+We can't use ``asanyarray`` or ``array``, since those are already
+taken.
+
+Any other ideas? ``np.cast``, ``np.coerce``?
+
+
+Implementation
+--------------
+
+1. Rename ``NDArrayOperatorsMixin`` to ``AbstractArray`` (leaving
+ behind an alias for backwards compatibility) and make it an ABC.
+
+2. Add ``asabstractarray`` (or whatever we end up calling it), and
+ probably a C API equivalent.
+
+3. Begin migrating NumPy internal functions to using
+ ``asabstractarray`` where appropriate.
+
+
+Backward compatibility
+----------------------
+
+This is purely a new feature, so there are no compatibility issues.
+(Unless we decide to change the semantics of ``asarray`` itself.)
+
+
+Rejected alternatives
+---------------------
+
+One suggestion that has come up is to define multiple abstract classes
+for different subsets of the array interface. Nothing in this proposal
+stops either NumPy or third-parties from doing this in the future, but
+it's very difficult to guess ahead of time which subsets would be
+useful. Also, "the full ndarray interface" is something that existing
+libraries are written to expect (because they work with actual
+ndarrays) and test (because they test with actual ndarrays), so it's
+by far the easiest place to start.
+
+
+Links to discussion
+-------------------
+
+TBD
+
+
+Appendix: Benchmark script
+--------------------------
+
+.. literal-include:: nep-0016-benchmark.py
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.
diff --git a/doc/neps/nep-0016-benchmark.py b/doc/neps/nep-0016-benchmark.py
new file mode 100644
index 000000000..ec8e44726
--- /dev/null
+++ b/doc/neps/nep-0016-benchmark.py
@@ -0,0 +1,48 @@
+import perf
+import abc
+import numpy as np
+
+class NotArray:
+ pass
+
+class AttrArray:
+ __array_implementer__ = True
+
+class ArrayBase(abc.ABC):
+ pass
+
+class ABCArray1(ArrayBase):
+ pass
+
+class ABCArray2:
+ pass
+
+ArrayBase.register(ABCArray2)
+
+not_array = NotArray()
+attr_array = AttrArray()
+abc_array_1 = ABCArray1()
+abc_array_2 = ABCArray2()
+
+# Make sure ABC cache is primed
+isinstance(not_array, ArrayBase)
+isinstance(abc_array_1, ArrayBase)
+isinstance(abc_array_2, ArrayBase)
+
+runner = perf.Runner()
+def t(name, statement):
+ runner.timeit(name, statement, globals=globals())
+
+t("np.asarray([])", "np.asarray([])")
+arrobj = np.array([])
+t("np.asarray(arrobj)", "np.asarray(arrobj)")
+
+t("attr, False",
+ "getattr(not_array, '__array_implementer__', False)")
+t("attr, True",
+ "getattr(attr_array, '__array_implementer__', False)")
+
+t("ABC, False", "isinstance(not_array, ArrayBase)")
+t("ABC, True, via inheritance", "isinstance(abc_array_1, ArrayBase)")
+t("ABC, True, via register", "isinstance(abc_array_2, ArrayBase)")
+