Merge pull request #12017 from rgommers/missing-data-neps

NEP: add 3 missing data NEPs rescued from 2011-2012
author: Charles Harris <charlesr.harris@gmail.com> 2018-09-22 10:37:38 -0500
committer: GitHub <noreply@github.com> 2018-09-22 10:37:38 -0500
commit: 12fc9fc9104fd30be42d532a6ca95bfa5615979d (patch)
tree: f32261ecc92ca46f016cbd9c0c3c0164ad9148e3
parent: 25f06a6684b88b925748544f51955d3b0a6db423 (diff)
parent: 531e97d39c4de28c9ebeaf2e6ad533d4b9e26142 (diff)
download: numpy-12fc9fc9104fd30be42d532a6ca95bfa5615979d.tar.gz
8 files changed, 1410 insertions, 29 deletions
diff --git a/doc/neps/_static/nep0013_image1.png b/doc/neps/_static/nep0013_image1.png
new file mode 100644
index 000000000..e1b35b738
--- /dev/null
+++ b/doc/neps/_static/nep0013_image1.png
diff --git a/doc/neps/_static/nep0013_image2.png b/doc/neps/_static/nep0013_image2.png
new file mode 100644
index 000000000..99f51b2fa
--- /dev/null
+++ b/doc/neps/_static/nep0013_image2.png
diff --git a/doc/neps/_static/nep0013_image3.png b/doc/neps/_static/nep0013_image3.png
new file mode 100644
index 000000000..87a354ad1
--- /dev/null
+++ b/doc/neps/_static/nep0013_image3.png
diff --git a/doc/neps/conf.py b/doc/neps/conf.py
index 8cfb2b570..bd7503781 100644
--- a/doc/neps/conf.py
+++ b/doc/neps/conf.py
@@ -30,8 +30,7 @@ import os
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ['sphinx.ext.imgmath',
-              'sphinx.ext.graphviz']
+extensions = ['sphinx.ext.imgmath',]
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['../source/_templates/']
diff --git a/doc/neps/nep-0013-ufunc-overrides.rst b/doc/neps/nep-0013-ufunc-overrides.rst
index a51ce3927..0888c7559 100644
--- a/doc/neps/nep-0013-ufunc-overrides.rst
+++ b/doc/neps/nep-0013-ufunc-overrides.rst
@@ -261,16 +261,7 @@ consider carefully if any surprising behavior results.
 
    Type casting hierarchy.
 
-   .. graphviz::
-
-      digraph array_ufuncs {
-         rankdir=BT;
-         A -> C [label="C"];
-         B -> C [label="C"];
-         D -> B [label="B"];
-         ndarray -> C [label="A"];
-         ndarray -> B [label="B"];
-      }
+   .. image:: _static/nep0013_image1.png
 
    The ``__array_ufunc__`` of type A can handle ndarrays returning C,
    B can handle ndarray and D returning B, and C can handle A and B returning C,
@@ -286,14 +277,7 @@ consider carefully if any surprising behavior results.
 
    One-cycle in the ``__array_ufunc__`` graph.
 
-   .. graphviz::
-
-      digraph array_ufuncs {
-         rankdir=BT;
-         A -> B [label="B"];
-         B -> A [label="A"];
-      }
-
+   .. image:: _static/nep0013_image2.png
 
    In this case, the ``__array_ufunc__`` relations have a cycle of length 1,
    and a type casting hierarchy does not exist. Binary operations are not
@@ -303,15 +287,7 @@ consider carefully if any surprising behavior results.
 
    Longer cycle in the ``__array_ufunc__`` graph.
 
-   .. graphviz::
-
-      digraph array_ufuncs {
-         rankdir=BT;
-         A -> B [label="B"];
-         B -> C [label="C"];
-         C -> A [label="A"];
-      }
-
+   .. image:: _static/nep0013_image3.png
 
    In this case, the ``__array_ufunc__`` relations have a longer cycle, and a
    type casting hierarchy does not exist. Binary operations are still
diff --git a/doc/neps/nep-0024-missing-data-2.rst b/doc/neps/nep-0024-missing-data-2.rst
new file mode 100644
index 000000000..c8b19561f
--- /dev/null
+++ b/doc/neps/nep-0024-missing-data-2.rst
@@ -0,0 +1,210 @@
+=============================================================
+NEP 24 — Missing Data Functionality - Alternative 1 to NEP 12
+=============================================================
+
+:Author: Nathaniel J. Smith <njs@pobox.com>, Matthew Brett <matthew.brett@gmail.com>
+:Status: Deferred
+:Type: Standards Track
+:Created: 2011-06-30
+
+
+Abstract
+--------
+
+*Context: this NEP was written as an alternative to NEP 12, which at the time of writing
+had an implementation that was merged into the NumPy master branch.*
+
+The principle of this NEP is to separate the APIs for masking and for missing values, according to
+
+* The current implementation of masked arrays (NEP 12)
+* This proposal.
+
+This discussion is only of the API, and not of the implementation.
+
+Detailed description
+--------------------
+
+
+Rationale
+^^^^^^^^^
+
+The purpose of this aNEP is to define two interfaces -- one for handling
+'missing values', and one for handling 'masked arrays'.
+
+An ordinary value is something like an integer or a floating point number. A
+*missing* value is a placeholder for an ordinary value that is for some
+reason unavailable. For example, in working with statistical data, we often
+build tables in which each row represents one item, and each column
+represents properties of that item. For instance, we might take a group of
+people and for each one record height, age, education level, and income, and
+then stick these values into a table. But then we discover that our research
+assistant screwed up and forgot to record the age of one of our individuals.
+We could throw out the rest of their data as well, but this would be
+wasteful; even such an incomplete row is still perfectly usable for some
+analyses (e.g., we can compute the correlation of height and income). The
+traditional way to handle this would be to stick some particular meaningless
+value in for the missing data, e.g., recording this person's age as 0. But
+this is very error prone; we may later forget about these special values
+while running other analyses, and discover to our surprise that babies have
+higher incomes than teenagers. (In this case, the solution would be to just
+leave out all the items where we have no age recorded, but this isn't a
+general solution; many analyses require something more clever to handle
+missing values.) So instead of using an ordinary value like 0, we define a
+special "missing" value, written "NA" for "not available".
+
+Therefore, missing values have the following properties: Like any other
+value, they must be supported by your array's dtype -- you can't store a
+floating point number in an array with dtype=int32, and you can't store an NA
+in it either. You need an array with dtype=NAint32 or something (exact syntax
+to be determined). Otherwise, they act exactly like any other values. In
+particular, you can apply arithmetic functions and so forth to them. By
+default, any function which takes an NA as an argument always returns an NA
+as well, regardless of the values of the other arguments. This ensures that
+if we try to compute the correlation of income with age, we will get "NA",
+meaning "given that some of the entries could be anything, the answer could
+be anything as well". This reminds us to spend a moment thinking about how we
+should rephrase our question to be more meaningful. And as a convenience for
+those times when you do decide that you just want the correlation between the
+known ages and income, then you can enable this behavior by adding a single
+argument to your function call.
+
+For floating point computations, NAs and NaNs have (almost?) identical
+behavior. But they represent different things -- NaN an invalid computation
+like 0/0, NA a value that is not available -- and distinguishing between
+these things is useful because in some situations they should be treated
+differently. (For example, an imputation procedure should replace NAs with
+imputed values, but probably should leave NaNs alone.) And anyway, we can't
+use NaNs for integers, or strings, or booleans, so we need NA anyway, and
+once we have NA support for all these types, we might as well support it for
+floating point too for consistency.
+
+A masked array is, conceptually, an ordinary rectangular numpy array, which
+has had an arbitrarily-shaped mask placed over it. The result is,
+essentially, a non-rectangular view of a rectangular array. In principle,
+anything you can accomplish with a masked array could also be accomplished by
+explicitly keeping a regular array and a boolean mask array and using numpy
+indexing to combine them for each operation, but combining them into a single
+structure is much more convenient when you need to perform complex operations
+on the masked view of an array, while still being able to manipulate the mask
+in the usual ways. Therefore, masks are preserved through indexing, and
+functions generally treat masked-out values as if they were not even part of
+the array in the first place. (Maybe this is a good heuristic: a length-4
+array in which the last value has been masked out behaves just like an
+ordinary length-3 array, so long as you don't change the mask.) Except, of
+course, that you are free to manipulate the mask in arbitrary ways whenever
+you like; it's just a standard numpy array.
+
+There are some simple situations where one could use either of these tools to
+get the job done -- or other tools entirely, like using designated surrogate
+values (age=0), separate mask arrays, etc. But missing values are designed to
+be particularly helpful in situations where the missingness is an intrinsic
+feature of the data -- where there's a specific value that **should** exist,
+if it did exist we'd it'd mean something specific, but it **doesn't**. Masked
+arrays are designed to be particularly helpful in situations where we just
+want to temporarily ignore some data that does exist, or generally when we
+need to work with data that has a non-rectangular shape (e.g., if you make
+some measurement at each point on a grid laid over a circular agar dish, then
+the points that fall outside the dish aren't missing measurements, they're
+just meaningless).
+
+Initialization
+^^^^^^^^^^^^^^
+
+First, missing values can be set and be displayed as ``np.NA, NA``::
+
+   >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
+   array([1., 2., NA, 7.], dtype='NA[<f8]')
+
+As the initialization is not ambiguous, this can be written without the NA
+dtype::
+
+   >>> np.array([1.0, 2.0, np.NA, 7.0])
+   array([1., 2., NA, 7.], dtype='NA[<f8]')
+
+Masked values can be set and be displayed as ``np.IGNORE, IGNORE``::
+
+   >>> np.array([1.0, 2.0, np.IGNORE, 7.0], masked=True)
+   array([1., 2., IGNORE, 7.], masked=True)
+
+As the initialization is not ambiguous, this can be written without
+``masked=True``::
+
+   >>> np.array([1.0, 2.0, np.IGNORE, 7.0])
+   array([1., 2., IGNORE, 7.], masked=True)
+
+Ufuncs
+^^^^^^
+
+By default, NA values propagate::
+
+   >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
+   >>> np.sum(na_arr)
+   NA('float64')
+
+unless the ``skipna`` flag is set::
+
+   >>> np.sum(na_arr, skipna=True)
+   10.0
+
+By default, masking does not propagate::
+
+   >>> masked_arr = np.array([1.0, 2.0, np.IGNORE, 7.0])
+   >>> np.sum(masked_arr)
+   10.0
+
+unless the ``propmask`` flag is set::
+
+   >>> np.sum(masked_arr, propmask=True)
+   IGNORE
+
+An array can be masked, and contain NA values::
+
+   >>> both_arr = np.array([1.0, 2.0, np.IGNORE, np.NA, 7.0])
+
+In the default case, the behavior is obvious::
+
+   >>> np.sum(both_arr)
+   NA('float64')
+
+It's also obvious what to do with ``skipna=True``::
+
+   >>> np.sum(both_arr, skipna=True)
+   10.0
+   >>> np.sum(both_arr, skipna=True, propmask=True)
+   IGNORE
+
+To break the tie between NA and MSK, NAs propagate harder::
+
+   >>> np.sum(both_arr, propmask=True)
+   NA('float64')
+
+Assignment
+^^^^^^^^^^
+
+is obvious in the NA case::
+
+   >>> arr = np.array([1.0, 2.0, 7.0])
+   >>> arr[2] = np.NA
+   TypeError('dtype does not support NA')
+   >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
+   >>> na_arr[2] = np.NA
+   >>> na_arr
+   array([1., 2., NA], dtype='NA[<f8]')
+
+Direct assignnent in the masked case is magic and confusing, and so happens only
+via the mask::
+
+   >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
+   >>> masked_arr[2] = np.NA
+   TypeError('dtype does not support NA')
+   >>> masked_arr[2] = np.IGNORE
+   TypeError('float() argument must be a string or a number')
+   >>> masked_arr.visible[2] = False
+   >>> masked_arr
+   array([1., 2., IGNORE], masked=True)
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.
diff --git a/doc/neps/nep-0025-missing-data-3.rst b/doc/neps/nep-0025-missing-data-3.rst
new file mode 100644
index 000000000..6343759e8
--- /dev/null
+++ b/doc/neps/nep-0025-missing-data-3.rst
@@ -0,0 +1,469 @@
+======================================
+NEP 25 — NA support via special dtypes
+======================================
+
+:Author: Nathaniel J. Smith <njs@pobox.com>
+:Status: Deferred
+:Type: Standards Track
+:Created: 2011-07-08
+
+Abstract
+========
+
+*Context: this NEP was written as an additional alternative to NEP 12 (NEP 24
+is another alternative), which at the time of writing had an implementation
+that was merged into the NumPy master branch.*
+
+To try and make more progress on the whole missing values/masked arrays/...
+debate, it seems useful to have a more technical discussion of the pieces
+which we *can* agree on. This is the second, which attempts to nail down the
+details of how NAs can be implemented using special dtype's.
+
+Rationale
+---------
+
+An ordinary value is something like an integer or a floating point number. A
+missing value is a placeholder for an ordinary value that is for some reason
+unavailable. For example, in working with statistical data, we often build
+tables in which each row represents one item, and each column represents
+properties of that item. For instance, we might take a group of people and
+for each one record height, age, education level, and income, and then stick
+these values into a table. But then we discover that our research assistant
+screwed up and forgot to record the age of one of our individuals. We could
+throw out the rest of their data as well, but this would be wasteful; even
+such an incomplete row is still perfectly usable for some analyses (e.g., we
+can compute the correlation of height and income). The traditional way to
+handle this would be to stick some particular meaningless value in for the
+missing data,e.g., recording this person's age as 0. But this is very error
+prone; we may later forget about these special values while running other
+analyses, and discover to our surprise that babies have higher incomes than
+teenagers. (In this case, the solution would be to just leave out all the
+items where we have no age recorded, but this isn't a general solution; many
+analyses require something more clever to handle missing values.) So instead
+of using an ordinary value like 0, we define a special "missing" value,
+written "NA" for "not available".
+
+There are several possible ways to represent such a value in memory. For
+instance, we could reserve a specific value (like 0, or a particular NaN, or
+the smallest negative integer) and then ensure that this value is treated
+specially by all arithmetic and other operations on our array. Another option
+would be to add an additional mask array next to our main array, use this to
+indicate which values should be treated as NA, and then extend our array
+operations to check this mask array whenever performing computations. Each
+implementation approach has various strengths and weaknesses, but here we focus
+on the former (value-based) approach exclusively and leave the possible
+addition of the latter to future discussion. The core advantages of this
+approach are (1) it adds no additional memory overhead, (2) it is
+straightforward to store and retrieve such arrays to disk using existing file
+storage formats, (3) it allows binary compatibility with R arrays including NA
+values, (4) it is compatible with the common practice of using NaN to indicate
+missingness when working with floating point numbers, (5) the dtype is already
+a place where "weird things can happen" -- there are a wide variety of dtypes
+that don't act like ordinary numbers (including structs, Python objects,
+fixed-length strings, ...), so code that accepts arbitrary numpy arrays already
+has to be prepared to handle these (even if only by checking for them and
+raising an error). Therefore adding yet more new dtypes has less impact on
+extension authors than if we change the ndarray object itself.
+
+The basic semantics of NA values are as follows. Like any other value, they
+must be supported by your array's dtype -- you can't store a floating point
+number in an array with dtype=int32, and you can't store an NA in it either.
+You need an array with dtype=NAint32 or something (exact syntax to be
+determined). Otherwise, NA values act exactly like any other values. In
+particular, you can apply arithmetic functions and so forth to them. By
+default, any function which takes an NA as an argument always returns an NA as
+well, regardless of the values of the other arguments. This ensures that if we
+try to compute the correlation of income with age, we will get "NA", meaning
+"given that some of the entries could be anything, the answer could be anything
+as well". This reminds us to spend a moment thinking about how we should
+rephrase our question to be more meaningful. And as a convenience for those
+times when you do decide that you just want the correlation between the known
+ages and income, then you can enable this behavior by adding a single argument
+to your function call.
+
+For floating point computations, NAs and NaNs have (almost?) identical
+behavior. But they represent different things -- NaN an invalid computation
+like 0/0, NA a value that is not available -- and distinguishing between these
+things is useful because in some situations they should be treated differently.
+(For example, an imputation procedure should replace NAs with imputed values,
+but probably should leave NaNs alone.) And anyway, we can't use NaNs for
+integers, or strings, or booleans, so we need NA anyway, and once we have NA
+support for all these types, we might as well support it for floating point too
+for consistency.
+
+General strategy
+================
+
+Numpy already has a general mechanism for defining new dtypes and slotting them
+in so that they're supported by ndarrays, by the casting machinery, by ufuncs,
+and so on. In principle, we could implement NA-dtypes just using these existing
+interfaces. But we don't want to do that, because defining all those new ufunc
+loops etc. from scratch would be a huge hassle, especially since the basic
+functionality needed is the same in all cases. So we need some generic
+functionality for NAs -- but it would be better not to bake this in as a single
+set of special "NA types", since users may well want to define new custom
+dtypes that have their own NA values, and have them integrate well the rest of
+the NA machinery. Our strategy, therefore, is to avoid the `mid-layer mistake`_
+by exposing some code for generic NA handling in different situations, which
+dtypes can selectively use or not as they choose.
+
+.. _mid-layer mistake: https://lwn.net/Articles/336262/
+
+Some example use cases:
+  1. We want to define a dtype that acts exactly like an int32, except that the
+     most negative value is treated as NA.
+  2. We want to define a parametrized dtype to represent `categorical data`_,
+     and the bit-pattern to be used for NA depends on the number of categories
+     defined, so our code needs to play an active role handling it rather than
+     simply deferring to the standard machinery.
+  3. We want to define a dtype that acts like an length-10 string and supports
+     NAs. Since our string may hold arbitrary binary values, we want to actually
+     allocate 11 bytes for it, with the first byte a flag indicating whether this
+     string is NA and the rest containing the string content.
+  4. We want to define a dtype that allows multiple different types of NA data,
+     which print differently and can be distinguished by the new ufunc that we
+     define called ``is_na_of_type(...)``, but otherwise takes advantage of the
+     generic NA machinery for most operations.
+
+.. _categorical data: http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html
+
+dtype C-level API extensions
+============================
+
+The `PyArray_Descr`_ struct gains the following new fields::
+
+  void * NA_value;
+  PyArray_Descr * NA_extends;
+  int NA_extends_offset;
+
+.. _PyArray_Descr: http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArray_Descr
+
+The following new flag values are defined::
+
+  NPY_NA_AUTO_ARRFUNCS
+  NPY_NA_AUTO_CAST
+  NPY_NA_AUTO_UFUNC
+  NPY_NA_AUTO_UFUNC_CHECKED
+  NPY_NA_AUTO_ALL /* the above flags OR'ed together */
+
+The `PyArray_ArrFuncs`_ struct gains the following new fields::
+
+  void (*isna)(void * src, void * dst, npy_intp n, void * arr);
+  void (*clearna)(void * data, npy_intp n, void * arr);
+
+.. _PyArray_ArrFuncs: http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArray_ArrFuncs
+
+We add at least one new convenience macro::
+
+  #define NPY_NA_SUPPORTED(dtype) ((dtype)->f->isna != NULL)
+
+The general idea is that anywhere where we used to call a dtype-specific
+function pointer, the code will be modified to instead:
+
+  1. Check for whether the relevant ``NPY_NA_AUTO_...`` bit is enabled, the
+     NA_extends field is non-NULL, and the function pointer we wanted to call
+     is NULL.
+  2. If these conditions are met, then use ``isna`` to identify which entries
+     in the array are NA, and handle them appropriately. Then look up whatever
+     function we were *going* to call using this dtype on the ``NA_extends``
+     dtype instead, and use that to handle the non-NA elements.
+
+For more specifics, see following sections.
+
+Note that if ``NA_extends`` points to a parametrized dtype, then the dtype
+object it points to must be fully specified. For example, if it is a string
+dtype, it must have a non-zero ``elsize`` field.
+
+In order to handle the case where the NA information is stored in a field next
+to the `real' data, the ``NA_extends_offset`` field is set to a non-zero value;
+it must point to the location within each element of this dtype where some data
+of the ``NA_extends`` dtype is found. For example, if we have are storing
+10-byte strings with an NA indicator byte at the beginning, then we have::
+
+  elsize == 11
+  NA_extends_offset == 1
+  NA_extends->elsize == 10
+
+When delegating to the ``NA_extends`` dtype, we offset our data pointer by
+``NA_extends_offset`` (while keeping our strides the same) so that it sees an
+array of data of the expected type (plus some superfluous padding). This is
+basically the same mechanism that record dtypes use, IIUC, so it should be
+pretty well-tested.
+
+When delegating to a function that cannot handle "misbehaved" source data (see
+the ``PyArray_ArrFuncs`` documentation for details), then we need to check for
+alignment issues before delegating (especially with a non-zero
+``NA_extends_offset``). If there's a problem, when we need to "clean up" the
+source data first, using the usual mechanisms for handling misaligned data. (Of
+course, we should usually set up our dtypes so that there aren't any alignment
+issues, but someone screws that up, or decides that reduced memory usage is
+more important to them then fast inner loops, then we should still handle that
+gracefully, as we do now.)
+
+The ``NA_value`` and ``clearna`` fields are used for various sorts of casting.
+``NA_value`` is a bit-pattern to be used when, for example, assigning from
+np.NA. ``clearna`` can be a no-op if ``elsize`` and ``NA_extends->elsize`` are
+the same, but if they aren't then it should clear whatever auxiliary NA storage
+this dtype uses, so that none of the specified array elements are NA.
+
+Core dtype functions
+--------------------
+
+The following functions are defined in ``PyArray_ArrFuncs``. The special
+behavior described here is enabled by the NPY_NA_AUTO_ARRFUNCS bit in the dtype
+flags, and only enabled if the given function field is *not* filled in.
+
+``getitem``: Calls ``isna``. If ``isna`` returns true, returns np.NA.
+Otherwise, delegates to the ``NA_extends`` dtype.
+
+``setitem``: If the input object is ``np.NA``, then runs
+``memcpy(self->NA_value, data, arr->dtype->elsize);``. Otherwise, calls
+``clearna``, and then delegates to the ``NA_extends`` dtype.
+
+``copyswapn``, ``copyswap``: FIXME: Not sure whether there's any special
+handling to use for these?
+
+``compare``: FIXME: how should this handle NAs? R's sort function *discards*
+NAs, which doesn't seem like a good option.
+
+``argmax``: FIXME: what is this used for? If it's the underlying implementation
+for np.max, then it really needs some way to get a skipna argument. If not,
+then the appropriate semantics depends on what it's supposed to accomplish...
+
+``dotfunc``: QUESTION: is it actually guaranteed that everything has the same
+dtype? FIXME: same issues as for ``argmax``.
+
+``scanfunc``: This one's ugly. We may have to explicitly override it in all of
+our special dtypes, because assuming that we want the option of, say, having
+the token "NA" represent an NA value in a text file, we need some way to check
+whether that's there before delegating. But ``ungetc`` is only guaranteed to
+let us put back 1 character, and we need 2 (or maybe 3 if we actually check for
+"NA "). The other option would be to read to the next delimiter, check whether
+we have an NA, and if not then delegate to ``fromstr`` instead of ``scanfunc``,
+but according to the current API, each dtype might in principle use a totally
+different rule for defining "the next delimiter". So... any ideas? (FIXME)
+
+``fromstr``: Easy -- check for "NA ", if present then assign ``NA_value``,
+otherwise call ``clearna`` and delegate.
+
+``nonzero``: FIXME: again, what is this used for? (It seems redundant with
+using the casting machinery to cast to bool.) Probably it needs to be modified
+so that it can return NA, though...
+
+``fill``: Use ``isna`` to check if either of the first two values is NA. If so,
+then fill the rest of the array with ``NA_value``. Otherwise, call ``clearna``
+and then delegate.
+
+``fillwithvalue``: Guess this can just delegate?
+
+``sort``, ``argsort``: These should probably arrange to sort NAs to a
+particular place in the array (either the front or the back -- any opinions?)
+
+``scalarkind``: FIXME: I have no idea what this does.
+
+``castdict``, ``cancastscalarkindto``, ``cancastto``: See section on casting
+below.
+
+Casting
+-------
+
+FIXME: this really needs attention from an expert on numpy's casting rules. But
+I can't seem to find the docs that explain how casting loops are looked up and
+decided between (e.g., if you're casting from dtype A to dtype B, which dtype's
+loops are used?), so I can't go into details. But those details are tricky and
+they matter...
+
+But the general idea is, if you have a dtype with ``NPY_NA_AUTO_CAST`` set,
+then the following conversions are automatically allowed:
+
+  * Casting from the underlying type to the NA-type: this is performed by the
+  * usual ``clearna`` + potentially-strided copy dance. Also, ``isna`` is
+  * called to check that none of the regular values have been accidentally
+  * converted into NA; if so, then an error is raised.
+  * Casting from the NA-type to the underlying type: allowed in principle, but
+    if ``isna`` returns true for any of the values that are to be converted,
+    then again, an error is raised. (If you want to get around this, use
+    ``np.view(array_with_NAs, dtype=float)``.)
+  * Casting between the NA-type and other types that do not support NA: this is
+    allowed if the underlying type is allowed to cast to the other type, and is
+    performed by combining a cast to or from the underlying type (using the
+    above rules) with a cast to or from the other type (using the underlying
+    type's rules).
+  * Casting between the NA-type and other types that do support NA: if the
+    other type has NPY_NA_AUTO_CAST set, then we use the above rules plus the
+    usual dance with ``isna`` on one array being converted to ``NA_value``
+    elements in the other. If only one of the arrays has NPY_NA_AUTO_CAST set,
+    then it's assumed that that dtype knows what it's doing, and we don't do
+    any magic. (But this is one of the things that I'm not sure makes sense, as
+    per my caveat above.)
+
+Ufuncs
+------
+
+All ufuncs gain an additional optional keyword argument, ``skipNA=``, which
+defaults to False.
+
+If ``skipNA == True``, then the ufunc machinery *unconditionally* calls
+``isna`` for any dtype where NPY_NA_SUPPORTED(dtype) is true, and then acts as
+if any values for which isna returns True were masked out in the ``where=``
+argument (see miniNEP 1 for the behavior of ``where=``). If a ``where=``
+argument is also given, then it acts as if the ``isna`` values had be ANDed out
+of the ``where=`` mask, though it does not actually modify the mask. Unlike the
+other changes below, this is performed *unconditionally* for any dtype which
+has an ``isna`` function defined; the NPY_NA_AUTO_UFUNC flag is *not* checked.
+
+If NPY_NA_AUTO_UFUNC is set, then ufunc loop lookup is modified so that
+whenever it checks for the existence of a loop on the current dtype, and does
+not find one, then it also checks for a loop on the ``NA_extends`` dtype. If
+that loop is found, then it uses it in the normal way, with the exceptions that
+(1) it is only called for values which are not NA according to ``isna``, (2) if
+the output array has NPY_NA_AUTO_UFUNC set, then ``clearna`` is called on it
+before calling the ufunc loop, (3) pointer offsets are adjusted by
+``NA_extends_offset`` before calling the ufunc loop. In addition, if
+NPY_NA_AUTO_UFUNC_CHECK is set, then after evaluating the ufunc loop we call
+``isna`` on the *output* array, and if there are any NAs in the output which
+were not in the input, then we raise an error. (The intention of this is to
+catch cases where, say, we represent NA using the most-negative integer, and
+then someone's arithmetic overflows to create such a value by accident.)
+
+FIXME: We should go into more detail here about how NPY_NA_AUTO_UFUNC works
+when there are multiple input arrays, of which potentially some have the flag
+set and some do not.
+
+Printing
+--------
+
+FIXME: There should be some sort of mechanism by which values which are NA are
+automatically repr'ed as NA, but I don't really understand how numpy printing
+works, so I'll let someone else fill in this section.
+
+Indexing
+--------
+
+Scalar indexing like ``a[12]`` goes via the ``getitem`` function, so according
+to the proposal as described above, if a dtype delegates ``getitem``, then
+scalar indexing on NAs will return the object ``np.NA``. (If it doesn't
+delegate ``getitem``, of course, then it can return whatever it wants.)
+
+This seems like the simplest approach, but an alternative would be to add a
+special case to scalar indexing, where if an ``NPY_NA_AUTO_INDEX`` flag were
+set, then it would call ``isna`` on the specified element. If this returned
+false, it would call ``getitem`` as usual; otherwise, it would return a 0-d
+array containing the specified element. The problem with this is that it breaks
+expressions like ``if a[i] is np.NA: ...``. (Of course, there is nothing nearly
+so convenient as that for NaN values now, but then, NaN values don't have their
+own global singleton.) So for now we stick to scalar indexing just returning
+``np.NA``, but this can be revisited if anyone objects.
+
+Python API for generic NA support
+=================================
+
+NumPy will gain a global singleton called numpy.NA, similar to None, but with
+semantics reflecting its status as a missing value. In particular, trying to
+treat it as a boolean will raise an exception, and comparisons with it will
+produce numpy.NA instead of True or False. These basics are adopted from the
+behavior of the NA value in the R project. To dig deeper into the ideas,
+http://en.wikipedia.org/wiki/Ternary_logic#Kleene_logic provides a starting
+point.
+
+Most operations on ``np.NA`` (e.g., ``__add__``, ``__mul__``) are overridden to
+unconditionally return ``np.NA``.
+
+The automagic dtype detection used for expressions like ``np.asarray([1, 2,
+3])``, ``np.asarray([1.0, 2.0. 3.0])`` will be extended to recognize the
+``np.NA`` value, and use it to automatically switch to a built-in NA-enabled
+dtype (which one being determined by the other elements in the array). A simple
+``np.asarray([np.NA])`` will use an NA-enabled float64 dtype (which is
+analogous to what you get from ``np.asarray([])``). Note that this means that
+expressions like ``np.log(np.NA)`` will work: first ``np.NA`` will be coerced
+to a 0-d NA-float array, and then ``np.log`` will be called on that.
+
+Python-level dtype objects gain the following new fields::
+
+  NA_supported
+  NA_value
+
+``NA_supported`` is a boolean which simply exposes the value of the
+``NPY_NA_SUPPORTED`` flag; it should be true if this dtype allows for NAs,
+false otherwise. [FIXME: would it be better to just key this off the existence
+of the ``isna`` function? Even if a dtype decides to implement all other NA
+handling itself, it still has to define ``isna`` in order to make ``skipNA=``
+work correctly.]
+
+``NA_value`` is a 0-d array of the given dtype, and its sole element contains
+the same bit-pattern as the dtype's underlying ``NA_value`` field. This makes
+it possible to determine the default bit-pattern for NA values for this type
+(e.g., with ``np.view(mydtype.NA_value, dtype=int8)``).
+
+We *do not* expose the ``NA_extends`` and ``NA_extends_offset`` values at the
+Python level, at least for now; they're considered an implementation detail
+(and it's easier to expose them later if they're needed then unexpose them if
+they aren't).
+
+Two new ufuncs are defined: ``np.isNA`` returns a logical array, with true
+values where-ever the dtype's ``isna`` function returned true. ``np.isnumber``
+is only defined for numeric dtypes, and returns True for all elements which are
+not NA, and for which ``np.isfinite`` would return True.
+
+Builtin NA dtypes
+=================
+
+The above describes the generic machinery for NA support in dtypes. It's
+flexible enough to handle all sorts of situations, but we also want to define a
+few generally useful NA-supporting dtypes that are available by default.
+
+For each built-in dtype, we define an associated NA-supporting dtype, as
+follows:
+
+* floats: the associated dtype uses a specific NaN bit-pattern to indicate NA
+  (chosen for R compatibility)
+* complex: we do whatever R does (FIXME: look this up -- two NA floats,
+  probably?)
+* signed integers: the most-negative signed value is used as NA (chosen for R
+  compatibility)
+* unsigned integers: the most-positive value is used as NA (no R compatibility
+  possible).
+* strings: the first byte (or, in the case of unicode strings, first 4 bytes)
+  is used as a flag to indicate NA, and the rest of the data gives the actual
+  string. (no R compatibility possible)
+* objects: Two options (FIXME): either we don't include an NA-ful version, or
+  we use np.NA as the NA bit pattern.
+* boolean: we do whatever R does (FIXME: look this up -- 0 == FALSE, 1 == TRUE,
+  2 == NA?)
+
+Each of these dtypes is trivially defined using the above machinery, and are
+what are automatically used by the automagic type inference machinery (for
+``np.asarray([True, np.NA, False])``, etc.).
+
+They can also be accessed via a new function ``np.withNA``, which takes a
+regular dtype (or an object that can be coerced to a dtype, like 'float') and
+returns one of the above dtypes. Ideally ``withNA`` should also take some
+optional arguments that let you describe which values you want to count as NA,
+etc., but I'll leave that for a future draft (FIXME).
+
+FIXME: If ``d`` is one of the above dtypes, then should ``d.type`` return?
+
+The NEP also contains a proposal for a somewhat elaborate
+domain-specific-language for describing NA dtypes. I'm not sure how great an
+idea that is. (I have a bias against using strings as data structures, and find
+the already existing strings confusing enough as it is -- also, apparently the
+NEP version of numpy uses strings like 'f8' when printing dtypes, while my
+numpy uses object names like 'float64', so I'm not sure what's going on there.
+``withNA(float64, arg1=value1)`` seems like a more pleasant way to print a
+dtype than "NA[f8,value1]", at least to me.) But if people want it, then cool.
+
+Type hierarchy 
+--------------
+
+FIXME: how should we do subtype checks, etc., for NA dtypes? What does
+``issubdtype(withNA(float), float)`` return? How about
+``issubdtype(withNA(float), np.floating)``?
+
+Serialization
+-------------
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.
diff --git a/doc/neps/nep-0026-missing-data-summary.rst b/doc/neps/nep-0026-missing-data-summary.rst
new file mode 100644
index 000000000..6f7fd0fa1
--- /dev/null
+++ b/doc/neps/nep-0026-missing-data-summary.rst
@@ -0,0 +1,727 @@
+====================================================
+NEP 26 — Summary of Missing Data NEPs and discussion
+====================================================
+
+:Author: Mark Wiebe <mwwiebe@gmail.com>, Nathaniel J. Smith <njs@pobox.com>
+:Status: Deferred
+:Type: Standards Track
+:Created: 2012-04-22
+
+*Context: this NEP was written as summary of the large number of discussions
+and proposals (NEP 12, NEP 24, NEP 25), regarding missing data functionality.*
+
+The debate about how NumPy should handle missing data, a subject with
+many preexisting approaches, requirements, and conventions, has been long and
+contentious. There has been more than one proposal for how to implement
+support into NumPy, and there is a testable implementation which is
+merged into NumPy's current master. The vast number of emails and differing
+points of view has made it difficult for interested parties to understand
+the issues and be comfortable with the direction NumPy is going.
+
+Here is our (Mark and Nathaniel's) attempt to summarize the
+problem, proposals, and points of agreement/disagreement in a single
+place, to help the community move towards consensus.
+
+The NumPy developers' problem
+=============================
+
+For this discussion, "missing data" means array elements
+which can be indexed (e.g. A[3] in an array A with shape (5,)),
+but have, in some sense, no value.
+
+It does not refer to compressed or sparse storage techniques where
+the value for A[3] is not actually stored in memory, but still has a
+well-defined value like 0.
+
+This is still vague, and to create an actual implementation,
+it is necessary to answer such questions as:
+
+* What values are computed when doing element-wise ufuncs.
+* What values are computed when doing reductions.
+* Whether the storage for an element gets overwritten when marking
+  that value missing.
+* Whether computations resulting in NaN automatically treat in the
+  same way as a missing value.
+* Whether one interacts with missing values using a placeholder object
+  (e.g. called "NA" or "masked"), or through a separate boolean array.
+* Whether there is such a thing as an array object that cannot hold
+  missing array elements.
+* How the (C and Python) API is expressed, in terms of dtypes,
+  masks, and other constructs.
+* If we decide to answer some of these questions in multiple ways,
+  then that creates the question of whether that requires multiple
+  systems, and if so how they should interact.
+
+There's clearly a very large space of missing-data APIs that *could*
+be implemented. There is likely at least one user, somewhere, who
+would find any possible implementation to be just the thing they
+need to solve some problem. On the other hand, much of NumPy's power
+and clarity comes from having a small number of orthogonal concepts,
+such as strided arrays, flexible indexing, broadcasting, and ufuncs,
+and we'd like to preserve that simplicity.
+
+There has been dissatisfaction among several major groups of NumPy users
+about the existing status quo of missing data support. In particular,
+neither the numpy.ma component nor use of floating-point NaNs as a
+missing data signal fully satisfy the performance requirements and
+ease of use for these users. The example of R, where missing data
+is treated via an NA placeholder and is deeply integrated into all
+computation, is where many of these users point to indicate what
+functionality they would like. Doing a deep integration of missing
+data like in R must be considered carefully, it must be clear it
+is not being done in a way which sacrifices existing performance
+or functionality.
+
+Our problem is, how can we choose some incremental additions to
+NumPy that will make a large class of users happy, be
+reasonably elegant, complement the existing design, and that we're
+comfortable we won't regret being stuck with in the long term.
+
+Prior art
+=========
+
+So a major (maybe *the* major) problem is figuring out how ambitious
+the project to add missing data support to NumPy should be, and which
+kinds of problems are in scope. Let's start with the
+best understood situation where "missing data" comes into play:
+
+"Statistical missing data"
+--------------------------
+
+In statistics, social science, etc., "missing data" is a term of art
+referring to a specific (but extremely common and important)
+situation: we have tried to gather some measurements according to some
+scheme, but some of these measurements are missing. For example, if we
+have a table listing the height, age, and income of a number of
+individuals, but one person did not provide their income, then we need
+some way to represent this::
+
+  Person | Height | Age | Income
+  ------------------------------
+     1   |   63   | 25  | 15000
+     2   |   58   | 32  | <missing>
+     3   |   71   | 45  | 30000
+
+The traditional way is to record that income as, say, "-99", and
+document this in the README along with the data set. Then, you have to
+remember to check for and handle such incomes specially; if you
+forget, you'll get superficially reasonable but completely incorrect
+results, like calculating the average income on this data set as
+14967. If you're in one of these fields, then such missing-ness is
+routine and inescapable, and if you use the "-99" approach then it's a
+pitfall you have to remember to check for explicitly on literally
+*every* calculation you ever do. This is, obviously, an unpleasant way
+to live.
+
+Let's call this situation the "statistical missing data" situation,
+just to have a convenient handle for it. (As mentioned, practitioners
+just call this "missing data", and what to do about it is literally an
+entire sub-field of statistics; if you google "missing data" then
+every reference is on how to handle it.) NumPy isn't going to do
+automatic imputation or anything like that, but it could help a great
+deal by providing some standard way to at least represent data which
+is missing in this sense.
+
+The main prior art for how this could be done comes from the S/S+/R
+family of languages. Their strategy is, for each type they support,
+to define a special value called "NA". (For ints this is INT_MAX,
+for floats it's a special NaN value that's distinguishable from
+other NaNs, ...) Then, they arrange that in computations, this
+value has a special semantics that we will call "NA semantics".
+
+NA Semantics
+------------
+
+The idea of NA semantics is that any computations involving NA
+values should be consistent with what would have happened if we
+had known the correct value.
+
+For example, let's say we want to compute the mean income, how might
+we do this? One way would be to just ignore the missing entry, and
+compute the mean of the remaining entries. This gives us (15000 +
+30000)/2, or 22500.
+
+Is this result consistent with discovering the income of person 2?
+Let's say we find out that person 2's income is 50000. This means
+the correct answer is (15000 + 50000 + 30000)/3, or 31666.67,
+indicating clearly that it is not consistent. Therefore, the mean
+income is NA, i.e. a specific number whose value we are unable
+to compute.
+
+This motivates the following rules, which are how R implements NA:
+
+Assignment:
+  NA values are understood to represent specific
+  unknown values, and thus should have value-like semantics with
+  respect to assignment and other basic data manipulation
+  operations. Code which does not actually look at the values involved
+  should work the same regardless of whether some of them are
+  missing. For example, one might write::
+
+    income[:] = income[np.argsort(height)]
+
+  to perform an in-place sort of the ``income`` array, and know that
+  the shortest person's income would end up being first. It turns out
+  that the shortest person's income is not known, so the array should
+  end up being ``[NA, 15000, 30000]``, but there's nothing
+  special about NAness here.
+
+Propagation:
+  In the example above, we concluded that an operation like ``mean``
+  should produce NA when one of its data values was NA.
+  If you ask me, "what is 3 plus x?", then my only possible answer is
+  "I don't know what x is, so I don't know what 3 + x is either". NA
+  means "I don't know", so 3 + NA is NA.
+
+  This is important for safety when analyzing data: missing data often
+  requires special handling for correctness -- the fact that you are
+  missing information might mean that something you wanted to compute
+  cannot actually be computed, and there are whole books written on
+  how to compensate in various situations. Plus, it's easy to not
+  realize that you have missing data, and write code that assumes you
+  have all the data. Such code should not silently produce the wrong
+  answer.
+
+  There is an important exception to characterizing this as propagation,
+  in the case of boolean values. Consider the calculation::
+
+    v = np.any([False, False, NA, True])
+
+  If we strictly propagate, ``v`` will become NA. However, no
+  matter whether we place True or False into the third array position,
+  ``v`` will then get the value True. The answer to the question
+  "Is the result True consistent with later discovering the value
+  that was missing?" is yes, so it is reasonable to not propagate here,
+  and instead return the value True. This is what R does::
+
+    > any(c(F, F, NA, T))
+    [1] TRUE
+    > any(c(F, F, NA, F))
+    [1] NA
+
+Other:
+  NaN and NA are conceptually distinct. 0.0/0.0 is not a mysterious,
+  unknown value -- it's defined to be NaN by IEEE floating point, Not
+  a Number. NAs are numbers (or strings, or whatever), just unknown
+  ones. Another small but important difference is that in Python, ``if
+  NaN: ...`` treats NaN as True (NaN is "truthy"); but ``if NA: ...``
+  would be an error.
+
+  In R, all reduction operations implement an alternative semantics,
+  activated by passing a special argument (``na.rm=TRUE`` in R).
+  ``sum(a)`` means "give me the sum of all the
+  values" (which is NA if some of the values are NA);
+  ``sum(a, na.rm=True)`` means "give me the sum of all the non-NA
+  values".
+
+Other prior art
+---------------
+
+Once we move beyond the "statistical missing data" case, the correct
+behavior for missing data becomes less clearly defined. There are many
+cases where specific elements are singled out to be treated specially
+or excluded from computations, and these could often be conceptualized
+as involving 'missing data' in some sense.
+
+In image processing, it's common to use a single image together with
+one or more boolean masks to e.g. composite subsets of an image. As
+Joe Harrington pointed out on the list, in the context of processing
+astronomical images, it's also common to generalize to a
+floating-point valued mask, or alpha channel, to indicate degrees of
+"missingness". We think this is out of scope for the present design,
+but it is an important use case, and ideally NumPy should support
+natural ways of manipulating such data.
+
+After R, numpy.ma is probably the most mature source of
+experience on missing-data-related APIs. Its design is quite different
+from R; it uses different semantics -- reductions skip masked values
+by default and NaNs convert to masked -- and it uses a different
+storage strategy via a separate mask. While it seems to be generally
+considered sub-optimal for general use, it's hard to pin down whether
+this is because the API is immature but basically good, or the API
+is fundamentally broken, or the API is great but the code should be
+faster, or what. We looked at some of those users to try and get a
+better idea.
+
+Matplotlib is perhaps the best known package to rely on numpy.ma. It
+seems to use it in two ways. One is as a way for users to indicate
+what data is missing when passing it to be graphed. (Other ways are
+also supported, e.g., passing in NaN values gives the same result.) In
+this regard, matplotlib treats np.ma.masked and NaN values in the same way
+that R's plotting routines handle NA and NaN values. For these purposes,
+matplotlib doesn't really care what semantics or storage strategy is
+used for missing data.
+
+Internally, matplotlib uses numpy.ma arrays to store and pass around
+separately computed boolean masks containing 'validity' information
+for each input array in a cheap and non-destructive fashion. Mark's
+impression from some shallow code review is that mostly it works
+directly with the data and mask attributes of the masked arrays,
+not extensively using the particular computational semantics of
+numpy.ma. So, for this usage they do rely on the non-destructive
+mask-based storage, but this doesn't say much about what semantics
+are needed.
+
+Paul Hobson `posted some code`__ on the list that uses numpy.ma for
+storing arrays of contaminant concentration measurements. Here the
+mask indicates whether the corresponding number represents an actual
+measurement, or just the estimated detection limit for a concentration
+which was too small to detect. Nathaniel's impression from reading
+through this code is that it also mostly uses the .data and .mask
+attributes in preference to performing operations on the MaskedArray
+directly.
+
+__ https://mail.scipy.org/pipermail/numpy-discussion/2012-April/061743.html
+
+So, these examples make it clear that there is demand for a convenient
+way to keep a data array and a mask array (or even a floating point
+array) bundled up together and "aligned". But they don't tell us much
+about what semantics the resulting object should have with respect to
+ufuncs and friends.
+
+Semantics, storage, API, oh my!
+===============================
+
+We think it's useful to draw a clear line between use cases,
+semantics, and storage. Use cases are situations that users encounter,
+regardless of what NumPy does; they're the focus of the previous
+section. When we say *semantics*, we mean the result of different
+operations as viewed from the Python level without regard to the
+underlying implementation.
+
+*NA semantics* are the ones described above and used by R::
+
+  1 + NA = NA
+  sum([1, 2, NA]) = NA
+  NA | False = NA
+  NA | True = True
+
+With ``na.rm=TRUE`` or ``skipNA=True``, this switches to::
+
+  1 + NA = illegal # in R, only reductions take na.rm argument
+  sum([1, 2, NA], skipNA=True) = 3
+
+There's also been discussion of what we'll call *ignore
+semantics*. These are somewhat underdefined::
+
+  sum([1, 2, IGNORED]) = 3
+  # Several options here:
+  1 + IGNORED = 1
+  #  or
+  1 + IGNORED = <leaves output array untouched>
+  #  or
+  1 + IGNORED = IGNORED
+
+The numpy.ma semantics are::
+
+  sum([1, 2, masked]) = 3
+  1 + masked = masked
+
+If either NA or ignore semantics are implemented with masks, then there
+is a choice of what should be done to the value in the storage
+for an array element which gets assigned a missing value. Three
+possibilities are:
+
+* Leave that memory untouched (the choice made in the NEP).
+* Do the calculation with the values independently of the mask
+  (perhaps the most useful option for Paul Hobson's use-case above).
+* Copy whatever value is stored behind the input missing value into
+  the output (this is what numpy.ma does. Even that is ambiguous in
+  the case of ``masked + masked`` -- in this case numpy.ma copies the
+  value stored behind the leftmost masked value).
+
+When we talk about *storage*, we mean the debate about whether missing
+values should be represented by designating a particular value of the
+underlying data-type (the *bitpattern dtype* option, as used in R), or
+by using a separate *mask* stored alongside the data itself.
+
+For mask-based storage, there is also an important question about what
+the API looks like for accessing the mask, modifying the mask, and
+"peeking behind" the mask.
+
+Designs that have been proposed
+===============================
+
+One option is to just copy R, by implementing a mechanism whereby
+dtypes can arrange for certain bitpatterns to be given NA semantics.
+
+One option is to copy numpy.ma closely, but with a more optimized
+implementation. (Or to simply optimize the existing implementation.)
+
+One option is that described in the NEP_, for which an implementation
+of mask-based missing data exists. This system is roughly:
+
+.. _NEP: https://github.com/numpy/numpy/blob/master/doc/neps/nep-0012-missing-data.rst
+
+* There is both bitpattern and mask-based missing data, and both
+  have identical interoperable NA semantics.
+* Masks are modified by assigning np.NA or values to array elements.
+  The way to peek behind the mask or to unmask values is to keep a
+  view of the array that shares the data pointer but not the mask pointer.
+* Mark would like to add a way to access and manipulate the mask more
+  directly, to be used in addition to this view-based API.
+* If an array has both a bitpattern dtype and a mask, then assigning
+  np.NA writes to the mask, rather than to the array itself. Writing
+  a bitpattern NA to an array which supports both requires accessing
+  the data by "peeking under the mask".
+
+Another option is that described in the alterNEP_, which is to implement
+bitpattern dtypes with NA semantics for the "statistical missing data"
+use case, and to also implement a totally independent API for masked
+arrays with ignore semantics and all mask manipulation done explicitly
+through a .mask attribute.
+
+.. _alterNEP: https://gist.github.com/njsmith/1056379
+
+Another option would be to define a minimalist aligned array container
+that holds multiple arrays and that can be used to pass them around
+together. It would support indexing (to help with the common problem
+of wanting to subset several arrays together without their becoming
+unaligned), but all arithmetic etc. would be done by accessing the
+underlying arrays directly via attributes. The "prior art" discussion
+above suggests that something like this holding a .data and a .mask
+array might actually be solve a number of people's problems without
+requiring any major architectural changes to NumPy. This is similar to
+a structured array, but with each field in a separately stored array
+instead of packed together.
+
+Several people have suggested that there should be a single system
+that has multiple missing values that each have different semantics,
+e.g., a MISSING value that has NA semantics, and a separate IGNORED
+value that has ignored semantics.
+
+None of these options are necessarily exclusive.
+
+The debate
+==========
+
+We both are dubious of using ignored semantics as a default missing
+data behavior. **Nathaniel** likes NA semantics because he is most
+interested in the "statistical missing data" use case, and NA semantics
+are exactly right for that. **Mark** isn't as interested in that use
+case in particular, but he likes the NA computational abstraction
+because it is unambiguous and well-defined in all cases, and has a
+lot of existing experience to draw from.
+
+What **Nathaniel** thinks, overall:
+
+* The "statistical missing data" use case is clear and compelling; the
+  other use cases certainly deserve our attention, but it's hard to say what
+  they *are* exactly yet, or even if the best way to support them is
+  by extending the ndarray object.
+* The "statistical missing data" use case is best served by an R-style
+  system that uses bitpattern storage to implement NA semantics. The
+  main advantage of bitpattern storage for this use case is that it
+  avoids the extra memory and speed overhead of storing and checking a
+  mask (especially for the common case of floating point data, where
+  some tricks with NaNs allow us to effectively hardware-accelerate
+  most NA operations). These concerns alone appears to make a
+  mask-based implementation unacceptable to many NA users,
+  particularly in areas like neuroscience (where memory is tight) or
+  financial modeling (where milliseconds are critical). In addition,
+  the bit-pattern approach is less confusing conceptually (e.g.,
+  assignment really is just assignment, no magic going on behind the
+  curtain), and it's possible to have in-memory compatibility with R
+  for inter-language calls via rpy2.  The main disadvantage of the
+  bitpattern approach is the need to give up a value to represent NA,
+  but this is not an issue for the most important data types (float,
+  bool, strings, enums, objects); really, only integers are
+  affected. And even for integers, giving up a value doesn't really
+  matter for statistical problems. (Occupy Wall Street
+  notwithstanding, no-one's income is 2**63 - 1. And if it were, we'd
+  be switching to floats anyway to avoid overflow.)
+* Adding new dtypes requires some cooperation with the ufunc and
+  casting machinery, but doesn't require any architectural changes or
+  violations of NumPy's current orthogonality.
+* His impression from the mailing list discussion, esp. the `"what can
+  we agree on?" thread`__, is that many numpy.ma users specifically
+  like the combination of masked storage, the mask being easily
+  accessible through the API, and ignored semantics. He could be
+  wrong, of course. But he cannot remember seeing anybody besides Mark
+  advocate for the specific combination of masked storage and NA
+  semantics, which makes him nervous.
+
+  __ http://thread.gmane.org/gmane.comp.python.numeric.general/46704
+* Also, he personally is not very happy with the idea of having two
+  storage implementations that are almost-but-not-quite identical at
+  the Python level. While there likely are people who would like to
+  temporarily pretend that certain data is "statistically missing
+  data" without making a copy of their array, it's not at all clear
+  that they outnumber the people who would like to use bitpatterns and
+  masks simultaneously for distinct purposes. And honestly he'd like
+  to be able to just ignore masks if he wants and stick to
+  bitpatterns, which isn't possible if they're coupled together
+  tightly in the API.  So he would say the jury is still very much out
+  on whether this aspect of the NEP design is an advantage or a
+  disadvantage. (Certainly he's never heard of any R users complaining
+  that they really wish they had an option of making a different
+  trade-off here.)
+* R's NA support is a `headline feature`__ and its target audience
+  consider it a compelling advantage over other platforms like Matlab
+  or Python. Working with statistical missing data is very painful
+  without platform support.
+
+  __ http://www.sr.bham.ac.uk/~ajrs/R/why_R.html
+* By comparison, we clearly have much more uncertainty about the use
+  cases that require a mask-based implementation, and it doesn't seem
+  like people will suffer too badly if they are forced for now to
+  settle for using NumPy's excellent mask-based indexing, the new
+  where= support, and even numpy.ma.
+* Therefore, bitpatterns with NA semantics seem to meet the criteria
+  of making a large class of users happy, in an elegant way, that fits
+  into the original design, and where we can have reasonable certainty
+  that we understand the problem and use cases well enough that we'll
+  be happy with them in the long run. But no mask-based storage
+  proposal does, yet.
+
+What **Mark** thinks, overall:
+
+* The idea of using NA semantics by default for missing data, inspired
+  by the "statistical missing data" problem, is better than all the
+  other default behaviors which were considered. This applies equally
+  to the bitpattern and the masked approach.
+
+* For NA-style functionality to get proper support by all NumPy
+  features and eventually all third-party libraries, it needs to be
+  in the core. How to correctly and efficiently handle missing data
+  differs by algorithm, and if thinking about it is required to fully
+  support NumPy, NA support will be broader and higher quality.
+
+* At the same time, providing two different missing data interfaces,
+  one for masks and one for bitpatterns, requires NumPy developers
+  and third-party NumPy plugin developers to separately consider the
+  question of what to do in either case, and do two additional
+  implementations of their code. This complicates their job,
+  and could lead to inconsistent support for missing data.
+
+* Providing the ability to work with both masks and bitpatterns through
+  the same C and Python programming interface makes missing data support
+  cleanly orthogonal with all other NumPy features.
+
+* There are many trade-offs of memory usage, performance, correctness, and
+  flexibility between masks and bitpatterns. Providing support for both
+  approaches allows users of NumPy to choose the approach which is
+  most compatible with their way of thinking, or has characteristics
+  which best match their use-case. Providing them through the same
+  interface further allows them to try both with minimal effort, and
+  choose the one which performs better or uses the least memory for
+  their programs.
+
+* Memory Usage
+
+  * With bitpatterns, less memory is used for storing a single array
+    containing some NAs.
+
+  * With masks, less memory is used for storing multiple arrays that
+    are identical except for the location of their NAs. (In this case a
+    single data array can be re-used with multiple mask arrays;
+    bitpattern NAs would need to copy the whole data array.)
+
+* Performance
+
+  * With bitpatterns, the floating point type can use native hardware
+    operations, with nearly correct behavior. For fully correct floating
+    point behavior and with other types, code must be written which
+    specially tests for equality with the missing-data bitpattern.
+
+  * With masks, there is always the overhead of accessing mask memory
+    and testing its truth value. The implementation that currently exists
+    has no performance tuning, so it is only good to judge a minimum
+    performance level. Optimal mask-based code is in general going to
+    be slower than optimal bitpattern-based code.
+
+* Correctness
+
+  * Bitpattern integer types must sacrifice a valid value to represent NA.
+    For larger integer types, there are arguments that this is ok, but for
+    8-bit types there is no reasonable choice. In the floating point case,
+    if the performance of native floating point operations is chosen,
+    there is a small inconsistency that NaN+NA and NA+NaN are different.
+  * With masks, it works correctly in all cases.
+
+* Generality
+
+  * The bitpattern approach can work in a fully general way only when
+    there is a specific value which can be given up from the
+    data type. For IEEE floating point, a NaN is an obvious choice,
+    and for booleans represented as a byte, there are plenty of choices.
+    For integers, a valid value must be sacrificed to use this approach.
+    Third-party dtypes which plug into NumPy will also have to
+    make a bitpattern choice to support this system, something which
+    may not always be possible.
+
+  * The mask approach works universally with all data types.
+
+Recommendations for Moving Forward
+==================================
+
+**Nathaniel** thinks we should:
+
+* Go ahead and implement bitpattern NAs.
+* *Don't* implement masked arrays in the core -- or at least, not
+  yet. Instead, we should focus on figuring out how to implement them
+  out-of-core, so that people can try out different approaches without
+  us committing to any one approach. And so new prototypes can be
+  released more quickly than the NumPy release cycle. And anyway,
+  we're going to have to figure out how to experiment with such
+  changes out-of-core if NumPy is to continue to evolve without
+  forking -- might as well do it now. The existing code can live in
+  master, disabled, or it can live in a branch -- it'll still be there
+  once we know what we're doing.
+
+**Mark** thinks we should:
+
+* The existing code should remain as is, with a global run-time experimental
+  flag added which disables NA support by default.
+
+A more detailed rationale for this recommendation is:
+
+* A solid preliminary NA-mask implementation is currently in NumPy
+  master. This implementation has been extensively tested
+  against scipy and other third-party packages, and has been in master
+  in a stable state for a significant amount of time.
+* This implementation integrates deeply with the core, providing an
+  interface which is usable in the same way R's NA support is. It
+  provides a compelling, user-friendly answer to R's NA support.
+* The missing data NEP provides a plan for adding bitpattern-based
+  dtype support of NAs, which will operate through the same interface
+  but allow for the same performance/correctness tradeoffs that R has made.
+* Making it very easy for users to try out this implementation, which
+  has reasonable feature coverage and performance characteristics, is
+  the best way to get more concrete feedback about how NumPy's missing
+  data support should look.
+
+Because of its preliminary state, the existing implementation is marked
+as experimental in the NumPy documentation. It would be good for this
+to remain marked as experimental until it is more fleshed out, for
+example supporting struct and array dtypes and with a fuller set of
+NumPy operations.
+
+I think the code should stay as it is, except to add a run-time global
+NumPy flag, perhaps numpy.experimental.maskna, which defaults to
+False and can be toggled to True. In its default state, any NA feature
+usage would raise an "ExperimentalError" exception, a measure which
+would prevent it from being accidentally used and communicate its
+experimental status very clearly.
+
+The `ABI issues`__ seem very tricky to deal with effectively in the 1.x
+series of releases, but I believe that with proper implementation-hiding
+in a 2.0 release, evolving the software to support various other
+ABI ideas that have been discussed is feasible. This is the approach
+I like best.
+
+__ http://thread.gmane.org/gmane.comp.python.numeric.general/49485>
+
+**Nathaniel** notes in response that he doesn't really have any
+objection to shipping experimental APIs in the main numpy distribution
+*if* we're careful to make sure that they don't "leak out" in a way
+that leaves us stuck with them. And in principle some sort of "this
+violates your warranty" global flag could be a way to do that. (In
+fact, this might also be a useful strategy for the kinds of changes
+that he favors, of adding minimal hooks to enable us to build
+prototypes more easily -- we could have some "rapid prototyping only"
+hooks that let prototype hacks get deeper access to NumPy's internals
+than we were otherwise ready to support.)
+
+But, he wants to point out two things. First, it seems like we still
+have fundamental questions to answer about the NEP design, like
+whether masks should have NA semantics or ignore semantics, and there
+are already plans to majorly change how NEP masks are exposed and
+accessed. So he isn't sure what we'll learn by asking for feedback on
+the NEP code in its current state.
+
+And second, given the concerns about their causing (minor) ABI issues,
+it's not clear that we could really prevent them from leaking out. (He
+looks forward to 2.0 too, but we're not there yet.) So maybe it would
+be better if they weren't present in the C API at all, and the hoops
+required for testers were instead something like, 'we have included a
+hacky pure-Python prototype accessible by typing "import
+numpy.experimental.donttrythisathome.NEP" and would welcome feedback'?
+
+If so, then he should mention that he did implement a horribly klugy,
+pure Python implementation of the NEP API that works with NumPy
+1.6.1. This was mostly as an experiment to see how possible such
+prototyping was and to test out a possible ufunc override mechanism,
+but if there's interest, the module is available here:
+https://github.com/njsmith/numpyNEP
+
+It passes the maskna test-suite, with some minor issues described
+in a big comment at the top.
+
+**Mark** responds:
+
+I agree that it's important to be careful when adding new
+features to NumPy, but I also believe it is essential that the project
+have forward development momentum. A project like NumPy requires
+developers to write code for advancement to occur, and obstacles
+that impede the writing of code discourage existing developers
+from contributing more, and potentially scare away developers
+who are thinking about joining in.
+
+All software projects, both open source and closed source, must
+balance between short-term practicality and long-term planning.
+In the case of the missing data development, there was a short-term
+resource commitment to tackle this problem, which is quite immense
+in scope. If there isn't a high likelihood of getting a contribution
+into NumPy that concretely advances towards a solution, I expect
+that individuals and companies interested in doing such work will
+have a much harder time justifying a commitment of their resources.
+For a project which is core to so many other libraries, only
+relying on the good will of selfless volunteers would mean that
+NumPy could more easily be overtaken by another project.
+
+In the case of the existing NA contribution at issue, how we resolve
+this disagreement represents a decision about how NumPy's
+developers, contributers, and users should interact. If we create
+a document describing a dispute resolution process, how do we
+design it so that it doesn't introduce a large burden and excessive
+uncertainty on developers that could prevent them from productively
+contributing code?
+
+If we go this route of writing up a decision process which includes
+such a dispute resolution mechanism, I think the meat of it should
+be a roadmap that potential contributers and developers can follow
+to gain influence over NumPy. NumPy development needs broad support
+beyond code contributions, and tying influence in the project to
+contributions seems to me like it would be a good way to encourage
+people to take on tasks like bug triaging/management, continuous
+integration/build server administration, and the myriad other
+tasks that help satisfy the project's needs. No specific meritocratic,
+democratic, consensus-striving system will satisfy everyone, but the
+vigour of the discussions around governance and process indicate that
+something at least a little bit more formal than the current status
+quo is necessary.
+
+In conclusion, I would like the NumPy project to prioritize movement
+towards a more flexible and modular ABI/API, balanced with strong
+backwards-compatibility constraints and feature additions that
+individuals, universities, and companies want to contribute.
+I do not believe keeping the NA code in 1.7 as it is, with the small
+additional measure of requiring it to be enabled by an experimental
+flag, poses a risk of long-term ABI troubles. The greater risk I see
+is a continuing lack of developers contributing to the project,
+and I believe backing out this code because these worries would create a
+risk of reducing developer contribution.
+
+
+References and Footnotes
+------------------------
+
+NEP 12 describes Mark's NA-semantics/mask implementation/view based mask
+handling API.
+
+NEP 24 ("the alterNEP") was Nathaniel's initial attempt at separating MISSING
+and IGNORED handling into bit-patterns versus masks, though there's a bunch
+he would change about the proposal at this point.
+
+NEP 25 ("miniNEP 2") was a later attempt by Nathaniel to sketch out an
+implementation strategy for NA dtypes.
+
+A further discussion overview page can be found at:
+https://github.com/njsmith/numpy/wiki/NA-discussion-status
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.
+\ No newline at end of file
author	Charles Harris <charlesr.harris@gmail.com>	2018-09-22 10:37:38 -0500
committer	GitHub <noreply@github.com>	2018-09-22 10:37:38 -0500
commit	12fc9fc9104fd30be42d532a6ca95bfa5615979d (patch)
tree	f32261ecc92ca46f016cbd9c0c3c0164ad9148e3
parent	25f06a6684b88b925748544f51955d3b0a6db423 (diff)
parent	531e97d39c4de28c9ebeaf2e6ad533d4b9e26142 (diff)
download	numpy-12fc9fc9104fd30be42d532a6ca95bfa5615979d.tar.gz