diff options
author | Charles Harris <charlesr.harris@gmail.com> | 2018-09-22 10:37:38 -0500 |
---|---|---|
committer | GitHub <noreply@github.com> | 2018-09-22 10:37:38 -0500 |
commit | 12fc9fc9104fd30be42d532a6ca95bfa5615979d (patch) | |
tree | f32261ecc92ca46f016cbd9c0c3c0164ad9148e3 | |
parent | 25f06a6684b88b925748544f51955d3b0a6db423 (diff) | |
parent | 531e97d39c4de28c9ebeaf2e6ad533d4b9e26142 (diff) | |
download | numpy-12fc9fc9104fd30be42d532a6ca95bfa5615979d.tar.gz |
Merge pull request #12017 from rgommers/missing-data-neps
NEP: add 3 missing data NEPs rescued from 2011-2012
-rw-r--r-- | doc/neps/_static/nep0013_image1.png | bin | 0 -> 18653 bytes | |||
-rw-r--r-- | doc/neps/_static/nep0013_image2.png | bin | 0 -> 6313 bytes | |||
-rw-r--r-- | doc/neps/_static/nep0013_image3.png | bin | 0 -> 11704 bytes | |||
-rw-r--r-- | doc/neps/conf.py | 3 | ||||
-rw-r--r-- | doc/neps/nep-0013-ufunc-overrides.rst | 30 | ||||
-rw-r--r-- | doc/neps/nep-0024-missing-data-2.rst | 210 | ||||
-rw-r--r-- | doc/neps/nep-0025-missing-data-3.rst | 469 | ||||
-rw-r--r-- | doc/neps/nep-0026-missing-data-summary.rst | 727 |
8 files changed, 1410 insertions, 29 deletions
diff --git a/doc/neps/_static/nep0013_image1.png b/doc/neps/_static/nep0013_image1.png Binary files differnew file mode 100644 index 000000000..e1b35b738 --- /dev/null +++ b/doc/neps/_static/nep0013_image1.png diff --git a/doc/neps/_static/nep0013_image2.png b/doc/neps/_static/nep0013_image2.png Binary files differnew file mode 100644 index 000000000..99f51b2fa --- /dev/null +++ b/doc/neps/_static/nep0013_image2.png diff --git a/doc/neps/_static/nep0013_image3.png b/doc/neps/_static/nep0013_image3.png Binary files differnew file mode 100644 index 000000000..87a354ad1 --- /dev/null +++ b/doc/neps/_static/nep0013_image3.png diff --git a/doc/neps/conf.py b/doc/neps/conf.py index 8cfb2b570..bd7503781 100644 --- a/doc/neps/conf.py +++ b/doc/neps/conf.py @@ -30,8 +30,7 @@ import os # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ['sphinx.ext.imgmath', - 'sphinx.ext.graphviz'] +extensions = ['sphinx.ext.imgmath',] # Add any paths that contain templates here, relative to this directory. templates_path = ['../source/_templates/'] diff --git a/doc/neps/nep-0013-ufunc-overrides.rst b/doc/neps/nep-0013-ufunc-overrides.rst index a51ce3927..0888c7559 100644 --- a/doc/neps/nep-0013-ufunc-overrides.rst +++ b/doc/neps/nep-0013-ufunc-overrides.rst @@ -261,16 +261,7 @@ consider carefully if any surprising behavior results. Type casting hierarchy. - .. graphviz:: - - digraph array_ufuncs { - rankdir=BT; - A -> C [label="C"]; - B -> C [label="C"]; - D -> B [label="B"]; - ndarray -> C [label="A"]; - ndarray -> B [label="B"]; - } + .. image:: _static/nep0013_image1.png The ``__array_ufunc__`` of type A can handle ndarrays returning C, B can handle ndarray and D returning B, and C can handle A and B returning C, @@ -286,14 +277,7 @@ consider carefully if any surprising behavior results. One-cycle in the ``__array_ufunc__`` graph. - .. graphviz:: - - digraph array_ufuncs { - rankdir=BT; - A -> B [label="B"]; - B -> A [label="A"]; - } - + .. image:: _static/nep0013_image2.png In this case, the ``__array_ufunc__`` relations have a cycle of length 1, and a type casting hierarchy does not exist. Binary operations are not @@ -303,15 +287,7 @@ consider carefully if any surprising behavior results. Longer cycle in the ``__array_ufunc__`` graph. - .. graphviz:: - - digraph array_ufuncs { - rankdir=BT; - A -> B [label="B"]; - B -> C [label="C"]; - C -> A [label="A"]; - } - + .. image:: _static/nep0013_image3.png In this case, the ``__array_ufunc__`` relations have a longer cycle, and a type casting hierarchy does not exist. Binary operations are still diff --git a/doc/neps/nep-0024-missing-data-2.rst b/doc/neps/nep-0024-missing-data-2.rst new file mode 100644 index 000000000..c8b19561f --- /dev/null +++ b/doc/neps/nep-0024-missing-data-2.rst @@ -0,0 +1,210 @@ +============================================================= +NEP 24 — Missing Data Functionality - Alternative 1 to NEP 12 +============================================================= + +:Author: Nathaniel J. Smith <njs@pobox.com>, Matthew Brett <matthew.brett@gmail.com> +:Status: Deferred +:Type: Standards Track +:Created: 2011-06-30 + + +Abstract +-------- + +*Context: this NEP was written as an alternative to NEP 12, which at the time of writing +had an implementation that was merged into the NumPy master branch.* + +The principle of this NEP is to separate the APIs for masking and for missing values, according to + +* The current implementation of masked arrays (NEP 12) +* This proposal. + +This discussion is only of the API, and not of the implementation. + +Detailed description +-------------------- + + +Rationale +^^^^^^^^^ + +The purpose of this aNEP is to define two interfaces -- one for handling +'missing values', and one for handling 'masked arrays'. + +An ordinary value is something like an integer or a floating point number. A +*missing* value is a placeholder for an ordinary value that is for some +reason unavailable. For example, in working with statistical data, we often +build tables in which each row represents one item, and each column +represents properties of that item. For instance, we might take a group of +people and for each one record height, age, education level, and income, and +then stick these values into a table. But then we discover that our research +assistant screwed up and forgot to record the age of one of our individuals. +We could throw out the rest of their data as well, but this would be +wasteful; even such an incomplete row is still perfectly usable for some +analyses (e.g., we can compute the correlation of height and income). The +traditional way to handle this would be to stick some particular meaningless +value in for the missing data, e.g., recording this person's age as 0. But +this is very error prone; we may later forget about these special values +while running other analyses, and discover to our surprise that babies have +higher incomes than teenagers. (In this case, the solution would be to just +leave out all the items where we have no age recorded, but this isn't a +general solution; many analyses require something more clever to handle +missing values.) So instead of using an ordinary value like 0, we define a +special "missing" value, written "NA" for "not available". + +Therefore, missing values have the following properties: Like any other +value, they must be supported by your array's dtype -- you can't store a +floating point number in an array with dtype=int32, and you can't store an NA +in it either. You need an array with dtype=NAint32 or something (exact syntax +to be determined). Otherwise, they act exactly like any other values. In +particular, you can apply arithmetic functions and so forth to them. By +default, any function which takes an NA as an argument always returns an NA +as well, regardless of the values of the other arguments. This ensures that +if we try to compute the correlation of income with age, we will get "NA", +meaning "given that some of the entries could be anything, the answer could +be anything as well". This reminds us to spend a moment thinking about how we +should rephrase our question to be more meaningful. And as a convenience for +those times when you do decide that you just want the correlation between the +known ages and income, then you can enable this behavior by adding a single +argument to your function call. + +For floating point computations, NAs and NaNs have (almost?) identical +behavior. But they represent different things -- NaN an invalid computation +like 0/0, NA a value that is not available -- and distinguishing between +these things is useful because in some situations they should be treated +differently. (For example, an imputation procedure should replace NAs with +imputed values, but probably should leave NaNs alone.) And anyway, we can't +use NaNs for integers, or strings, or booleans, so we need NA anyway, and +once we have NA support for all these types, we might as well support it for +floating point too for consistency. + +A masked array is, conceptually, an ordinary rectangular numpy array, which +has had an arbitrarily-shaped mask placed over it. The result is, +essentially, a non-rectangular view of a rectangular array. In principle, +anything you can accomplish with a masked array could also be accomplished by +explicitly keeping a regular array and a boolean mask array and using numpy +indexing to combine them for each operation, but combining them into a single +structure is much more convenient when you need to perform complex operations +on the masked view of an array, while still being able to manipulate the mask +in the usual ways. Therefore, masks are preserved through indexing, and +functions generally treat masked-out values as if they were not even part of +the array in the first place. (Maybe this is a good heuristic: a length-4 +array in which the last value has been masked out behaves just like an +ordinary length-3 array, so long as you don't change the mask.) Except, of +course, that you are free to manipulate the mask in arbitrary ways whenever +you like; it's just a standard numpy array. + +There are some simple situations where one could use either of these tools to +get the job done -- or other tools entirely, like using designated surrogate +values (age=0), separate mask arrays, etc. But missing values are designed to +be particularly helpful in situations where the missingness is an intrinsic +feature of the data -- where there's a specific value that **should** exist, +if it did exist we'd it'd mean something specific, but it **doesn't**. Masked +arrays are designed to be particularly helpful in situations where we just +want to temporarily ignore some data that does exist, or generally when we +need to work with data that has a non-rectangular shape (e.g., if you make +some measurement at each point on a grid laid over a circular agar dish, then +the points that fall outside the dish aren't missing measurements, they're +just meaningless). + +Initialization +^^^^^^^^^^^^^^ + +First, missing values can be set and be displayed as ``np.NA, NA``:: + + >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') + array([1., 2., NA, 7.], dtype='NA[<f8]') + +As the initialization is not ambiguous, this can be written without the NA +dtype:: + + >>> np.array([1.0, 2.0, np.NA, 7.0]) + array([1., 2., NA, 7.], dtype='NA[<f8]') + +Masked values can be set and be displayed as ``np.IGNORE, IGNORE``:: + + >>> np.array([1.0, 2.0, np.IGNORE, 7.0], masked=True) + array([1., 2., IGNORE, 7.], masked=True) + +As the initialization is not ambiguous, this can be written without +``masked=True``:: + + >>> np.array([1.0, 2.0, np.IGNORE, 7.0]) + array([1., 2., IGNORE, 7.], masked=True) + +Ufuncs +^^^^^^ + +By default, NA values propagate:: + + >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) + >>> np.sum(na_arr) + NA('float64') + +unless the ``skipna`` flag is set:: + + >>> np.sum(na_arr, skipna=True) + 10.0 + +By default, masking does not propagate:: + + >>> masked_arr = np.array([1.0, 2.0, np.IGNORE, 7.0]) + >>> np.sum(masked_arr) + 10.0 + +unless the ``propmask`` flag is set:: + + >>> np.sum(masked_arr, propmask=True) + IGNORE + +An array can be masked, and contain NA values:: + + >>> both_arr = np.array([1.0, 2.0, np.IGNORE, np.NA, 7.0]) + +In the default case, the behavior is obvious:: + + >>> np.sum(both_arr) + NA('float64') + +It's also obvious what to do with ``skipna=True``:: + + >>> np.sum(both_arr, skipna=True) + 10.0 + >>> np.sum(both_arr, skipna=True, propmask=True) + IGNORE + +To break the tie between NA and MSK, NAs propagate harder:: + + >>> np.sum(both_arr, propmask=True) + NA('float64') + +Assignment +^^^^^^^^^^ + +is obvious in the NA case:: + + >>> arr = np.array([1.0, 2.0, 7.0]) + >>> arr[2] = np.NA + TypeError('dtype does not support NA') + >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') + >>> na_arr[2] = np.NA + >>> na_arr + array([1., 2., NA], dtype='NA[<f8]') + +Direct assignnent in the masked case is magic and confusing, and so happens only +via the mask:: + + >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) + >>> masked_arr[2] = np.NA + TypeError('dtype does not support NA') + >>> masked_arr[2] = np.IGNORE + TypeError('float() argument must be a string or a number') + >>> masked_arr.visible[2] = False + >>> masked_arr + array([1., 2., IGNORE], masked=True) + + +Copyright +--------- + +This document has been placed in the public domain. diff --git a/doc/neps/nep-0025-missing-data-3.rst b/doc/neps/nep-0025-missing-data-3.rst new file mode 100644 index 000000000..6343759e8 --- /dev/null +++ b/doc/neps/nep-0025-missing-data-3.rst @@ -0,0 +1,469 @@ +====================================== +NEP 25 — NA support via special dtypes +====================================== + +:Author: Nathaniel J. Smith <njs@pobox.com> +:Status: Deferred +:Type: Standards Track +:Created: 2011-07-08 + +Abstract +======== + +*Context: this NEP was written as an additional alternative to NEP 12 (NEP 24 +is another alternative), which at the time of writing had an implementation +that was merged into the NumPy master branch.* + +To try and make more progress on the whole missing values/masked arrays/... +debate, it seems useful to have a more technical discussion of the pieces +which we *can* agree on. This is the second, which attempts to nail down the +details of how NAs can be implemented using special dtype's. + +Rationale +--------- + +An ordinary value is something like an integer or a floating point number. A +missing value is a placeholder for an ordinary value that is for some reason +unavailable. For example, in working with statistical data, we often build +tables in which each row represents one item, and each column represents +properties of that item. For instance, we might take a group of people and +for each one record height, age, education level, and income, and then stick +these values into a table. But then we discover that our research assistant +screwed up and forgot to record the age of one of our individuals. We could +throw out the rest of their data as well, but this would be wasteful; even +such an incomplete row is still perfectly usable for some analyses (e.g., we +can compute the correlation of height and income). The traditional way to +handle this would be to stick some particular meaningless value in for the +missing data,e.g., recording this person's age as 0. But this is very error +prone; we may later forget about these special values while running other +analyses, and discover to our surprise that babies have higher incomes than +teenagers. (In this case, the solution would be to just leave out all the +items where we have no age recorded, but this isn't a general solution; many +analyses require something more clever to handle missing values.) So instead +of using an ordinary value like 0, we define a special "missing" value, +written "NA" for "not available". + +There are several possible ways to represent such a value in memory. For +instance, we could reserve a specific value (like 0, or a particular NaN, or +the smallest negative integer) and then ensure that this value is treated +specially by all arithmetic and other operations on our array. Another option +would be to add an additional mask array next to our main array, use this to +indicate which values should be treated as NA, and then extend our array +operations to check this mask array whenever performing computations. Each +implementation approach has various strengths and weaknesses, but here we focus +on the former (value-based) approach exclusively and leave the possible +addition of the latter to future discussion. The core advantages of this +approach are (1) it adds no additional memory overhead, (2) it is +straightforward to store and retrieve such arrays to disk using existing file +storage formats, (3) it allows binary compatibility with R arrays including NA +values, (4) it is compatible with the common practice of using NaN to indicate +missingness when working with floating point numbers, (5) the dtype is already +a place where "weird things can happen" -- there are a wide variety of dtypes +that don't act like ordinary numbers (including structs, Python objects, +fixed-length strings, ...), so code that accepts arbitrary numpy arrays already +has to be prepared to handle these (even if only by checking for them and +raising an error). Therefore adding yet more new dtypes has less impact on +extension authors than if we change the ndarray object itself. + +The basic semantics of NA values are as follows. Like any other value, they +must be supported by your array's dtype -- you can't store a floating point +number in an array with dtype=int32, and you can't store an NA in it either. +You need an array with dtype=NAint32 or something (exact syntax to be +determined). Otherwise, NA values act exactly like any other values. In +particular, you can apply arithmetic functions and so forth to them. By +default, any function which takes an NA as an argument always returns an NA as +well, regardless of the values of the other arguments. This ensures that if we +try to compute the correlation of income with age, we will get "NA", meaning +"given that some of the entries could be anything, the answer could be anything +as well". This reminds us to spend a moment thinking about how we should +rephrase our question to be more meaningful. And as a convenience for those +times when you do decide that you just want the correlation between the known +ages and income, then you can enable this behavior by adding a single argument +to your function call. + +For floating point computations, NAs and NaNs have (almost?) identical +behavior. But they represent different things -- NaN an invalid computation +like 0/0, NA a value that is not available -- and distinguishing between these +things is useful because in some situations they should be treated differently. +(For example, an imputation procedure should replace NAs with imputed values, +but probably should leave NaNs alone.) And anyway, we can't use NaNs for +integers, or strings, or booleans, so we need NA anyway, and once we have NA +support for all these types, we might as well support it for floating point too +for consistency. + +General strategy +================ + +Numpy already has a general mechanism for defining new dtypes and slotting them +in so that they're supported by ndarrays, by the casting machinery, by ufuncs, +and so on. In principle, we could implement NA-dtypes just using these existing +interfaces. But we don't want to do that, because defining all those new ufunc +loops etc. from scratch would be a huge hassle, especially since the basic +functionality needed is the same in all cases. So we need some generic +functionality for NAs -- but it would be better not to bake this in as a single +set of special "NA types", since users may well want to define new custom +dtypes that have their own NA values, and have them integrate well the rest of +the NA machinery. Our strategy, therefore, is to avoid the `mid-layer mistake`_ +by exposing some code for generic NA handling in different situations, which +dtypes can selectively use or not as they choose. + +.. _mid-layer mistake: https://lwn.net/Articles/336262/ + +Some example use cases: + 1. We want to define a dtype that acts exactly like an int32, except that the + most negative value is treated as NA. + 2. We want to define a parametrized dtype to represent `categorical data`_, + and the bit-pattern to be used for NA depends on the number of categories + defined, so our code needs to play an active role handling it rather than + simply deferring to the standard machinery. + 3. We want to define a dtype that acts like an length-10 string and supports + NAs. Since our string may hold arbitrary binary values, we want to actually + allocate 11 bytes for it, with the first byte a flag indicating whether this + string is NA and the rest containing the string content. + 4. We want to define a dtype that allows multiple different types of NA data, + which print differently and can be distinguished by the new ufunc that we + define called ``is_na_of_type(...)``, but otherwise takes advantage of the + generic NA machinery for most operations. + +.. _categorical data: http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html + +dtype C-level API extensions +============================ + +The `PyArray_Descr`_ struct gains the following new fields:: + + void * NA_value; + PyArray_Descr * NA_extends; + int NA_extends_offset; + +.. _PyArray_Descr: http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArray_Descr + +The following new flag values are defined:: + + NPY_NA_AUTO_ARRFUNCS + NPY_NA_AUTO_CAST + NPY_NA_AUTO_UFUNC + NPY_NA_AUTO_UFUNC_CHECKED + NPY_NA_AUTO_ALL /* the above flags OR'ed together */ + +The `PyArray_ArrFuncs`_ struct gains the following new fields:: + + void (*isna)(void * src, void * dst, npy_intp n, void * arr); + void (*clearna)(void * data, npy_intp n, void * arr); + +.. _PyArray_ArrFuncs: http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArray_ArrFuncs + +We add at least one new convenience macro:: + + #define NPY_NA_SUPPORTED(dtype) ((dtype)->f->isna != NULL) + +The general idea is that anywhere where we used to call a dtype-specific +function pointer, the code will be modified to instead: + + 1. Check for whether the relevant ``NPY_NA_AUTO_...`` bit is enabled, the + NA_extends field is non-NULL, and the function pointer we wanted to call + is NULL. + 2. If these conditions are met, then use ``isna`` to identify which entries + in the array are NA, and handle them appropriately. Then look up whatever + function we were *going* to call using this dtype on the ``NA_extends`` + dtype instead, and use that to handle the non-NA elements. + +For more specifics, see following sections. + +Note that if ``NA_extends`` points to a parametrized dtype, then the dtype +object it points to must be fully specified. For example, if it is a string +dtype, it must have a non-zero ``elsize`` field. + +In order to handle the case where the NA information is stored in a field next +to the `real' data, the ``NA_extends_offset`` field is set to a non-zero value; +it must point to the location within each element of this dtype where some data +of the ``NA_extends`` dtype is found. For example, if we have are storing +10-byte strings with an NA indicator byte at the beginning, then we have:: + + elsize == 11 + NA_extends_offset == 1 + NA_extends->elsize == 10 + +When delegating to the ``NA_extends`` dtype, we offset our data pointer by +``NA_extends_offset`` (while keeping our strides the same) so that it sees an +array of data of the expected type (plus some superfluous padding). This is +basically the same mechanism that record dtypes use, IIUC, so it should be +pretty well-tested. + +When delegating to a function that cannot handle "misbehaved" source data (see +the ``PyArray_ArrFuncs`` documentation for details), then we need to check for +alignment issues before delegating (especially with a non-zero +``NA_extends_offset``). If there's a problem, when we need to "clean up" the +source data first, using the usual mechanisms for handling misaligned data. (Of +course, we should usually set up our dtypes so that there aren't any alignment +issues, but someone screws that up, or decides that reduced memory usage is +more important to them then fast inner loops, then we should still handle that +gracefully, as we do now.) + +The ``NA_value`` and ``clearna`` fields are used for various sorts of casting. +``NA_value`` is a bit-pattern to be used when, for example, assigning from +np.NA. ``clearna`` can be a no-op if ``elsize`` and ``NA_extends->elsize`` are +the same, but if they aren't then it should clear whatever auxiliary NA storage +this dtype uses, so that none of the specified array elements are NA. + +Core dtype functions +-------------------- + +The following functions are defined in ``PyArray_ArrFuncs``. The special +behavior described here is enabled by the NPY_NA_AUTO_ARRFUNCS bit in the dtype +flags, and only enabled if the given function field is *not* filled in. + +``getitem``: Calls ``isna``. If ``isna`` returns true, returns np.NA. +Otherwise, delegates to the ``NA_extends`` dtype. + +``setitem``: If the input object is ``np.NA``, then runs +``memcpy(self->NA_value, data, arr->dtype->elsize);``. Otherwise, calls +``clearna``, and then delegates to the ``NA_extends`` dtype. + +``copyswapn``, ``copyswap``: FIXME: Not sure whether there's any special +handling to use for these? + +``compare``: FIXME: how should this handle NAs? R's sort function *discards* +NAs, which doesn't seem like a good option. + +``argmax``: FIXME: what is this used for? If it's the underlying implementation +for np.max, then it really needs some way to get a skipna argument. If not, +then the appropriate semantics depends on what it's supposed to accomplish... + +``dotfunc``: QUESTION: is it actually guaranteed that everything has the same +dtype? FIXME: same issues as for ``argmax``. + +``scanfunc``: This one's ugly. We may have to explicitly override it in all of +our special dtypes, because assuming that we want the option of, say, having +the token "NA" represent an NA value in a text file, we need some way to check +whether that's there before delegating. But ``ungetc`` is only guaranteed to +let us put back 1 character, and we need 2 (or maybe 3 if we actually check for +"NA "). The other option would be to read to the next delimiter, check whether +we have an NA, and if not then delegate to ``fromstr`` instead of ``scanfunc``, +but according to the current API, each dtype might in principle use a totally +different rule for defining "the next delimiter". So... any ideas? (FIXME) + +``fromstr``: Easy -- check for "NA ", if present then assign ``NA_value``, +otherwise call ``clearna`` and delegate. + +``nonzero``: FIXME: again, what is this used for? (It seems redundant with +using the casting machinery to cast to bool.) Probably it needs to be modified +so that it can return NA, though... + +``fill``: Use ``isna`` to check if either of the first two values is NA. If so, +then fill the rest of the array with ``NA_value``. Otherwise, call ``clearna`` +and then delegate. + +``fillwithvalue``: Guess this can just delegate? + +``sort``, ``argsort``: These should probably arrange to sort NAs to a +particular place in the array (either the front or the back -- any opinions?) + +``scalarkind``: FIXME: I have no idea what this does. + +``castdict``, ``cancastscalarkindto``, ``cancastto``: See section on casting +below. + +Casting +------- + +FIXME: this really needs attention from an expert on numpy's casting rules. But +I can't seem to find the docs that explain how casting loops are looked up and +decided between (e.g., if you're casting from dtype A to dtype B, which dtype's +loops are used?), so I can't go into details. But those details are tricky and +they matter... + +But the general idea is, if you have a dtype with ``NPY_NA_AUTO_CAST`` set, +then the following conversions are automatically allowed: + + * Casting from the underlying type to the NA-type: this is performed by the + * usual ``clearna`` + potentially-strided copy dance. Also, ``isna`` is + * called to check that none of the regular values have been accidentally + * converted into NA; if so, then an error is raised. + * Casting from the NA-type to the underlying type: allowed in principle, but + if ``isna`` returns true for any of the values that are to be converted, + then again, an error is raised. (If you want to get around this, use + ``np.view(array_with_NAs, dtype=float)``.) + * Casting between the NA-type and other types that do not support NA: this is + allowed if the underlying type is allowed to cast to the other type, and is + performed by combining a cast to or from the underlying type (using the + above rules) with a cast to or from the other type (using the underlying + type's rules). + * Casting between the NA-type and other types that do support NA: if the + other type has NPY_NA_AUTO_CAST set, then we use the above rules plus the + usual dance with ``isna`` on one array being converted to ``NA_value`` + elements in the other. If only one of the arrays has NPY_NA_AUTO_CAST set, + then it's assumed that that dtype knows what it's doing, and we don't do + any magic. (But this is one of the things that I'm not sure makes sense, as + per my caveat above.) + +Ufuncs +------ + +All ufuncs gain an additional optional keyword argument, ``skipNA=``, which +defaults to False. + +If ``skipNA == True``, then the ufunc machinery *unconditionally* calls +``isna`` for any dtype where NPY_NA_SUPPORTED(dtype) is true, and then acts as +if any values for which isna returns True were masked out in the ``where=`` +argument (see miniNEP 1 for the behavior of ``where=``). If a ``where=`` +argument is also given, then it acts as if the ``isna`` values had be ANDed out +of the ``where=`` mask, though it does not actually modify the mask. Unlike the +other changes below, this is performed *unconditionally* for any dtype which +has an ``isna`` function defined; the NPY_NA_AUTO_UFUNC flag is *not* checked. + +If NPY_NA_AUTO_UFUNC is set, then ufunc loop lookup is modified so that +whenever it checks for the existence of a loop on the current dtype, and does +not find one, then it also checks for a loop on the ``NA_extends`` dtype. If +that loop is found, then it uses it in the normal way, with the exceptions that +(1) it is only called for values which are not NA according to ``isna``, (2) if +the output array has NPY_NA_AUTO_UFUNC set, then ``clearna`` is called on it +before calling the ufunc loop, (3) pointer offsets are adjusted by +``NA_extends_offset`` before calling the ufunc loop. In addition, if +NPY_NA_AUTO_UFUNC_CHECK is set, then after evaluating the ufunc loop we call +``isna`` on the *output* array, and if there are any NAs in the output which +were not in the input, then we raise an error. (The intention of this is to +catch cases where, say, we represent NA using the most-negative integer, and +then someone's arithmetic overflows to create such a value by accident.) + +FIXME: We should go into more detail here about how NPY_NA_AUTO_UFUNC works +when there are multiple input arrays, of which potentially some have the flag +set and some do not. + +Printing +-------- + +FIXME: There should be some sort of mechanism by which values which are NA are +automatically repr'ed as NA, but I don't really understand how numpy printing +works, so I'll let someone else fill in this section. + +Indexing +-------- + +Scalar indexing like ``a[12]`` goes via the ``getitem`` function, so according +to the proposal as described above, if a dtype delegates ``getitem``, then +scalar indexing on NAs will return the object ``np.NA``. (If it doesn't +delegate ``getitem``, of course, then it can return whatever it wants.) + +This seems like the simplest approach, but an alternative would be to add a +special case to scalar indexing, where if an ``NPY_NA_AUTO_INDEX`` flag were +set, then it would call ``isna`` on the specified element. If this returned +false, it would call ``getitem`` as usual; otherwise, it would return a 0-d +array containing the specified element. The problem with this is that it breaks +expressions like ``if a[i] is np.NA: ...``. (Of course, there is nothing nearly +so convenient as that for NaN values now, but then, NaN values don't have their +own global singleton.) So for now we stick to scalar indexing just returning +``np.NA``, but this can be revisited if anyone objects. + +Python API for generic NA support +================================= + +NumPy will gain a global singleton called numpy.NA, similar to None, but with +semantics reflecting its status as a missing value. In particular, trying to +treat it as a boolean will raise an exception, and comparisons with it will +produce numpy.NA instead of True or False. These basics are adopted from the +behavior of the NA value in the R project. To dig deeper into the ideas, +http://en.wikipedia.org/wiki/Ternary_logic#Kleene_logic provides a starting +point. + +Most operations on ``np.NA`` (e.g., ``__add__``, ``__mul__``) are overridden to +unconditionally return ``np.NA``. + +The automagic dtype detection used for expressions like ``np.asarray([1, 2, +3])``, ``np.asarray([1.0, 2.0. 3.0])`` will be extended to recognize the +``np.NA`` value, and use it to automatically switch to a built-in NA-enabled +dtype (which one being determined by the other elements in the array). A simple +``np.asarray([np.NA])`` will use an NA-enabled float64 dtype (which is +analogous to what you get from ``np.asarray([])``). Note that this means that +expressions like ``np.log(np.NA)`` will work: first ``np.NA`` will be coerced +to a 0-d NA-float array, and then ``np.log`` will be called on that. + +Python-level dtype objects gain the following new fields:: + + NA_supported + NA_value + +``NA_supported`` is a boolean which simply exposes the value of the +``NPY_NA_SUPPORTED`` flag; it should be true if this dtype allows for NAs, +false otherwise. [FIXME: would it be better to just key this off the existence +of the ``isna`` function? Even if a dtype decides to implement all other NA +handling itself, it still has to define ``isna`` in order to make ``skipNA=`` +work correctly.] + +``NA_value`` is a 0-d array of the given dtype, and its sole element contains +the same bit-pattern as the dtype's underlying ``NA_value`` field. This makes +it possible to determine the default bit-pattern for NA values for this type +(e.g., with ``np.view(mydtype.NA_value, dtype=int8)``). + +We *do not* expose the ``NA_extends`` and ``NA_extends_offset`` values at the +Python level, at least for now; they're considered an implementation detail +(and it's easier to expose them later if they're needed then unexpose them if +they aren't). + +Two new ufuncs are defined: ``np.isNA`` returns a logical array, with true +values where-ever the dtype's ``isna`` function returned true. ``np.isnumber`` +is only defined for numeric dtypes, and returns True for all elements which are +not NA, and for which ``np.isfinite`` would return True. + +Builtin NA dtypes +================= + +The above describes the generic machinery for NA support in dtypes. It's +flexible enough to handle all sorts of situations, but we also want to define a +few generally useful NA-supporting dtypes that are available by default. + +For each built-in dtype, we define an associated NA-supporting dtype, as +follows: + +* floats: the associated dtype uses a specific NaN bit-pattern to indicate NA + (chosen for R compatibility) +* complex: we do whatever R does (FIXME: look this up -- two NA floats, + probably?) +* signed integers: the most-negative signed value is used as NA (chosen for R + compatibility) +* unsigned integers: the most-positive value is used as NA (no R compatibility + possible). +* strings: the first byte (or, in the case of unicode strings, first 4 bytes) + is used as a flag to indicate NA, and the rest of the data gives the actual + string. (no R compatibility possible) +* objects: Two options (FIXME): either we don't include an NA-ful version, or + we use np.NA as the NA bit pattern. +* boolean: we do whatever R does (FIXME: look this up -- 0 == FALSE, 1 == TRUE, + 2 == NA?) + +Each of these dtypes is trivially defined using the above machinery, and are +what are automatically used by the automagic type inference machinery (for +``np.asarray([True, np.NA, False])``, etc.). + +They can also be accessed via a new function ``np.withNA``, which takes a +regular dtype (or an object that can be coerced to a dtype, like 'float') and +returns one of the above dtypes. Ideally ``withNA`` should also take some +optional arguments that let you describe which values you want to count as NA, +etc., but I'll leave that for a future draft (FIXME). + +FIXME: If ``d`` is one of the above dtypes, then should ``d.type`` return? + +The NEP also contains a proposal for a somewhat elaborate +domain-specific-language for describing NA dtypes. I'm not sure how great an +idea that is. (I have a bias against using strings as data structures, and find +the already existing strings confusing enough as it is -- also, apparently the +NEP version of numpy uses strings like 'f8' when printing dtypes, while my +numpy uses object names like 'float64', so I'm not sure what's going on there. +``withNA(float64, arg1=value1)`` seems like a more pleasant way to print a +dtype than "NA[f8,value1]", at least to me.) But if people want it, then cool. + +Type hierarchy +-------------- + +FIXME: how should we do subtype checks, etc., for NA dtypes? What does +``issubdtype(withNA(float), float)`` return? How about +``issubdtype(withNA(float), np.floating)``? + +Serialization +------------- + + +Copyright +--------- + +This document has been placed in the public domain. diff --git a/doc/neps/nep-0026-missing-data-summary.rst b/doc/neps/nep-0026-missing-data-summary.rst new file mode 100644 index 000000000..6f7fd0fa1 --- /dev/null +++ b/doc/neps/nep-0026-missing-data-summary.rst @@ -0,0 +1,727 @@ +==================================================== +NEP 26 — Summary of Missing Data NEPs and discussion +==================================================== + +:Author: Mark Wiebe <mwwiebe@gmail.com>, Nathaniel J. Smith <njs@pobox.com> +:Status: Deferred +:Type: Standards Track +:Created: 2012-04-22 + +*Context: this NEP was written as summary of the large number of discussions +and proposals (NEP 12, NEP 24, NEP 25), regarding missing data functionality.* + +The debate about how NumPy should handle missing data, a subject with +many preexisting approaches, requirements, and conventions, has been long and +contentious. There has been more than one proposal for how to implement +support into NumPy, and there is a testable implementation which is +merged into NumPy's current master. The vast number of emails and differing +points of view has made it difficult for interested parties to understand +the issues and be comfortable with the direction NumPy is going. + +Here is our (Mark and Nathaniel's) attempt to summarize the +problem, proposals, and points of agreement/disagreement in a single +place, to help the community move towards consensus. + +The NumPy developers' problem +============================= + +For this discussion, "missing data" means array elements +which can be indexed (e.g. A[3] in an array A with shape (5,)), +but have, in some sense, no value. + +It does not refer to compressed or sparse storage techniques where +the value for A[3] is not actually stored in memory, but still has a +well-defined value like 0. + +This is still vague, and to create an actual implementation, +it is necessary to answer such questions as: + +* What values are computed when doing element-wise ufuncs. +* What values are computed when doing reductions. +* Whether the storage for an element gets overwritten when marking + that value missing. +* Whether computations resulting in NaN automatically treat in the + same way as a missing value. +* Whether one interacts with missing values using a placeholder object + (e.g. called "NA" or "masked"), or through a separate boolean array. +* Whether there is such a thing as an array object that cannot hold + missing array elements. +* How the (C and Python) API is expressed, in terms of dtypes, + masks, and other constructs. +* If we decide to answer some of these questions in multiple ways, + then that creates the question of whether that requires multiple + systems, and if so how they should interact. + +There's clearly a very large space of missing-data APIs that *could* +be implemented. There is likely at least one user, somewhere, who +would find any possible implementation to be just the thing they +need to solve some problem. On the other hand, much of NumPy's power +and clarity comes from having a small number of orthogonal concepts, +such as strided arrays, flexible indexing, broadcasting, and ufuncs, +and we'd like to preserve that simplicity. + +There has been dissatisfaction among several major groups of NumPy users +about the existing status quo of missing data support. In particular, +neither the numpy.ma component nor use of floating-point NaNs as a +missing data signal fully satisfy the performance requirements and +ease of use for these users. The example of R, where missing data +is treated via an NA placeholder and is deeply integrated into all +computation, is where many of these users point to indicate what +functionality they would like. Doing a deep integration of missing +data like in R must be considered carefully, it must be clear it +is not being done in a way which sacrifices existing performance +or functionality. + +Our problem is, how can we choose some incremental additions to +NumPy that will make a large class of users happy, be +reasonably elegant, complement the existing design, and that we're +comfortable we won't regret being stuck with in the long term. + +Prior art +========= + +So a major (maybe *the* major) problem is figuring out how ambitious +the project to add missing data support to NumPy should be, and which +kinds of problems are in scope. Let's start with the +best understood situation where "missing data" comes into play: + +"Statistical missing data" +-------------------------- + +In statistics, social science, etc., "missing data" is a term of art +referring to a specific (but extremely common and important) +situation: we have tried to gather some measurements according to some +scheme, but some of these measurements are missing. For example, if we +have a table listing the height, age, and income of a number of +individuals, but one person did not provide their income, then we need +some way to represent this:: + + Person | Height | Age | Income + ------------------------------ + 1 | 63 | 25 | 15000 + 2 | 58 | 32 | <missing> + 3 | 71 | 45 | 30000 + +The traditional way is to record that income as, say, "-99", and +document this in the README along with the data set. Then, you have to +remember to check for and handle such incomes specially; if you +forget, you'll get superficially reasonable but completely incorrect +results, like calculating the average income on this data set as +14967. If you're in one of these fields, then such missing-ness is +routine and inescapable, and if you use the "-99" approach then it's a +pitfall you have to remember to check for explicitly on literally +*every* calculation you ever do. This is, obviously, an unpleasant way +to live. + +Let's call this situation the "statistical missing data" situation, +just to have a convenient handle for it. (As mentioned, practitioners +just call this "missing data", and what to do about it is literally an +entire sub-field of statistics; if you google "missing data" then +every reference is on how to handle it.) NumPy isn't going to do +automatic imputation or anything like that, but it could help a great +deal by providing some standard way to at least represent data which +is missing in this sense. + +The main prior art for how this could be done comes from the S/S+/R +family of languages. Their strategy is, for each type they support, +to define a special value called "NA". (For ints this is INT_MAX, +for floats it's a special NaN value that's distinguishable from +other NaNs, ...) Then, they arrange that in computations, this +value has a special semantics that we will call "NA semantics". + +NA Semantics +------------ + +The idea of NA semantics is that any computations involving NA +values should be consistent with what would have happened if we +had known the correct value. + +For example, let's say we want to compute the mean income, how might +we do this? One way would be to just ignore the missing entry, and +compute the mean of the remaining entries. This gives us (15000 + +30000)/2, or 22500. + +Is this result consistent with discovering the income of person 2? +Let's say we find out that person 2's income is 50000. This means +the correct answer is (15000 + 50000 + 30000)/3, or 31666.67, +indicating clearly that it is not consistent. Therefore, the mean +income is NA, i.e. a specific number whose value we are unable +to compute. + +This motivates the following rules, which are how R implements NA: + +Assignment: + NA values are understood to represent specific + unknown values, and thus should have value-like semantics with + respect to assignment and other basic data manipulation + operations. Code which does not actually look at the values involved + should work the same regardless of whether some of them are + missing. For example, one might write:: + + income[:] = income[np.argsort(height)] + + to perform an in-place sort of the ``income`` array, and know that + the shortest person's income would end up being first. It turns out + that the shortest person's income is not known, so the array should + end up being ``[NA, 15000, 30000]``, but there's nothing + special about NAness here. + +Propagation: + In the example above, we concluded that an operation like ``mean`` + should produce NA when one of its data values was NA. + If you ask me, "what is 3 plus x?", then my only possible answer is + "I don't know what x is, so I don't know what 3 + x is either". NA + means "I don't know", so 3 + NA is NA. + + This is important for safety when analyzing data: missing data often + requires special handling for correctness -- the fact that you are + missing information might mean that something you wanted to compute + cannot actually be computed, and there are whole books written on + how to compensate in various situations. Plus, it's easy to not + realize that you have missing data, and write code that assumes you + have all the data. Such code should not silently produce the wrong + answer. + + There is an important exception to characterizing this as propagation, + in the case of boolean values. Consider the calculation:: + + v = np.any([False, False, NA, True]) + + If we strictly propagate, ``v`` will become NA. However, no + matter whether we place True or False into the third array position, + ``v`` will then get the value True. The answer to the question + "Is the result True consistent with later discovering the value + that was missing?" is yes, so it is reasonable to not propagate here, + and instead return the value True. This is what R does:: + + > any(c(F, F, NA, T)) + [1] TRUE + > any(c(F, F, NA, F)) + [1] NA + +Other: + NaN and NA are conceptually distinct. 0.0/0.0 is not a mysterious, + unknown value -- it's defined to be NaN by IEEE floating point, Not + a Number. NAs are numbers (or strings, or whatever), just unknown + ones. Another small but important difference is that in Python, ``if + NaN: ...`` treats NaN as True (NaN is "truthy"); but ``if NA: ...`` + would be an error. + + In R, all reduction operations implement an alternative semantics, + activated by passing a special argument (``na.rm=TRUE`` in R). + ``sum(a)`` means "give me the sum of all the + values" (which is NA if some of the values are NA); + ``sum(a, na.rm=True)`` means "give me the sum of all the non-NA + values". + +Other prior art +--------------- + +Once we move beyond the "statistical missing data" case, the correct +behavior for missing data becomes less clearly defined. There are many +cases where specific elements are singled out to be treated specially +or excluded from computations, and these could often be conceptualized +as involving 'missing data' in some sense. + +In image processing, it's common to use a single image together with +one or more boolean masks to e.g. composite subsets of an image. As +Joe Harrington pointed out on the list, in the context of processing +astronomical images, it's also common to generalize to a +floating-point valued mask, or alpha channel, to indicate degrees of +"missingness". We think this is out of scope for the present design, +but it is an important use case, and ideally NumPy should support +natural ways of manipulating such data. + +After R, numpy.ma is probably the most mature source of +experience on missing-data-related APIs. Its design is quite different +from R; it uses different semantics -- reductions skip masked values +by default and NaNs convert to masked -- and it uses a different +storage strategy via a separate mask. While it seems to be generally +considered sub-optimal for general use, it's hard to pin down whether +this is because the API is immature but basically good, or the API +is fundamentally broken, or the API is great but the code should be +faster, or what. We looked at some of those users to try and get a +better idea. + +Matplotlib is perhaps the best known package to rely on numpy.ma. It +seems to use it in two ways. One is as a way for users to indicate +what data is missing when passing it to be graphed. (Other ways are +also supported, e.g., passing in NaN values gives the same result.) In +this regard, matplotlib treats np.ma.masked and NaN values in the same way +that R's plotting routines handle NA and NaN values. For these purposes, +matplotlib doesn't really care what semantics or storage strategy is +used for missing data. + +Internally, matplotlib uses numpy.ma arrays to store and pass around +separately computed boolean masks containing 'validity' information +for each input array in a cheap and non-destructive fashion. Mark's +impression from some shallow code review is that mostly it works +directly with the data and mask attributes of the masked arrays, +not extensively using the particular computational semantics of +numpy.ma. So, for this usage they do rely on the non-destructive +mask-based storage, but this doesn't say much about what semantics +are needed. + +Paul Hobson `posted some code`__ on the list that uses numpy.ma for +storing arrays of contaminant concentration measurements. Here the +mask indicates whether the corresponding number represents an actual +measurement, or just the estimated detection limit for a concentration +which was too small to detect. Nathaniel's impression from reading +through this code is that it also mostly uses the .data and .mask +attributes in preference to performing operations on the MaskedArray +directly. + +__ https://mail.scipy.org/pipermail/numpy-discussion/2012-April/061743.html + +So, these examples make it clear that there is demand for a convenient +way to keep a data array and a mask array (or even a floating point +array) bundled up together and "aligned". But they don't tell us much +about what semantics the resulting object should have with respect to +ufuncs and friends. + +Semantics, storage, API, oh my! +=============================== + +We think it's useful to draw a clear line between use cases, +semantics, and storage. Use cases are situations that users encounter, +regardless of what NumPy does; they're the focus of the previous +section. When we say *semantics*, we mean the result of different +operations as viewed from the Python level without regard to the +underlying implementation. + +*NA semantics* are the ones described above and used by R:: + + 1 + NA = NA + sum([1, 2, NA]) = NA + NA | False = NA + NA | True = True + +With ``na.rm=TRUE`` or ``skipNA=True``, this switches to:: + + 1 + NA = illegal # in R, only reductions take na.rm argument + sum([1, 2, NA], skipNA=True) = 3 + +There's also been discussion of what we'll call *ignore +semantics*. These are somewhat underdefined:: + + sum([1, 2, IGNORED]) = 3 + # Several options here: + 1 + IGNORED = 1 + # or + 1 + IGNORED = <leaves output array untouched> + # or + 1 + IGNORED = IGNORED + +The numpy.ma semantics are:: + + sum([1, 2, masked]) = 3 + 1 + masked = masked + +If either NA or ignore semantics are implemented with masks, then there +is a choice of what should be done to the value in the storage +for an array element which gets assigned a missing value. Three +possibilities are: + +* Leave that memory untouched (the choice made in the NEP). +* Do the calculation with the values independently of the mask + (perhaps the most useful option for Paul Hobson's use-case above). +* Copy whatever value is stored behind the input missing value into + the output (this is what numpy.ma does. Even that is ambiguous in + the case of ``masked + masked`` -- in this case numpy.ma copies the + value stored behind the leftmost masked value). + +When we talk about *storage*, we mean the debate about whether missing +values should be represented by designating a particular value of the +underlying data-type (the *bitpattern dtype* option, as used in R), or +by using a separate *mask* stored alongside the data itself. + +For mask-based storage, there is also an important question about what +the API looks like for accessing the mask, modifying the mask, and +"peeking behind" the mask. + +Designs that have been proposed +=============================== + +One option is to just copy R, by implementing a mechanism whereby +dtypes can arrange for certain bitpatterns to be given NA semantics. + +One option is to copy numpy.ma closely, but with a more optimized +implementation. (Or to simply optimize the existing implementation.) + +One option is that described in the NEP_, for which an implementation +of mask-based missing data exists. This system is roughly: + +.. _NEP: https://github.com/numpy/numpy/blob/master/doc/neps/nep-0012-missing-data.rst + +* There is both bitpattern and mask-based missing data, and both + have identical interoperable NA semantics. +* Masks are modified by assigning np.NA or values to array elements. + The way to peek behind the mask or to unmask values is to keep a + view of the array that shares the data pointer but not the mask pointer. +* Mark would like to add a way to access and manipulate the mask more + directly, to be used in addition to this view-based API. +* If an array has both a bitpattern dtype and a mask, then assigning + np.NA writes to the mask, rather than to the array itself. Writing + a bitpattern NA to an array which supports both requires accessing + the data by "peeking under the mask". + +Another option is that described in the alterNEP_, which is to implement +bitpattern dtypes with NA semantics for the "statistical missing data" +use case, and to also implement a totally independent API for masked +arrays with ignore semantics and all mask manipulation done explicitly +through a .mask attribute. + +.. _alterNEP: https://gist.github.com/njsmith/1056379 + +Another option would be to define a minimalist aligned array container +that holds multiple arrays and that can be used to pass them around +together. It would support indexing (to help with the common problem +of wanting to subset several arrays together without their becoming +unaligned), but all arithmetic etc. would be done by accessing the +underlying arrays directly via attributes. The "prior art" discussion +above suggests that something like this holding a .data and a .mask +array might actually be solve a number of people's problems without +requiring any major architectural changes to NumPy. This is similar to +a structured array, but with each field in a separately stored array +instead of packed together. + +Several people have suggested that there should be a single system +that has multiple missing values that each have different semantics, +e.g., a MISSING value that has NA semantics, and a separate IGNORED +value that has ignored semantics. + +None of these options are necessarily exclusive. + +The debate +========== + +We both are dubious of using ignored semantics as a default missing +data behavior. **Nathaniel** likes NA semantics because he is most +interested in the "statistical missing data" use case, and NA semantics +are exactly right for that. **Mark** isn't as interested in that use +case in particular, but he likes the NA computational abstraction +because it is unambiguous and well-defined in all cases, and has a +lot of existing experience to draw from. + +What **Nathaniel** thinks, overall: + +* The "statistical missing data" use case is clear and compelling; the + other use cases certainly deserve our attention, but it's hard to say what + they *are* exactly yet, or even if the best way to support them is + by extending the ndarray object. +* The "statistical missing data" use case is best served by an R-style + system that uses bitpattern storage to implement NA semantics. The + main advantage of bitpattern storage for this use case is that it + avoids the extra memory and speed overhead of storing and checking a + mask (especially for the common case of floating point data, where + some tricks with NaNs allow us to effectively hardware-accelerate + most NA operations). These concerns alone appears to make a + mask-based implementation unacceptable to many NA users, + particularly in areas like neuroscience (where memory is tight) or + financial modeling (where milliseconds are critical). In addition, + the bit-pattern approach is less confusing conceptually (e.g., + assignment really is just assignment, no magic going on behind the + curtain), and it's possible to have in-memory compatibility with R + for inter-language calls via rpy2. The main disadvantage of the + bitpattern approach is the need to give up a value to represent NA, + but this is not an issue for the most important data types (float, + bool, strings, enums, objects); really, only integers are + affected. And even for integers, giving up a value doesn't really + matter for statistical problems. (Occupy Wall Street + notwithstanding, no-one's income is 2**63 - 1. And if it were, we'd + be switching to floats anyway to avoid overflow.) +* Adding new dtypes requires some cooperation with the ufunc and + casting machinery, but doesn't require any architectural changes or + violations of NumPy's current orthogonality. +* His impression from the mailing list discussion, esp. the `"what can + we agree on?" thread`__, is that many numpy.ma users specifically + like the combination of masked storage, the mask being easily + accessible through the API, and ignored semantics. He could be + wrong, of course. But he cannot remember seeing anybody besides Mark + advocate for the specific combination of masked storage and NA + semantics, which makes him nervous. + + __ http://thread.gmane.org/gmane.comp.python.numeric.general/46704 +* Also, he personally is not very happy with the idea of having two + storage implementations that are almost-but-not-quite identical at + the Python level. While there likely are people who would like to + temporarily pretend that certain data is "statistically missing + data" without making a copy of their array, it's not at all clear + that they outnumber the people who would like to use bitpatterns and + masks simultaneously for distinct purposes. And honestly he'd like + to be able to just ignore masks if he wants and stick to + bitpatterns, which isn't possible if they're coupled together + tightly in the API. So he would say the jury is still very much out + on whether this aspect of the NEP design is an advantage or a + disadvantage. (Certainly he's never heard of any R users complaining + that they really wish they had an option of making a different + trade-off here.) +* R's NA support is a `headline feature`__ and its target audience + consider it a compelling advantage over other platforms like Matlab + or Python. Working with statistical missing data is very painful + without platform support. + + __ http://www.sr.bham.ac.uk/~ajrs/R/why_R.html +* By comparison, we clearly have much more uncertainty about the use + cases that require a mask-based implementation, and it doesn't seem + like people will suffer too badly if they are forced for now to + settle for using NumPy's excellent mask-based indexing, the new + where= support, and even numpy.ma. +* Therefore, bitpatterns with NA semantics seem to meet the criteria + of making a large class of users happy, in an elegant way, that fits + into the original design, and where we can have reasonable certainty + that we understand the problem and use cases well enough that we'll + be happy with them in the long run. But no mask-based storage + proposal does, yet. + +What **Mark** thinks, overall: + +* The idea of using NA semantics by default for missing data, inspired + by the "statistical missing data" problem, is better than all the + other default behaviors which were considered. This applies equally + to the bitpattern and the masked approach. + +* For NA-style functionality to get proper support by all NumPy + features and eventually all third-party libraries, it needs to be + in the core. How to correctly and efficiently handle missing data + differs by algorithm, and if thinking about it is required to fully + support NumPy, NA support will be broader and higher quality. + +* At the same time, providing two different missing data interfaces, + one for masks and one for bitpatterns, requires NumPy developers + and third-party NumPy plugin developers to separately consider the + question of what to do in either case, and do two additional + implementations of their code. This complicates their job, + and could lead to inconsistent support for missing data. + +* Providing the ability to work with both masks and bitpatterns through + the same C and Python programming interface makes missing data support + cleanly orthogonal with all other NumPy features. + +* There are many trade-offs of memory usage, performance, correctness, and + flexibility between masks and bitpatterns. Providing support for both + approaches allows users of NumPy to choose the approach which is + most compatible with their way of thinking, or has characteristics + which best match their use-case. Providing them through the same + interface further allows them to try both with minimal effort, and + choose the one which performs better or uses the least memory for + their programs. + +* Memory Usage + + * With bitpatterns, less memory is used for storing a single array + containing some NAs. + + * With masks, less memory is used for storing multiple arrays that + are identical except for the location of their NAs. (In this case a + single data array can be re-used with multiple mask arrays; + bitpattern NAs would need to copy the whole data array.) + +* Performance + + * With bitpatterns, the floating point type can use native hardware + operations, with nearly correct behavior. For fully correct floating + point behavior and with other types, code must be written which + specially tests for equality with the missing-data bitpattern. + + * With masks, there is always the overhead of accessing mask memory + and testing its truth value. The implementation that currently exists + has no performance tuning, so it is only good to judge a minimum + performance level. Optimal mask-based code is in general going to + be slower than optimal bitpattern-based code. + +* Correctness + + * Bitpattern integer types must sacrifice a valid value to represent NA. + For larger integer types, there are arguments that this is ok, but for + 8-bit types there is no reasonable choice. In the floating point case, + if the performance of native floating point operations is chosen, + there is a small inconsistency that NaN+NA and NA+NaN are different. + * With masks, it works correctly in all cases. + +* Generality + + * The bitpattern approach can work in a fully general way only when + there is a specific value which can be given up from the + data type. For IEEE floating point, a NaN is an obvious choice, + and for booleans represented as a byte, there are plenty of choices. + For integers, a valid value must be sacrificed to use this approach. + Third-party dtypes which plug into NumPy will also have to + make a bitpattern choice to support this system, something which + may not always be possible. + + * The mask approach works universally with all data types. + +Recommendations for Moving Forward +================================== + +**Nathaniel** thinks we should: + +* Go ahead and implement bitpattern NAs. +* *Don't* implement masked arrays in the core -- or at least, not + yet. Instead, we should focus on figuring out how to implement them + out-of-core, so that people can try out different approaches without + us committing to any one approach. And so new prototypes can be + released more quickly than the NumPy release cycle. And anyway, + we're going to have to figure out how to experiment with such + changes out-of-core if NumPy is to continue to evolve without + forking -- might as well do it now. The existing code can live in + master, disabled, or it can live in a branch -- it'll still be there + once we know what we're doing. + +**Mark** thinks we should: + +* The existing code should remain as is, with a global run-time experimental + flag added which disables NA support by default. + +A more detailed rationale for this recommendation is: + +* A solid preliminary NA-mask implementation is currently in NumPy + master. This implementation has been extensively tested + against scipy and other third-party packages, and has been in master + in a stable state for a significant amount of time. +* This implementation integrates deeply with the core, providing an + interface which is usable in the same way R's NA support is. It + provides a compelling, user-friendly answer to R's NA support. +* The missing data NEP provides a plan for adding bitpattern-based + dtype support of NAs, which will operate through the same interface + but allow for the same performance/correctness tradeoffs that R has made. +* Making it very easy for users to try out this implementation, which + has reasonable feature coverage and performance characteristics, is + the best way to get more concrete feedback about how NumPy's missing + data support should look. + +Because of its preliminary state, the existing implementation is marked +as experimental in the NumPy documentation. It would be good for this +to remain marked as experimental until it is more fleshed out, for +example supporting struct and array dtypes and with a fuller set of +NumPy operations. + +I think the code should stay as it is, except to add a run-time global +NumPy flag, perhaps numpy.experimental.maskna, which defaults to +False and can be toggled to True. In its default state, any NA feature +usage would raise an "ExperimentalError" exception, a measure which +would prevent it from being accidentally used and communicate its +experimental status very clearly. + +The `ABI issues`__ seem very tricky to deal with effectively in the 1.x +series of releases, but I believe that with proper implementation-hiding +in a 2.0 release, evolving the software to support various other +ABI ideas that have been discussed is feasible. This is the approach +I like best. + +__ http://thread.gmane.org/gmane.comp.python.numeric.general/49485> + +**Nathaniel** notes in response that he doesn't really have any +objection to shipping experimental APIs in the main numpy distribution +*if* we're careful to make sure that they don't "leak out" in a way +that leaves us stuck with them. And in principle some sort of "this +violates your warranty" global flag could be a way to do that. (In +fact, this might also be a useful strategy for the kinds of changes +that he favors, of adding minimal hooks to enable us to build +prototypes more easily -- we could have some "rapid prototyping only" +hooks that let prototype hacks get deeper access to NumPy's internals +than we were otherwise ready to support.) + +But, he wants to point out two things. First, it seems like we still +have fundamental questions to answer about the NEP design, like +whether masks should have NA semantics or ignore semantics, and there +are already plans to majorly change how NEP masks are exposed and +accessed. So he isn't sure what we'll learn by asking for feedback on +the NEP code in its current state. + +And second, given the concerns about their causing (minor) ABI issues, +it's not clear that we could really prevent them from leaking out. (He +looks forward to 2.0 too, but we're not there yet.) So maybe it would +be better if they weren't present in the C API at all, and the hoops +required for testers were instead something like, 'we have included a +hacky pure-Python prototype accessible by typing "import +numpy.experimental.donttrythisathome.NEP" and would welcome feedback'? + +If so, then he should mention that he did implement a horribly klugy, +pure Python implementation of the NEP API that works with NumPy +1.6.1. This was mostly as an experiment to see how possible such +prototyping was and to test out a possible ufunc override mechanism, +but if there's interest, the module is available here: +https://github.com/njsmith/numpyNEP + +It passes the maskna test-suite, with some minor issues described +in a big comment at the top. + +**Mark** responds: + +I agree that it's important to be careful when adding new +features to NumPy, but I also believe it is essential that the project +have forward development momentum. A project like NumPy requires +developers to write code for advancement to occur, and obstacles +that impede the writing of code discourage existing developers +from contributing more, and potentially scare away developers +who are thinking about joining in. + +All software projects, both open source and closed source, must +balance between short-term practicality and long-term planning. +In the case of the missing data development, there was a short-term +resource commitment to tackle this problem, which is quite immense +in scope. If there isn't a high likelihood of getting a contribution +into NumPy that concretely advances towards a solution, I expect +that individuals and companies interested in doing such work will +have a much harder time justifying a commitment of their resources. +For a project which is core to so many other libraries, only +relying on the good will of selfless volunteers would mean that +NumPy could more easily be overtaken by another project. + +In the case of the existing NA contribution at issue, how we resolve +this disagreement represents a decision about how NumPy's +developers, contributers, and users should interact. If we create +a document describing a dispute resolution process, how do we +design it so that it doesn't introduce a large burden and excessive +uncertainty on developers that could prevent them from productively +contributing code? + +If we go this route of writing up a decision process which includes +such a dispute resolution mechanism, I think the meat of it should +be a roadmap that potential contributers and developers can follow +to gain influence over NumPy. NumPy development needs broad support +beyond code contributions, and tying influence in the project to +contributions seems to me like it would be a good way to encourage +people to take on tasks like bug triaging/management, continuous +integration/build server administration, and the myriad other +tasks that help satisfy the project's needs. No specific meritocratic, +democratic, consensus-striving system will satisfy everyone, but the +vigour of the discussions around governance and process indicate that +something at least a little bit more formal than the current status +quo is necessary. + +In conclusion, I would like the NumPy project to prioritize movement +towards a more flexible and modular ABI/API, balanced with strong +backwards-compatibility constraints and feature additions that +individuals, universities, and companies want to contribute. +I do not believe keeping the NA code in 1.7 as it is, with the small +additional measure of requiring it to be enabled by an experimental +flag, poses a risk of long-term ABI troubles. The greater risk I see +is a continuing lack of developers contributing to the project, +and I believe backing out this code because these worries would create a +risk of reducing developer contribution. + + +References and Footnotes +------------------------ + +NEP 12 describes Mark's NA-semantics/mask implementation/view based mask +handling API. + +NEP 24 ("the alterNEP") was Nathaniel's initial attempt at separating MISSING +and IGNORED handling into bit-patterns versus masks, though there's a bunch +he would change about the proposal at this point. + +NEP 25 ("miniNEP 2") was a later attempt by Nathaniel to sketch out an +implementation strategy for NA dtypes. + +A further discussion overview page can be found at: +https://github.com/njsmith/numpy/wiki/NA-discussion-status + + +Copyright +--------- + +This document has been placed in the public domain.
\ No newline at end of file |