summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/neps/nep-0024-missing-data-2.rst210
1 files changed, 210 insertions, 0 deletions
diff --git a/doc/neps/nep-0024-missing-data-2.rst b/doc/neps/nep-0024-missing-data-2.rst
new file mode 100644
index 000000000..c8b19561f
--- /dev/null
+++ b/doc/neps/nep-0024-missing-data-2.rst
@@ -0,0 +1,210 @@
+=============================================================
+NEP 24 — Missing Data Functionality - Alternative 1 to NEP 12
+=============================================================
+
+:Author: Nathaniel J. Smith <njs@pobox.com>, Matthew Brett <matthew.brett@gmail.com>
+:Status: Deferred
+:Type: Standards Track
+:Created: 2011-06-30
+
+
+Abstract
+--------
+
+*Context: this NEP was written as an alternative to NEP 12, which at the time of writing
+had an implementation that was merged into the NumPy master branch.*
+
+The principle of this NEP is to separate the APIs for masking and for missing values, according to
+
+* The current implementation of masked arrays (NEP 12)
+* This proposal.
+
+This discussion is only of the API, and not of the implementation.
+
+Detailed description
+--------------------
+
+
+Rationale
+^^^^^^^^^
+
+The purpose of this aNEP is to define two interfaces -- one for handling
+'missing values', and one for handling 'masked arrays'.
+
+An ordinary value is something like an integer or a floating point number. A
+*missing* value is a placeholder for an ordinary value that is for some
+reason unavailable. For example, in working with statistical data, we often
+build tables in which each row represents one item, and each column
+represents properties of that item. For instance, we might take a group of
+people and for each one record height, age, education level, and income, and
+then stick these values into a table. But then we discover that our research
+assistant screwed up and forgot to record the age of one of our individuals.
+We could throw out the rest of their data as well, but this would be
+wasteful; even such an incomplete row is still perfectly usable for some
+analyses (e.g., we can compute the correlation of height and income). The
+traditional way to handle this would be to stick some particular meaningless
+value in for the missing data, e.g., recording this person's age as 0. But
+this is very error prone; we may later forget about these special values
+while running other analyses, and discover to our surprise that babies have
+higher incomes than teenagers. (In this case, the solution would be to just
+leave out all the items where we have no age recorded, but this isn't a
+general solution; many analyses require something more clever to handle
+missing values.) So instead of using an ordinary value like 0, we define a
+special "missing" value, written "NA" for "not available".
+
+Therefore, missing values have the following properties: Like any other
+value, they must be supported by your array's dtype -- you can't store a
+floating point number in an array with dtype=int32, and you can't store an NA
+in it either. You need an array with dtype=NAint32 or something (exact syntax
+to be determined). Otherwise, they act exactly like any other values. In
+particular, you can apply arithmetic functions and so forth to them. By
+default, any function which takes an NA as an argument always returns an NA
+as well, regardless of the values of the other arguments. This ensures that
+if we try to compute the correlation of income with age, we will get "NA",
+meaning "given that some of the entries could be anything, the answer could
+be anything as well". This reminds us to spend a moment thinking about how we
+should rephrase our question to be more meaningful. And as a convenience for
+those times when you do decide that you just want the correlation between the
+known ages and income, then you can enable this behavior by adding a single
+argument to your function call.
+
+For floating point computations, NAs and NaNs have (almost?) identical
+behavior. But they represent different things -- NaN an invalid computation
+like 0/0, NA a value that is not available -- and distinguishing between
+these things is useful because in some situations they should be treated
+differently. (For example, an imputation procedure should replace NAs with
+imputed values, but probably should leave NaNs alone.) And anyway, we can't
+use NaNs for integers, or strings, or booleans, so we need NA anyway, and
+once we have NA support for all these types, we might as well support it for
+floating point too for consistency.
+
+A masked array is, conceptually, an ordinary rectangular numpy array, which
+has had an arbitrarily-shaped mask placed over it. The result is,
+essentially, a non-rectangular view of a rectangular array. In principle,
+anything you can accomplish with a masked array could also be accomplished by
+explicitly keeping a regular array and a boolean mask array and using numpy
+indexing to combine them for each operation, but combining them into a single
+structure is much more convenient when you need to perform complex operations
+on the masked view of an array, while still being able to manipulate the mask
+in the usual ways. Therefore, masks are preserved through indexing, and
+functions generally treat masked-out values as if they were not even part of
+the array in the first place. (Maybe this is a good heuristic: a length-4
+array in which the last value has been masked out behaves just like an
+ordinary length-3 array, so long as you don't change the mask.) Except, of
+course, that you are free to manipulate the mask in arbitrary ways whenever
+you like; it's just a standard numpy array.
+
+There are some simple situations where one could use either of these tools to
+get the job done -- or other tools entirely, like using designated surrogate
+values (age=0), separate mask arrays, etc. But missing values are designed to
+be particularly helpful in situations where the missingness is an intrinsic
+feature of the data -- where there's a specific value that **should** exist,
+if it did exist we'd it'd mean something specific, but it **doesn't**. Masked
+arrays are designed to be particularly helpful in situations where we just
+want to temporarily ignore some data that does exist, or generally when we
+need to work with data that has a non-rectangular shape (e.g., if you make
+some measurement at each point on a grid laid over a circular agar dish, then
+the points that fall outside the dish aren't missing measurements, they're
+just meaningless).
+
+Initialization
+^^^^^^^^^^^^^^
+
+First, missing values can be set and be displayed as ``np.NA, NA``::
+
+ >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
+ array([1., 2., NA, 7.], dtype='NA[<f8]')
+
+As the initialization is not ambiguous, this can be written without the NA
+dtype::
+
+ >>> np.array([1.0, 2.0, np.NA, 7.0])
+ array([1., 2., NA, 7.], dtype='NA[<f8]')
+
+Masked values can be set and be displayed as ``np.IGNORE, IGNORE``::
+
+ >>> np.array([1.0, 2.0, np.IGNORE, 7.0], masked=True)
+ array([1., 2., IGNORE, 7.], masked=True)
+
+As the initialization is not ambiguous, this can be written without
+``masked=True``::
+
+ >>> np.array([1.0, 2.0, np.IGNORE, 7.0])
+ array([1., 2., IGNORE, 7.], masked=True)
+
+Ufuncs
+^^^^^^
+
+By default, NA values propagate::
+
+ >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
+ >>> np.sum(na_arr)
+ NA('float64')
+
+unless the ``skipna`` flag is set::
+
+ >>> np.sum(na_arr, skipna=True)
+ 10.0
+
+By default, masking does not propagate::
+
+ >>> masked_arr = np.array([1.0, 2.0, np.IGNORE, 7.0])
+ >>> np.sum(masked_arr)
+ 10.0
+
+unless the ``propmask`` flag is set::
+
+ >>> np.sum(masked_arr, propmask=True)
+ IGNORE
+
+An array can be masked, and contain NA values::
+
+ >>> both_arr = np.array([1.0, 2.0, np.IGNORE, np.NA, 7.0])
+
+In the default case, the behavior is obvious::
+
+ >>> np.sum(both_arr)
+ NA('float64')
+
+It's also obvious what to do with ``skipna=True``::
+
+ >>> np.sum(both_arr, skipna=True)
+ 10.0
+ >>> np.sum(both_arr, skipna=True, propmask=True)
+ IGNORE
+
+To break the tie between NA and MSK, NAs propagate harder::
+
+ >>> np.sum(both_arr, propmask=True)
+ NA('float64')
+
+Assignment
+^^^^^^^^^^
+
+is obvious in the NA case::
+
+ >>> arr = np.array([1.0, 2.0, 7.0])
+ >>> arr[2] = np.NA
+ TypeError('dtype does not support NA')
+ >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
+ >>> na_arr[2] = np.NA
+ >>> na_arr
+ array([1., 2., NA], dtype='NA[<f8]')
+
+Direct assignnent in the masked case is magic and confusing, and so happens only
+via the mask::
+
+ >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
+ >>> masked_arr[2] = np.NA
+ TypeError('dtype does not support NA')
+ >>> masked_arr[2] = np.IGNORE
+ TypeError('float() argument must be a string or a number')
+ >>> masked_arr.visible[2] = False
+ >>> masked_arr
+ array([1., 2., IGNORE], masked=True)
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.