NEP: missingdata: Various improvements

author: Mark Wiebe <mwiebe@enthought.com> 2011-06-29 18:17:54 -0500
committer: Charles Harris <charlesr.harris@gmail.com> 2011-07-06 16:24:13 -0600
commit: 6a6f8f52acf36289840866f437dcac9c75798202 (patch)
tree: 8d4e46aa080d83fe115b4d0e44e308f33ad95c67 /doc/neps
parent: 3d9db9b960cbc04a44458ea36f056a8e87196a04 (diff)
download: numpy-6a6f8f52acf36289840866f437dcac9c75798202.tar.gz
1 files changed, 39 insertions, 20 deletions
diff --git a/doc/neps/missing-data.rst b/doc/neps/missing-data.rst
index 17526c3d7..f5100d493 100644
--- a/doc/neps/missing-data.rst
+++ b/doc/neps/missing-data.rst
@@ -48,27 +48,32 @@ chosen in the case of the NA bit pattern solution.
 Definition of Missing Data
 **************************
 
+In order to be able to develop an intuition about what computation
+will be done by various NumPy functions, a consistent conceptual
+model of what a missing element means must be applied.
 Ferreting out the behaviors people need or want when they are working
 with "missing data" seems to be tricky, but I believe that it boils
 down to two different ideas, each of which is internally self-consistent.
 
-The "unknown yet existing data" interpretation, however, can be applied
-rigorously to all computations, while the "data that doesn't exist"
-interpretation makes sense for statistical operations like standard
-deviation but not for linear algebra operations like matrix product.
+One of them, the "unknown yet existing data" interpretation, can be applied
+rigorously to all computations, while the other makes sense for
+some statistical operations like standard deviation but not for
+linear algebra operations like matrix product.
 Thus, making "unknown yet existing data" be the default interpretation
 is superior, providing a consistent model across all computations,
 and for those operations where the other interpretation makes sense,
 an optional parameter "skipna=" can be added.
 
+For people who want the other interpretation to be default, a mechanism
+proposed elsewhere for customizing subclass ufunc behavior with a
+_numpy_ufunc_ member function would allow a subclass with a different
+default to be created.
+
 Unknown Yet Existing Data
 =========================
 
-In order to be able to develop an intuition about what computation
-will be done by various NumPy functions, a consistent conceptual
-model of what a missing element means must be applied. The approach
-taken in the R project is to define a missing element as something which
-does have a valid value, but that value is unknown. This proposal
+This is the approach taken in the R project, defining a missing element
+as something which does have a valid value which isn't known. This proposal
 adopts this behavior as as the default for all operations involving
 missing values.
 
@@ -80,7 +85,7 @@ that is not NA, such as logical_and(NA, False) == False.
 
 Some more complex arithmetic operations, such as matrix products, are
 well defined with this interpretation, and the result should be
-the same as is the missing values were NaNs. Actually implementing
+the same as if if the missing values were NaNs. Actually implementing
 such things to the theoretical limit is probably not worth it,
 and in many cases either raising an exception or returning all
 missing values may be preferred to doing precise calculations.
@@ -96,15 +101,17 @@ treated as if they didn't exist in the array, and the operation should
 do its best to interpret what that means according to the data
 that's left. In this case, 'mean(a)' would compute the mean of just
 the values that are unmasked, adjusting both the sum and count it
-uses based on the mask.
+uses based on the mask. To be consistent, the mean of an array of
+all missing values must produce the same result as the mean of a zero-sized
+array without missing value support.
 
 This kind of data can arise when conforming sparsely sampled data
-into a regular sampling pattern, and is a useful interpretation so 
+into a regular sampling pattern, and is a useful interpretation to 
 use when attempting to get best-guess answers for many statistical queries.
 
 In R, many functions take a parameter "na.rm=T" which means to treat
 the data as if the NA values are not part of the data set. This proposal
-defines a standard parameter "skipmissing=True" for this same purpose. 
+defines a standard parameter "skipna=True" for this same purpose. 
 
 ********************************************
 Implementation Techniques For Missing Values
@@ -128,8 +135,8 @@ NA Value Bit Patterns
 =====================
 
 One or more patterns of bits, for example a NaN with
-a particular payload, are chosen to represent the missing value placeholder
-NA.
+a particular payload, are chosen to represent the missing value
+placeholder NA.
 
 A consequence of this approach is that assigning NA changes the bits
 holding the value, so that value is gone.
@@ -207,7 +214,7 @@ like 'arr[0]' would be throwing away the dtype, which is still
 valuable to retain, so 'arr[0]' will return a zero-dimensional
 array either with its value masked, or containing the NA bit pattern
 for the array's dtype. To test if the value is missing, the function
-"np.ismissing(arr[0])" will be provided. One of the key reasons for the
+"np.isna(arr[0])" will be provided. One of the key reasons for the
 NumPy scalars is to allow their values into dictionaries. Having a
 missing value as the key in a dictionary is a bad idea, so the NumPy
 scalars will not support missing values in any form.
@@ -237,6 +244,11 @@ if a valid value in a masked array happens to have the NA bit pattern,
 copying this value to the NA form of the dtype will cause it to
 become NA as well.
 
+When operations are done between arrays with NA dtypes and masked arrays,
+the result will be masked arrays. This is because in some cases the
+NA dtypes cannot represent all the values in the masked array, so
+going to masked arrays is the only way to losslessly preserve the data.
+
 If np.NA or masked values are copied to an array without support for
 missing values enabled, an exception will be raised. Adding a mask to
 the target array would be problematic, because then having a mask
@@ -246,7 +258,7 @@ performance in unexpected ways.
 By default, the string "NA" will be used to represent missing values
 in str and repr outputs. A global configuration will allow
 this to be changed. The array2string function will also gain a
-'maskedstr=' parameter so this could be changed to "<missing>" or
+'nastr=' parameter so this could be changed to "<missing>" or
 other values people may desire.
 
 For floating point numbers, Inf and NaN are separate concepts from
@@ -279,7 +291,7 @@ space optimization, where a bit-level instead of a byte-level mask
 is used to get a factor of eight memory usage improvement.
 
 To access the mask values, there are two functions provided,
-'np.ismissing' and 'np.isavail', which test for NA or available values
+'np.isna' and 'np.isavail', which test for NA or available values
 respectively. These functions work equivalently for masked arrays
 and NA bit pattern dtypes.
 
@@ -323,7 +335,7 @@ New ndarray Methods
 
 New functions added to the numpy namespace are::
 
-    np.ismissing(arr)
+    np.isna(arr)
         Returns a boolean array with True whereever the array is masked
         or matches the NA bit pattern, and False elsewhere
 
@@ -421,7 +433,6 @@ Some examples::
       return mean(axis, dtype, out)
     nan
 
-
 PEP 3118
 ========
 
@@ -552,6 +563,14 @@ cannot hold values, but will conform to the input types in functions like
 maps to [('a', 'NA[f4]'), ('b', 'NA[i4]')]. Thus, to view the memory
 of an 'f8' array 'arr' with 'NA[f8]', you can say arr.view(dtype='NA').
 
+******************************
+C API Access: Masked Iteration
+******************************
+
+TODO: Describe details about how the nditer will be extended to allow
+functions to do masked iteration, transparently working with both
+NA dtypes or masked arrays in one implementation.
+
 ********************
 Rejected Alternative
 ********************
author	Mark Wiebe <mwiebe@enthought.com>	2011-06-29 18:17:54 -0500
committer	Charles Harris <charlesr.harris@gmail.com>	2011-07-06 16:24:13 -0600
commit	6a6f8f52acf36289840866f437dcac9c75798202 (patch)
tree	8d4e46aa080d83fe115b4d0e44e308f33ad95c67 /doc/neps
parent	3d9db9b960cbc04a44458ea36f056a8e87196a04 (diff)
download	numpy-6a6f8f52acf36289840866f437dcac9c75798202.tar.gz