NEP: missing-data: Add glossary of terms, try to clarify them better

This is a step towards having everyone on the list use the same vocabulary with specific nailed-down definitions for the terms.
author: Mark Wiebe <mwiebe@enthought.com> 2011-07-06 10:16:08 -0500
committer: Charles Harris <charlesr.harris@gmail.com> 2011-07-06 16:24:13 -0600
commit: a9be5a1a2e753c52db36f1ae678115fcf046d8a4 (patch)
tree: b9aa238a3dcea835d31fb711ba9cb86508405750 /doc
parent: f2f7bd6510b24f6f1c12642afb888f82da0a353e (diff)
download: numpy-a9be5a1a2e753c52db36f1ae678115fcf046d8a4.tar.gz
1 files changed, 59 insertions, 28 deletions
diff --git a/doc/neps/missing-data.rst b/doc/neps/missing-data.rst
index 0dde5bb0f..2056d3661 100644
--- a/doc/neps/missing-data.rst
+++ b/doc/neps/missing-data.rst
@@ -21,19 +21,19 @@ find the programming interface challenging or inconsistent tend not
 to use it.
 
 This NEP proposes to integrate a mask-based missing data solution
-into NumPy, with an additional NA bit pattern-based missing data solution
+into NumPy, with an additional bitpattern-based missing data solution
 that can be implemented  concurrently or later integrating seamlessly
 with the mask-based solution.
 
-The mask-based solution and the NA bit pattern-based solutions in this
+The mask-based solution and the bitpattern-based solutions in this
 proposal offer the exact same missing value abstraction, with several
 differences in performance, memory overhead, and flexibility.
 
 The mask-based solution is more flexible, supporting all behaviors of the
-NA bit pattern-based solution, but leaving the hidden values untouched
+bitpattern-based solution, but leaving the hidden values untouched
 whenever an element is masked.
 
-The NA bit pattern-based solution requires less memory, is bit-level
+The bitpattern-based solution requires less memory, is bit-level
 compatible with the 64-bit floating point representation used in R, but
 does not preserve the hidden values and in fact requires stealing at
 least one bit pattern from the underlying dtype to represent the missing
@@ -42,7 +42,7 @@ value NA.
 Both solutions are generic in the sense that they can be used with
 custom data types very easily, with no effort in the case of the masked
 solution, and with the requirement that a bit pattern to sacrifice be
-chosen in the case of the NA bit pattern solution.
+chosen in the case of the bitpattern solution.
 
 **************************
 Definition of Missing Data
@@ -69,8 +69,8 @@ proposed elsewhere for customizing subclass ufunc behavior with a
 _numpy_ufunc_ member function would allow a subclass with a different
 default to be created.
 
-Unknown Yet Existing Data
-=========================
+Unknown Yet Existing Data (NA)
+==============================
 
 This is the approach taken in the R project, defining a missing element
 as something which does have a valid value which isn't known, or is
@@ -90,8 +90,8 @@ such things to the theoretical limit is probably not worth it,
 and in many cases either raising an exception or returning all
 missing values may be preferred to doing precise calculations.
 
-Data That Doesn't Exist Or Is Being Skipped
-===========================================
+Data That Doesn't Exist Or Is Being Skipped (IGNORE)
+====================================================
 
 Another useful interpretation is that the missing elements should be
 treated as if they didn't exist in the array, and the operation should
@@ -121,15 +121,15 @@ existing implementations of the techniques, I believe that the design
 choices made in a new implementation must be made based on their merits,
 not by rote copying of previous designs.
 
-Both masks and NA bit patterns have different strong and weak points,
+Both masks and bitpatterns have different strong and weak points,
 depending on the application context. This NEP thus proposes to implement
 both. To enable the writing of generic "missing value" code which does
 not have to worry about whether the arrays it is using have taken one
 or the other approach, the missing value semantics will be identical
 for the two implementations.
 
-NA Value Bit Patterns
-=====================
+Bit Patterns Signalling Missing Values (bitpattern)
+===================================================
 
 One or more patterns of bits, for example a NaN with
 a particular payload, are chosen to represent the missing value
@@ -141,8 +141,8 @@ holding the value, so that value is gone.
 Additionally, for some types such as integers, a good and proper value
 must be sacrificed to enable this functionality.
 
-Array Masks
-===========
+Boolean Masks Signalling Missing Values (mask)
+==============================================
 
 A mask is a parallel array of booleans, either one byte per element or
 one bit per element, allocated alongside the existing array data. In this
@@ -159,6 +159,37 @@ This approach places no limitations on the values of the underlying
 data type, it may take on any binary pattern without affecting the
 NA behavior.
 
+*****************
+Glossary of Terms
+*****************
+
+Because the above discussions of the different concepts and their
+relationships are tricky to understand, here are more succinct
+definitions of the terms used in this NEP.
+
+NA (Not Available)
+    A placeholder for a value which is unknown to computations. That
+    value may be temporarily hidden with a mask, may have been lost
+    due to hard drive corruption, or gone for any number of reasons.
+    This is the same as NA in the R project.
+
+IGNORE (Skip/Ignore)
+    A placeholder which should be treated by computations as if no value does
+    or could exist there. For sums, this means act as if the value
+    were zero, and for products, this means act as if the value were one.
+    It's as if the array were compressed in some fashion to not include
+    that element.
+
+bitpattern
+    A technique for implementing either NA or IGNORE, where a particular
+    set of bit patterns are chosen from all the possible bit patterns of the
+    value's data type to signal that the element is NA or IGNORE.
+
+mask
+    A technique for implementing either NA or IGNORE, where a
+    boolean or enum array parallel to the data array is used to signal
+    which elements are NA or IGNORE.
+
 ********************************
 Missing Values as Seen in Python
 ********************************
@@ -183,7 +214,7 @@ For example,::
 
 produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
 mask [Unmasked, Unmasked, Masked, Unmasked], and
-values [1.0, 2.0, <NA bit pattern>, 7.0] respectively.
+values [1.0, 2.0, <NA bitpattern>, 7.0] respectively.
 
 It may be worth overloading the np.NA __call__ method to accept a dtype,
 returning a zero-dimensional array with a missing value of that dtype.
@@ -203,7 +234,7 @@ but with this, they could be printed as::
 
 Assigning a value to an array always causes that element to not be NA,
 transparently unmasking it if necessary. Assigning numpy.NA to the array
-masks that element or assigns the NA bit pattern for the particular dtype.
+masks that element or assigns the NA bitpattern for the particular dtype.
 In the mask-based implementation, the storage behind a missing value may never
 be accessed in any way, other than to unmask it by assigning its value.
 
@@ -211,7 +242,7 @@ While numpy.NA works to mask values, it does not itself have a dtype.
 This means that returning the numpy.NA singleton from an operation
 like 'arr[0]' would be throwing away the dtype, which is still
 valuable to retain, so 'arr[0]' will return a zero-dimensional
-array either with its value masked, or containing the NA bit pattern
+array either with its value masked, or containing the NA bitpattern
 for the array's dtype. To test if the value is missing, the function
 "np.isna(arr[0])" will be provided. One of the key reasons for the
 NumPy scalars is to allow their values into dictionaries. Having a
@@ -236,10 +267,10 @@ from another view which doesn't have them masked. For example::
     >>> # The underlying number 1 value in 'a[0]' was untouched
 
 Copying values between the mask-based implementation and the
-NA bit pattern implementation will transparently do the correct thing,
-turning the NA bit pattern into a masked value, or a masked value
-into the NA bit pattern where appropriate. The one exception is
-if a valid value in a masked array happens to have the NA bit pattern,
+bitpattern implementation will transparently do the correct thing,
+turning the bitpattern into a masked value, or a masked value
+into the bitpattern where appropriate. The one exception is
+if a valid value in a masked array happens to have the NA bitpattern,
 copying this value to the NA form of the dtype will cause it to
 become NA as well.
 
@@ -264,7 +295,7 @@ For floating point numbers, Inf and NaN are separate concepts from
 missing values. If a division by zero occurs in an array with default
 missing value support, an unmasked Inf or NaN will be produced. To
 mask those values, a further 'a[np.logical_not(a.isfinite(a)] = np.NA'
-can achieve that. For the NA bit pattern approach, the parameterized
+can achieve that. For the bitpattern approach, the parameterized
 dtype('NA[f8,InfNan]') described in a later section can be used to get
 these semantics without the extra manipulation.
 
@@ -338,17 +369,17 @@ New functions added to the numpy namespace are::
 
     np.isna(arr)
         Returns a boolean array with True whereever the array is masked
-        or matches the NA bit pattern, and False elsewhere
+        or matches the NA bitpattern, and False elsewhere
 
     np.isavail(arr)
         Returns a boolean array with False whereever the array is masked
-        or matches the NA bit pattern, and True elsewhere
+        or matches the NA bitpattern, and True elsewhere
 
 New functions added to the ndarray are::
 
     arr.copy(..., replacena=None)
         Modification to the copy function which replaces NA values,
-        either masked or with the NA bit pattern, with the 'replacena='
+        either masked or with the NA bitpattern, with the 'replacena='
         parameter suppled. When 'replacena' isn't None, the copied
         array is unmasked and has the 'NA' part stripped from the
         parameterized type ('NA[f8]' becomes just 'f8').
@@ -479,7 +510,7 @@ This allows one to avoid the need to write special case code for each
 ufunc and for each na* dtype, something that is hard to avoid when
 building a separate independent dtype implementation for each na* dtype.
 
-Reliable conversions with the NA bit pattern preserved across primitive
+Reliable conversions with the NA bitpattern preserved across primitive
 types requires consideration as well. Even in the simple case of
 double -> float, where this is supported by hardware, the NA value
 will get lost because the NaN payload is typically not preserved.
@@ -547,8 +578,8 @@ PEP 3118
 
 PEP 3118 doesn't have any mask mechanism, so arrays with masks will
 not be accessible through this interface. Similarly, it doesn't support
-the specification of dtypes with NA bit patterns, so the parameterized NA
-dtypes will also not be accessible through this interface.
+the specification of dtypes with NA or IGNORE bitpatterns, so the
+parameterized NA dtypes will also not be accessible through this interface.
 
 If NumPy did allow access through PEP 3118, this would circumvent the
 missing value abstraction in a very damaging way. Other libraries would
author	Mark Wiebe <mwiebe@enthought.com>	2011-07-06 10:16:08 -0500
committer	Charles Harris <charlesr.harris@gmail.com>	2011-07-06 16:24:13 -0600
commit	a9be5a1a2e753c52db36f1ae678115fcf046d8a4 (patch)
tree	b9aa238a3dcea835d31fb711ba9cb86508405750 /doc
parent	f2f7bd6510b24f6f1c12642afb888f82da0a353e (diff)
download	numpy-a9be5a1a2e753c52db36f1ae678115fcf046d8a4.tar.gz