NEP: c-masked-array: Idea for a parameterized dtype with an NA bit pattern

author: Mark Wiebe <mwiebe@enthought.com> 2011-06-25 13:37:58 -0500
committer: Mark Wiebe <mwiebe@enthought.com> 2011-06-27 10:32:42 -0500
commit: 95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7 (patch)
tree: 705411d25ebab94e97d0565e4c052b701cdf2c02 /doc/neps
parent: 63c9d8744e3129dd7ddbc070ac14e6d2f24ea7e7 (diff)
download: numpy-95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7.tar.gz
1 files changed, 109 insertions, 48 deletions
diff --git a/doc/neps/c-masked-array.rst b/doc/neps/c-masked-array.rst
index 743adc23c..36c5445ad 100644
--- a/doc/neps/c-masked-array.rst
+++ b/doc/neps/c-masked-array.rst
@@ -30,9 +30,9 @@ shouldn't, so the ufunc error handling mechanism can't be relied on.
 While no comprehensive benchmarks appear to exist, poor performance is
 sometimes cited as a problem as well.
 
-*************************
-Definition of Masked Data
-*************************
+**************************
+Definition of Missing Data
+**************************
 
 Unknown Yet Existing Data
 =========================
@@ -76,6 +76,14 @@ the data that is available.
 In R, many functions take a parameter "na.rm=T" which means to treat
 the data as if the NA values are not part of the data set.
 
+Data That Is Being Temporarily Ignored
+======================================
+
+Iterpreting the meaning of temporarily ignored data requires
+choosing between one of the missing data interpretations above.
+This is a common use case for masks, which are an elegant mechanism
+to implement this.
+
 **************************
 The Mask as Seen in Python
 **************************
@@ -288,41 +296,92 @@ a compatible change.
 Possible Alternative Designs
 ****************************
 
-Data Types With NA Signal Values
-================================
+Parameterized Data Type With NA Signal Values
+=============================================
 
 A masked array isn't the only way to deal with missing data, and
 some systems deal with the problem by defining a special "NA" value,
 for data which is missing. This is distinct from NaN floating point
-values, which are the result of bad floating point calculation values.
+values, which are the result of bad floating point calculation values,
+but many people use NaNs for this purpose.
 
 In the case of IEEE floating point values, it is possible to use a
 particular NaN value, of which there are many, for "NA", distinct
-from NaN. For integers, a reasonable approach would be to use
+from NaN. For signed integers, a reasonable approach would be to use
 the minimum storable value, which doesn't have a corresponding positive
-value, so is perhaps reasonable to dispense with in most contexts.
-
-The trouble with this approach is that it requires a large amount
-of special case code in each data type, and writing a new data type
-supporting missing data requires defining a mechanism for a special
-signal value which may not be possible in general. This causes the
-missing value logic to be replicated many times, something that can be
-error-prone. This is also a lot more code for all the various ufuncs
-than a general masked mechanism which can use the unmasked loop for
-a default implementation.
-
-The masked array approach, on the other hand, works with all data types
-in a uniform fashion, adding the cost of one byte per value storage
-for the mask. The attractiveness of being able to define a new custom
-data type for NumPy and have it automatically work with missing values
-is one of the reasons the masked approach has been chosen over special
-signal values.
-
-Implementing masks as described in this NEP does not preclude also
-creating data types with special "NA" values.
-
-Parameterized Type Which Adds an NA Flag
-========================================
+value. For unsigned integers, the maximum storage value seems most
+reasonable.
+
+With the goal of providing a general mechanism, a parameterized type
+mechanism for this is much more attractive than creating separate
+nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed
+as an alternative way of treating the mask except without value preservation,
+this parameterized type can work together with the mask in a special
+way to produce a value + mask combination on the fly, and use the
+exact same computational infrastructure as the masked array system.
+This allows one to avoid the need to write special case code for each
+ufunc and for each na* dtype, something that is hard to avoid when
+building a separate independent dtype implementation for each na* dtype.
+
+Reliable conversions with the NA bit pattern preserved across primitive
+types requires consideration as well. Even in the simple case of
+double -> float, where this is supported by hardware, the NA value
+will get lost because the NaN payload is typically not preserved.
+The ability to have different bit masks specified for the same underlying
+type also needs to convert properly. With a well-defined interface
+converting to/from a (value,flag) pair, this becomes straightforward
+to support generically.
+
+This approach also provides some opportunities for some subtle variations
+with IEEE floats. By default, one exact bit-pattern, a silent NaN with
+a payload that won't be generated by hardware floating point operations,
+would be used. The choice R has made could be this default.
+
+Additionally, it might be nice to sometimes treat all NaNs as missing values.
+This requires a slightly more complex mapping to convert the floating point
+values into mask/value combinations, and converting back would always
+produce the default NaN used by NumPy. Finally, treating both NaNs
+and Infs as missing values would be just a slight variation of the NaN
+version.
+
+Strings require a slightly different handling, because they
+may be any size. One approach is to use a one-character signal consisting
+of one of the first 32 ASCII/unicode values. There are many possible values
+to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'.
+
+The Object dtype has an obvious signal, the np.NA singleton itself. Any
+dtype with object semantics won't be able to have this customized, since
+specifying bit patterns applies only to plain binary data, not data
+with object semantics of construction and destructions.
+
+Struct dtypes are more of a core primitive dtype, in the same fashion that
+this parameterized NA-capable dtype is. It won't be possible to put
+these as the parameter for the parameterized NA-dtype.
+
+The dtype names would be parameterized similar to how the datetime64
+is parameterized by the metadata unit. What name to use may require some
+debate, but "NA" seems like a reasonable choice. With the default
+missing value bit-pattern, these dtypes would look like
+np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]').
+
+To override the bit pattern that signals a missing value, a raw
+value in the format of a hexadecimal unsigned integer can be given,
+and in the above special cases for floating point, special strings
+can be provided. The defaults for some cases, written explicitly in this
+form, are then::
+
+    np.dtype('NA[?,0x02]')
+    np.dtype('NA[i4,0x80000000]')
+    np.dtype('NA[u4,0xffffffff]')
+    np.dtype('NA[f4,0x7f8007a2')
+    np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern)
+    np.dtype('NA[S16,0x15]') (using the NAK character as the signal).
+
+    np.dtype('NA[f8,NaN]') (for any NaN)
+    np.dtype('NA[f8,InfNaN]') (for any NaN or Inf)
+
+Parameterized Data Type Which Adds Additional Memory for the NA Flag
+====================================================================
 
 Another alternative to having a separate mask added to the array is
 to introduced a parameterized type, which takes a primitive dtype
@@ -342,23 +401,25 @@ Acknowledgments
 In addition to feedback Travis Oliphant and others at Enthought,
 this NEP has been revised based on a great deal of feedback from
 the NumPy-Discussion mailing list. The people participating in
-the discussion are:
-
-Nathaniel Smith
-Robert Kern
-Charles Harris
-Gael Varoquaux
-Eric Firing
-Keith Goodman
-Pierre GM
-Christopher Barker
-Josef Perktold
-Benjamin Root
-Laurent Gautier
-Neal Becker
-Bruce Southey
-Matthew Brett
-Wes McKinney
-Lluís
+the discussion are::
+
+    Nathaniel Smith
+    Robert Kern
+    Charles Harris
+    Gael Varoquaux
+    Eric Firing
+    Keith Goodman
+    Pierre GM
+    Christopher Barker
+    Josef Perktold
+    Benjamin Root
+    Laurent Gautier
+    Neal Becker
+    Bruce Southey
+    Matthew Brett
+    Wes McKinney
+    Lluís
+    Olivier Delalleau
+    Alan G Isaac
 
 I apologize if I missed anyone.
author	Mark Wiebe <mwiebe@enthought.com>	2011-06-25 13:37:58 -0500
committer	Mark Wiebe <mwiebe@enthought.com>	2011-06-27 10:32:42 -0500
commit	95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7 (patch)
tree	705411d25ebab94e97d0565e4c052b701cdf2c02 /doc/neps
parent	63c9d8744e3129dd7ddbc070ac14e6d2f24ea7e7 (diff)
download	numpy-95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7.tar.gz