diff options
author | Mark Wiebe <mwiebe@enthought.com> | 2011-06-25 13:37:58 -0500 |
---|---|---|
committer | Mark Wiebe <mwiebe@enthought.com> | 2011-06-27 10:32:42 -0500 |
commit | 95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7 (patch) | |
tree | 705411d25ebab94e97d0565e4c052b701cdf2c02 /doc/neps | |
parent | 63c9d8744e3129dd7ddbc070ac14e6d2f24ea7e7 (diff) | |
download | numpy-95bb76a6dc8b538fd844d3bcb9b94f12d47d78c7.tar.gz |
NEP: c-masked-array: Idea for a parameterized dtype with an NA bit pattern
Diffstat (limited to 'doc/neps')
-rw-r--r-- | doc/neps/c-masked-array.rst | 157 |
1 files changed, 109 insertions, 48 deletions
diff --git a/doc/neps/c-masked-array.rst b/doc/neps/c-masked-array.rst index 743adc23c..36c5445ad 100644 --- a/doc/neps/c-masked-array.rst +++ b/doc/neps/c-masked-array.rst @@ -30,9 +30,9 @@ shouldn't, so the ufunc error handling mechanism can't be relied on. While no comprehensive benchmarks appear to exist, poor performance is sometimes cited as a problem as well. -************************* -Definition of Masked Data -************************* +************************** +Definition of Missing Data +************************** Unknown Yet Existing Data ========================= @@ -76,6 +76,14 @@ the data that is available. In R, many functions take a parameter "na.rm=T" which means to treat the data as if the NA values are not part of the data set. +Data That Is Being Temporarily Ignored +====================================== + +Iterpreting the meaning of temporarily ignored data requires +choosing between one of the missing data interpretations above. +This is a common use case for masks, which are an elegant mechanism +to implement this. + ************************** The Mask as Seen in Python ************************** @@ -288,41 +296,92 @@ a compatible change. Possible Alternative Designs **************************** -Data Types With NA Signal Values -================================ +Parameterized Data Type With NA Signal Values +============================================= A masked array isn't the only way to deal with missing data, and some systems deal with the problem by defining a special "NA" value, for data which is missing. This is distinct from NaN floating point -values, which are the result of bad floating point calculation values. +values, which are the result of bad floating point calculation values, +but many people use NaNs for this purpose. In the case of IEEE floating point values, it is possible to use a particular NaN value, of which there are many, for "NA", distinct -from NaN. For integers, a reasonable approach would be to use +from NaN. For signed integers, a reasonable approach would be to use the minimum storable value, which doesn't have a corresponding positive -value, so is perhaps reasonable to dispense with in most contexts. - -The trouble with this approach is that it requires a large amount -of special case code in each data type, and writing a new data type -supporting missing data requires defining a mechanism for a special -signal value which may not be possible in general. This causes the -missing value logic to be replicated many times, something that can be -error-prone. This is also a lot more code for all the various ufuncs -than a general masked mechanism which can use the unmasked loop for -a default implementation. - -The masked array approach, on the other hand, works with all data types -in a uniform fashion, adding the cost of one byte per value storage -for the mask. The attractiveness of being able to define a new custom -data type for NumPy and have it automatically work with missing values -is one of the reasons the masked approach has been chosen over special -signal values. - -Implementing masks as described in this NEP does not preclude also -creating data types with special "NA" values. - -Parameterized Type Which Adds an NA Flag -======================================== +value. For unsigned integers, the maximum storage value seems most +reasonable. + +With the goal of providing a general mechanism, a parameterized type +mechanism for this is much more attractive than creating separate +nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed +as an alternative way of treating the mask except without value preservation, +this parameterized type can work together with the mask in a special +way to produce a value + mask combination on the fly, and use the +exact same computational infrastructure as the masked array system. +This allows one to avoid the need to write special case code for each +ufunc and for each na* dtype, something that is hard to avoid when +building a separate independent dtype implementation for each na* dtype. + +Reliable conversions with the NA bit pattern preserved across primitive +types requires consideration as well. Even in the simple case of +double -> float, where this is supported by hardware, the NA value +will get lost because the NaN payload is typically not preserved. +The ability to have different bit masks specified for the same underlying +type also needs to convert properly. With a well-defined interface +converting to/from a (value,flag) pair, this becomes straightforward +to support generically. + +This approach also provides some opportunities for some subtle variations +with IEEE floats. By default, one exact bit-pattern, a silent NaN with +a payload that won't be generated by hardware floating point operations, +would be used. The choice R has made could be this default. + +Additionally, it might be nice to sometimes treat all NaNs as missing values. +This requires a slightly more complex mapping to convert the floating point +values into mask/value combinations, and converting back would always +produce the default NaN used by NumPy. Finally, treating both NaNs +and Infs as missing values would be just a slight variation of the NaN +version. + +Strings require a slightly different handling, because they +may be any size. One approach is to use a one-character signal consisting +of one of the first 32 ASCII/unicode values. There are many possible values +to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'. + +The Object dtype has an obvious signal, the np.NA singleton itself. Any +dtype with object semantics won't be able to have this customized, since +specifying bit patterns applies only to plain binary data, not data +with object semantics of construction and destructions. + +Struct dtypes are more of a core primitive dtype, in the same fashion that +this parameterized NA-capable dtype is. It won't be possible to put +these as the parameter for the parameterized NA-dtype. + +The dtype names would be parameterized similar to how the datetime64 +is parameterized by the metadata unit. What name to use may require some +debate, but "NA" seems like a reasonable choice. With the default +missing value bit-pattern, these dtypes would look like +np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]'). + +To override the bit pattern that signals a missing value, a raw +value in the format of a hexadecimal unsigned integer can be given, +and in the above special cases for floating point, special strings +can be provided. The defaults for some cases, written explicitly in this +form, are then:: + + np.dtype('NA[?,0x02]') + np.dtype('NA[i4,0x80000000]') + np.dtype('NA[u4,0xffffffff]') + np.dtype('NA[f4,0x7f8007a2') + np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern) + np.dtype('NA[S16,0x15]') (using the NAK character as the signal). + + np.dtype('NA[f8,NaN]') (for any NaN) + np.dtype('NA[f8,InfNaN]') (for any NaN or Inf) + +Parameterized Data Type Which Adds Additional Memory for the NA Flag +==================================================================== Another alternative to having a separate mask added to the array is to introduced a parameterized type, which takes a primitive dtype @@ -342,23 +401,25 @@ Acknowledgments In addition to feedback Travis Oliphant and others at Enthought, this NEP has been revised based on a great deal of feedback from the NumPy-Discussion mailing list. The people participating in -the discussion are: - -Nathaniel Smith -Robert Kern -Charles Harris -Gael Varoquaux -Eric Firing -Keith Goodman -Pierre GM -Christopher Barker -Josef Perktold -Benjamin Root -Laurent Gautier -Neal Becker -Bruce Southey -Matthew Brett -Wes McKinney -Lluís +the discussion are:: + + Nathaniel Smith + Robert Kern + Charles Harris + Gael Varoquaux + Eric Firing + Keith Goodman + Pierre GM + Christopher Barker + Josef Perktold + Benjamin Root + Laurent Gautier + Neal Becker + Bruce Southey + Matthew Brett + Wes McKinney + Lluís + Olivier Delalleau + Alan G Isaac I apologize if I missed anyone. |