diff options
Diffstat (limited to 'doc/neps/c-masked-array.rst')
-rw-r--r-- | doc/neps/c-masked-array.rst | 529 |
1 files changed, 0 insertions, 529 deletions
diff --git a/doc/neps/c-masked-array.rst b/doc/neps/c-masked-array.rst deleted file mode 100644 index de08c3d3e..000000000 --- a/doc/neps/c-masked-array.rst +++ /dev/null @@ -1,529 +0,0 @@ -:Title: Missing Data Functionality in NumPy -:Author: Mark Wiebe <mwwiebe@gmail.com> -:Date: 2011-06-23 - -***************** -Table of Contents -***************** - -.. contents:: - -******** -Abstract -******** - -Users interested in dealing with missing data within NumPy are generally -pointed to the masked array subclass of the ndarray, generally known -as 'numpy.ma'. This class has a number of users who depend strongly -on its capabilities, but people who are accustomed to the deep integration -of the missing data placeholder "NA" in the R project and others who -find the programming interface challenging or inconsistent tend not -to use it. - -This NEP proposes to integrate a mask-based missing data solution -into NumPy, with an additional NA bit pattern-based missing data solution -that can be implemented concurrently or later which would integrate seamlessly -with the mask-based solution. - -The mask-based solution and the NA bit pattern-based solutions in this -proposal offer the exact same missing value abstraction, with several -differences in performance, memory overhead, and flexibility. - -The mask-based solution is more flexible, supporting all behaviors of the -NA bit pattern-based solution, but leaving the hidden values untouched -whenever an element is masked. - -The NA bit pattern-based solution requires less memory, is bit-level -compatible with the 64-bit floating point representation used in R, but -does not preserve the hidden values and in fact requires stealing at -least one bit pattern from the underlying dtype to represent the missing -value NA. - -Both solutions are generic in the sense that they can be used with -custom data types very easily, with no effort in the case of the masked -solution, and with the requirement that a bit pattern to sacrifice be -chosen in the case of the NA bit pattern solution. - -************************** -Definition of Missing Data -************************** - -Unknown Yet Existing Data -========================= - -In order to be able to develop an intuition about what computation -will be done by various NumPy functions, a consistent conceptual -model of what a missing element means must be applied. The approach -taken in the R project is to define a missing element as something which -does have a valid value, but that value is unknown. This proposal -adopts this behavior as as the default for all operations involving -missing values. - -In this interpretation, nearly any computation with a missing input produces -a missing output. For example, 'sum(a)' would produce a missing value -if 'a' contained just one missing element. When the output value does -not depend on one of the inputs, it is reasonable to output a value -that is not NA, such as logical_and(NA, False) == False. - -Some more complex arithmetic operations, such as matrix products, are -well defined with this interpretation, and the result should be -the same as is the missing values were NaNs. Actually implementing -such things to the theoretical limit is probably not worth it, -and in many cases either raising an exception or returning all -missing values may be preferred to doing precise calculations. -Care must be taken here when dealing with the values and the masks, -to preserve the semantics that masking a value never touches -the element's backing memory. - -Data That Doesn't Exist -======================= - -Another useful interpretation is that the missing elements should be -treated as if they didn't exist in the array, and the operation should -do its best to interpret what that means according to the data -that's left. In this case, 'mean(a)' would compute the mean of just -the values that are unmasked, adjusting both the sum and count it -uses based on the mask. - -This kind of data can arise when conforming sparsely sampled data -into a regular sampling pattern, and is a useful interpretation so -use when attempting to get best-guess answers for many statistical queries. - -In R, many functions take a parameter "na.rm=T" which means to treat -the data as if the NA values are not part of the data set. This proposal -defines a standard parameter "skipmissing=True" for this same purpose. - -Data That Is Being Temporarily Ignored -====================================== - -It can be useful to temporarily treat some array elements as if they -were NA, possibly in many different configurations. This is a common -use case for masks, and the mask-based implementation of missing values -supports this usage by having the strict requirement that the data -storage backing any missing array elements never be touched. - -In general, this can be done by first creating a view, then either adding -a mask if there isn't one yet, or having the view create its own copy of -the mask instead of retaining a view of the original's mask. - -******************************** -Missing Values as Seen in Python -******************************** - -Working With Missing Values -=========================== - -NumPy will gain a global singleton called numpy.NA, similar to None, -but with semantics reflecting its status as a missing value. In particular, -trying to treat it as a boolean will raise an exception, and comparisons -with it will produce numpy.NA instead of True or False. These basics are -adopted from the behavior of the NA value in the R project. - -For example,:: - - >>> np.array([1.0, 2.0, np.NA, 7.0], masked=True) - array([1., 2., NA, 7.], masked=True) - >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') - array([1., 2., NA, 7.], dtype='NA[<f8]') - -produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] / -mask [Unmasked, Unmasked, Masked, Unmasked], and -values [1.0, 2.0, <NA bit pattern>, 7.0] respectively. - -It may be worth overloading the np.NA __call__ method to accept a dtype, -returning a zero-dimensional array with a missing value of that dtype. -Without doing this, NA printouts would look like:: - - >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True)) - array(NA, dtype='float64', masked=True) - >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')) - array(NA, dtype='NA[<f8]') - -but with this, they could be printed as:: - >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True)) - NA('float64') - >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')) - NA('NA[<f8]') - -Assigning a value to an array always causes that element to not be NA, -transparently unmasking it if necessary.. Assigning numpy.NA to the array -masks that element or assigns the NA bit pattern for the particular dtype. -In the mask-based implementation, the storage behind a missing value may never -be accessed in any way, other than to unmask it by assigning its value. - -While numpy.NA works to mask values, it does not itself have a dtype. -This means that returning the numpy.NA singleton from an operation -like 'arr[0]' would be throwing away the dtype, which is still -valuable to retain, so 'arr[0]' will return a zero-dimensional -array either with its value masked, or containing the NA bit pattern -for the array's dtype. To test if the value is missing, the function -"np.ismissing(arr[0])" will be provided. One of the key reasons for the -NumPy scalars is to allow their values into dictionaries. Having a -missing value as the key in a dictionary is a bad idea, so the NumPy -scalars will not support missing values in any form. - -All operations which write to masked arrays will not affect the value -unless they also unmask that value. This allows the storage behind -masked elements to still be relied on if they are still accessible -from another view which doesn't have them masked. For example:: - - >>> a = np.array([1,2]) - >>> b = a.view() - >>> b.flags.hasmask = True - >>> b - array([1,2], masked=True) - >>> b[0] = np.NA - >>> b - array([NA,2], masked=True) - >>> a - array([1,2]) - >>> # The underlying number 1 value in 'a[0]' was untouched - -Copying values between the mask-based implementation and the -NA bit pattern implementation will transparently do the correct thing, -turning the NA bit pattern into a masked value, or a masked value -into the NA bit pattern where appropriate. The one exception is -if a valid value in a masked array happens to have the NA bit pattern, -copying this value to the NA form of the dtype will cause it to -become NA as well. - -If np.NA or masked values are copied to an array without support for -missing values enabled, an exception will be raised. Adding a mask to -the target array would be problematic, because then having a mask -would be a "viral" property consuming extra memory and reducing -performance in unexpected ways. - -By default, the string "NA" will be used to represent missing values -in str and repr outputs. A global configuration will allow -this to be changed. The array2string function will also gain a -'maskedstr=' parameter so this could be changed to "<missing>" or -other values people may desire. - -For floating point numbers, Inf and NaN are separate concepts from -missing values. If a division by zero occurs in an array with default -missing value support, an unmasked Inf or NaN will be produced. To -mask those values, a further 'a[np.logical_not(a.isfinite(a)] = np.NA' -can achieve that. For the NA bit pattern approach, the parameterized -dtype('NA[f8,InfNan]') described in a later section can be used to get -these semantics without the extra manipulation. - -A manual loop through a masked array like:: - - for i in xrange(len(a)): - a[i] = np.log(a[i]) - -works even with masked values, because 'a[i]' returns a zero-dimensional -array with a missing value instead of the singleton np.NA for the missing -elements. If np.NA was returned, np.log would have to raise an exception -because it doesn't know the log of which dtype it's meant to call, whether -it's a missing float or a missing string, for example. - -Accessing a Boolean Mask -======================== - -The mask used to implement missing data in the masked approach is not -accessible from Python directly. This is partially due to differing -opinions on whether True in the mask should mean "missing" or "not missing" -Additionally, exposing the mask directly would preclude a potential -space optimization, where a bit-level instead of a byte-level mask -is used to get a factor of eight memory usage improvement. - -To access the mask values, there are two functions provided, -'np.ismissing' and 'np.isavail', which test for NA or available values -respectively. These functions work equivalently for masked arrays -and NA bit pattern dtypes. - -Creating Masked Arrays -====================== - -There are two flags which indicate and control the nature of the mask -used in masked arrays. - -First is 'arr.flags.hasmask', which is True for all masked arrays and -may be set to True to add a mask to an array which does not have one. - -Second is 'arr.flags.ownmask', which is True if the array owns the -memory to the mask, and False if the array has no mask, or has a view -into the mask of another array. If this is set to False in a masked -array, the array will create a copy of the mask so that further modifications -to the mask will not affect the array being viewed. - -Mask Implementation Details -=========================== - -The memory ordering of the mask will always match the ordering of -the array it is associated with. A Fortran-style array will have a -Fortran-style mask, etc. - -When a view of an array with a mask is taken, the view will have -a mask which is also a view of the mask in the original -array. This means unmasking values in views will also unmask them -in the original array, and if a mask is added to an array, it will -not be possible to ever remove that mask except to create a new array -copying the data but not the mask. - -It is still possible to temporarily treat an array with a mask without -giving it one, by first creating a view of the array and then adding a -mask to that view. A data set can be viewed with multiple different -masks simultaneously, by creating multiple views, and giving each view -a mask. - -New ndarray Methods -=================== - -New functions added to the numpy namespace are:: - - np.ismissing(arr) - Returns a boolean array with True whereever the array is masked - or matches the NA bit pattern, and False elsewhere - - np.isavail(arr) - Returns a boolean array with False whereever the array is masked - or matches the NA bit pattern, and True elsewhere - -New functions added to the ndarray are:: - - arr.copy(..., replacena=None) - Modification to the copy function which replaces NA values, - either masked or with the NA bit pattern, with the 'replacena=' - parameter suppled. When 'replacena' isn't None, the copied - array is unmasked and has the 'NA' part stripped from the - parameterized type ('NA[f8]' becomes just 'f8'). - - arr.view(masked=True) - This is a shortcut for 'a = arr.view(); a.flags.hasmask=True'. - -Element-wise UFuncs With Missing Values -======================================= - -As part of the implementation, ufuncs and other operations will -have to be extended to support masked computation. Because this -is a useful feature in general, even outside the context of -a masked array, in addition to working with masked arrays ufuncs -will take an optional 'mask=' parameter which allows the use -of boolean arrays to choose where a computation should be done. -This functions similar to a "where" clause on the ufunc.:: - - >>> np.add(a, b, out=b, mask=(a > threshold)) - -A benefit of having this 'mask=' parameter is that it provides a way -to temporarily treat an object with a mask without ever creating a -masked array object. - -If the 'out' parameter isn't specified, use of the 'mask=' parameter -will produce a array with a mask as the result. - -For boolean operations, the R project special cases logical_and and -logical_or so that logical_and(NA, False) is False, and -logical_or(NA, True) is True. On the other hand, 0 * NA isn't 0, but -here the NA could represent Inf or NaN, in which case 0 * the backing -value wouldn't be 0 anyway. - -For NumPy element-wise ufuncs, the design won't support this ability -for the mask of the output to depend simultaneously on the mask and -the value of the inputs. The NumPy 1.6 nditer, however, makes it -fairly easy to write standalone functions which look and feel just -like ufuncs, but deviate from their behavior. The functions logical_and -and logical_or can be moved into standalone function objects which are -backwards compatible with the current ufuncs. - -Reduction UFuncs With Missing Values -==================================== - -Reduction operations like 'sum', 'prod', 'min', and 'max' will operate -consistently with the idea that a masked value exists, but its value -is unknown. - -An optional parameter 'skipna=' will be added to those functions -which can interpret it appropriately to do the operation as if just -the unmasked values existed. - -With 'skipna=True', when all the input values are masked, -'sum' and 'prod' will produce the additive and multiplicative identities -respectively, while 'min' and 'max' will produce masked values. -Statistics operations which require a count, like 'mean' and 'std' -will also use the unmasked value counts for their calculations if -'skipna=True', and produce masked values when all the inputs are masked. - -Some examples:: - - >>> a = np.array([1., 3., np.NA, 7.], masked=True) - >>> np.sum(a) - array(NA, dtype='<f8', masked=True) - >>> np.sum(a, skipna=True) - 11.0 - >>> np.mean(a) - array(NA, dtype='<f8', masked=True) - >>> np.mean(a) - 3.6666666666666665 - >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) - >>> np.sum(a, skipna=True) - 0.0 - >>> np.max(a, skipna=True) - array(NA, dtype='<f8', masked=True) - -PEP 3118 -======== - -PEP 3118 doesn't have any mask mechanism, so arrays with masks will -not be accessible through this interface. Similarly, it doesn't support -the specification of dtypes with NA bit patterns, so the parameterized NA -dtypes will also not be accessible through this interface. - -If NumPy did allow access through PEP 3118, this would circumvent the -missing value abstraction in a very damaging way. Other libraries would -try to use masked arrays, and silently get access to the data without -also getting access to the mask or being aware of the missing value -abstraction the mask and data together are following. - -Unresolved Design Questions -=========================== - -The existing masked array implementation has a "hardmask" feature, -which prevents values from ever being unmasked by assigning a value. -This would be an internal array flag, named something like -'arr.flags.hardmask'. - -If the hardmask feature is implemented, boolean indexing could -return a hardmasked array instead of a flattened array with the -arbitrary choice of C-ordering as it currently does. While this -improves the abstraction of the array significantly, it is not -a compatible change. - -********************************** -Alternative Designs Without a Mask -********************************** - -Parameterized Data Type With NA Signal Values -============================================= - -A masked array isn't the only way to deal with missing data, and -some systems deal with the problem by defining a special "NA" value, -for data which is missing. This is distinct from NaN floating point -values, which are the result of bad floating point calculation values, -but many people use NaNs for this purpose. - -In the case of IEEE floating point values, it is possible to use a -particular NaN value, of which there are many, for "NA", distinct -from NaN. For signed integers, a reasonable approach would be to use -the minimum storable value, which doesn't have a corresponding positive -value. For unsigned integers, the maximum storage value seems most -reasonable. - -With the goal of providing a general mechanism, a parameterized type -mechanism for this is much more attractive than creating separate -nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed -as an alternative way of treating the mask except without value preservation, -this parameterized type can work together with the mask in a special -way to produce a value + mask combination on the fly, and use the -exact same computational infrastructure as the masked array system. -This allows one to avoid the need to write special case code for each -ufunc and for each na* dtype, something that is hard to avoid when -building a separate independent dtype implementation for each na* dtype. - -Reliable conversions with the NA bit pattern preserved across primitive -types requires consideration as well. Even in the simple case of -double -> float, where this is supported by hardware, the NA value -will get lost because the NaN payload is typically not preserved. -The ability to have different bit masks specified for the same underlying -type also needs to convert properly. With a well-defined interface -converting to/from a (value,flag) pair, this becomes straightforward -to support generically. - -This approach also provides some opportunities for some subtle variations -with IEEE floats. By default, one exact bit-pattern, a silent NaN with -a payload that won't be generated by hardware floating point operations, -would be used. The choice R has made could be this default. - -Additionally, it might be nice to sometimes treat all NaNs as missing values. -This requires a slightly more complex mapping to convert the floating point -values into mask/value combinations, and converting back would always -produce the default NaN used by NumPy. Finally, treating both NaNs -and Infs as missing values would be just a slight variation of the NaN -version. - -Strings require a slightly different handling, because they -may be any size. One approach is to use a one-character signal consisting -of one of the first 32 ASCII/unicode values. There are many possible values -to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'. - -The Object dtype has an obvious signal, the np.NA singleton itself. Any -dtype with object semantics won't be able to have this customized, since -specifying bit patterns applies only to plain binary data, not data -with object semantics of construction and destructions. - -Struct dtypes are more of a core primitive dtype, in the same fashion that -this parameterized NA-capable dtype is. It won't be possible to put -these as the parameter for the parameterized NA-dtype. - -The dtype names would be parameterized similar to how the datetime64 -is parameterized by the metadata unit. What name to use may require some -debate, but "NA" seems like a reasonable choice. With the default -missing value bit-pattern, these dtypes would look like -np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]'). - -To override the bit pattern that signals a missing value, a raw -value in the format of a hexadecimal unsigned integer can be given, -and in the above special cases for floating point, special strings -can be provided. The defaults for some cases, written explicitly in this -form, are then:: - - np.dtype('NA[?,0x02]') - np.dtype('NA[i4,0x80000000]') - np.dtype('NA[u4,0xffffffff]') - np.dtype('NA[f4,0x7f8007a2') - np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern) - np.dtype('NA[S16,0x15]') (using the NAK character as the signal). - - np.dtype('NA[f8,NaN]') (for any NaN) - np.dtype('NA[f8,InfNaN]') (for any NaN or Inf) - -When no parameter is specified a flexible NA dtype is created, which itself -cannot hold values, but will conform to the input types in funcions like -'np.astype'. The dtype 'f8' maps to 'NA[f8]', and [('a', 'f4'), ('b', 'i4')] -maps to [('a', 'NA[f4]'), ('b', 'NA[i4]')]. Thus, to view the memory -of an 'f8' array 'arr' with 'NA[f8]', you can say arr.view(dtype='NA'). - -Parameterized Data Type Which Adds Additional Memory for the NA Flag -==================================================================== - -Another alternative to having a separate mask added to the array is -to introduced a parameterized type, which takes a primitive dtype -as an argument. The dtype "i8" would turn into "maybe[i8]", and -a byte flag would be appended to the dtype to indicate whether the -value was NA or not. - -This approach adds memory overhead greater or equal to keeping a separate -mask, but has better locality. To keep the dtype aligned, an 'i8' would -need to have 16 bytes to retain proper alignment, a 100% overhead compared -to 12.5% overhead for a separately kept mask. - -*************** -Acknowledgments -*************** - -In addition to feedback Travis Oliphant and others at Enthought, -this NEP has been revised based on a great deal of feedback from -the NumPy-Discussion mailing list. The people participating in -the discussion are:: - - Nathaniel Smith - Robert Kern - Charles Harris - Gael Varoquaux - Eric Firing - Keith Goodman - Pierre GM - Christopher Barker - Josef Perktold - Ben Root - Laurent Gautier - Neal Becker - Bruce Southey - Matthew Brett - Wes McKinney - LluĂs - Olivier Delalleau - Alan G Isaac - -I apologize if I missed anyone. |