:Title: Missing Data Functionality in NumPy :Author: Mark Wiebe :Date: 2011-06-23 ***************** Table of Contents ***************** .. contents:: ******** Abstract ******** Users interested in dealing with missing data within NumPy are generally pointed to the masked array subclass of the ndarray, generally known as 'numpy.ma'. This class has a number of users who depend strongly on its capabilities, but people who are accustomed to the deep integration of the missing data placeholder "NA" in the R project and others who find the programming interface challenging or inconsistent tend not to use it. This NEP proposes to integrate a mask-based missing data solution into NumPy, with an additional NA bit pattern-based missing data solution that can be implemented concurrently or later which would integrate seamlessly with the mask-based solution. The mask-based solution and the NA bit pattern-based solutions in this proposal offer the exact same missing value abstraction, with several differences in performance, memory overhead, and flexibility. The mask-based solution is more flexible, supporting all behaviors of the NA bit pattern-based solution, but leaving the hidden values untouched whenever an element is masked. The NA bit pattern-based solution requires less memory, is bit-level compatible with the 64-bit floating point representation used in R, but does not preserve the hidden values and in fact requires stealing at least one bit pattern from the underlying dtype to represent the missing value NA. Both solutions are generic in the sense that they can be used with custom data types very easily, with no effort in the case of the masked solution, and with the requirement that a bit pattern to sacrifice be chosen in the case of the NA bit pattern solution. ************************** Definition of Missing Data ************************** Unknown Yet Existing Data ========================= In order to be able to develop an intuition about what computation will be done by various NumPy functions, a consistent conceptual model of what a missing element means must be applied. The approach taken in the R project is to define a missing element as something which does have a valid value, but that value is unknown. This proposal adopts this behavior as as the default for all operations involving missing values. In this interpretation, nearly any computation with a missing input produces a missing output. For example, 'sum(a)' would produce a missing value if 'a' contained just one missing element. When the output value does not depend on one of the inputs, it is reasonable to output a value that is not NA, such as logical_and(NA, False) == False. Some more complex arithmetic operations, such as matrix products, are well defined with this interpretation, and the result should be the same as is the missing values were NaNs. Actually implementing such things to the theoretical limit is probably not worth it, and in many cases either raising an exception or returning all missing values may be preferred to doing precise calculations. Care must be taken here when dealing with the values and the masks, to preserve the semantics that masking a value never touches the element's backing memory. Data That Doesn't Exist ======================= Another useful interpretation is that the missing elements should be treated as if they didn't exist in the array, and the operation should do its best to interpret what that means according to the data that's left. In this case, 'mean(a)' would compute the mean of just the values that are unmasked, adjusting both the sum and count it uses based on the mask. This kind of data can arise when conforming sparsely sampled data into a regular sampling pattern, and is a useful interpretation so use when attempting to get best-guess answers for many statistical queries. In R, many functions take a parameter "na.rm=T" which means to treat the data as if the NA values are not part of the data set. This proposal defines a standard parameter "skipmissing=True" for this same purpose. Data That Is Being Temporarily Ignored ====================================== It can be useful to temporarily treat some array elements as if they were NA, possibly in many different configurations. This is a common use case for masks, and the mask-based implementation of missing values supports this usage by having the strict requirement that the data storage backing any missing array elements never be touched. In general, this can be done by first creating a view, then either adding a mask if there isn't one yet, or having the view create its own copy of the mask instead of retaining a view of the original's mask. ******************************** Missing Values as Seen in Python ******************************** Working With Missing Values =========================== NumPy will gain a global singleton called numpy.NA, similar to None, but with semantics reflecting its status as a missing value. In particular, trying to treat it as a boolean will raise an exception, and comparisons with it will produce numpy.NA instead of True or False. These basics are adopted from the behavior of the NA value in the R project. For example,:: >>> np.array([1.0, 2.0, np.NA, 7.0], masked=True) array([1., 2., NA, 7.], masked=True) >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[, 7.0] / mask [Unmasked, Unmasked, Masked, Unmasked], and values [1.0, 2.0, , 7.0] respectively. It may be worth overloading the np.NA __call__ method to accept a dtype, returning a zero-dimensional array with a missing value of that dtype. Without doing this, NA printouts would look like:: >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True)) array(NA, dtype='float64', masked=True) >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')) array(NA, dtype='NA[>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True)) NA('float64') >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')) NA('NA[>> a = np.array([1,2]) >>> b = a.view() >>> b.flags.hasmask = True >>> b array([1,2], masked=True) >>> b[0] = np.NA >>> b array([NA,2], masked=True) >>> a array([1,2]) >>> # The underlying number 1 value in 'a[0]' was untouched Copying values between the mask-based implementation and the NA bit pattern implementation will transparently do the correct thing, turning the NA bit pattern into a masked value, or a masked value into the NA bit pattern where appropriate. The one exception is if a valid value in a masked array happens to have the NA bit pattern, copying this value to the NA form of the dtype will cause it to become NA as well. If np.NA or masked values are copied to an array without support for missing values enabled, an exception will be raised. Adding a mask to the target array would be problematic, because then having a mask would be a "viral" property consuming extra memory and reducing performance in unexpected ways. By default, the string "NA" will be used to represent missing values in str and repr outputs. A global configuration will allow this to be changed. The array2string function will also gain a 'maskedstr=' parameter so this could be changed to "" or other values people may desire. For floating point numbers, Inf and NaN are separate concepts from missing values. If a division by zero occurs in an array with default missing value support, an unmasked Inf or NaN will be produced. To mask those values, a further 'a[np.logical_not(a.isfinite(a)] = np.NA' can achieve that. For the NA bit pattern approach, the parameterized dtype('NA[f8,InfNan]') described in a later section can be used to get these semantics without the extra manipulation. A manual loop through a masked array like:: for i in xrange(len(a)): a[i] = np.log(a[i]) works even with masked values, because 'a[i]' returns a zero-dimensional array with a missing value instead of the singleton np.NA for the missing elements. If np.NA was returned, np.log would have to raise an exception because it doesn't know the log of which dtype it's meant to call, whether it's a missing float or a missing string, for example. Accessing a Boolean Mask ======================== The mask used to implement missing data in the masked approach is not accessible from Python directly. This is partially due to differing opinions on whether True in the mask should mean "missing" or "not missing" Additionally, exposing the mask directly would preclude a potential space optimization, where a bit-level instead of a byte-level mask is used to get a factor of eight memory usage improvement. To access the mask values, there are two functions provided, 'np.ismissing' and 'np.isavail', which test for NA or available values respectively. These functions work equivalently for masked arrays and NA bit pattern dtypes. Creating Masked Arrays ====================== There are two flags which indicate and control the nature of the mask used in masked arrays. First is 'arr.flags.hasmask', which is True for all masked arrays and may be set to True to add a mask to an array which does not have one. Second is 'arr.flags.ownmask', which is True if the array owns the memory to the mask, and False if the array has no mask, or has a view into the mask of another array. If this is set to False in a masked array, the array will create a copy of the mask so that further modifications to the mask will not affect the array being viewed. Mask Implementation Details =========================== The memory ordering of the mask will always match the ordering of the array it is associated with. A Fortran-style array will have a Fortran-style mask, etc. When a view of an array with a mask is taken, the view will have a mask which is also a view of the mask in the original array. This means unmasking values in views will also unmask them in the original array, and if a mask is added to an array, it will not be possible to ever remove that mask except to create a new array copying the data but not the mask. It is still possible to temporarily treat an array with a mask without giving it one, by first creating a view of the array and then adding a mask to that view. A data set can be viewed with multiple different masks simultaneously, by creating multiple views, and giving each view a mask. New ndarray Methods =================== New functions added to the numpy namespace are:: np.ismissing(arr) Returns a boolean array with True whereever the array is masked or matches the NA bit pattern, and False elsewhere np.isavail(arr) Returns a boolean array with False whereever the array is masked or matches the NA bit pattern, and True elsewhere New functions added to the ndarray are:: arr.copy(..., replacena=None) Modification to the copy function which replaces NA values, either masked or with the NA bit pattern, with the 'replacena=' parameter suppled. When 'replacena' isn't None, the copied array is unmasked and has the 'NA' part stripped from the parameterized type ('NA[f8]' becomes just 'f8'). arr.view(masked=True) This is a shortcut for 'a = arr.view(); a.flags.hasmask=True'. Element-wise UFuncs With Missing Values ======================================= As part of the implementation, ufuncs and other operations will have to be extended to support masked computation. Because this is a useful feature in general, even outside the context of a masked array, in addition to working with masked arrays ufuncs will take an optional 'mask=' parameter which allows the use of boolean arrays to choose where a computation should be done. This functions similar to a "where" clause on the ufunc.:: >>> np.add(a, b, out=b, mask=(a > threshold)) A benefit of having this 'mask=' parameter is that it provides a way to temporarily treat an object with a mask without ever creating a masked array object. If the 'out' parameter isn't specified, use of the 'mask=' parameter will produce an array with a mask as the result. For boolean operations, the R project special cases logical_and and logical_or so that logical_and(NA, False) is False, and logical_or(NA, True) is True. On the other hand, 0 * NA isn't 0, but here the NA could represent Inf or NaN, in which case 0 * the backing value wouldn't be 0 anyway. For NumPy element-wise ufuncs, the design won't support this ability for the mask of the output to depend simultaneously on the mask and the value of the inputs. The NumPy 1.6 nditer, however, makes it fairly easy to write standalone functions which look and feel just like ufuncs, but deviate from their behavior. The functions logical_and and logical_or can be moved into standalone function objects which are backwards compatible with the current ufuncs. Reduction UFuncs With Missing Values ==================================== Reduction operations like 'sum', 'prod', 'min', and 'max' will operate consistently with the idea that a masked value exists, but its value is unknown. An optional parameter 'skipna=' will be added to those functions which can interpret it appropriately to do the operation as if just the unmasked values existed. With 'skipna=True', when all the input values are masked, 'sum' and 'prod' will produce the additive and multiplicative identities respectively, while 'min' and 'max' will produce masked values. Statistics operations which require a count, like 'mean' and 'std' will also use the unmasked value counts for their calculations if 'skipna=True', and produce masked values when all the inputs are masked. Some examples:: >>> a = np.array([1., 3., np.NA, 7.], masked=True) >>> np.sum(a) array(NA, dtype='>> np.sum(a, skipna=True) 11.0 >>> np.mean(a) NA('>> np.mean(a, skipna=True) 3.6666666666666665 >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >>> np.sum(a, skipna=True) 0.0 >>> np.max(a, skipna=True) array(NA, dtype='>> np.mean(a) NA('>> np.mean(a, skipna=True) /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: RuntimeWarning: invalid value encountered in double_scalars return mean(axis, dtype, out) nan PEP 3118 ======== PEP 3118 doesn't have any mask mechanism, so arrays with masks will not be accessible through this interface. Similarly, it doesn't support the specification of dtypes with NA bit patterns, so the parameterized NA dtypes will also not be accessible through this interface. If NumPy did allow access through PEP 3118, this would circumvent the missing value abstraction in a very damaging way. Other libraries would try to use masked arrays, and silently get access to the data without also getting access to the mask or being aware of the missing value abstraction the mask and data together are following. Unresolved Design Questions =========================== The existing masked array implementation has a "hardmask" feature, which prevents values from ever being unmasked by assigning a value. This would be an internal array flag, named something like 'arr.flags.hardmask'. If the hardmask feature is implemented, boolean indexing could return a hardmasked array instead of a flattened array with the arbitrary choice of C-ordering as it currently does. While this improves the abstraction of the array significantly, it is not a compatible change. ********************************** Alternative Designs Without a Mask ********************************** Parameterized Data Type With NA Signal Values ============================================= A masked array isn't the only way to deal with missing data, and some systems deal with the problem by defining a special "NA" value, for data which is missing. This is distinct from NaN floating point values, which are the result of bad floating point calculation values, but many people use NaNs for this purpose. In the case of IEEE floating point values, it is possible to use a particular NaN value, of which there are many, for "NA", distinct from NaN. For signed integers, a reasonable approach would be to use the minimum storable value, which doesn't have a corresponding positive value. For unsigned integers, the maximum storage value seems most reasonable. With the goal of providing a general mechanism, a parameterized type mechanism for this is much more attractive than creating separate nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed as an alternative way of treating the mask except without value preservation, this parameterized type can work together with the mask in a special way to produce a value + mask combination on the fly, and use the exact same computational infrastructure as the masked array system. This allows one to avoid the need to write special case code for each ufunc and for each na* dtype, something that is hard to avoid when building a separate independent dtype implementation for each na* dtype. Reliable conversions with the NA bit pattern preserved across primitive types requires consideration as well. Even in the simple case of double -> float, where this is supported by hardware, the NA value will get lost because the NaN payload is typically not preserved. The ability to have different bit masks specified for the same underlying type also needs to convert properly. With a well-defined interface converting to/from a (value,flag) pair, this becomes straightforward to support generically. This approach also provides some opportunities for some subtle variations with IEEE floats. By default, one exact bit-pattern, a silent NaN with a payload that won't be generated by hardware floating point operations, would be used. The choice R has made could be this default. Additionally, it might be nice to sometimes treat all NaNs as missing values. This requires a slightly more complex mapping to convert the floating point values into mask/value combinations, and converting back would always produce the default NaN used by NumPy. Finally, treating both NaNs and Infs as missing values would be just a slight variation of the NaN version. Strings require a slightly different handling, because they may be any size. One approach is to use a one-character signal consisting of one of the first 32 ASCII/unicode values. There are many possible values to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'. The Object dtype has an obvious signal, the np.NA singleton itself. Any dtype with object semantics won't be able to have this customized, since specifying bit patterns applies only to plain binary data, not data with object semantics of construction and destructions. Struct dtypes are more of a core primitive dtype, in the same fashion that this parameterized NA-capable dtype is. It won't be possible to put these as the parameter for the parameterized NA-dtype. The dtype names would be parameterized similar to how the datetime64 is parameterized by the metadata unit. What name to use may require some debate, but "NA" seems like a reasonable choice. With the default missing value bit-pattern, these dtypes would look like np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]'). To override the bit pattern that signals a missing value, a raw value in the format of a hexadecimal unsigned integer can be given, and in the above special cases for floating point, special strings can be provided. The defaults for some cases, written explicitly in this form, are then:: np.dtype('NA[?,0x02]') np.dtype('NA[i4,0x80000000]') np.dtype('NA[u4,0xffffffff]') np.dtype('NA[f4,0x7f8007a2') np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern) np.dtype('NA[S16,0x15]') (using the NAK character as the signal). np.dtype('NA[f8,NaN]') (for any NaN) np.dtype('NA[f8,InfNaN]') (for any NaN or Inf) When no parameter is specified a flexible NA dtype is created, which itself cannot hold values, but will conform to the input types in functions like 'np.astype'. The dtype 'f8' maps to 'NA[f8]', and [('a', 'f4'), ('b', 'i4')] maps to [('a', 'NA[f4]'), ('b', 'NA[i4]')]. Thus, to view the memory of an 'f8' array 'arr' with 'NA[f8]', you can say arr.view(dtype='NA'). Parameterized Data Type Which Adds Additional Memory for the NA Flag ==================================================================== Another alternative to having a separate mask added to the array is to introduced a parameterized type, which takes a primitive dtype as an argument. The dtype "i8" would turn into "maybe[i8]", and a byte flag would be appended to the dtype to indicate whether the value was NA or not. This approach adds memory overhead greater or equal to keeping a separate mask, but has better locality. To keep the dtype aligned, an 'i8' would need to have 16 bytes to retain proper alignment, a 100% overhead compared to 12.5% overhead for a separately kept mask. *************** Acknowledgments *************** In addition to feedback Travis Oliphant and others at Enthought, this NEP has been revised based on a great deal of feedback from the NumPy-Discussion mailing list. The people participating in the discussion are:: Nathaniel Smith Robert Kern Charles Harris Gael Varoquaux Eric Firing Keith Goodman Pierre GM Christopher Barker Josef Perktold Ben Root Laurent Gautier Neal Becker Bruce Southey Matthew Brett Wes McKinney LluĂ­s Olivier Delalleau Alan G Isaac E. Antero Tammi I apologize if I missed anyone.