diff options
author | Mark Wiebe <mwiebe@enthought.com> | 2011-06-24 19:13:30 -0500 |
---|---|---|
committer | Mark Wiebe <mwiebe@enthought.com> | 2011-06-27 10:32:42 -0500 |
commit | 63c9d8744e3129dd7ddbc070ac14e6d2f24ea7e7 (patch) | |
tree | edfbe7e8d092069506dc0f3407d5cd37f704a60e /doc/neps | |
parent | ff71c887a7a28ffb8b228e2377e7c4c7a63538e3 (diff) | |
download | numpy-63c9d8744e3129dd7ddbc070ac14e6d2f24ea7e7.tar.gz |
NEP: c-masked-array: Add a global np.NA singleton, many more changes
Diffstat (limited to 'doc/neps')
-rw-r--r-- | doc/neps/c-masked-array.rst | 336 |
1 files changed, 210 insertions, 126 deletions
diff --git a/doc/neps/c-masked-array.rst b/doc/neps/c-masked-array.rst index 8ec3f5e15..743adc23c 100644 --- a/doc/neps/c-masked-array.rst +++ b/doc/neps/c-masked-array.rst @@ -14,14 +14,9 @@ Abstract The existing masked array functionality in NumPy is useful for many people, however it has a number of issues that prevent it from being -the preferred solution in many cases. By implementing mask functionality -into the core ndarray object, all the current issues with the system -can be resolved in a high performance and flexible manner. - -One key problem is a lack of orthogonality with other features, for -instance creating a masked array with physical quantities can't be -done because both are separate subclasses of ndarray. The only reasonable -way to deal with this is to move the mask into the core ndarray. +the preferred solution in some important cases. By implementing mask +functionality into the core ndarray object, all the current issues +with the system can be resolved in a high performance and flexible manner. The integration with ufuncs and other numpy core functions like sum is weak. This could be dealt with either through a better function overloading @@ -35,36 +30,6 @@ shouldn't, so the ufunc error handling mechanism can't be relied on. While no comprehensive benchmarks appear to exist, poor performance is sometimes cited as a problem as well. -*********************************************** -Possible Alternative: Data Types With NA Values -*********************************************** - -A masked array isn't the only way to deal with missing data, and -some systems deal with the problem by defining a special "NA" value, -for data which is missing. This is distinct from NaN floating point -values, which are the result of bad floating point calculation values. - -In the case of IEEE floating point values, it is possible to use a -particular NaN value, of which there are many, for "NA", distinct -from NaN. For integers, a reasonable approach would be to use -the minimum storable value, which doesn't have a corresponding positive -value, so is perhaps reasonable to dispense with in most contexts. - -The trouble with this approach is that it requires a large amount -of special case code in each data type, and writing a new data type -supporting missing data requires defining a mechanism for a special -signal value which may not be possible in general. - -The masked array approach, on the other hand, works with all data types -in a uniform fashion, adding the cost of one byte per value storage -for the mask. The attractiveness of being able to define a new custom -data type for NumPy and have it automatically work with missing values -is one of the reasons the masked approach has been chosen over special -signal values. - -Implementing masks as described in this NEP does not preclude also -creating data types with special "NA" values. - ************************* Definition of Masked Data ************************* @@ -108,96 +73,121 @@ This approach is useful when working with messy data and the analysis being done is trying to produce the best result that's possible from the data that is available. +In R, many functions take a parameter "na.rm=T" which means to treat +the data as if the NA values are not part of the data set. + ************************** The Mask as Seen in Python ************************** -The 'mask' Property -=================== - -The array object will get a new property 'mask', which behaves very -similar to a boolean array. When this property isn't None, it -has a shape exactly matching the array's shape, and for struct dtypes, -has a matching dtype with every type in the struct replaced with bool. - -The mask value is True for values that exist in the array, and False -for values that do not. This is the same convention used in most places -masks are used, for instance for image masks specifying which are valid -pixels and which are transparent. This is the reverse of the convention -in the current masked array subclass, but I think fixing this is worth -the trouble for the long term benefit. - -When an array has no mask, as indicated by the 'mask' property being -None, a mask may be added by assigning a boolean array broadcastable -to the shape of the array. If the array already has a mask, this -operation will raise an exception unless the single value False is -being assigned, which will mask all the elements. The &= operator, -however will be allowed, as it can only cause unmasked values to become -masked. - -The memory ordering of the mask will always match the ordering of -the array it is associated with. A Fortran-style array will have a -Fortran-style mask, etc. - -When a view of an array with a mask is taken, the view will have a mask -which is also a view of the mask in the original array. This means unmasking -values in views will also unmask them in the original array, and if -a mask is added to an array, it will not be possible to ever remove that -mask except to create a new array copying the data but not the mask. - -It is still possible to temporarily treat an array with a mask without -giving it one, by first creating a view of the array and then adding a -mask to that view. - Working With Masked Values ========================== -Assigning a value to the array always unmasks that element. There is -no interface to "unmask" elements except through assigning values. -The storage behind a masked value may never be accessed in any way, -other than to unmask it by assigning a value. If a masked view of -an array is taken, for instance, and another masked array is copied -over it, any values which stay masked will not have their underlying -value modified. - -If masked values are copied to an array without a mask, an exception will -be raised. Adding a mask to the target array would be problematic, because -then having a mask would be a "viral" property consuming extra memory -and reducing performance in unexpected ways. To assign a value would require -a default value, which is something that should be explicitly stated, -so a function like "a.assign_from_masked(b, maskedvalue=3.0)" needs to -be created. - -Except for object arrays, the None value will be used to represent -missing values in repr and str representations, except array2string -will gain a 'maskedstr=' parameter so this could be changed to "NA" or +NumPy will gain a global singleton called numpy.NA, similar to None, +but with semantics reflecting its status as a missing value. In particular, +trying to treat it as a boolean will raise an exception, and comparisons +with it will produce numpy.NA instead of True or False. These basics are +adopted from the behavior of the NA value in the R project. + +Assigning a value to the array always unmasks that element. Assigning +numpy.NA to the array masks that element. The storage behind a masked +value may never be accessed in any way, other than to unmask it by +assigning a value. + +Because numpy.NA is a global singleton, it will be possible to test +whether a value is masked by saying "arr[0] is np.NA". + +All operations which write to masked arrays will not affect the value +unless they also unmask that value. This allows the storage behind +masked elements to still be relied on if they are still accessible +from another view which doesn't have them masked. + +If np.NA or masked values are copied to an array without a mask, an +exception will be raised. Adding a validitymask to the target array +would be problematic, because then having a mask would be a "viral" +property consuming extra memory and reducing performance in unexpected +ways. To assign a value would require a default value, which is +something that should be explicitly stated, so a function like +"a.assign_from_masked(b, maskedvalue=3.0)" needs to be created. + +By default, the string "NA" will be used to represent masked values +in str and repr outputs. A global default configuration will allow +this to be changed. The array2string function will also gain a +'maskedstr=' parameter so this could be changed to "NA" or other values people may desire. For example,:: - >>>np.array([1.0, 2.0, None, 7.0], masked=True) + >>>np.array([1.0, 2.0, np.NA, 7.0], masked=True) will produce an array with values [1.0, 2.0, <inaccessible>, 7.0], and -mask [True, True, False, True]. +validitymask [True, True, False, True]. For floating point numbers, Inf and NaN are separate concepts from missing values. If a division by zero occurs, an unmasked Inf or NaN will -be produced. To mask those values, a further "a.mask &= np.isfinite(a)" +be produced. To mask those values, a further 'a.validitymask &= np.isfinite(a)' can achieve that. -Scalars will not be modified to have a mask, so this leaves two options -for what value should be returned when retrieving a single masked value. -Either 'None', or a zero-dimensional masked array. Based on discussion, -'None' is unacceptable, for two reasons. First, masked values should -always compare as False, similar to NaNs. If None, is returned, they -will compare as True. Second, a manual loop through a masked array -like:: +A manual loop through a masked array like:: for i in xrange(len(a)): a[i] = np.log(a[i]) -would raise an error, because you can't call 'np.log' on None. With -a zero-dimensional masked array, having them compare False is easy, -and calling 'np.log' on a masked value will produce a masked value, -since 'np.log' is a regular ufunc. +should work, something that is a little bit tricky because the global +singleton np.NA has no type, and doesn't follow the type promotion rules. +A good approach to deal with this needs to be found. + +The 'validitymask' Property +=========================== + +The array object will get a new property 'validitymask', which behaves very +similar to a boolean array. When this property isn't None, it +has a shape exactly matching the array's shape, and for struct dtypes, +has a matching dtype with every type in the struct replaced with bool. + +The reason for calling it 'validitymask' instead of just 'mask' or something +shorter is that this object is not intended to be the primary way to work +with masked values. It provides an interface for working with the mask, +but primarily the mask will be changed transparently based on manipulating +values and using the global singleton 'numpy.NA'. + +The validitymask value is True for values that exist in the array, and False +for values that do not. This is the same convention used in most places +masks are used, for instance for image masks specifying which are valid +pixels and which are transparent. This is the reverse of the convention +in the current masked array subclass, but I think changing this is worth +the trouble for the long term benefit. + +When an array has no mask, as indicated by the 'arr.flags.hasmask' +property being False, a mask may be added either by assigning True to +'arr.flags.hasmask', or assigning a boolean array to 'arr.validitymask'. +If the array already has a validitymask, this operation will raise an +exception unless the single value False is being assigned, which will +mask all the elements. The &= operator, however will be allowed, as +it can only cause unmasked values to become masked. + +The memory ordering of the validitymask will always match the ordering of +the array it is associated with. A Fortran-style array will have a +Fortran-style validitymask, etc. + +When a view of an array with a validitymask is taken, the view will have +a validitymask which is also a view of the validitymask in the original +array. This means unmasking values in views will also unmask them +in the original array, and if a mask is added to an array, it will +not be possible to ever remove that mask except to create a new array +copying the data but not the mask. + +It is still possible to temporarily treat an array with a mask without +giving it one, by first creating a view of the array and then adding a +mask to that view. A data set can be viewed with multiple different +masks simultaneously, by creating multiple views, and giving each view +a mask. + +When a validitymask gets added, the array to which it was added owns +the validitymask. This is indicated by the 'arr.flags.ownmask' flag. +When a view of an array with a validity mask is taken, the view does +not own its validitymask. In this case, it is possible to assign +'arr.flags.ownmask = True', which gives 'arr' its own copy of the +validitymask it is using, allowing it to be changed without affecting +the mask of the array being viewed. New ndarray Methods =================== @@ -243,18 +233,41 @@ masked array object. If the 'out' parameter isn't specified, use of the 'mask=' parameter will produce a array with a mask as the result. -Reduction operations like 'sum', 'prod', 'min', and 'max' will operate as -if the values were like NaN, producing masked values if any of their -input values were masked. +For boolean operations, the R project special cases logical_and and +logical_or so that logical_and(NA, False) is False, and +logical_or(NA, True) is True. On the other hand, 0 * NA isn't 0, but +here the NA could represent Inf or NaN, in which case 0 * the backing +value wouldn't be 0 anyway. + +For NumPy element-wise ufuncs, the design won't support this ability +for the mask of the output to depend simultaneously on the mask and +the value of the inputs. The NumPy 1.6 nditer, however, makes it +fairly easy to write standalone functions which look and feel just +like ufuncs, but deviate from their behavior. The functions logical_and +and logical_or can be moved into standalone function objects which are +backwards compatible with the current ufuncs. + +Masked Reduction UFuncs +======================= + +Reduction operations like 'sum', 'prod', 'min', and 'max' will operate +consistently with the idea that a masked value exists, but its value +is unknown. + +An optional parameter 'skipna=False' will be added to those functions +which can interpret it appropriately to do the operation as if just +the unmasked values existed. When all the input values are masked, +'sum' and 'prod' will produce the additive and multiplicative identities +respectively, while 'min' and 'max' will produce masked values. With +this parameter enabled, statistics operations which require a count, +like 'mean' and 'std' will also use the unmasked value counts for +their calculations, and produce masked values when all the inputs are masked. + +PEP 3118 +======== -An optional parameter to change the interpretation of masked values -is also needed, to do the operation as if just the unmasked values existed. -When all the input values are masked, 'sum' and 'prod' will produce -the additive and multiplicative identities respectively, while 'min' -and 'max' will produce masked values. With this parameter enabled, -statistics operations which require a count, like 'mean' and 'std' -will also use the unmasked value counts for their calculations, and -produce masked values when all the inputs are masked. +PEP 3118 doesn't have any mask mechanism, so arrays with masks will +not be accessible through this interface. Unresolved Design Questions =========================== @@ -271,10 +284,81 @@ arbitrary choice of C-ordering as it currently does. While this improves the abstraction of the array significantly, it is not a compatible change. -There is some consternation about the conventional True/False -interpretation of the mask, centered around the name "mask". One -possibility to deal with this is to call it a "validity mask" in -all documentation, which more clearly indicates that True means -valid data. If this isn't sufficient, an alternate name for the -attribute could be found, like "a.validitymask", "a.validmask", -or "a.validity". +**************************** +Possible Alternative Designs +**************************** + +Data Types With NA Signal Values +================================ + +A masked array isn't the only way to deal with missing data, and +some systems deal with the problem by defining a special "NA" value, +for data which is missing. This is distinct from NaN floating point +values, which are the result of bad floating point calculation values. + +In the case of IEEE floating point values, it is possible to use a +particular NaN value, of which there are many, for "NA", distinct +from NaN. For integers, a reasonable approach would be to use +the minimum storable value, which doesn't have a corresponding positive +value, so is perhaps reasonable to dispense with in most contexts. + +The trouble with this approach is that it requires a large amount +of special case code in each data type, and writing a new data type +supporting missing data requires defining a mechanism for a special +signal value which may not be possible in general. This causes the +missing value logic to be replicated many times, something that can be +error-prone. This is also a lot more code for all the various ufuncs +than a general masked mechanism which can use the unmasked loop for +a default implementation. + +The masked array approach, on the other hand, works with all data types +in a uniform fashion, adding the cost of one byte per value storage +for the mask. The attractiveness of being able to define a new custom +data type for NumPy and have it automatically work with missing values +is one of the reasons the masked approach has been chosen over special +signal values. + +Implementing masks as described in this NEP does not preclude also +creating data types with special "NA" values. + +Parameterized Type Which Adds an NA Flag +======================================== + +Another alternative to having a separate mask added to the array is +to introduced a parameterized type, which takes a primitive dtype +as an argument. The dtype "i8" would turn into "maybe[i8]", and +a byte flag would be appended to the dtype to indicate whether the +value was NA or not. + +This approach adds memory overhead greater or equal to keeping a separate +mask, but has better locality. To keep the dtype aligned, an 'i8' would +need to have 16 bytes to retain proper alignment, a 100% overhead compared +to 12.5% overhead for a separately kept mask. + +*************** +Acknowledgments +*************** + +In addition to feedback Travis Oliphant and others at Enthought, +this NEP has been revised based on a great deal of feedback from +the NumPy-Discussion mailing list. The people participating in +the discussion are: + +Nathaniel Smith +Robert Kern +Charles Harris +Gael Varoquaux +Eric Firing +Keith Goodman +Pierre GM +Christopher Barker +Josef Perktold +Benjamin Root +Laurent Gautier +Neal Becker +Bruce Southey +Matthew Brett +Wes McKinney +LluĂs + +I apologize if I missed anyone. |