NEP: Design document for adding a mask to the core ndarray

author: Mark Wiebe <mwiebe@enthought.com> 2011-06-23 12:54:52 -0500
committer: Mark Wiebe <mwiebe@enthought.com> 2011-06-27 10:32:41 -0500
commit: 9ec55573493ddd06da6c030fc95b0139a4c3b28e (patch)
tree: a4ffb24279aca598145e861e77f9a8e26266aac1 /doc
parent: 79a1db5a12b6310cc91294a92ecc0b8b6e6745f2 (diff)
download: numpy-9ec55573493ddd06da6c030fc95b0139a4c3b28e.tar.gz
1 files changed, 177 insertions, 0 deletions
diff --git a/doc/neps/c-masked-array.rst b/doc/neps/c-masked-array.rst
new file mode 100644
index 000000000..68149d004
--- /dev/null
+++ b/doc/neps/c-masked-array.rst
@@ -0,0 +1,177 @@
+:Title: Masked Array Functionality in C
+:Author: Mark Wiebe <mwwiebe@gmail.com>
+:Date: 2011-06-23
+
+*****************
+Table of Contents
+*****************
+
+.. contents::
+
+********
+Abstract
+********
+
+The existing masked array functionality in NumPy is useful for many
+people, however it has a number of issues that prevent it from being
+the preferred solution in many cases. By implementing mask functionality
+into the core ndarray object, all the current issues with the system
+can be resolved in a high performance and flexible manner.
+
+One key problem is a lack of orthogonality with other features, for
+instance creating a masked array with physical quantities can't be
+done because both are separate subclasses of ndarray. The only reasonable
+way to deal with this is to move the mask into the core ndarray.
+
+The integration with ufuncs and other numpy core functions like sum is weak.
+This could be dealt with either through a better function overloading
+mechanism or moving the mask into the core ndarray.
+
+In the current masked array, calculations are done for the whole array,
+then masks are patched up afterwords. This means that invalid calculations
+sitting in masked elements can raise warnings or exceptions even though they
+shouldn't, so the ufunc error handling mechanism can't be relied on.
+
+While no comprehensive benchmarks appear to exist, poor performance is
+sometimes cited as a problem as well.
+
+***********************************************
+Possible Alternative: Data Types With NA Values
+***********************************************
+
+A masked array isn't the only way to deal with missing data, and
+some systems deal with the problem by defining a special "NA" value,
+for data which is missing. This is distinct from NaN floating point
+values, which are the result of bad floating point calculation values.
+
+In the case of IEEE floating point values, it is possible to use a
+particular NaN value, of which there are many, for "NA", distinct
+from NaN. For integers, a reasonable approach would be to use
+the minimum storable value, which doesn't have a corresponding positive
+value, so is perhaps reasonable to dispense with in most contexts.
+
+The trouble with this approach is that it requires a large amount
+of special case code in each data type, and writing a new data type
+supporting missing data requires defining a mechanism for a special
+signal value which may not be possible in general.
+
+The masked array approach, on the other hand, works with all data types
+in a uniform fashion, adding the cost of one byte per value storage
+for the mask. The attractiveness of being able to define a new custom
+data type for NumPy and have it automatically work with missing values
+is one of the reasons the masked approach has been chosen over special
+signal values.
+
+**************************
+The Mask as Seen in Python
+**************************
+
+The 'mask' Property
+===================
+
+The array object will get a new property 'mask', which behaves very
+similar to a boolean array. When this property isn't None, it
+has a shape exactly matching the array's shape, and for struct dtypes,
+has a matching dtype with every type in the struct replaced with bool.
+
+The mask value is True for values that exist in the array, and False
+for values that do not. This is the same convention used in most places
+masks are used, for instance for image masks specifying which are valid
+pixels and which are transparent. This is the reverse of the convention
+in the current masked array subclass, but I think fixing this is worth
+the trouble for the long term benefit.
+
+When an array has no mask, as indicated by the 'mask' property being
+None, a mask may be added by assigning a boolean array broadcastable
+to the shape of the array. If the array already has a mask, this
+operation will raise an exception unless the single value False is
+being assigned, which will mask all the elements. The &= operator,
+however will be allowed, as it can only cause unmasked values to become
+masked.
+
+The memory ordering of the mask will always match the ordering of
+the array it is associated with. A Fortran-style array will have a
+Fortran-style mask, etc.
+
+When a view of an array with a mask is taken, the view will have a mask
+which is also a view of the mask in the original array. This means unmasking
+values in views will also unmask them in the original array, and if
+a mask is added to an array, it will not be possible to ever remove that
+mask except to create a new array copying the data but not the mask.
+
+Working With Masked Values
+==========================
+
+Assigning a value to the array always unmasks that element. There is
+no interface to "unmask" elements except through assigning values.
+The storage behind a masked value may never be accessed in any way,
+other than to unmask it by assigning a value. If a masked view of
+an array is taken, for instance, and another masked array is copied
+over it, any values which stay masked will not have their underlying
+value modified.
+
+If masked values are copied to an array without a mask, an exception will
+be raised. Adding a mask to the target array would be problematic, because
+then having a mask would be a "viral" property consuming extra memory
+and reducing performance in unexpected ways. To assign a value would require
+a default value, which is something that should be explicitly stated,
+so a function like "a.assign_from_masked(b, maskedvalue=3.0)" needs to
+be created.
+
+Except for object arrays, the None value will be used to represent
+missing values in repr and str representations, except array2string
+will gain a 'maskedstr=' parameter so this could be changed to "NA" or
+other values people may desire. For example,::
+
+    >>>np.array([1.0, 2.0, None, 7.0], masked=True)
+
+will produce an array with values [1.0, 2.0, <inaccessible>, 7.0], and
+mask [True, True, False, True].
+
+For floating point numbers, Inf and NaN are separate concepts from
+missing values. If a division by zero occurs, an unmasked Inf or NaN will
+be produced. To mask those values, a further "a.mask &= np.isfinite(a)"
+can achieve that.
+
+New ndarray Methods
+===================
+
+In addition to the 'mask' property, the ndarray needs several new
+methods to easily work with masked values. The proposed methods for
+an np.array *a* are::
+
+    a.assign_from_masked(b, fillvalue, casting='same_kind'):
+        This is equivalent to a[...] = b, with the provided maskedvalue
+        being substituted wherever there is missing data. This is
+        intended for use when 'a' has no mask, but 'b' does.
+
+    a.fill_masked(value)
+        This is exactly like a.fill(value), but only modifies the
+        masked elements of 'a'. All values of 'a' become unmasked.
+
+    a.fill_unmasked(value)
+        This is exactly like a.fill(value), but only modifies the
+        unmasked elements of a. The mask remains unchanged.
+
+    a.copy_filled(fillvalue, order='K', ...)
+        Exactly like a.copy(), except always produces an array
+        without a mask and uses 'fillvalue' for any masked values.
+
+Unresolved Design Questions
+===========================
+
+Scalars will not be modified to have a mask, so this leaves two options
+for what value should be returned when retrieving a single masked value.
+Either 'None', or a one-dimensional masked array. The former follows
+the convention of returning an immutable value from such accesses,
+while the later preserves type information, so the correct choice
+will require some discussion to resolve.
+
+The existing masked array implementation has a "hardmask" feature,
+which freezes the mask. Boolean indexing could for instance return
+a hardmasked array instead of a flattened array with the arbitrary
+choice of C-ordering as it currently is. This could be an internal
+array flag, with a.mask.harden() and a.mask.soften() performing the
+functions of a.harden_mask() and a.soften_mask() in the current masked
+array. There would also be an a.mask.ishard property.
+
author	Mark Wiebe <mwiebe@enthought.com>	2011-06-23 12:54:52 -0500
committer	Mark Wiebe <mwiebe@enthought.com>	2011-06-27 10:32:41 -0500
commit	9ec55573493ddd06da6c030fc95b0139a4c3b28e (patch)
tree	a4ffb24279aca598145e861e77f9a8e26266aac1 /doc
parent	79a1db5a12b6310cc91294a92ecc0b8b6e6745f2 (diff)
download	numpy-9ec55573493ddd06da6c030fc95b0139a4c3b28e.tar.gz