summaryrefslogtreecommitdiff
path: root/doc/neps/missing-data.rst
diff options
context:
space:
mode:
authorMark Wiebe <mwiebe@enthought.com>2011-06-27 10:09:49 -0500
committerMark Wiebe <mwiebe@enthought.com>2011-06-27 10:32:42 -0500
commit5becc3883dc177d30c83f778aa954630cd3ad551 (patch)
treec2d2c5013d9a32b48790a3cfb0475458073fea59 /doc/neps/missing-data.rst
parentb911992478ab40bfe9127d3f0c6604936ca84706 (diff)
downloadnumpy-5becc3883dc177d30c83f778aa954630cd3ad551.tar.gz
NEP: missing-data: Rename the NEP as it now describes more generality than just masks
Diffstat (limited to 'doc/neps/missing-data.rst')
-rw-r--r--doc/neps/missing-data.rst529
1 files changed, 529 insertions, 0 deletions
diff --git a/doc/neps/missing-data.rst b/doc/neps/missing-data.rst
new file mode 100644
index 000000000..de08c3d3e
--- /dev/null
+++ b/doc/neps/missing-data.rst
@@ -0,0 +1,529 @@
+:Title: Missing Data Functionality in NumPy
+:Author: Mark Wiebe <mwwiebe@gmail.com>
+:Date: 2011-06-23
+
+*****************
+Table of Contents
+*****************
+
+.. contents::
+
+********
+Abstract
+********
+
+Users interested in dealing with missing data within NumPy are generally
+pointed to the masked array subclass of the ndarray, generally known
+as 'numpy.ma'. This class has a number of users who depend strongly
+on its capabilities, but people who are accustomed to the deep integration
+of the missing data placeholder "NA" in the R project and others who
+find the programming interface challenging or inconsistent tend not
+to use it.
+
+This NEP proposes to integrate a mask-based missing data solution
+into NumPy, with an additional NA bit pattern-based missing data solution
+that can be implemented concurrently or later which would integrate seamlessly
+with the mask-based solution.
+
+The mask-based solution and the NA bit pattern-based solutions in this
+proposal offer the exact same missing value abstraction, with several
+differences in performance, memory overhead, and flexibility.
+
+The mask-based solution is more flexible, supporting all behaviors of the
+NA bit pattern-based solution, but leaving the hidden values untouched
+whenever an element is masked.
+
+The NA bit pattern-based solution requires less memory, is bit-level
+compatible with the 64-bit floating point representation used in R, but
+does not preserve the hidden values and in fact requires stealing at
+least one bit pattern from the underlying dtype to represent the missing
+value NA.
+
+Both solutions are generic in the sense that they can be used with
+custom data types very easily, with no effort in the case of the masked
+solution, and with the requirement that a bit pattern to sacrifice be
+chosen in the case of the NA bit pattern solution.
+
+**************************
+Definition of Missing Data
+**************************
+
+Unknown Yet Existing Data
+=========================
+
+In order to be able to develop an intuition about what computation
+will be done by various NumPy functions, a consistent conceptual
+model of what a missing element means must be applied. The approach
+taken in the R project is to define a missing element as something which
+does have a valid value, but that value is unknown. This proposal
+adopts this behavior as as the default for all operations involving
+missing values.
+
+In this interpretation, nearly any computation with a missing input produces
+a missing output. For example, 'sum(a)' would produce a missing value
+if 'a' contained just one missing element. When the output value does
+not depend on one of the inputs, it is reasonable to output a value
+that is not NA, such as logical_and(NA, False) == False.
+
+Some more complex arithmetic operations, such as matrix products, are
+well defined with this interpretation, and the result should be
+the same as is the missing values were NaNs. Actually implementing
+such things to the theoretical limit is probably not worth it,
+and in many cases either raising an exception or returning all
+missing values may be preferred to doing precise calculations.
+Care must be taken here when dealing with the values and the masks,
+to preserve the semantics that masking a value never touches
+the element's backing memory.
+
+Data That Doesn't Exist
+=======================
+
+Another useful interpretation is that the missing elements should be
+treated as if they didn't exist in the array, and the operation should
+do its best to interpret what that means according to the data
+that's left. In this case, 'mean(a)' would compute the mean of just
+the values that are unmasked, adjusting both the sum and count it
+uses based on the mask.
+
+This kind of data can arise when conforming sparsely sampled data
+into a regular sampling pattern, and is a useful interpretation so
+use when attempting to get best-guess answers for many statistical queries.
+
+In R, many functions take a parameter "na.rm=T" which means to treat
+the data as if the NA values are not part of the data set. This proposal
+defines a standard parameter "skipmissing=True" for this same purpose.
+
+Data That Is Being Temporarily Ignored
+======================================
+
+It can be useful to temporarily treat some array elements as if they
+were NA, possibly in many different configurations. This is a common
+use case for masks, and the mask-based implementation of missing values
+supports this usage by having the strict requirement that the data
+storage backing any missing array elements never be touched.
+
+In general, this can be done by first creating a view, then either adding
+a mask if there isn't one yet, or having the view create its own copy of
+the mask instead of retaining a view of the original's mask.
+
+********************************
+Missing Values as Seen in Python
+********************************
+
+Working With Missing Values
+===========================
+
+NumPy will gain a global singleton called numpy.NA, similar to None,
+but with semantics reflecting its status as a missing value. In particular,
+trying to treat it as a boolean will raise an exception, and comparisons
+with it will produce numpy.NA instead of True or False. These basics are
+adopted from the behavior of the NA value in the R project.
+
+For example,::
+
+ >>> np.array([1.0, 2.0, np.NA, 7.0], masked=True)
+ array([1., 2., NA, 7.], masked=True)
+ >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
+ array([1., 2., NA, 7.], dtype='NA[<f8]')
+
+produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
+mask [Unmasked, Unmasked, Masked, Unmasked], and
+values [1.0, 2.0, <NA bit pattern>, 7.0] respectively.
+
+It may be worth overloading the np.NA __call__ method to accept a dtype,
+returning a zero-dimensional array with a missing value of that dtype.
+Without doing this, NA printouts would look like::
+
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True))
+ array(NA, dtype='float64', masked=True)
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
+ array(NA, dtype='NA[<f8]')
+
+but with this, they could be printed as::
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True))
+ NA('float64')
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
+ NA('NA[<f8]')
+
+Assigning a value to an array always causes that element to not be NA,
+transparently unmasking it if necessary.. Assigning numpy.NA to the array
+masks that element or assigns the NA bit pattern for the particular dtype.
+In the mask-based implementation, the storage behind a missing value may never
+be accessed in any way, other than to unmask it by assigning its value.
+
+While numpy.NA works to mask values, it does not itself have a dtype.
+This means that returning the numpy.NA singleton from an operation
+like 'arr[0]' would be throwing away the dtype, which is still
+valuable to retain, so 'arr[0]' will return a zero-dimensional
+array either with its value masked, or containing the NA bit pattern
+for the array's dtype. To test if the value is missing, the function
+"np.ismissing(arr[0])" will be provided. One of the key reasons for the
+NumPy scalars is to allow their values into dictionaries. Having a
+missing value as the key in a dictionary is a bad idea, so the NumPy
+scalars will not support missing values in any form.
+
+All operations which write to masked arrays will not affect the value
+unless they also unmask that value. This allows the storage behind
+masked elements to still be relied on if they are still accessible
+from another view which doesn't have them masked. For example::
+
+ >>> a = np.array([1,2])
+ >>> b = a.view()
+ >>> b.flags.hasmask = True
+ >>> b
+ array([1,2], masked=True)
+ >>> b[0] = np.NA
+ >>> b
+ array([NA,2], masked=True)
+ >>> a
+ array([1,2])
+ >>> # The underlying number 1 value in 'a[0]' was untouched
+
+Copying values between the mask-based implementation and the
+NA bit pattern implementation will transparently do the correct thing,
+turning the NA bit pattern into a masked value, or a masked value
+into the NA bit pattern where appropriate. The one exception is
+if a valid value in a masked array happens to have the NA bit pattern,
+copying this value to the NA form of the dtype will cause it to
+become NA as well.
+
+If np.NA or masked values are copied to an array without support for
+missing values enabled, an exception will be raised. Adding a mask to
+the target array would be problematic, because then having a mask
+would be a "viral" property consuming extra memory and reducing
+performance in unexpected ways.
+
+By default, the string "NA" will be used to represent missing values
+in str and repr outputs. A global configuration will allow
+this to be changed. The array2string function will also gain a
+'maskedstr=' parameter so this could be changed to "<missing>" or
+other values people may desire.
+
+For floating point numbers, Inf and NaN are separate concepts from
+missing values. If a division by zero occurs in an array with default
+missing value support, an unmasked Inf or NaN will be produced. To
+mask those values, a further 'a[np.logical_not(a.isfinite(a)] = np.NA'
+can achieve that. For the NA bit pattern approach, the parameterized
+dtype('NA[f8,InfNan]') described in a later section can be used to get
+these semantics without the extra manipulation.
+
+A manual loop through a masked array like::
+
+ for i in xrange(len(a)):
+ a[i] = np.log(a[i])
+
+works even with masked values, because 'a[i]' returns a zero-dimensional
+array with a missing value instead of the singleton np.NA for the missing
+elements. If np.NA was returned, np.log would have to raise an exception
+because it doesn't know the log of which dtype it's meant to call, whether
+it's a missing float or a missing string, for example.
+
+Accessing a Boolean Mask
+========================
+
+The mask used to implement missing data in the masked approach is not
+accessible from Python directly. This is partially due to differing
+opinions on whether True in the mask should mean "missing" or "not missing"
+Additionally, exposing the mask directly would preclude a potential
+space optimization, where a bit-level instead of a byte-level mask
+is used to get a factor of eight memory usage improvement.
+
+To access the mask values, there are two functions provided,
+'np.ismissing' and 'np.isavail', which test for NA or available values
+respectively. These functions work equivalently for masked arrays
+and NA bit pattern dtypes.
+
+Creating Masked Arrays
+======================
+
+There are two flags which indicate and control the nature of the mask
+used in masked arrays.
+
+First is 'arr.flags.hasmask', which is True for all masked arrays and
+may be set to True to add a mask to an array which does not have one.
+
+Second is 'arr.flags.ownmask', which is True if the array owns the
+memory to the mask, and False if the array has no mask, or has a view
+into the mask of another array. If this is set to False in a masked
+array, the array will create a copy of the mask so that further modifications
+to the mask will not affect the array being viewed.
+
+Mask Implementation Details
+===========================
+
+The memory ordering of the mask will always match the ordering of
+the array it is associated with. A Fortran-style array will have a
+Fortran-style mask, etc.
+
+When a view of an array with a mask is taken, the view will have
+a mask which is also a view of the mask in the original
+array. This means unmasking values in views will also unmask them
+in the original array, and if a mask is added to an array, it will
+not be possible to ever remove that mask except to create a new array
+copying the data but not the mask.
+
+It is still possible to temporarily treat an array with a mask without
+giving it one, by first creating a view of the array and then adding a
+mask to that view. A data set can be viewed with multiple different
+masks simultaneously, by creating multiple views, and giving each view
+a mask.
+
+New ndarray Methods
+===================
+
+New functions added to the numpy namespace are::
+
+ np.ismissing(arr)
+ Returns a boolean array with True whereever the array is masked
+ or matches the NA bit pattern, and False elsewhere
+
+ np.isavail(arr)
+ Returns a boolean array with False whereever the array is masked
+ or matches the NA bit pattern, and True elsewhere
+
+New functions added to the ndarray are::
+
+ arr.copy(..., replacena=None)
+ Modification to the copy function which replaces NA values,
+ either masked or with the NA bit pattern, with the 'replacena='
+ parameter suppled. When 'replacena' isn't None, the copied
+ array is unmasked and has the 'NA' part stripped from the
+ parameterized type ('NA[f8]' becomes just 'f8').
+
+ arr.view(masked=True)
+ This is a shortcut for 'a = arr.view(); a.flags.hasmask=True'.
+
+Element-wise UFuncs With Missing Values
+=======================================
+
+As part of the implementation, ufuncs and other operations will
+have to be extended to support masked computation. Because this
+is a useful feature in general, even outside the context of
+a masked array, in addition to working with masked arrays ufuncs
+will take an optional 'mask=' parameter which allows the use
+of boolean arrays to choose where a computation should be done.
+This functions similar to a "where" clause on the ufunc.::
+
+ >>> np.add(a, b, out=b, mask=(a > threshold))
+
+A benefit of having this 'mask=' parameter is that it provides a way
+to temporarily treat an object with a mask without ever creating a
+masked array object.
+
+If the 'out' parameter isn't specified, use of the 'mask=' parameter
+will produce a array with a mask as the result.
+
+For boolean operations, the R project special cases logical_and and
+logical_or so that logical_and(NA, False) is False, and
+logical_or(NA, True) is True. On the other hand, 0 * NA isn't 0, but
+here the NA could represent Inf or NaN, in which case 0 * the backing
+value wouldn't be 0 anyway.
+
+For NumPy element-wise ufuncs, the design won't support this ability
+for the mask of the output to depend simultaneously on the mask and
+the value of the inputs. The NumPy 1.6 nditer, however, makes it
+fairly easy to write standalone functions which look and feel just
+like ufuncs, but deviate from their behavior. The functions logical_and
+and logical_or can be moved into standalone function objects which are
+backwards compatible with the current ufuncs.
+
+Reduction UFuncs With Missing Values
+====================================
+
+Reduction operations like 'sum', 'prod', 'min', and 'max' will operate
+consistently with the idea that a masked value exists, but its value
+is unknown.
+
+An optional parameter 'skipna=' will be added to those functions
+which can interpret it appropriately to do the operation as if just
+the unmasked values existed.
+
+With 'skipna=True', when all the input values are masked,
+'sum' and 'prod' will produce the additive and multiplicative identities
+respectively, while 'min' and 'max' will produce masked values.
+Statistics operations which require a count, like 'mean' and 'std'
+will also use the unmasked value counts for their calculations if
+'skipna=True', and produce masked values when all the inputs are masked.
+
+Some examples::
+
+ >>> a = np.array([1., 3., np.NA, 7.], masked=True)
+ >>> np.sum(a)
+ array(NA, dtype='<f8', masked=True)
+ >>> np.sum(a, skipna=True)
+ 11.0
+ >>> np.mean(a)
+ array(NA, dtype='<f8', masked=True)
+ >>> np.mean(a)
+ 3.6666666666666665
+ >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
+ >>> np.sum(a, skipna=True)
+ 0.0
+ >>> np.max(a, skipna=True)
+ array(NA, dtype='<f8', masked=True)
+
+PEP 3118
+========
+
+PEP 3118 doesn't have any mask mechanism, so arrays with masks will
+not be accessible through this interface. Similarly, it doesn't support
+the specification of dtypes with NA bit patterns, so the parameterized NA
+dtypes will also not be accessible through this interface.
+
+If NumPy did allow access through PEP 3118, this would circumvent the
+missing value abstraction in a very damaging way. Other libraries would
+try to use masked arrays, and silently get access to the data without
+also getting access to the mask or being aware of the missing value
+abstraction the mask and data together are following.
+
+Unresolved Design Questions
+===========================
+
+The existing masked array implementation has a "hardmask" feature,
+which prevents values from ever being unmasked by assigning a value.
+This would be an internal array flag, named something like
+'arr.flags.hardmask'.
+
+If the hardmask feature is implemented, boolean indexing could
+return a hardmasked array instead of a flattened array with the
+arbitrary choice of C-ordering as it currently does. While this
+improves the abstraction of the array significantly, it is not
+a compatible change.
+
+**********************************
+Alternative Designs Without a Mask
+**********************************
+
+Parameterized Data Type With NA Signal Values
+=============================================
+
+A masked array isn't the only way to deal with missing data, and
+some systems deal with the problem by defining a special "NA" value,
+for data which is missing. This is distinct from NaN floating point
+values, which are the result of bad floating point calculation values,
+but many people use NaNs for this purpose.
+
+In the case of IEEE floating point values, it is possible to use a
+particular NaN value, of which there are many, for "NA", distinct
+from NaN. For signed integers, a reasonable approach would be to use
+the minimum storable value, which doesn't have a corresponding positive
+value. For unsigned integers, the maximum storage value seems most
+reasonable.
+
+With the goal of providing a general mechanism, a parameterized type
+mechanism for this is much more attractive than creating separate
+nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed
+as an alternative way of treating the mask except without value preservation,
+this parameterized type can work together with the mask in a special
+way to produce a value + mask combination on the fly, and use the
+exact same computational infrastructure as the masked array system.
+This allows one to avoid the need to write special case code for each
+ufunc and for each na* dtype, something that is hard to avoid when
+building a separate independent dtype implementation for each na* dtype.
+
+Reliable conversions with the NA bit pattern preserved across primitive
+types requires consideration as well. Even in the simple case of
+double -> float, where this is supported by hardware, the NA value
+will get lost because the NaN payload is typically not preserved.
+The ability to have different bit masks specified for the same underlying
+type also needs to convert properly. With a well-defined interface
+converting to/from a (value,flag) pair, this becomes straightforward
+to support generically.
+
+This approach also provides some opportunities for some subtle variations
+with IEEE floats. By default, one exact bit-pattern, a silent NaN with
+a payload that won't be generated by hardware floating point operations,
+would be used. The choice R has made could be this default.
+
+Additionally, it might be nice to sometimes treat all NaNs as missing values.
+This requires a slightly more complex mapping to convert the floating point
+values into mask/value combinations, and converting back would always
+produce the default NaN used by NumPy. Finally, treating both NaNs
+and Infs as missing values would be just a slight variation of the NaN
+version.
+
+Strings require a slightly different handling, because they
+may be any size. One approach is to use a one-character signal consisting
+of one of the first 32 ASCII/unicode values. There are many possible values
+to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'.
+
+The Object dtype has an obvious signal, the np.NA singleton itself. Any
+dtype with object semantics won't be able to have this customized, since
+specifying bit patterns applies only to plain binary data, not data
+with object semantics of construction and destructions.
+
+Struct dtypes are more of a core primitive dtype, in the same fashion that
+this parameterized NA-capable dtype is. It won't be possible to put
+these as the parameter for the parameterized NA-dtype.
+
+The dtype names would be parameterized similar to how the datetime64
+is parameterized by the metadata unit. What name to use may require some
+debate, but "NA" seems like a reasonable choice. With the default
+missing value bit-pattern, these dtypes would look like
+np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]').
+
+To override the bit pattern that signals a missing value, a raw
+value in the format of a hexadecimal unsigned integer can be given,
+and in the above special cases for floating point, special strings
+can be provided. The defaults for some cases, written explicitly in this
+form, are then::
+
+ np.dtype('NA[?,0x02]')
+ np.dtype('NA[i4,0x80000000]')
+ np.dtype('NA[u4,0xffffffff]')
+ np.dtype('NA[f4,0x7f8007a2')
+ np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern)
+ np.dtype('NA[S16,0x15]') (using the NAK character as the signal).
+
+ np.dtype('NA[f8,NaN]') (for any NaN)
+ np.dtype('NA[f8,InfNaN]') (for any NaN or Inf)
+
+When no parameter is specified a flexible NA dtype is created, which itself
+cannot hold values, but will conform to the input types in funcions like
+'np.astype'. The dtype 'f8' maps to 'NA[f8]', and [('a', 'f4'), ('b', 'i4')]
+maps to [('a', 'NA[f4]'), ('b', 'NA[i4]')]. Thus, to view the memory
+of an 'f8' array 'arr' with 'NA[f8]', you can say arr.view(dtype='NA').
+
+Parameterized Data Type Which Adds Additional Memory for the NA Flag
+====================================================================
+
+Another alternative to having a separate mask added to the array is
+to introduced a parameterized type, which takes a primitive dtype
+as an argument. The dtype "i8" would turn into "maybe[i8]", and
+a byte flag would be appended to the dtype to indicate whether the
+value was NA or not.
+
+This approach adds memory overhead greater or equal to keeping a separate
+mask, but has better locality. To keep the dtype aligned, an 'i8' would
+need to have 16 bytes to retain proper alignment, a 100% overhead compared
+to 12.5% overhead for a separately kept mask.
+
+***************
+Acknowledgments
+***************
+
+In addition to feedback Travis Oliphant and others at Enthought,
+this NEP has been revised based on a great deal of feedback from
+the NumPy-Discussion mailing list. The people participating in
+the discussion are::
+
+ Nathaniel Smith
+ Robert Kern
+ Charles Harris
+ Gael Varoquaux
+ Eric Firing
+ Keith Goodman
+ Pierre GM
+ Christopher Barker
+ Josef Perktold
+ Ben Root
+ Laurent Gautier
+ Neal Becker
+ Bruce Southey
+ Matthew Brett
+ Wes McKinney
+ LluĂ­s
+ Olivier Delalleau
+ Alan G Isaac
+
+I apologize if I missed anyone.