summaryrefslogtreecommitdiff
path: root/doc/source/reference/arrays.maskna.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/reference/arrays.maskna.rst')
-rw-r--r--doc/source/reference/arrays.maskna.rst301
1 files changed, 301 insertions, 0 deletions
diff --git a/doc/source/reference/arrays.maskna.rst b/doc/source/reference/arrays.maskna.rst
new file mode 100644
index 000000000..2faabde83
--- /dev/null
+++ b/doc/source/reference/arrays.maskna.rst
@@ -0,0 +1,301 @@
+.. currentmodule:: numpy
+
+.. _arrays.maskna:
+
+****************
+NA-Masked Arrays
+****************
+
+.. versionadded:: 1.7.0
+
+NumPy 1.7 adds preliminary support for missing values using an interface
+based on an NA (Not Available) placeholder, implemented as masks in the
+core ndarray. This system is highly flexible, allowing NAs to be used
+with any underlying dtype, and supports creating multiple views of the same
+data with different choices of NAs.
+
+Other Missing Data Approaches
+=============================
+
+The previous recommended approach for working with missing values was the
+:mod:`numpy.ma` module, a subclass of ndarray written purely in Python.
+By placing NA-masks directly in the NumPy core, it's possible to avoid
+the need for calling "ma.<func>(arr)" instead of "np.<func>(arr)".
+
+Another approach many people have taken is to use NaN as the
+placeholder for missing values. There are a few functions
+like :func:`numpy.nansum` which behave similarly to usage of the
+ufunc.reduce *skipna* parameter.
+
+As experienced in the R language, a programming interface based on an
+NA placeholder is generally more intuitive to work with than direct
+mask manipulation.
+
+Missing Data Model
+==================
+
+The model adopted by NumPy for missing values is that NA is a
+placeholder for a value which is there, but is unknown to computations.
+The value may be temporarily hidden by the mask, or may be unknown
+for any reason, but could be any value the dtype of the array is able
+to hold.
+
+This model affects computations in specific, well-defined ways. Any time
+we have a computation, like *c = NA + 1*, we must reason about whether
+*c* will be an NA or not. The NA is not available now, but maybe a
+measurement will be made later to determine what its value is, so anything
+we calculate must be consistent with it eventually being revealed. One way
+to do this is with thought experiments imagining we have discovered
+the value of this NA. If the NA is 0, then *c* is 1. If the NA is
+100, then *c* is 101. Because the value of *c* is ambiguous, it
+isn't available either, so must be NA as well.
+
+A consequence of separating the NA model from the dtype is that, unlike
+in R, NaNs are not considered to be NA. An NA is a value that is completely
+unknown, whereas a NaN is usually the result of an invalid computation
+as defined in the IEEE 754 floating point arithmetic specification.
+
+Most computations whose input is NA will output NA as well, a property
+known as propagation. Some operations, however, always produce the
+same result no matter what the value of the NA is. The clearest
+example of this is with the logical operations *and* and *or*. Since both
+np.logical_or(True, True) and np.logical_or(False, True) are True,
+all possible boolean values on the left hand side produce the
+same answer. This means that np.logical_or(np.NA, True) can produce
+True instead of the more conservative np.NA. There is a similar case
+for np.logical_and.
+
+A similar, but slightly deceptive, example is wanting to treat (NA * 0.0)
+as 0.0 instead of as NA. This is invalid because the NA might be Inf
+or NaN, in which case the result is NaN instead of 0.0. This idea is
+valid for integer dtypes, but NumPy still chooses to return NA because
+checking this special case would adversely affect performance.
+
+The NA Object
+=============
+
+In the root numpy namespace, there is a new object NA. This is not
+the only possible instance of an NA as is the case for None, since an NA
+may have a dtype associated with it and has been designed for future
+expansion to carry a multi-NA payload. It can be used in computations
+like any value::
+
+ >>> np.NA
+ NA
+ >>> np.NA * 3
+ NA(dtype='int64')
+ >>> np.sin(np.NA)
+ NA(dtype='float64')
+
+To check whether a value is NA, use the :func:`numpy.isna` function::
+
+ >>> np.isna(np.NA)
+ True
+ >>> np.isna(1.5)
+ False
+ >>> np.isna(np.nan)
+ False
+ >>> np.isna(np.NA * 3)
+ True
+ >>> (np.NA * 3) is np.NA
+ False
+
+
+Creating NA-Masked Arrays
+=========================
+
+Because having NA support adds some overhead to NumPy arrays, one
+must explicitly request it when creating arrays. There are several ways
+to get an NA-masked array. The easiest way is to include an NA
+value in the list used to construct the array.::
+
+ >>> a = np.array([1,3,5])
+ >>> a
+ array([1, 3, 5])
+ >>> a.flags.maskna
+ False
+
+ >>> b = np.array([1,3,np.NA])
+ >>> b
+ array([1, 3, NA])
+ >>> b.flags.maskna
+ True
+
+If one already has an array without an NA-mask, it can be added
+by directly setting the *maskna* flag to True. Assigning an NA
+to an array without NA support will raise an error rather than
+automatically creating an NA-mask, with the idea that supporting
+NA should be an explicit user choice.::
+
+ >>> a = np.array([1,3,5])
+ >>> a[1] = np.NA
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ ValueError: Cannot assign NA to an array which does not support NAs
+ >>> a.flags.maskna = True
+ >>> a[1] = np.NA
+ >>> a
+ array([1, NA, 5])
+
+Most array construction functions have a new parameter *maskna*, which
+can be set to True to produce an array with an NA-mask.::
+
+ >>> np.arange(5., maskna=True)
+ array([ 0., 1., 2., 3., 4.], maskna=True)
+ >>> np.eye(3, maskna=True)
+ array([[ 1., 0., 0.],
+ [ 0., 1., 0.],
+ [ 0., 0., 1.]], maskna=True)
+ >>> np.array([1,3,5], maskna=True)
+ array([1, 3, 5], maskna=True)
+
+Creating NA-Masked Views
+========================
+
+It will sometimes be desirable to view an array with an NA-mask, without
+adding an NA-mask to that array. This is possible by taking an NA-masked
+view of the array. There are two ways to do this, one which simply
+guarantees that the view has an NA-mask, and another which guarantees that the
+view has its own NA-mask, even if the array already had an NA-mask.
+
+Starting with a non-masked array, we can use the :func:`ndarray.view` method
+to get an NA-masked view.::
+
+ >>> a = np.array([1,3,5])
+ >>> b = a.view(maskna=True)
+
+ >>> b[2] = np.NA
+ >>> a
+ array([1, 3, 5])
+ >>> b
+ array([1, 3, NA])
+
+ >>> b[0] = 2
+ >>> a
+ array([2, 3, 5])
+ >>> b
+ array([2, 3, NA])
+
+
+It is important to be cautious here, though, since if the array already
+has a mask, this will also take a view of that mask. This means the original
+array's mask will be affected by assigning NA to the view.::
+
+ >>> a = np.array([1,np.NA,5])
+ >>> b = a.view(maskna=True)
+
+ >>> b[2] = np.NA
+ >>> a
+ array([1, NA, NA])
+ >>> b
+ array([1, NA, NA])
+
+ >>> b[1] = 4
+ >>> a
+ array([1, 4, NA])
+ >>> b
+ array([1, 4, NA])
+
+
+To guarantee that the view created has its own NA-mask, there is another
+flag *ownmaskna*. Using this flag will cause a copy of the array's mask
+to be created for the view when the array already has a mask.::
+
+ >>> a = np.array([1,np.NA,5])
+ >>> b = a.view(ownmaskna=True)
+
+ >>> b[2] = np.NA
+ >>> a
+ array([1, NA, 5])
+ >>> b
+ array([1, NA, NA])
+
+ >>> b[1] = 4
+ >>> a
+ array([1, NA, 5])
+ >>> b
+ array([1, 4, NA])
+
+
+In general, when an NA-masked view of an array has been taken, any time
+an NA is assigned to an element of the array the data for that element
+will remain untouched. This mechanism allows for multiple temporary
+views with NAs of the same original array.
+
+NA-Masked Reductions
+====================
+
+Many of NumPy's reductions like :func:`numpy.sum` and :func:`numpy.std`
+have been extended to work with NA-masked arrays. A consequence of the
+missing value model is that any NA value in an array will cause the
+output including that value to become NA.::
+
+ >>> a = np.array([[1,2,np.NA,3], [0,np.NA,1,1]])
+ >>> a.sum(axis=0)
+ array([1, NA, NA, 4])
+ >>> a.sum(axis=1)
+ array([NA, NA], dtype=int64)
+
+This is not always the desired result, so NumPy includes a parameter
+*skipna* which causes the NA values to be skipped during computation.::
+
+ >>> a = np.array([[1,2,np.NA,3], [0,np.NA,1,1]])
+ >>> a.sum(axis=0, skipna=True)
+ array([1, 2, 1, 4])
+ >>> a.sum(axis=1, skipna=True)
+ array([6, 2])
+
+Iterating Over NA-Masked Arrays
+===============================
+
+The :class:`nditer` object can be used to iterate over arrays with
+NA values just like over normal arrays.::
+
+ >>> a = np.array([1,3,np.NA])
+ >>> for x in np.nditer(a):
+ ... print x,
+ ...
+ 1 3 NA
+ >>> b = np.zeros(3, maskna=True)
+ >>> for x, y in np.nditer([a,b], op_flags=[['readonly'],
+ ... ['writeonly']]):
+ ... y[...] = -x
+ ...
+ >>> b
+ array([-1., -3., NA])
+
+When using the C-API version of the nditer, one must explicitly
+add the NPY_ITER_USE_MASKNA flag and take care to deal with the NA
+mask appropriately. In the Python exposure, this flag is added
+automatically.
+
+Planned Future Additions
+========================
+
+The NA support in 1.7 is fairly preliminary, and is focused on getting
+the basics solid. This particularly meant getting the API in C refined
+to a level where adding NA support to all of NumPy and to third party
+software using NumPy would be a reasonable task.
+
+The biggest missing feature within the core is supporting NA values with
+structured arrays. The design for this involves a mask slot for each
+field in the structured array, motivated by the fact that many important
+uses of structured arrays involve treating the structured fields like
+another dimension.
+
+Another feature that was discussed during the design process is the ability
+to support more than one NA value. The design created supports this multi-NA
+idea with the addition of a payload to the NA value and to the NA-mask.
+The API has been designed in such a way that adding this feature in a future
+release should be possible without changing existing API functions in any way.
+
+To see a more complete list of what is supported and unsupported in the
+1.7 release of NumPy, please refer to the release notes.
+
+During the design phase of this feature, two implementation approaches
+for NA values were discussed, called "mask" and "bitpattern". What
+has been implemented is the "mask" approach, but the design document,
+or "NEP", describes a way both approaches could co-operatively exist
+in NumPy, since each has both pros and cons. This design document is
+available in the file "doc/neps/missing-data.rst" of the NumPy source
+code.