:Title: Missing Data Functionality in NumPy
:Author: Mark Wiebe <mwwiebe@gmail.com>
:Date: 2011-06-23

*****************
Table of Contents
*****************

.. contents::

********
Abstract
********

Users interested in dealing with missing data within NumPy are generally
pointed to the masked array subclass of the ndarray, generally known
as 'numpy.ma'. This class has a number of users who depend strongly
on its capabilities, but people who are accustomed to the deep integration
of the missing data placeholder "NA" in the R project and others who
find the programming interface challenging or inconsistent tend not
to use it.

This NEP proposes to integrate a mask-based missing data solution
into NumPy, with an additional NA bit pattern-based missing data solution
that can be implemented  concurrently or later which would integrate seamlessly
with the mask-based solution.

The mask-based solution and the NA bit pattern-based solutions in this
proposal offer the exact same missing value abstraction, with several
differences in performance, memory overhead, and flexibility.

The mask-based solution is more flexible, supporting all behaviors of the
NA bit pattern-based solution, but leaving the hidden values untouched
whenever an element is masked.

The NA bit pattern-based solution requires less memory, is bit-level
compatible with the 64-bit floating point representation used in R, but
does not preserve the hidden values and in fact requires stealing at
least one bit pattern from the underlying dtype to represent the missing
value NA.

Both solutions are generic in the sense that they can be used with
custom data types very easily, with no effort in the case of the masked
solution, and with the requirement that a bit pattern to sacrifice be
chosen in the case of the NA bit pattern solution.

**************************
Definition of Missing Data
**************************

Unknown Yet Existing Data
=========================

In order to be able to develop an intuition about what computation
will be done by various NumPy functions, a consistent conceptual
model of what a missing element means must be applied. The approach
taken in the R project is to define a missing element as something which
does have a valid value, but that value is unknown. This proposal
adopts this behavior as as the default for all operations involving
missing values.

In this interpretation, nearly any computation with a missing input produces
a missing output. For example, 'sum(a)' would produce a missing value
if 'a' contained just one missing element. When the output value does
not depend on one of the inputs, it is reasonable to output a value
that is not NA, such as logical_and(NA, False) == False.

Some more complex arithmetic operations, such as matrix products, are
well defined with this interpretation, and the result should be
the same as is the missing values were NaNs. Actually implementing
such things to the theoretical limit is probably not worth it,
and in many cases either raising an exception or returning all
missing values may be preferred to doing precise calculations.
Care must be taken here when dealing with the values and the masks,
to preserve the semantics that masking a value never touches
the element's backing memory.

Data That Doesn't Exist
=======================

Another useful interpretation is that the missing elements should be
treated as if they didn't exist in the array, and the operation should
do its best to interpret what that means according to the data
that's left. In this case, 'mean(a)' would compute the mean of just
the values that are unmasked, adjusting both the sum and count it
uses based on the mask.

This kind of data can arise when conforming sparsely sampled data
into a regular sampling pattern, and is a useful interpretation so 
use when attempting to get best-guess answers for many statistical queries.

In R, many functions take a parameter "na.rm=T" which means to treat
the data as if the NA values are not part of the data set. This proposal
defines a standard parameter "skipmissing=True" for this same purpose. 

Data That Is Being Temporarily Ignored
======================================

It can be useful to temporarily treat some array elements as if they
were NA, possibly in many different configurations. This is a common
use case for masks, and the mask-based implementation of missing values
supports this usage by having the strict requirement that the data
storage backing any missing array elements never be touched.

In general, this can be done by first creating a view, then either adding
a mask if there isn't one yet, or having the view create its own copy of
the mask instead of retaining a view of the original's mask.

********************************
Missing Values as Seen in Python
********************************

Working With Missing Values
===========================

NumPy will gain a global singleton called numpy.NA, similar to None,
but with semantics reflecting its status as a missing value. In particular,
trying to treat it as a boolean will raise an exception, and comparisons
with it will produce numpy.NA instead of True or False. These basics are
adopted from the behavior of the NA value in the R project.

For example,::

    >>> np.array([1.0, 2.0, np.NA, 7.0], masked=True)
    array([1., 2., NA, 7.], masked=True)
    >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
    array([1., 2., NA, 7.], dtype='NA[<f8]')

produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
mask [Unmasked, Unmasked, Masked, Unmasked], and
values [1.0, 2.0, <NA bit pattern>, 7.0] respectively.

It may be worth overloading the np.NA __call__ method to accept a dtype,
returning a zero-dimensional array with a missing value of that dtype.
Without doing this, NA printouts would look like::

    >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True))
    array(NA, dtype='float64', masked=True)
    >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
    array(NA, dtype='NA[<f8]')

but with this, they could be printed as::

    >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], masked=True))
    NA('float64')
    >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
    NA('NA[<f8]')

Assigning a value to an array always causes that element to not be NA,
transparently unmasking it if necessary. Assigning numpy.NA to the array
masks that element or assigns the NA bit pattern for the particular dtype.
In the mask-based implementation, the storage behind a missing value may never
be accessed in any way, other than to unmask it by assigning its value.

While numpy.NA works to mask values, it does not itself have a dtype.
This means that returning the numpy.NA singleton from an operation
like 'arr[0]' would be throwing away the dtype, which is still
valuable to retain, so 'arr[0]' will return a zero-dimensional
array either with its value masked, or containing the NA bit pattern
for the array's dtype. To test if the value is missing, the function
"np.ismissing(arr[0])" will be provided. One of the key reasons for the
NumPy scalars is to allow their values into dictionaries. Having a
missing value as the key in a dictionary is a bad idea, so the NumPy
scalars will not support missing values in any form.

All operations which write to masked arrays will not affect the value
unless they also unmask that value. This allows the storage behind
masked elements to still be relied on if they are still accessible
from another view which doesn't have them masked. For example::

    >>> a = np.array([1,2])
    >>> b = a.view()
    >>> b.flags.hasmask = True
    >>> b
    array([1,2], masked=True)
    >>> b[0] = np.NA
    >>> b
    array([NA,2], masked=True)
    >>> a
    array([1,2])
    >>> # The underlying number 1 value in 'a[0]' was untouched

Copying values between the mask-based implementation and the
NA bit pattern implementation will transparently do the correct thing,
turning the NA bit pattern into a masked value, or a masked value
into the NA bit pattern where appropriate. The one exception is
if a valid value in a masked array happens to have the NA bit pattern,
copying this value to the NA form of the dtype will cause it to
become NA as well.

If np.NA or masked values are copied to an array without support for
missing values enabled, an exception will be raised. Adding a mask to
the target array would be problematic, because then having a mask
would be a "viral" property consuming extra memory and reducing
performance in unexpected ways.

By default, the string "NA" will be used to represent missing values
in str and repr outputs. A global configuration will allow
this to be changed. The array2string function will also gain a
'maskedstr=' parameter so this could be changed to "<missing>" or
other values people may desire.

For floating point numbers, Inf and NaN are separate concepts from
missing values. If a division by zero occurs in an array with default
missing value support, an unmasked Inf or NaN will be produced. To
mask those values, a further 'a[np.logical_not(a.isfinite(a)] = np.NA'
can achieve that. For the NA bit pattern approach, the parameterized
dtype('NA[f8,InfNan]') described in a later section can be used to get
these semantics without the extra manipulation.

A manual loop through a masked array like::

    for i in xrange(len(a)):
        a[i] = np.log(a[i])

works even with masked values, because 'a[i]' returns a zero-dimensional
array with a missing value instead of the singleton np.NA for the missing
elements. If np.NA was returned, np.log would have to raise an exception
because it doesn't know the log of which dtype it's meant to call, whether
it's a missing float or a missing string, for example.

Accessing a Boolean Mask
========================

The mask used to implement missing data in the masked approach is not
accessible from Python directly. This is partially due to differing
opinions on whether True in the mask should mean "missing" or "not missing"
Additionally, exposing the mask directly would preclude a potential
space optimization, where a bit-level instead of a byte-level mask
is used to get a factor of eight memory usage improvement.

To access the mask values, there are two functions provided,
'np.ismissing' and 'np.isavail', which test for NA or available values
respectively. These functions work equivalently for masked arrays
and NA bit pattern dtypes.

Creating Masked Arrays
======================

There are two flags which indicate and control the nature of the mask
used in masked arrays.

First is 'arr.flags.hasmask', which is True for all masked arrays and
may be set to True to add a mask to an array which does not have one.

Second is 'arr.flags.ownmask', which is True if the array owns the
memory to the mask, and False if the array has no mask, or has a view
into the mask of another array. If this is set to False in a masked
array, the array will create a copy of the mask so that further modifications
to the mask will not affect the array being viewed.

Mask Implementation Details
===========================

The memory ordering of the mask will always match the ordering of
the array it is associated with. A Fortran-style array will have a
Fortran-style mask, etc.

When a view of an array with a mask is taken, the view will have
a mask which is also a view of the mask in the original
array. This means unmasking values in views will also unmask them
in the original array, and if a mask is added to an array, it will
not be possible to ever remove that mask except to create a new array
copying the data but not the mask.

It is still possible to temporarily treat an array with a mask without
giving it one, by first creating a view of the array and then adding a
mask to that view. A data set can be viewed with multiple different
masks simultaneously, by creating multiple views, and giving each view
a mask.

New ndarray Methods
===================

New functions added to the numpy namespace are::

    np.ismissing(arr)
        Returns a boolean array with True whereever the array is masked
        or matches the NA bit pattern, and False elsewhere

    np.isavail(arr)
        Returns a boolean array with False whereever the array is masked
        or matches the NA bit pattern, and True elsewhere

New functions added to the ndarray are::

    arr.copy(..., replacena=None)
        Modification to the copy function which replaces NA values,
        either masked or with the NA bit pattern, with the 'replacena='
        parameter suppled. When 'replacena' isn't None, the copied
        array is unmasked and has the 'NA' part stripped from the
        parameterized type ('NA[f8]' becomes just 'f8').

    arr.view(masked=True)
        This is a shortcut for 'a = arr.view(); a.flags.hasmask=True'.

Element-wise UFuncs With Missing Values
=======================================

As part of the implementation, ufuncs and other operations will
have to be extended to support masked computation. Because this
is a useful feature in general, even outside the context of
a masked array, in addition to working with masked arrays ufuncs
will take an optional 'mask=' parameter which allows the use
of boolean arrays to choose where a computation should be done.
This functions similar to a "where" clause on the ufunc.::

    >>> np.add(a, b, out=b, mask=(a > threshold))

A benefit of having this 'mask=' parameter is that it provides a way
to temporarily treat an object with a mask without ever creating a
masked array object.

If the 'out' parameter isn't specified, use of the 'mask=' parameter
will produce an array with a mask as the result.

For boolean operations, the R project special cases logical_and and
logical_or so that logical_and(NA, False) is False, and
logical_or(NA, True) is True. On the other hand, 0 * NA isn't 0, but
here the NA could represent Inf or NaN, in which case 0 * the backing
value wouldn't be 0 anyway.

For NumPy element-wise ufuncs, the design won't support this ability
for the mask of the output to depend simultaneously on the mask and
the value of the inputs. The NumPy 1.6 nditer, however, makes it
fairly easy to write standalone functions which look and feel just
like ufuncs, but deviate from their behavior. The functions logical_and
and logical_or can be moved into standalone function objects which are
backwards compatible with the current ufuncs.

Reduction UFuncs With Missing Values
====================================

Reduction operations like 'sum', 'prod', 'min', and 'max' will operate
consistently with the idea that a masked value exists, but its value
is unknown.

An optional parameter 'skipna=' will be added to those functions
which can interpret it appropriately to do the operation as if just
the unmasked values existed.

With 'skipna=True', when all the input values are masked,
'sum' and 'prod' will produce the additive and multiplicative identities
respectively, while 'min' and 'max' will produce masked values.
Statistics operations which require a count, like 'mean' and 'std'
will also use the unmasked value counts for their calculations if
'skipna=True', and produce masked values when all the inputs are masked.

Some examples::

    >>> a = np.array([1., 3., np.NA, 7.], masked=True)
    >>> np.sum(a)
    array(NA, dtype='<f8', masked=True)
    >>> np.sum(a, skipna=True)
    11.0
    >>> np.mean(a)
    NA('<f8')
    >>> np.mean(a, skipna=True)
    3.6666666666666665

    >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
    >>> np.sum(a, skipna=True)
    0.0
    >>> np.max(a, skipna=True)
    array(NA, dtype='<f8', masked=True)
    >>> np.mean(a)
    NA('<f8')
    >>> np.mean(a, skipna=True)
    /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: RuntimeWarning: invalid value encountered in double_scalars
      return mean(axis, dtype, out)
    nan


PEP 3118
========

PEP 3118 doesn't have any mask mechanism, so arrays with masks will
not be accessible through this interface. Similarly, it doesn't support
the specification of dtypes with NA bit patterns, so the parameterized NA
dtypes will also not be accessible through this interface.

If NumPy did allow access through PEP 3118, this would circumvent the
missing value abstraction in a very damaging way. Other libraries would
try to use masked arrays, and silently get access to the data without
also getting access to the mask or being aware of the missing value
abstraction the mask and data together are following.

Unresolved Design Questions
===========================

The existing masked array implementation has a "hardmask" feature,
which prevents values from ever being unmasked by assigning a value.
This would be an internal array flag, named something like
'arr.flags.hardmask'.

If the hardmask feature is implemented, boolean indexing could
return a hardmasked array instead of a flattened array with the
arbitrary choice of C-ordering as it currently does. While this
improves the abstraction of the array significantly, it is not
a compatible change.

**********************************
Alternative Designs Without a Mask
**********************************

Parameterized Data Type With NA Signal Values
=============================================

A masked array isn't the only way to deal with missing data, and
some systems deal with the problem by defining a special "NA" value,
for data which is missing. This is distinct from NaN floating point
values, which are the result of bad floating point calculation values,
but many people use NaNs for this purpose.

In the case of IEEE floating point values, it is possible to use a
particular NaN value, of which there are many, for "NA", distinct
from NaN. For signed integers, a reasonable approach would be to use
the minimum storable value, which doesn't have a corresponding positive
value. For unsigned integers, the maximum storage value seems most
reasonable.

With the goal of providing a general mechanism, a parameterized type
mechanism for this is much more attractive than creating separate
nafloat32, nafloat64, naint64, nauint64, etc dtypes. If this is viewed
as an alternative way of treating the mask except without value preservation,
this parameterized type can work together with the mask in a special
way to produce a value + mask combination on the fly, and use the
exact same computational infrastructure as the masked array system.
This allows one to avoid the need to write special case code for each
ufunc and for each na* dtype, something that is hard to avoid when
building a separate independent dtype implementation for each na* dtype.

Reliable conversions with the NA bit pattern preserved across primitive
types requires consideration as well. Even in the simple case of
double -> float, where this is supported by hardware, the NA value
will get lost because the NaN payload is typically not preserved.
The ability to have different bit masks specified for the same underlying
type also needs to convert properly. With a well-defined interface
converting to/from a (value,flag) pair, this becomes straightforward
to support generically.

This approach also provides some opportunities for some subtle variations
with IEEE floats. By default, one exact bit-pattern, a silent NaN with
a payload that won't be generated by hardware floating point operations,
would be used. The choice R has made could be this default.

Additionally, it might be nice to sometimes treat all NaNs as missing values.
This requires a slightly more complex mapping to convert the floating point
values into mask/value combinations, and converting back would always
produce the default NaN used by NumPy. Finally, treating both NaNs
and Infs as missing values would be just a slight variation of the NaN
version.

Strings require a slightly different handling, because they
may be any size. One approach is to use a one-character signal consisting
of one of the first 32 ASCII/unicode values. There are many possible values
to use here, like 0x15 'Negative Acknowledgement' or 0x10 'Data Link Escape'.

The Object dtype has an obvious signal, the np.NA singleton itself. Any
dtype with object semantics won't be able to have this customized, since
specifying bit patterns applies only to plain binary data, not data
with object semantics of construction and destructions.

Struct dtypes are more of a core primitive dtype, in the same fashion that
this parameterized NA-capable dtype is. It won't be possible to put
these as the parameter for the parameterized NA-dtype.

The dtype names would be parameterized similar to how the datetime64
is parameterized by the metadata unit. What name to use may require some
debate, but "NA" seems like a reasonable choice. With the default
missing value bit-pattern, these dtypes would look like
np.dtype('NA[float32]'), np.dtype('NA[f8]'), or np.dtype('NA[i64]').

To override the bit pattern that signals a missing value, a raw
value in the format of a hexadecimal unsigned integer can be given,
and in the above special cases for floating point, special strings
can be provided. The defaults for some cases, written explicitly in this
form, are then::

    np.dtype('NA[?,0x02]')
    np.dtype('NA[i4,0x80000000]')
    np.dtype('NA[u4,0xffffffff]')
    np.dtype('NA[f4,0x7f8007a2')
    np.dtype('NA[f8,0x7ff00000000007a2') (R-compatible bitpattern)
    np.dtype('NA[S16,0x15]') (using the NAK character as the signal).

    np.dtype('NA[f8,NaN]') (for any NaN)
    np.dtype('NA[f8,InfNaN]') (for any NaN or Inf)

When no parameter is specified a flexible NA dtype is created, which itself
cannot hold values, but will conform to the input types in functions like
'np.astype'. The dtype 'f8' maps to 'NA[f8]', and [('a', 'f4'), ('b', 'i4')]
maps to [('a', 'NA[f4]'), ('b', 'NA[i4]')]. Thus, to view the memory
of an 'f8' array 'arr' with 'NA[f8]', you can say arr.view(dtype='NA').

Parameterized Data Type Which Adds Additional Memory for the NA Flag
====================================================================

Another alternative to having a separate mask added to the array is
to introduced a parameterized type, which takes a primitive dtype
as an argument. The dtype "i8" would turn into "maybe[i8]", and
a byte flag would be appended to the dtype to indicate whether the
value was NA or not.

This approach adds memory overhead greater or equal to keeping a separate
mask, but has better locality. To keep the dtype aligned, an 'i8' would
need to have 16 bytes to retain proper alignment, a 100% overhead compared
to 12.5% overhead for a separately kept mask.

***************
Acknowledgments
***************

In addition to feedback Travis Oliphant and others at Enthought,
this NEP has been revised based on a great deal of feedback from
the NumPy-Discussion mailing list. The people participating in
the discussion are::

    Nathaniel Smith
    Robert Kern
    Charles Harris
    Gael Varoquaux
    Eric Firing
    Keith Goodman
    Pierre GM
    Christopher Barker
    Josef Perktold
    Ben Root
    Laurent Gautier
    Neal Becker
    Bruce Southey
    Matthew Brett
    Wes McKinney
    Lluís
    Olivier Delalleau
    Alan G Isaac
    E. Antero Tammi

I apologize if I missed anyone.