Array NA Mask API
==================

.. sectionauthor:: Mark Wiebe

.. index::
   pair: maskna; C-API
   pair: C-API; maskna

.. versionadded:: 1.7

NA Masks in Arrays
------------------

NumPy supports the idea of NA (Not Available) missing values in its
arrays.  In the design document leading up to the implementation, two
mechanisms for this were proposed, NA masks and NA bitpatterns. NA masks
have been implemented as the first representation of these values. This
mechanism supports working with NA values similar to what the R language
provides, and when combined with views, allows one to temporarily mark
elements as NA without affecting the original data.

The C API has been updated with mechanisms to allow NumPy extensions
to work with these masks, and this document provides some examples and
reference for the NA mask-related functions.

The NA Object
-------------

The main *numpy* namespace in Python has a new object called *NA*.
This is an instance of :ctype:`NpyNA`, which is a Python object
representing an NA value. This object is analogous to the NumPy
scalars, and is returned by :cfunc:`PyArray_Return` instead of
a scalar where appropriate.

The global *numpy.NA* object is accessible from C as :cdata:`Npy_NA`.
This is an NA value with no data type or multi-NA payload. Use it
just as you would Py_None, except use :cfunc:`NpyNA_Check` to
see if an object is an :ctype:`NpyNA`, because :cdata:`Npy_NA` isn't
the only instance of NA possible.

If you want to see whether a general PyObject* is NA, you should
use the API function :cfunc:`NpyNA_FromObject` with *suppress_error*
set to true. If this returns NULL, the object is not an NA, and if
it returns an NpyNA instance, the object is NA and you can then
access its *dtype* and *payload* fields as needed.

To make new :ctype:`NpyNA` objects, use
:cfunc:`NpyNA_FromDTypeAndPayload`. The functions
:cfunc:`NpyNA_GetDType`, :cfunc:`NpyNA_IsMultiNA`, and
:cfunc:`NpyNA_GetPayload` provide access to the data members.

Working With NA-Masked Arrays
-----------------------------

The starting point for many C-API functions which manipulate NumPy
arrays is the function :cfunc:`PyArray_FromAny`. This function converts
a general PyObject* object into a NumPy ndarray, based on options
specified in the flags. To avoid surprises, this function does
not allow NA-masked arrays to pass through by default.

To allow third-party code to work with NA-masked arrays which contain
no NAs, :cfunc:`PyArray_FromAny` will make a copy of the array into
a new array without an NA-mask, and return that. This allows for
proper interoperability in cases where it's possible until functions
are updated to provide optimal code paths for NA-masked arrays.

To update a function with NA-mask support, add the flag
:cdata:`NPY_ARRAY_ALLOWNA` when calling :cfunc:`PyArray_FromAny`.
This allows NA-masked arrays to pass through untouched, and will
convert PyObject lists containing NA values into NA-masked arrays
instead of the alternative of switching to object arrays.

To check whether an array has an NA-mask, use the function
:cfunc:`PyArray_HASMASKNA`, which checks the appropriate flag.
There are a number of things that one will typically want to do
when encountering an NA-masked array. We'll go through a few
of these cases.

Forbidding Any NA Values
~~~~~~~~~~~~~~~~~~~~~~~~

The simplest case is to forbid any NA values. Note that it is better
to still be aware of the NA mask and explicitly test for NA values
than to leave out the :cdata:`NPY_ARRAY_ALLOWNA`, because it is possible
to avoid the extra copy that :cfunc:`PyArray_FromAny` will make. The
check for NAs will go something like this::

    PyArrayObject *arr = ...;
    int containsna;

    /* ContainsNA checks HASMASKNA() for you */
    containsna = PyArray_ContainsNA(arr, NULL, NULL);
    /* Error case */
    if (containsna < 0) {
        return NULL;
    }
    /* If it found an NA */
    else if (containsna) {
        PyErr_SetString(PyExc_ValueError,
                "this operation does not support arrays with NA values");
        return NULL;
    }

After this check, you can be certain that the array doesn't contain any
NA values, and can proceed accordingly. For example, if you iterate
over the elements of the array, you may pass the flag
:cdata:`NPY_ITER_IGNORE_MASKNA` to iterate over the data without
touching the NA-mask at all.

Manipulating NA Values
~~~~~~~~~~~~~~~~~~~~~~

The semantics of the NA-mask demand that whenever an array element
is hidden by the NA-mask, no computations are permitted to modify
the data backing that element. The :ctype:`NpyIter` provides
a number of flags to assist with visiting both the array data
and the mask data simultaneously, and preserving the masking semantics
even when buffering is required.

The main flag for iterating over NA-masked arrays is
:cdata:`NPY_ITER_USE_MASKNA`. For each iterator operand which has this
flag specified, a new operand is added to the end of the iterator operand
list, and is set to iterate over the original operand's NA-mask. Operands
which do not have an NA mask are permitted as well when they are flagged
as read-only. The new operand in this case points to a single exposed
mask value and all its strides are zero. The latter feature is useful
when combining multiple read-only inputs, where some of them have masks.

Accumulating NA Values
~~~~~~~~~~~~~~~~~~~~~~

More complex operations, like the NumPy ufunc reduce functions, need
to take extra care to follow the masking semantics. If we accumulate
the NA mask and the data values together, we could discover half way
through that the output is NA, and that we have violated the contract
to never change the underlying output value when it is being assigned
NA.

The solution to this problem is to first accumulate the NA-mask as necessary
to produce the output's NA-mask, then accumulate the data values without
touching NA-masked values in the output. The parameter *preservena* in
functions like :cfunc:`PyArray_AssignArray` can assist when initializing
values in such an algorithm.

Example NA-Masked Operation in C
--------------------------------

As an example, let's implement a simple binary NA-masked operation
for the double dtype. We'll make a divide operation which turns
divide by zero into NA instead of Inf or NaN.

To start, we define the function prototype and some basic
:ctype:`NpyIter` boilerplate setup. We'll make a function which
supports an optional *out* parameter, which may be NULL.::

    static PyArrayObject*
    SpecialDivide(PyArrayObject* a, PyArrayObject* b, PyArrayObject *out)
    {
        NpyIter *iter = NULL;
        PyArrayObject *op[3];
        PyArray_Descr *dtypes[3];
        npy_uint32 flags, op_flags[3];

        /* Iterator construction parameters */
        op[0] = a;
        op[1] = b;
        op[2] = out;

        dtypes[0] = PyArray_DescrFromType(NPY_DOUBLE);
        if (dtypes[0] == NULL) {
            return NULL;
        }
        dtypes[1] = dtypes[0];
        dtypes[2] = dtypes[0];

        flags = NPY_ITER_BUFFERED |
                NPY_ITER_EXTERNAL_LOOP |
                NPY_ITER_GROWINNER |
                NPY_ITER_REFS_OK |
                NPY_ITER_ZEROSIZE_OK;

        /* Every operand gets the flag NPY_ITER_USE_MASKNA */
        op_flags[0] = NPY_ITER_READONLY |
                      NPY_ITER_ALIGNED |
                      NPY_ITER_USE_MASKNA;
        op_flags[1] = op_flags[0];
        op_flags[2] = NPY_ITER_WRITEONLY |
                      NPY_ITER_ALIGNED |
                      NPY_ITER_USE_MASKNA |
                      NPY_ITER_NO_BROADCAST |
                      NPY_ITER_ALLOCATE;

        iter = NpyIter_MultiNew(3, op, flags, NPY_KEEPORDER,
                                NPY_SAME_KIND_CASTING, op_flags, dtypes);
        /* Don't need the dtype reference anymore */
        Py_DECREF(dtypes[0]);
        if (iter == NULL) {
            return NULL;
        }

At this point, the input operands have been validated according to
the casting rule, the shapes of the arrays have been broadcast together,
and any buffering necessary has been prepared. This means we can
dive into the inner loop of this function.::

    ...
        if (NpyIter_GetIterSize(iter) > 0) {
            NpyIter_IterNextFunc *iternext;
            char **dataptr;
            npy_intp *stridesptr, *countptr;

            /* Variables needed for looping */
            iternext = NpyIter_GetIterNext(iter, NULL);
            if (iternext == NULL) {
                NpyIter_Deallocate(iter);
                return NULL;
            }
            dataptr = NpyIter_GetDataPtrArray(iter);
            stridesptr = NpyIter_GetInnerStrideArray(iter);
            countptr = NpyIter_GetInnerLoopSizePtr(iter);

The loop gets a bit messy when dealing with NA-masks, because it
doubles the number of operands being processed in the iterator. Here
we are naming things clearly so that the content of the innermost loop
can be easy to work with.::

    ...
            do {
                /* Data pointers and strides needed for innermost loop */
                char *data_a = dataptr[0], *data_b = dataptr[1];
                char *data_out = dataptr[2];
                char *maskna_a = dataptr[3], *maskna_b = dataptr[4];
                char *maskna_out = dataptr[5];
                npy_intp stride_a = stridesptr[0], stride_b = stridesptr[1];
                npy_intp stride_out = strides[2];
                npy_intp maskna_stride_a = stridesptr[3];
                npy_intp maskna_stride_b = stridesptr[4];
                npy_intp maskna_stride_out = stridesptr[5];
                npy_intp i, count = *countptr;

                for (i = 0; i < count; ++i) {

Here is the code for performing one special division. We use
the functions :cfunc:`NpyMaskValue_IsExposed` and
:cfunc:`NpyMaskValue_Create` to work with the masks, in order to be
as general as possible. These are inline functions, and the compiler
optimizer should be able to produce the same result as if you performed
these operations directly inline here.::

    ...
                    /* If neither of the inputs are NA */
                    if (NpyMaskValue_IsExposed((npy_mask)*maskna_a) &&
                                NpyMaskValue_IsExposed((npy_mask)*maskna_b)) {
                        double a_val = *(double *)data_a;
                        double b_val = *(double *)data_b;
                        /* Do the divide if 'b' isn't zero */
                        if (b_val != 0.0) {
                            *(double *)data_out = a_val / b_val;
                            /* Need to also set this element to exposed */
                            *maskna_out = NpyMaskValue_Create(1, 0);
                        }
                        /* Otherwise output an NA without touching its data */
                        else {
                            *maskna_out = NpyMaskValue_Create(0, 0);
                        }
                    }
                    /* Turn the output into NA without touching its data */
                    else {
                        *maskna_out = NpyMaskValue_Create(0, 0);
                    }

                    data_a += stride_a;
                    data_b += stride_b;
                    data_out += stride_out;
                    maskna_a += maskna_stride_a;
                    maskna_b += maskna_stride_b;
                    maskna_out += maskna_stride_out;
                }
            } while (iternext(iter));
        }

A little bit more boilerplate for returning the result from the iterator,
and the function is done.::

    ...
        if (out == NULL) {
            out = NpyIter_GetOperandArray(iter)[2];
        }
        Py_INCREF(out);
        NpyIter_Deallocate(iter);

        return out;
    }

To run this example, you can create a simple module with a C-file spdiv_mod.c
consisting of::

    #include <Python.h>
    #include <numpy/arrayobject.h>

    /* INSERT SpecialDivide source code here */

    static PyObject *
    spdiv(PyObject *self, PyObject *args, PyObject *kwds)
    {
        PyArrayObject *a, *b, *out = NULL;
        static char *kwlist[] = {"a", "b", "out", NULL};

        if (!PyArg_ParseTupleAndKeywords(args, kwds, "O&O&|O&", kwlist,
                                    &PyArray_Converter, &a,
                                    &PyArray_Converter, &b,
                                    &PyArray_OutputConverter, &out)) {
            return NULL;
        }

        /*
         * The usual NumPy way is to only use PyArray_Return when
         * the 'out' parameter is not provided.
         */
        if (out == NULL) {
            return PyArray_Return(SpecialDivide(a, b, out));
        }
        else {
            return (PyObject *)SpecialDivide(a, b, out);
        }
    }

    static PyMethodDef SpDivMethods[] = {
        {"spdiv", (PyCFunction)spdiv, METH_VARARGS | METH_KEYWORDS, NULL},
        {NULL, NULL, 0, NULL}
    };


    PyMODINIT_FUNC initspdiv_mod(void)
    {
        PyObject *m;

        m = Py_InitModule("spdiv_mod", SpDivMethods);
        if (m == NULL) {
            return;
        }

        /* Make sure NumPy is initialized */
        import_array();
    }

Create a setup.py file like::

    #!/usr/bin/env python
    def configuration(parent_package='',top_path=None):
        from numpy.distutils.misc_util import Configuration
        config = Configuration('.',parent_package,top_path)
        config.add_extension('spdiv_mod',['spdiv_mod.c'])
        return config

    if __name__ == "__main__":
        from numpy.distutils.core import setup
        setup(configuration=configuration)

With these two files in a directory by itself, run::

    $ python setup.py build_ext --inplace

and the file spdiv_mod.so (or .dll) will be placed in the same directory.
Now you can try out this sample, to see how it behaves.::

    >>> import numpy as np
    >>> from spdiv_mod import spdiv

Because we used :cfunc:`PyArray_Return` when wrapping SpecialDivide,
it returns scalars like any typical NumPy function does::

    >>> spdiv(1, 2)
    0.5
    >>> spdiv(2, 0)
    NA(dtype='float64')
    >>> spdiv(np.NA, 1.5)
    NA(dtype='float64')

Here we can see how NAs propagate, and how 0 in the output turns into NA
as desired.::

    >>> a = np.arange(6)
    >>> b = np.array([0,np.NA,0,2,1,0])
    >>> spdiv(a, b)
    array([  NA,   NA,   NA,  1.5,  4. ,   NA])

Finally, we can see the masking behavior by creating a masked
view of an array. The ones in *c_orig* are preserved whereever
NA got assigned.::

    >>> c_orig = np.ones(6)
    >>> c = c_orig.view(maskna=True)
    >>> spdiv(a, b, out=c)
    array([  NA,   NA,   NA,  1.5,  4. ,   NA])
    >>> c_orig
    array([ 1. ,  1. ,  1. ,  1.5,  4. ,  1. ])

NA Object Data Type
-------------------

.. ctype:: NpyNA

    This is the C object corresponding to objects of type
    numpy.NAType. The fields themselves are hidden from consumers of the
    API, you must use the functions provided to create new NA objects
    and get their properties.

    This object contains two fields, a :ctype:`PyArray_Descr *` dtype
    which is either NULL or indicates the data type the NA represents,
    and a payload which is there for the future addition of multi-NA support.

.. cvar:: Npy_NA

    This is a global singleton, similar to Py_None, which is the
    *numpy.NA* object. Note that unlike Py_None, multiple NAs may be
    created, for instance with different multi-NA payloads or with
    different dtypes. If you want to return an NA with no payload
    or dtype, return a new reference to Npy_NA.

NA Object Functions
-------------------

.. cfunction:: NpyNA_Check(obj)

    Evaluates to true if *obj* is an instance of :ctype:`NpyNA`.

.. cfunction:: PyArray_Descr* NpyNA_GetDType(NpyNA* na)

    Returns the *dtype* field of the NA object, which is NULL when
    the NA has no dtype.  Does not raise an error.

.. cfunction:: npy_bool NpyNA_IsMultiNA(NpyNA* na)

    Returns true if the NA has a multi-NA payload, false otherwise.

.. cfunction:: int NpyNA_GetPayload(NpyNA* na)

    Gets the multi-NA payload of the NA, or 0 if *na* doesn't have
    a multi-NA payload.

.. cfunction:: NpyNA* NpyNA_FromObject(PyObject* obj, int suppress_error)

    If *obj* represents an object which is NA, for example if it
    is an :ctype:`NpyNA`, or a zero-dimensional NA-masked array with
    its value hidden by the mask, returns a new reference to an
    :ctype:`NpyNA` object representing *obj*. Otherwise returns
    NULL.

    If *suppress_error* is true, this function doesn't raise an exception
    when the input isn't NA and it returns NULL, otherwise it does.

.. cfunction:: NpyNA* NpyNA_FromDTypeAndPayload(PyArray_Descr *dtype, int multina, int payload)


    Constructs a new :ctype:`NpyNA` instance with the specified *dtype*
    and *payload*. For an NA with no dtype, provide NULL in *dtype*.

    Until multi-NA is implemented, just pass 0 for both *multina*
    and *payload*.

NA Mask Functions
-----------------

A mask dtype can be one of three different possibilities. It can
be :cdata:`NPY_BOOL`, :cdata:`NPY_MASK`, or a struct dtype whose
fields are all mask dtypes.

A mask of :cdata:`NPY_BOOL` can just indicate True, with underlying
value 1, for an element that is exposed, and False, with underlying
value 0, for an element that is hidden.

A mask of :cdata:`NPY_MASK` can additionally carry a payload which
is a value from 0 to 127. This allows for missing data implementations
based on such masks to support multiple reasons for data being missing.

A mask of a struct dtype can only pair up with another struct dtype
with the same field names. In this way, each field of the mask controls
the masking for the corresponding field in the associated data array.

Inline functions to work with masks are as follows.

.. cfunction:: npy_bool NpyMaskValue_IsExposed(npy_mask mask)

    Returns true if the data element corresponding to the mask element
    can be modified, false if not.

.. cfunction:: npy_uint8 NpyMaskValue_GetPayload(npy_mask mask)

    Returns the payload contained in the mask. The return value
    is between 0 and 127.

.. cfunction:: npy_mask NpyMaskValue_Create(npy_bool exposed, npy_int8 payload)

    Creates a mask from a flag indicating whether the element is exposed
    or not and a payload value.

NA Mask Array Functions
-----------------------

.. cfunction:: int PyArray_AllocateMaskNA(PyArrayObject *arr, npy_bool ownmaskna, npy_bool multina, npy_mask defaultmask)

    Allocates an NA mask for the array *arr* if necessary. If *ownmaskna*
    if false, it only allocates an NA mask if none exists, but if
    *ownmaskna* is true, it also allocates one if the NA mask is a view
    into another array's NA mask. Here are the two most common usage
    patterns::

        /* Use this to make sure 'arr' has an NA mask */
        if (PyArray_AllocateMaskNA(arr, 0, 0, 1) < 0) {
            return NULL;
        }

        /* Use this to make sure 'arr' owns an NA mask */
        if (PyArray_AllocateMaskNA(arr, 1, 0, 1) < 0) {
            return NULL;
        }

    The parameter *multina* is provided for future expansion, when
    mult-NA support is added to NumPy. This will affect the dtype of
    the NA mask, which currently must be always NPY_BOOL, but will be
    NPY_MASK for arrays multi-NA when this is implemented.

    When a new NA mask is allocated, and the mask needs to be filled,
    it uses the value *defaultmask*. In nearly all cases, this should be set
    to 1, indicating that the elements are exposed. If a mask is allocated
    just because of *ownmaskna*, the existing mask values are copied
    into the newly allocated mask.

    This function returns 0 for success, -1 for failure.

.. cfunction:: npy_bool PyArray_HasNASupport(PyArrayObject *arr)

    Returns true if *arr* is an array which supports NA. This function
    exists because the design for adding NA proposed two mechanisms
    for NAs in NumPy, NA masks and NA bitpatterns. Currently, just
    NA masks have been implemented, but when NA bitpatterns are implemented
    this would return true for arrays with an NA bitpattern dtype as well.

.. cfunction:: int PyArray_ContainsNA(PyArrayObject *arr, PyArrayObject *wheremask, npy_bool *whichna)

    Checks whether the array *arr* contains any NA values.

    If *wheremask* is non-NULL, it must be an NPY_BOOL mask which can
    broadcast onto *arr*. Whereever the where mask is True, *arr*
    is checked for NA, and whereever it is False, the *arr* value is
    ignored.

    The parameter *whichna* is provided for future expansion to multi-NA
    support. When implemented, this parameter will be a 128 element
    array of npy_bool, with the value True for the NA values that are
    being looked for.

    This function returns 1 when the array contains NA values, 0 when
    it does not, and -1 when a error has occurred.

.. cfunction:: int PyArray_AssignNA(PyArrayObject *arr, NpyNA *na, PyArrayObject *wheremask, npy_bool preservena, npy_bool *preservewhichna)

    Assigns the given *na* value to elements of *arr*.

    If *wheremask* is non-NULL, it must be an NPY_BOOL array broadcastable
    onto *arr*, and only elements of *arr* with a corresponding value
    of True in *wheremask* will have *na* assigned.

    The parameters *preservena* and *preservewhichna* are provided for
    future expansion to multi-NA support. With a single NA value, one
    NA cannot be distinguished from another, so preserving NA values
    does not make sense. With multiple NA values, preserving NA values
    becomes an important concept because that implies not overwriting the
    multi-NA payloads. The parameter *preservewhichna* will be a 128 element
    array of npy_bool, indicating which NA payloads to preserve.

    This function returns 0 for success, -1 for failure.

.. cfunction:: int PyArray_AssignMaskNA(PyArrayObject *arr, npy_mask maskvalue, PyArrayObject *wheremask, npy_bool preservena, npy_bool *preservewhichna)

    Assigns the given NA mask *maskvalue* to elements of *arr*.

    If *wheremask* is non-NULL, it must be an NPY_BOOL array broadcastable
    onto *arr*, and only elements of *arr* with a corresponding value
    of True in *wheremask* will have the NA *maskvalue* assigned.

    The parameters *preservena* and *preservewhichna* are provided for
    future expansion to multi-NA support. With a single NA value, one
    NA cannot be distinguished from another, so preserving NA values
    does not make sense. With multiple NA values, preserving NA values
    becomes an important concept because that implies not overwriting the
    multi-NA payloads. The parameter *preservewhichna* will be a 128 element
    array of npy_bool, indicating which NA payloads to preserve.

    This function returns 0 for success, -1 for failure.