From 770c94ef5ab2478bc9f5451f931613d7424459e1 Mon Sep 17 00:00:00 2001
From: Mark Wiebe <mwwiebe@gmail.com>
Date: Wed, 24 Aug 2011 16:24:52 -0700
Subject: DOC: missingdata: Add example of a C-API function supporting NA masks

---
 doc/source/reference/c-api.maskna.rst | 384 +++++++++++++++++++++++++++++++++-
 1 file changed, 376 insertions(+), 8 deletions(-)

(limited to 'doc/source/reference/c-api.maskna.rst')

diff --git a/doc/source/reference/c-api.maskna.rst b/doc/source/reference/c-api.maskna.rst
index a7613e1b2..09fa44d35 100644
--- a/doc/source/reference/c-api.maskna.rst
+++ b/doc/source/reference/c-api.maskna.rst
@@ -28,7 +28,7 @@ The NA Object
 -------------
 
 The main *numpy* namespace in Python has a new object called *NA*.
-This is an instance of :ctype:`NpyNA *`, which is a Python object
+This is an instance of :ctype:`NpyNA`, which is a Python object
 representing an NA value. This object is analogous to the NumPy
 scalars, and is returned by :cfunc:`PyArray_Return` instead of
 a scalar where appropriate.
@@ -36,7 +36,7 @@ a scalar where appropriate.
 The global *numpy.NA* object is accessible from C as :cdata:`Npy_NA`.
 This is an NA value with no data type or multi-NA payload. Use it
 just as you would Py_None, except use :cfunc:`NpyNA_Check` to
-see if an object is an :ctype:`NpyNA *`, because :cdata:`Npy_NA` isn't
+see if an object is an :ctype:`NpyNA`, because :cdata:`Npy_NA` isn't
 the only instance of NA possible.
 
 If you want to see whether a general PyObject* is NA, you should
@@ -45,18 +45,350 @@ set to true. If this returns NULL, the object is not an NA, and if
 it returns an NpyNA instance, the object is NA and you can then
 access its *dtype* and *payload* fields as needed.
 
-To make new :ctype:`NpyNA *` objects, use
+To make new :ctype:`NpyNA` objects, use
 :cfunc:`NpyNA_FromDTypeAndPayload`, and the functions
 :cfunc:`NpyNA_GetDType`, :cfunc:`NpyNA_IsMultiNA`, and
 :cfunc:`NpyNA_GetPayload` provide access to the data members.
 
+Working With NA-Masked Arrays
+-----------------------------
+
+The starting point for many C-API functions which manipulate NumPy
+arrays is the function :cfunc:`PyArray_FromAny`. This function converts
+a general PyObject* object into a NumPy ndarray, based on options
+specified in the flags. To avoid surprises, this function does
+not allow NA-masked arrays to pass through by default.
+
+To allow third-party code to work with NA-masked arrays which contain
+no NAs, :cfunc:`PyArray_FromAny` will make a copy of the array into
+a new array without an NA-mask, and return that. This allows for
+proper interoperability in cases where it's possible until functions
+are updated to provide optimal code paths for NA-masked arrays.
+
+To update a function with NA-mask support, add the flag
+:cdata:`NPY_ARRAY_ALLOWNA` when calling :cfunc:`PyArray_FromAny`.
+This allows NA-masked arrays to pass through untouched, and will
+convert PyObject lists containing NA values into NA-masked arrays
+instead of the alternative of switching to object arrays.
+
+To check whether an array has an NA-mask, use the function
+:cfunc:`PyArray_HASMASKNA`, which checks the appropriate flag.
+There are a number of things that one will typically want to do
+when encountering an NA-masked array. We'll go through a few
+of these cases.
+
+Forbidding Any NA Values
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest case is to forbid any NA values. Note that it is better
+to still be aware of the NA mask and explicitly test for NA values
+than to leave out the :cdata:`NPY_ARRAY_ALLOWNA`, because it is possible
+to avoid the extra copy that :cfunc:`PyArray_FromAny` will make. The
+check for NAs will go something like this::
+
+    PyArrayObject *arr = ...;
+    int containsna;
+
+    /* ContainsNA checks HASMASKNA() for you */
+    containsna = PyArray_ContainsNA(arr, NULL, NULL);
+    /* Error case */
+    if (containsna < 0) {
+        return NULL;
+    }
+    /* If it found an NA */
+    else if (containsna) {
+        PyErr_SetString(PyExc_ValueError,
+                "this operation does not support arrays with NA values");
+        return NULL;
+    }
+
+After this check, you can be certain that the array doesn't contain any
+NA values, and can proceed accordingly. For example, if you iterate
+over the elements of the array, you may pass the flag
+:cdata:`NPY_ITER_IGNORE_MASKNA` to iterate over the data without
+touching the NA-mask at all.
+
+Manipulating NA Values
+~~~~~~~~~~~~~~~~~~~~~~
+
+The semantics of the NA-mask demand that whenever an array element
+is hidden by the NA-mask, no computations are permitted to modify
+the data backing that element. The :ctype:`NpyIter` provides
+a number of flags to assist with visiting both the array data
+and the mask data simultaneously, and preserving the masking semantics
+even when buffering is required.
+
+The main flag for iterating over NA-masked arrays is
+:cdata:`NPY_ITER_USE_MASKNA`. For each iterator operand which has this
+flag specified, a new operand is added to the end of the iterator operand
+list, and is set to iterate over the original operand's NA-mask. Operands
+which do not have an NA mask are permitted as well when they are flagged
+as read-only. The new operand in this case points to a single exposed
+mask value and all its strides are zero. The latter feature is useful
+when combining multiple read-only inputs, where some of them have masks.
+
+Accumulating NA Values
+~~~~~~~~~~~~~~~~~~~~~~
+
+More complex operations, like the NumPy ufunc reduce functions, need
+to take extra care to follow the masking semantics. If we accumulate
+the NA mask and the data values together, we could discover half way
+through that the output is NA, and that we have violated the contract
+to never change the underlying output value when it is being assigned
+NA.
+
+The solution to this problem is to first accumulate the NA-mask as necessary
+to produce the output's NA-mask, then accumulate the data values without
+touching NA-masked values in the output. The parameter *preservena* in
+functions like :cfunc:`PyArray_AssignArray` can assist when initializing
+values in such an algorithm.
+
 Example NA-Masked Operation in C
 --------------------------------
 
+As an example, let's implement a simple binary NA-masked operation
+for the double dtype. We'll make a divide operation which turns
+divide by zero into NA instead of NaN.
+
+To start, we define the function prototype and some basic
+:ctype:`NpyIter` boilerplate setup. We'll make a function which
+supports an optional *out* parameter, which may be NULL.::
+
+    PyArrayObject*
+    SpecialDivide(PyArrayObject* a, PyArrayObject* b, PyArrayObject *out)
+    {
+        NpyIter *iter = NULL;
+        PyArrayObject *op[3];
+        PyArray_Descr *dtypes[3];
+        npy_uint32 flags, op_flags[3];
+
+        /* Iterator construction parameters */
+        op[0] = a;
+        op[1] = b;
+        op[2] = out;
+
+        dtypes[0] = PyArray_DescrFromType(NPY_DOUBLE);
+        if (dtypes[0] == NULL) {
+            return NULL;
+        }
+        dtypes[1] = dtypes[0];
+        dtypes[2] = dtypes[0];
+
+        flags = NPY_ITER_BUFFERED |
+                NPY_ITER_EXTERNAL_LOOP |
+                NPY_ITER_GROWINNER |
+                NPY_ITER_REFS_OK |
+                NPY_ITER_ZEROSIZE_OK;
+
+        /* Every operand gets the flag NPY_ITER_USE_MASKNA */
+        op_flags[0] = NPY_ITER_READONLY |
+                      NPY_ITER_ALIGNED |
+                      NPY_ITER_USE_MASKNA;
+        op_flags[1] = op_flags[0];
+        op_flags[2] = NPY_ITER_WRITEONLY |
+                      NPY_ITER_ALIGNED |
+                      NPY_ITER_USE_MASKNA |
+                      NPY_ITER_NO_BROADCAST |
+                      NPY_ITER_ALLOCATE;
+
+        iter = NpyIter_MultiNew(3, op, flags, NPY_KEEPORDER,
+                                NPY_SAME_KIND_CASTING, op_flags, dtypes);
+        /* Don't need the dtype reference anymore */
+        Py_DECREF(dtypes[0]);
+        if (iter == NULL) {
+            return NULL;
+        }
+
+At this point, the input operands have been validated according to
+the casting rule, the shapes of the arrays have been broadcast together,
+and any buffering necessary has been prepared. This means we can
+dive into the inner loop of this function.::
+
+    ...
+        if (NpyIter_GetIterSize(iter) > 0) {
+            NpyIter_IterNextFunc *iternext;
+            char **dataptr;
+            npy_intp *stridesptr, *countptr;
+
+            /* Variables needed for looping */
+            iternext = NpyIter_GetIterNext(iter, NULL);
+            if (iternext == NULL) {
+                NpyIter_Deallocate(iter);
+                return NULL;
+            }
+            dataptr = NpyIter_GetDataPtrArray(iter);
+            stridesptr = NpyIter_GetInnerStrideArray(iter);
+            countptr = NpyIter_GetInnerLoopSizePtr(iter);
+
+The loop gets a bit messy when dealing with NA-masks, because it
+doubles the number of operands being processed in the iterator. Here
+we are naming things clearly so that the content of the innermost loop
+can be easy to work with.::
+
+    ...
+            do {
+                /* Data pointers and strides needed for innermost loop */
+                char *data_a = dataptr[0], *data_b = dataptr[1];
+                char *data_out = dataptr[2];
+                char *maskna_a = dataptr[3], *maskna_b = dataptr[4];
+                char *maskna_out = dataptr[5];
+                npy_intp stride_a = stridesptr[0], stride_b = stridesptr[1];
+                npy_intp stride_out = strides[2];
+                npy_intp maskna_stride_a = stridesptr[3];
+                npy_intp maskna_stride_b = stridesptr[4];
+                npy_intp maskna_stride_out = stridesptr[5];
+                npy_intp i, count = *countptr;
+
+                for (i = 0; i < count; ++i) {
+
+Here is the code for performing one special division. We use
+the functions :cfunc:`NpyMaskValue_IsExposed` and
+:cfunc:`NpyMaskValue_Create` to work with the masks, in order to be
+as general as possible. These are inline functions, and the compiler
+optimizer should be able to produce the same result as if you performed
+these operations directly inline here.::
+
+    ...
+                    /* If neither of the inputs are NA */
+                    if (NpyMaskValue_IsExposed((npy_mask)*maskna_a) &&
+                                NpyMaskValue_IsExposed((npy_mask)*maskna_b)) {
+                        double a_val = *(double *)data_a;
+                        double b_val = *(double *)data_b;
+                        /* Do the divide if 'b' isn't zero */
+                        if (b_val != 0.0) {
+                            *(double *)data_out = a_val / b_val;
+                            /* Need to also set this element to exposed */
+                            *maskna_out = NpyMaskValue_Create(1, 0);
+                        }
+                        /* Otherwise output an NA without touching its data */
+                        else {
+                            *maskna_out = NpyMaskValue_Create(0, 0);
+                        }
+                    }
+                    /* Turn the output into NA without touching its data */
+                    else {
+                        *maskna_out = NpyMaskValue_Create(0, 0);
+                    }
+
+                    data_a += stride_a;
+                    data_b += stride_b;
+                    data_out += stride_out;
+                    maskna_a += maskna_stride_a;
+                    maskna_b += maskna_stride_b;
+                    maskna_out += maskna_stride_out;
+                }
+            } while (iternext(iter));
+        }
+
+A little bit more boilerplate for returning the result from the iterator,
+and the function is done.::
+
+    ...
+        if (out == NULL) {
+            out = NpyIter_GetOperandArray(iter)[2];
+        }
+        Py_INCREF(out);
+        NpyIter_Deallocate(iter);
+
+        return out;
+    }
+
+To run this example, you can create a simple module with a C-file spdiv_mod.c
+consisting of::
+
+    #include <Python.h>
+    #include <numpy/arrayobject.h>
+
+    /* INSERT SpecialDivide source code here */
+
+    static PyObject *
+    spdiv(PyObject *self, PyObject *args, PyObject *kwds)
+    {
+        PyArrayObject *a, *b, *out = NULL;
+        static char *kwlist[] = {"a", "b", "out", NULL};
+
+        if (!PyArg_ParseTupleAndKeywords(args, kwds, "O&O&|O&", kwlist,
+                                    &PyArray_Converter, &a,
+                                    &PyArray_Converter, &b,
+                                    &PyArray_OutputConverter, &out)) {
+            return NULL;
+        }
+
+        return (PyObject *)SpecialDivide(a, b, out);
+    }
+
+    static PyMethodDef SpDivMethods[] = {
+        {"spdiv", (PyCFunction)spdiv, METH_VARARGS | METH_KEYWORDS, NULL},
+        {NULL, NULL, 0, NULL}
+    };
+
+
+    PyMODINIT_FUNC initspdiv_mod(void)
+    {
+        PyObject *m;
+
+        m = Py_InitModule("spdiv_mod", SpDivMethods);
+        if (m == NULL) {
+            return;
+        }
+
+        /* Make sure NumPy is initialized */
+        import_array();
+    }
+
+Create a setup.py file like::
+
+    #!/usr/bin/env python
+    def configuration(parent_package='',top_path=None):
+        from numpy.distutils.misc_util import Configuration
+        config = Configuration('.',parent_package,top_path)
+        config.add_extension('spdiv_mod',['spdiv_mod.c'])
+        return config
+
+    if __name__ == "__main__":
+        from numpy.distutils.core import setup
+        setup(configuration=configuration)
+
+With these two files in a directory by itself, run::
+
+    $ python setup.py build_ext --inplace
+
+and the file spdiv_mod.so (or .dll) will be placed in the same directory.
+Now you can try out this sample, to see how it behaves.::
+
+    >>> import numpy as np
+    >>> from spdiv_mod import spdiv
+
+It always produces a masked output array. To get the NumPy scalar returning
+behavior, replace the return in spdiv() with
+"PyArray_Return(SpecialDivide(a, b, out)".::
+
+    >>> spdiv(1,2)
+    array(0.5, maskna=True)
+
+Here we can see how NAs propagate, and how 0 in the output turns into NA
+as desired.::
+
+    >>> a = np.arange(6)
+    >>> b = np.array([0,np.NA,0,2,1,0])
+    >>> spdiv(a,b)
+    array([  NA,   NA,   NA,  1.5,  4. ,   NA])
+
+Finally, we can see the masking behavior by creating a masked
+view of an array. The ones in *c_orig* are preserved whereever
+NA got assigned.::
+
+    >>> c_orig = np.ones(6)
+    >>> c = c_orig.view(maskna=True)
+    >>> spdiv(a,b,out=c)
+    array([  NA,   NA,   NA,  1.5,  4. ,   NA])
+    >>> c_orig
+    array([ 1. ,  1. ,  1. ,  1.5,  4. ,  1. ])
+
 NA Object Data Type
 -------------------
 
-.. ctype:: NpyNA *
+.. ctype:: NpyNA
 
     This is the C object corresponding to objects of type
     numpy.NAType. The fields themselves are hidden from consumers of the
@@ -80,7 +412,7 @@ NA Object Functions
 
 .. cfunction:: NpyNA_Check(obj)
 
-    Evaluates to true if *obj* is an instance of :ctype:`NpyNA *`.
+    Evaluates to true if *obj* is an instance of :ctype:`NpyNA`.
 
 .. cfunction:: PyArray_Descr* NpyNA_GetDType(NpyNA* na)
 
@@ -99,9 +431,9 @@ NA Object Functions
 .. cfunction:: NpyNA* NpyNA_FromObject(PyObject* obj, int suppress_error)
 
     If *obj* represents an object which is NA, for example if it
-    is an :ctype:`NpyNA *`, or a zero-dimensional NA-masked array with
+    is an :ctype:`NpyNA`, or a zero-dimensional NA-masked array with
     its value hidden by the mask, returns a new reference to an
-    :ctype:`NpyNA *` object representing *obj*. Otherwise returns
+    :ctype:`NpyNA` object representing *obj*. Otherwise returns
     NULL.
 
     If *suppress_error* is true, this function doesn't raise an exception
@@ -110,7 +442,7 @@ NA Object Functions
 .. cfunction:: NpyNA* NpyNA_FromDTypeAndPayload(PyArray_Descr *dtype, int multina, int payload)
 
 
-    Constructs a new :ctype:`NpyNA *` instance with the specified *dtype*
+    Constructs a new :ctype:`NpyNA` instance with the specified *dtype*
     and *payload*. For an NA with no dtype, provide NULL in *dtype*.
 
     Until multi-NA is implemented, just pass 0 for both *multina*
@@ -119,6 +451,42 @@ NA Object Functions
 NA Mask Functions
 -----------------
 
+A mask dtype can be one of three different possibilities. It can
+be :cdata:`NPY_BOOL`, :cdata:`NPY_MASK`, or a struct dtype whose
+fields are all mask dtypes.
+
+A mask of :cdata:`NPY_BOOL` can just indicate True, with underlying
+value 1, for an element that is exposed, and False, with underlying
+value 0, for an element that is hidden.
+
+A mask of :cdata:`NPY_MASK` can additionally carry a payload which
+is a value from 0 to 127. This allows for missing data implementations
+based on such masks to support multiple reasons for data being missing.
+
+A mask of a struct dtype can only pair up with another struct dtype
+with the same field names. In this way, each field of the mask controls
+the masking for the corresponding field in the associated data array.
+
+Inline functions to work with masks are as follows.
+
+.. cfunction:: npy_bool NpyMaskValue_IsExposed(npy_mask mask)
+
+    Returns true if the data element corresponding to the mask element
+    can be modified, false if not.
+
+.. cfunction:: npy_uint8 NpyMaskValue_GetPayload(npy_mask mask)
+
+    Returns the payload contained in the mask. The return value
+    is between 0 and 127.
+
+.. cfunction:: npy_mask NpyMaskValue_Create(npy_bool exposed, npy_int8 payload)
+
+    Creates a mask from a flag indicating whether the element is exposed
+    or not and a payload value.
+
+NA Mask Array Functions
+-----------------------
+
 .. cfunction:: int PyArray_AllocateMaskNA(PyArrayObject *arr, npy_bool ownmaskna, npy_bool multina, npy_mask defaultmask)
 
     Allocates an NA mask for the array *arr* if necessary. If *ownmaskna*
-- 
cgit v1.2.1