From 770c94ef5ab2478bc9f5451f931613d7424459e1 Mon Sep 17 00:00:00 2001 From: Mark Wiebe Date: Wed, 24 Aug 2011 16:24:52 -0700 Subject: DOC: missingdata: Add example of a C-API function supporting NA masks --- doc/source/reference/c-api.maskna.rst | 384 +++++++++++++++++++++++++++++++++- 1 file changed, 376 insertions(+), 8 deletions(-) (limited to 'doc/source/reference/c-api.maskna.rst') diff --git a/doc/source/reference/c-api.maskna.rst b/doc/source/reference/c-api.maskna.rst index a7613e1b2..09fa44d35 100644 --- a/doc/source/reference/c-api.maskna.rst +++ b/doc/source/reference/c-api.maskna.rst @@ -28,7 +28,7 @@ The NA Object ------------- The main *numpy* namespace in Python has a new object called *NA*. -This is an instance of :ctype:`NpyNA *`, which is a Python object +This is an instance of :ctype:`NpyNA`, which is a Python object representing an NA value. This object is analogous to the NumPy scalars, and is returned by :cfunc:`PyArray_Return` instead of a scalar where appropriate. @@ -36,7 +36,7 @@ a scalar where appropriate. The global *numpy.NA* object is accessible from C as :cdata:`Npy_NA`. This is an NA value with no data type or multi-NA payload. Use it just as you would Py_None, except use :cfunc:`NpyNA_Check` to -see if an object is an :ctype:`NpyNA *`, because :cdata:`Npy_NA` isn't +see if an object is an :ctype:`NpyNA`, because :cdata:`Npy_NA` isn't the only instance of NA possible. If you want to see whether a general PyObject* is NA, you should @@ -45,18 +45,350 @@ set to true. If this returns NULL, the object is not an NA, and if it returns an NpyNA instance, the object is NA and you can then access its *dtype* and *payload* fields as needed. -To make new :ctype:`NpyNA *` objects, use +To make new :ctype:`NpyNA` objects, use :cfunc:`NpyNA_FromDTypeAndPayload`, and the functions :cfunc:`NpyNA_GetDType`, :cfunc:`NpyNA_IsMultiNA`, and :cfunc:`NpyNA_GetPayload` provide access to the data members. +Working With NA-Masked Arrays +----------------------------- + +The starting point for many C-API functions which manipulate NumPy +arrays is the function :cfunc:`PyArray_FromAny`. This function converts +a general PyObject* object into a NumPy ndarray, based on options +specified in the flags. To avoid surprises, this function does +not allow NA-masked arrays to pass through by default. + +To allow third-party code to work with NA-masked arrays which contain +no NAs, :cfunc:`PyArray_FromAny` will make a copy of the array into +a new array without an NA-mask, and return that. This allows for +proper interoperability in cases where it's possible until functions +are updated to provide optimal code paths for NA-masked arrays. + +To update a function with NA-mask support, add the flag +:cdata:`NPY_ARRAY_ALLOWNA` when calling :cfunc:`PyArray_FromAny`. +This allows NA-masked arrays to pass through untouched, and will +convert PyObject lists containing NA values into NA-masked arrays +instead of the alternative of switching to object arrays. + +To check whether an array has an NA-mask, use the function +:cfunc:`PyArray_HASMASKNA`, which checks the appropriate flag. +There are a number of things that one will typically want to do +when encountering an NA-masked array. We'll go through a few +of these cases. + +Forbidding Any NA Values +~~~~~~~~~~~~~~~~~~~~~~~~ + +The simplest case is to forbid any NA values. Note that it is better +to still be aware of the NA mask and explicitly test for NA values +than to leave out the :cdata:`NPY_ARRAY_ALLOWNA`, because it is possible +to avoid the extra copy that :cfunc:`PyArray_FromAny` will make. The +check for NAs will go something like this:: + + PyArrayObject *arr = ...; + int containsna; + + /* ContainsNA checks HASMASKNA() for you */ + containsna = PyArray_ContainsNA(arr, NULL, NULL); + /* Error case */ + if (containsna < 0) { + return NULL; + } + /* If it found an NA */ + else if (containsna) { + PyErr_SetString(PyExc_ValueError, + "this operation does not support arrays with NA values"); + return NULL; + } + +After this check, you can be certain that the array doesn't contain any +NA values, and can proceed accordingly. For example, if you iterate +over the elements of the array, you may pass the flag +:cdata:`NPY_ITER_IGNORE_MASKNA` to iterate over the data without +touching the NA-mask at all. + +Manipulating NA Values +~~~~~~~~~~~~~~~~~~~~~~ + +The semantics of the NA-mask demand that whenever an array element +is hidden by the NA-mask, no computations are permitted to modify +the data backing that element. The :ctype:`NpyIter` provides +a number of flags to assist with visiting both the array data +and the mask data simultaneously, and preserving the masking semantics +even when buffering is required. + +The main flag for iterating over NA-masked arrays is +:cdata:`NPY_ITER_USE_MASKNA`. For each iterator operand which has this +flag specified, a new operand is added to the end of the iterator operand +list, and is set to iterate over the original operand's NA-mask. Operands +which do not have an NA mask are permitted as well when they are flagged +as read-only. The new operand in this case points to a single exposed +mask value and all its strides are zero. The latter feature is useful +when combining multiple read-only inputs, where some of them have masks. + +Accumulating NA Values +~~~~~~~~~~~~~~~~~~~~~~ + +More complex operations, like the NumPy ufunc reduce functions, need +to take extra care to follow the masking semantics. If we accumulate +the NA mask and the data values together, we could discover half way +through that the output is NA, and that we have violated the contract +to never change the underlying output value when it is being assigned +NA. + +The solution to this problem is to first accumulate the NA-mask as necessary +to produce the output's NA-mask, then accumulate the data values without +touching NA-masked values in the output. The parameter *preservena* in +functions like :cfunc:`PyArray_AssignArray` can assist when initializing +values in such an algorithm. + Example NA-Masked Operation in C -------------------------------- +As an example, let's implement a simple binary NA-masked operation +for the double dtype. We'll make a divide operation which turns +divide by zero into NA instead of NaN. + +To start, we define the function prototype and some basic +:ctype:`NpyIter` boilerplate setup. We'll make a function which +supports an optional *out* parameter, which may be NULL.:: + + PyArrayObject* + SpecialDivide(PyArrayObject* a, PyArrayObject* b, PyArrayObject *out) + { + NpyIter *iter = NULL; + PyArrayObject *op[3]; + PyArray_Descr *dtypes[3]; + npy_uint32 flags, op_flags[3]; + + /* Iterator construction parameters */ + op[0] = a; + op[1] = b; + op[2] = out; + + dtypes[0] = PyArray_DescrFromType(NPY_DOUBLE); + if (dtypes[0] == NULL) { + return NULL; + } + dtypes[1] = dtypes[0]; + dtypes[2] = dtypes[0]; + + flags = NPY_ITER_BUFFERED | + NPY_ITER_EXTERNAL_LOOP | + NPY_ITER_GROWINNER | + NPY_ITER_REFS_OK | + NPY_ITER_ZEROSIZE_OK; + + /* Every operand gets the flag NPY_ITER_USE_MASKNA */ + op_flags[0] = NPY_ITER_READONLY | + NPY_ITER_ALIGNED | + NPY_ITER_USE_MASKNA; + op_flags[1] = op_flags[0]; + op_flags[2] = NPY_ITER_WRITEONLY | + NPY_ITER_ALIGNED | + NPY_ITER_USE_MASKNA | + NPY_ITER_NO_BROADCAST | + NPY_ITER_ALLOCATE; + + iter = NpyIter_MultiNew(3, op, flags, NPY_KEEPORDER, + NPY_SAME_KIND_CASTING, op_flags, dtypes); + /* Don't need the dtype reference anymore */ + Py_DECREF(dtypes[0]); + if (iter == NULL) { + return NULL; + } + +At this point, the input operands have been validated according to +the casting rule, the shapes of the arrays have been broadcast together, +and any buffering necessary has been prepared. This means we can +dive into the inner loop of this function.:: + + ... + if (NpyIter_GetIterSize(iter) > 0) { + NpyIter_IterNextFunc *iternext; + char **dataptr; + npy_intp *stridesptr, *countptr; + + /* Variables needed for looping */ + iternext = NpyIter_GetIterNext(iter, NULL); + if (iternext == NULL) { + NpyIter_Deallocate(iter); + return NULL; + } + dataptr = NpyIter_GetDataPtrArray(iter); + stridesptr = NpyIter_GetInnerStrideArray(iter); + countptr = NpyIter_GetInnerLoopSizePtr(iter); + +The loop gets a bit messy when dealing with NA-masks, because it +doubles the number of operands being processed in the iterator. Here +we are naming things clearly so that the content of the innermost loop +can be easy to work with.:: + + ... + do { + /* Data pointers and strides needed for innermost loop */ + char *data_a = dataptr[0], *data_b = dataptr[1]; + char *data_out = dataptr[2]; + char *maskna_a = dataptr[3], *maskna_b = dataptr[4]; + char *maskna_out = dataptr[5]; + npy_intp stride_a = stridesptr[0], stride_b = stridesptr[1]; + npy_intp stride_out = strides[2]; + npy_intp maskna_stride_a = stridesptr[3]; + npy_intp maskna_stride_b = stridesptr[4]; + npy_intp maskna_stride_out = stridesptr[5]; + npy_intp i, count = *countptr; + + for (i = 0; i < count; ++i) { + +Here is the code for performing one special division. We use +the functions :cfunc:`NpyMaskValue_IsExposed` and +:cfunc:`NpyMaskValue_Create` to work with the masks, in order to be +as general as possible. These are inline functions, and the compiler +optimizer should be able to produce the same result as if you performed +these operations directly inline here.:: + + ... + /* If neither of the inputs are NA */ + if (NpyMaskValue_IsExposed((npy_mask)*maskna_a) && + NpyMaskValue_IsExposed((npy_mask)*maskna_b)) { + double a_val = *(double *)data_a; + double b_val = *(double *)data_b; + /* Do the divide if 'b' isn't zero */ + if (b_val != 0.0) { + *(double *)data_out = a_val / b_val; + /* Need to also set this element to exposed */ + *maskna_out = NpyMaskValue_Create(1, 0); + } + /* Otherwise output an NA without touching its data */ + else { + *maskna_out = NpyMaskValue_Create(0, 0); + } + } + /* Turn the output into NA without touching its data */ + else { + *maskna_out = NpyMaskValue_Create(0, 0); + } + + data_a += stride_a; + data_b += stride_b; + data_out += stride_out; + maskna_a += maskna_stride_a; + maskna_b += maskna_stride_b; + maskna_out += maskna_stride_out; + } + } while (iternext(iter)); + } + +A little bit more boilerplate for returning the result from the iterator, +and the function is done.:: + + ... + if (out == NULL) { + out = NpyIter_GetOperandArray(iter)[2]; + } + Py_INCREF(out); + NpyIter_Deallocate(iter); + + return out; + } + +To run this example, you can create a simple module with a C-file spdiv_mod.c +consisting of:: + + #include + #include + + /* INSERT SpecialDivide source code here */ + + static PyObject * + spdiv(PyObject *self, PyObject *args, PyObject *kwds) + { + PyArrayObject *a, *b, *out = NULL; + static char *kwlist[] = {"a", "b", "out", NULL}; + + if (!PyArg_ParseTupleAndKeywords(args, kwds, "O&O&|O&", kwlist, + &PyArray_Converter, &a, + &PyArray_Converter, &b, + &PyArray_OutputConverter, &out)) { + return NULL; + } + + return (PyObject *)SpecialDivide(a, b, out); + } + + static PyMethodDef SpDivMethods[] = { + {"spdiv", (PyCFunction)spdiv, METH_VARARGS | METH_KEYWORDS, NULL}, + {NULL, NULL, 0, NULL} + }; + + + PyMODINIT_FUNC initspdiv_mod(void) + { + PyObject *m; + + m = Py_InitModule("spdiv_mod", SpDivMethods); + if (m == NULL) { + return; + } + + /* Make sure NumPy is initialized */ + import_array(); + } + +Create a setup.py file like:: + + #!/usr/bin/env python + def configuration(parent_package='',top_path=None): + from numpy.distutils.misc_util import Configuration + config = Configuration('.',parent_package,top_path) + config.add_extension('spdiv_mod',['spdiv_mod.c']) + return config + + if __name__ == "__main__": + from numpy.distutils.core import setup + setup(configuration=configuration) + +With these two files in a directory by itself, run:: + + $ python setup.py build_ext --inplace + +and the file spdiv_mod.so (or .dll) will be placed in the same directory. +Now you can try out this sample, to see how it behaves.:: + + >>> import numpy as np + >>> from spdiv_mod import spdiv + +It always produces a masked output array. To get the NumPy scalar returning +behavior, replace the return in spdiv() with +"PyArray_Return(SpecialDivide(a, b, out)".:: + + >>> spdiv(1,2) + array(0.5, maskna=True) + +Here we can see how NAs propagate, and how 0 in the output turns into NA +as desired.:: + + >>> a = np.arange(6) + >>> b = np.array([0,np.NA,0,2,1,0]) + >>> spdiv(a,b) + array([ NA, NA, NA, 1.5, 4. , NA]) + +Finally, we can see the masking behavior by creating a masked +view of an array. The ones in *c_orig* are preserved whereever +NA got assigned.:: + + >>> c_orig = np.ones(6) + >>> c = c_orig.view(maskna=True) + >>> spdiv(a,b,out=c) + array([ NA, NA, NA, 1.5, 4. , NA]) + >>> c_orig + array([ 1. , 1. , 1. , 1.5, 4. , 1. ]) + NA Object Data Type ------------------- -.. ctype:: NpyNA * +.. ctype:: NpyNA This is the C object corresponding to objects of type numpy.NAType. The fields themselves are hidden from consumers of the @@ -80,7 +412,7 @@ NA Object Functions .. cfunction:: NpyNA_Check(obj) - Evaluates to true if *obj* is an instance of :ctype:`NpyNA *`. + Evaluates to true if *obj* is an instance of :ctype:`NpyNA`. .. cfunction:: PyArray_Descr* NpyNA_GetDType(NpyNA* na) @@ -99,9 +431,9 @@ NA Object Functions .. cfunction:: NpyNA* NpyNA_FromObject(PyObject* obj, int suppress_error) If *obj* represents an object which is NA, for example if it - is an :ctype:`NpyNA *`, or a zero-dimensional NA-masked array with + is an :ctype:`NpyNA`, or a zero-dimensional NA-masked array with its value hidden by the mask, returns a new reference to an - :ctype:`NpyNA *` object representing *obj*. Otherwise returns + :ctype:`NpyNA` object representing *obj*. Otherwise returns NULL. If *suppress_error* is true, this function doesn't raise an exception @@ -110,7 +442,7 @@ NA Object Functions .. cfunction:: NpyNA* NpyNA_FromDTypeAndPayload(PyArray_Descr *dtype, int multina, int payload) - Constructs a new :ctype:`NpyNA *` instance with the specified *dtype* + Constructs a new :ctype:`NpyNA` instance with the specified *dtype* and *payload*. For an NA with no dtype, provide NULL in *dtype*. Until multi-NA is implemented, just pass 0 for both *multina* @@ -119,6 +451,42 @@ NA Object Functions NA Mask Functions ----------------- +A mask dtype can be one of three different possibilities. It can +be :cdata:`NPY_BOOL`, :cdata:`NPY_MASK`, or a struct dtype whose +fields are all mask dtypes. + +A mask of :cdata:`NPY_BOOL` can just indicate True, with underlying +value 1, for an element that is exposed, and False, with underlying +value 0, for an element that is hidden. + +A mask of :cdata:`NPY_MASK` can additionally carry a payload which +is a value from 0 to 127. This allows for missing data implementations +based on such masks to support multiple reasons for data being missing. + +A mask of a struct dtype can only pair up with another struct dtype +with the same field names. In this way, each field of the mask controls +the masking for the corresponding field in the associated data array. + +Inline functions to work with masks are as follows. + +.. cfunction:: npy_bool NpyMaskValue_IsExposed(npy_mask mask) + + Returns true if the data element corresponding to the mask element + can be modified, false if not. + +.. cfunction:: npy_uint8 NpyMaskValue_GetPayload(npy_mask mask) + + Returns the payload contained in the mask. The return value + is between 0 and 127. + +.. cfunction:: npy_mask NpyMaskValue_Create(npy_bool exposed, npy_int8 payload) + + Creates a mask from a flag indicating whether the element is exposed + or not and a payload value. + +NA Mask Array Functions +----------------------- + .. cfunction:: int PyArray_AllocateMaskNA(PyArrayObject *arr, npy_bool ownmaskna, npy_bool multina, npy_mask defaultmask) Allocates an NA mask for the array *arr* if necessary. If *ownmaskna* -- cgit v1.2.1