summaryrefslogtreecommitdiff
path: root/doc/source/reference/c-api.maskna.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/reference/c-api.maskna.rst')
-rw-r--r--doc/source/reference/c-api.maskna.rst592
1 files changed, 592 insertions, 0 deletions
diff --git a/doc/source/reference/c-api.maskna.rst b/doc/source/reference/c-api.maskna.rst
new file mode 100644
index 000000000..6abb624eb
--- /dev/null
+++ b/doc/source/reference/c-api.maskna.rst
@@ -0,0 +1,592 @@
+Array NA Mask API
+==================
+
+.. sectionauthor:: Mark Wiebe
+
+.. index::
+ pair: maskna; C-API
+ pair: C-API; maskna
+
+.. versionadded:: 1.7
+
+NA Masks in Arrays
+------------------
+
+NumPy supports the idea of NA (Not Available) missing values in its
+arrays. In the design document leading up to the implementation, two
+mechanisms for this were proposed, NA masks and NA bitpatterns. NA masks
+have been implemented as the first representation of these values. This
+mechanism supports working with NA values similar to what the R language
+provides, and when combined with views, allows one to temporarily mark
+elements as NA without affecting the original data.
+
+The C API has been updated with mechanisms to allow NumPy extensions
+to work with these masks, and this document provides some examples and
+reference for the NA mask-related functions.
+
+The NA Object
+-------------
+
+The main *numpy* namespace in Python has a new object called *NA*.
+This is an instance of :ctype:`NpyNA`, which is a Python object
+representing an NA value. This object is analogous to the NumPy
+scalars, and is returned by :cfunc:`PyArray_Return` instead of
+a scalar where appropriate.
+
+The global *numpy.NA* object is accessible from C as :cdata:`Npy_NA`.
+This is an NA value with no data type or multi-NA payload. Use it
+just as you would Py_None, except use :cfunc:`NpyNA_Check` to
+see if an object is an :ctype:`NpyNA`, because :cdata:`Npy_NA` isn't
+the only instance of NA possible.
+
+If you want to see whether a general PyObject* is NA, you should
+use the API function :cfunc:`NpyNA_FromObject` with *suppress_error*
+set to true. If this returns NULL, the object is not an NA, and if
+it returns an NpyNA instance, the object is NA and you can then
+access its *dtype* and *payload* fields as needed.
+
+To make new :ctype:`NpyNA` objects, use
+:cfunc:`NpyNA_FromDTypeAndPayload`. The functions
+:cfunc:`NpyNA_GetDType`, :cfunc:`NpyNA_IsMultiNA`, and
+:cfunc:`NpyNA_GetPayload` provide access to the data members.
+
+Working With NA-Masked Arrays
+-----------------------------
+
+The starting point for many C-API functions which manipulate NumPy
+arrays is the function :cfunc:`PyArray_FromAny`. This function converts
+a general PyObject* object into a NumPy ndarray, based on options
+specified in the flags. To avoid surprises, this function does
+not allow NA-masked arrays to pass through by default.
+
+To allow third-party code to work with NA-masked arrays which contain
+no NAs, :cfunc:`PyArray_FromAny` will make a copy of the array into
+a new array without an NA-mask, and return that. This allows for
+proper interoperability in cases where it's possible until functions
+are updated to provide optimal code paths for NA-masked arrays.
+
+To update a function with NA-mask support, add the flag
+:cdata:`NPY_ARRAY_ALLOWNA` when calling :cfunc:`PyArray_FromAny`.
+This allows NA-masked arrays to pass through untouched, and will
+convert PyObject lists containing NA values into NA-masked arrays
+instead of the alternative of switching to object arrays.
+
+To check whether an array has an NA-mask, use the function
+:cfunc:`PyArray_HASMASKNA`, which checks the appropriate flag.
+There are a number of things that one will typically want to do
+when encountering an NA-masked array. We'll go through a few
+of these cases.
+
+Forbidding Any NA Values
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest case is to forbid any NA values. Note that it is better
+to still be aware of the NA mask and explicitly test for NA values
+than to leave out the :cdata:`NPY_ARRAY_ALLOWNA`, because it is possible
+to avoid the extra copy that :cfunc:`PyArray_FromAny` will make. The
+check for NAs will go something like this::
+
+ PyArrayObject *arr = ...;
+ int containsna;
+
+ /* ContainsNA checks HASMASKNA() for you */
+ containsna = PyArray_ContainsNA(arr, NULL, NULL);
+ /* Error case */
+ if (containsna < 0) {
+ return NULL;
+ }
+ /* If it found an NA */
+ else if (containsna) {
+ PyErr_SetString(PyExc_ValueError,
+ "this operation does not support arrays with NA values");
+ return NULL;
+ }
+
+After this check, you can be certain that the array doesn't contain any
+NA values, and can proceed accordingly. For example, if you iterate
+over the elements of the array, you may pass the flag
+:cdata:`NPY_ITER_IGNORE_MASKNA` to iterate over the data without
+touching the NA-mask at all.
+
+Manipulating NA Values
+~~~~~~~~~~~~~~~~~~~~~~
+
+The semantics of the NA-mask demand that whenever an array element
+is hidden by the NA-mask, no computations are permitted to modify
+the data backing that element. The :ctype:`NpyIter` provides
+a number of flags to assist with visiting both the array data
+and the mask data simultaneously, and preserving the masking semantics
+even when buffering is required.
+
+The main flag for iterating over NA-masked arrays is
+:cdata:`NPY_ITER_USE_MASKNA`. For each iterator operand which has this
+flag specified, a new operand is added to the end of the iterator operand
+list, and is set to iterate over the original operand's NA-mask. Operands
+which do not have an NA mask are permitted as well when they are flagged
+as read-only. The new operand in this case points to a single exposed
+mask value and all its strides are zero. The latter feature is useful
+when combining multiple read-only inputs, where some of them have masks.
+
+Accumulating NA Values
+~~~~~~~~~~~~~~~~~~~~~~
+
+More complex operations, like the NumPy ufunc reduce functions, need
+to take extra care to follow the masking semantics. If we accumulate
+the NA mask and the data values together, we could discover half way
+through that the output is NA, and that we have violated the contract
+to never change the underlying output value when it is being assigned
+NA.
+
+The solution to this problem is to first accumulate the NA-mask as necessary
+to produce the output's NA-mask, then accumulate the data values without
+touching NA-masked values in the output. The parameter *preservena* in
+functions like :cfunc:`PyArray_AssignArray` can assist when initializing
+values in such an algorithm.
+
+Example NA-Masked Operation in C
+--------------------------------
+
+As an example, let's implement a simple binary NA-masked operation
+for the double dtype. We'll make a divide operation which turns
+divide by zero into NA instead of Inf or NaN.
+
+To start, we define the function prototype and some basic
+:ctype:`NpyIter` boilerplate setup. We'll make a function which
+supports an optional *out* parameter, which may be NULL.::
+
+ static PyArrayObject*
+ SpecialDivide(PyArrayObject* a, PyArrayObject* b, PyArrayObject *out)
+ {
+ NpyIter *iter = NULL;
+ PyArrayObject *op[3];
+ PyArray_Descr *dtypes[3];
+ npy_uint32 flags, op_flags[3];
+
+ /* Iterator construction parameters */
+ op[0] = a;
+ op[1] = b;
+ op[2] = out;
+
+ dtypes[0] = PyArray_DescrFromType(NPY_DOUBLE);
+ if (dtypes[0] == NULL) {
+ return NULL;
+ }
+ dtypes[1] = dtypes[0];
+ dtypes[2] = dtypes[0];
+
+ flags = NPY_ITER_BUFFERED |
+ NPY_ITER_EXTERNAL_LOOP |
+ NPY_ITER_GROWINNER |
+ NPY_ITER_REFS_OK |
+ NPY_ITER_ZEROSIZE_OK;
+
+ /* Every operand gets the flag NPY_ITER_USE_MASKNA */
+ op_flags[0] = NPY_ITER_READONLY |
+ NPY_ITER_ALIGNED |
+ NPY_ITER_USE_MASKNA;
+ op_flags[1] = op_flags[0];
+ op_flags[2] = NPY_ITER_WRITEONLY |
+ NPY_ITER_ALIGNED |
+ NPY_ITER_USE_MASKNA |
+ NPY_ITER_NO_BROADCAST |
+ NPY_ITER_ALLOCATE;
+
+ iter = NpyIter_MultiNew(3, op, flags, NPY_KEEPORDER,
+ NPY_SAME_KIND_CASTING, op_flags, dtypes);
+ /* Don't need the dtype reference anymore */
+ Py_DECREF(dtypes[0]);
+ if (iter == NULL) {
+ return NULL;
+ }
+
+At this point, the input operands have been validated according to
+the casting rule, the shapes of the arrays have been broadcast together,
+and any buffering necessary has been prepared. This means we can
+dive into the inner loop of this function.::
+
+ ...
+ if (NpyIter_GetIterSize(iter) > 0) {
+ NpyIter_IterNextFunc *iternext;
+ char **dataptr;
+ npy_intp *stridesptr, *countptr;
+
+ /* Variables needed for looping */
+ iternext = NpyIter_GetIterNext(iter, NULL);
+ if (iternext == NULL) {
+ NpyIter_Deallocate(iter);
+ return NULL;
+ }
+ dataptr = NpyIter_GetDataPtrArray(iter);
+ stridesptr = NpyIter_GetInnerStrideArray(iter);
+ countptr = NpyIter_GetInnerLoopSizePtr(iter);
+
+The loop gets a bit messy when dealing with NA-masks, because it
+doubles the number of operands being processed in the iterator. Here
+we are naming things clearly so that the content of the innermost loop
+can be easy to work with.::
+
+ ...
+ do {
+ /* Data pointers and strides needed for innermost loop */
+ char *data_a = dataptr[0], *data_b = dataptr[1];
+ char *data_out = dataptr[2];
+ char *maskna_a = dataptr[3], *maskna_b = dataptr[4];
+ char *maskna_out = dataptr[5];
+ npy_intp stride_a = stridesptr[0], stride_b = stridesptr[1];
+ npy_intp stride_out = strides[2];
+ npy_intp maskna_stride_a = stridesptr[3];
+ npy_intp maskna_stride_b = stridesptr[4];
+ npy_intp maskna_stride_out = stridesptr[5];
+ npy_intp i, count = *countptr;
+
+ for (i = 0; i < count; ++i) {
+
+Here is the code for performing one special division. We use
+the functions :cfunc:`NpyMaskValue_IsExposed` and
+:cfunc:`NpyMaskValue_Create` to work with the masks, in order to be
+as general as possible. These are inline functions, and the compiler
+optimizer should be able to produce the same result as if you performed
+these operations directly inline here.::
+
+ ...
+ /* If neither of the inputs are NA */
+ if (NpyMaskValue_IsExposed((npy_mask)*maskna_a) &&
+ NpyMaskValue_IsExposed((npy_mask)*maskna_b)) {
+ double a_val = *(double *)data_a;
+ double b_val = *(double *)data_b;
+ /* Do the divide if 'b' isn't zero */
+ if (b_val != 0.0) {
+ *(double *)data_out = a_val / b_val;
+ /* Need to also set this element to exposed */
+ *maskna_out = NpyMaskValue_Create(1, 0);
+ }
+ /* Otherwise output an NA without touching its data */
+ else {
+ *maskna_out = NpyMaskValue_Create(0, 0);
+ }
+ }
+ /* Turn the output into NA without touching its data */
+ else {
+ *maskna_out = NpyMaskValue_Create(0, 0);
+ }
+
+ data_a += stride_a;
+ data_b += stride_b;
+ data_out += stride_out;
+ maskna_a += maskna_stride_a;
+ maskna_b += maskna_stride_b;
+ maskna_out += maskna_stride_out;
+ }
+ } while (iternext(iter));
+ }
+
+A little bit more boilerplate for returning the result from the iterator,
+and the function is done.::
+
+ ...
+ if (out == NULL) {
+ out = NpyIter_GetOperandArray(iter)[2];
+ }
+ Py_INCREF(out);
+ NpyIter_Deallocate(iter);
+
+ return out;
+ }
+
+To run this example, you can create a simple module with a C-file spdiv_mod.c
+consisting of::
+
+ #include <Python.h>
+ #include <numpy/arrayobject.h>
+
+ /* INSERT SpecialDivide source code here */
+
+ static PyObject *
+ spdiv(PyObject *self, PyObject *args, PyObject *kwds)
+ {
+ PyArrayObject *a, *b, *out = NULL;
+ static char *kwlist[] = {"a", "b", "out", NULL};
+
+ if (!PyArg_ParseTupleAndKeywords(args, kwds, "O&O&|O&", kwlist,
+ &PyArray_AllowNAConverter, &a,
+ &PyArray_AllowNAConverter, &b,
+ &PyArray_OutputAllowNAConverter, &out)) {
+ return NULL;
+ }
+
+ /*
+ * The usual NumPy way is to only use PyArray_Return when
+ * the 'out' parameter is not provided.
+ */
+ if (out == NULL) {
+ return PyArray_Return(SpecialDivide(a, b, out));
+ }
+ else {
+ return (PyObject *)SpecialDivide(a, b, out);
+ }
+ }
+
+ static PyMethodDef SpDivMethods[] = {
+ {"spdiv", (PyCFunction)spdiv, METH_VARARGS | METH_KEYWORDS, NULL},
+ {NULL, NULL, 0, NULL}
+ };
+
+
+ PyMODINIT_FUNC initspdiv_mod(void)
+ {
+ PyObject *m;
+
+ m = Py_InitModule("spdiv_mod", SpDivMethods);
+ if (m == NULL) {
+ return;
+ }
+
+ /* Make sure NumPy is initialized */
+ import_array();
+ }
+
+Create a setup.py file like::
+
+ #!/usr/bin/env python
+ def configuration(parent_package='',top_path=None):
+ from numpy.distutils.misc_util import Configuration
+ config = Configuration('.',parent_package,top_path)
+ config.add_extension('spdiv_mod',['spdiv_mod.c'])
+ return config
+
+ if __name__ == "__main__":
+ from numpy.distutils.core import setup
+ setup(configuration=configuration)
+
+With these two files in a directory by itself, run::
+
+ $ python setup.py build_ext --inplace
+
+and the file spdiv_mod.so (or .dll) will be placed in the same directory.
+Now you can try out this sample, to see how it behaves.::
+
+ >>> import numpy as np
+ >>> from spdiv_mod import spdiv
+
+Because we used :cfunc:`PyArray_Return` when wrapping SpecialDivide,
+it returns scalars like any typical NumPy function does::
+
+ >>> spdiv(1, 2)
+ 0.5
+ >>> spdiv(2, 0)
+ NA(dtype='float64')
+ >>> spdiv(np.NA, 1.5)
+ NA(dtype='float64')
+
+Here we can see how NAs propagate, and how 0 in the output turns into NA
+as desired.::
+
+ >>> a = np.arange(6)
+ >>> b = np.array([0,np.NA,0,2,1,0])
+ >>> spdiv(a, b)
+ array([ NA, NA, NA, 1.5, 4. , NA])
+
+Finally, we can see the masking behavior by creating a masked
+view of an array. The ones in *c_orig* are preserved whereever
+NA got assigned.::
+
+ >>> c_orig = np.ones(6)
+ >>> c = c_orig.view(maskna=True)
+ >>> spdiv(a, b, out=c)
+ array([ NA, NA, NA, 1.5, 4. , NA])
+ >>> c_orig
+ array([ 1. , 1. , 1. , 1.5, 4. , 1. ])
+
+NA Object Data Type
+-------------------
+
+.. ctype:: NpyNA
+
+ This is the C object corresponding to objects of type
+ numpy.NAType. The fields themselves are hidden from consumers of the
+ API, you must use the functions provided to create new NA objects
+ and get their properties.
+
+ This object contains two fields, a :ctype:`PyArray_Descr *` dtype
+ which is either NULL or indicates the data type the NA represents,
+ and a payload which is there for the future addition of multi-NA support.
+
+.. cvar:: Npy_NA
+
+ This is a global singleton, similar to Py_None, which is the
+ *numpy.NA* object. Note that unlike Py_None, multiple NAs may be
+ created, for instance with different multi-NA payloads or with
+ different dtypes. If you want to return an NA with no payload
+ or dtype, return a new reference to Npy_NA.
+
+NA Object Functions
+-------------------
+
+.. cfunction:: NpyNA_Check(obj)
+
+ Evaluates to true if *obj* is an instance of :ctype:`NpyNA`.
+
+.. cfunction:: PyArray_Descr* NpyNA_GetDType(NpyNA* na)
+
+ Returns the *dtype* field of the NA object, which is NULL when
+ the NA has no dtype. Does not raise an error.
+
+.. cfunction:: npy_bool NpyNA_IsMultiNA(NpyNA* na)
+
+ Returns true if the NA has a multi-NA payload, false otherwise.
+
+.. cfunction:: int NpyNA_GetPayload(NpyNA* na)
+
+ Gets the multi-NA payload of the NA, or 0 if *na* doesn't have
+ a multi-NA payload.
+
+.. cfunction:: NpyNA* NpyNA_FromObject(PyObject* obj, int suppress_error)
+
+ If *obj* represents an object which is NA, for example if it
+ is an :ctype:`NpyNA`, or a zero-dimensional NA-masked array with
+ its value hidden by the mask, returns a new reference to an
+ :ctype:`NpyNA` object representing *obj*. Otherwise returns
+ NULL.
+
+ If *suppress_error* is true, this function doesn't raise an exception
+ when the input isn't NA and it returns NULL, otherwise it does.
+
+.. cfunction:: NpyNA* NpyNA_FromDTypeAndPayload(PyArray_Descr *dtype, int multina, int payload)
+
+
+ Constructs a new :ctype:`NpyNA` instance with the specified *dtype*
+ and *payload*. For an NA with no dtype, provide NULL in *dtype*.
+
+ Until multi-NA is implemented, just pass 0 for both *multina*
+ and *payload*.
+
+NA Mask Functions
+-----------------
+
+A mask dtype can be one of three different possibilities. It can
+be :cdata:`NPY_BOOL`, :cdata:`NPY_MASK`, or a struct dtype whose
+fields are all mask dtypes.
+
+A mask of :cdata:`NPY_BOOL` can just indicate True, with underlying
+value 1, for an element that is exposed, and False, with underlying
+value 0, for an element that is hidden.
+
+A mask of :cdata:`NPY_MASK` can additionally carry a payload which
+is a value from 0 to 127. This allows for missing data implementations
+based on such masks to support multiple reasons for data being missing.
+
+A mask of a struct dtype can only pair up with another struct dtype
+with the same field names. In this way, each field of the mask controls
+the masking for the corresponding field in the associated data array.
+
+Inline functions to work with masks are as follows.
+
+.. cfunction:: npy_bool NpyMaskValue_IsExposed(npy_mask mask)
+
+ Returns true if the data element corresponding to the mask element
+ can be modified, false if not.
+
+.. cfunction:: npy_uint8 NpyMaskValue_GetPayload(npy_mask mask)
+
+ Returns the payload contained in the mask. The return value
+ is between 0 and 127.
+
+.. cfunction:: npy_mask NpyMaskValue_Create(npy_bool exposed, npy_int8 payload)
+
+ Creates a mask from a flag indicating whether the element is exposed
+ or not and a payload value.
+
+NA Mask Array Functions
+-----------------------
+
+.. cfunction:: int PyArray_AllocateMaskNA(PyArrayObject *arr, npy_bool ownmaskna, npy_bool multina, npy_mask defaultmask)
+
+ Allocates an NA mask for the array *arr* if necessary. If *ownmaskna*
+ if false, it only allocates an NA mask if none exists, but if
+ *ownmaskna* is true, it also allocates one if the NA mask is a view
+ into another array's NA mask. Here are the two most common usage
+ patterns::
+
+ /* Use this to make sure 'arr' has an NA mask */
+ if (PyArray_AllocateMaskNA(arr, 0, 0, 1) < 0) {
+ return NULL;
+ }
+
+ /* Use this to make sure 'arr' owns an NA mask */
+ if (PyArray_AllocateMaskNA(arr, 1, 0, 1) < 0) {
+ return NULL;
+ }
+
+ The parameter *multina* is provided for future expansion, when
+ mult-NA support is added to NumPy. This will affect the dtype of
+ the NA mask, which currently must be always NPY_BOOL, but will be
+ NPY_MASK for arrays multi-NA when this is implemented.
+
+ When a new NA mask is allocated, and the mask needs to be filled,
+ it uses the value *defaultmask*. In nearly all cases, this should be set
+ to 1, indicating that the elements are exposed. If a mask is allocated
+ just because of *ownmaskna*, the existing mask values are copied
+ into the newly allocated mask.
+
+ This function returns 0 for success, -1 for failure.
+
+.. cfunction:: npy_bool PyArray_HasNASupport(PyArrayObject *arr)
+
+ Returns true if *arr* is an array which supports NA. This function
+ exists because the design for adding NA proposed two mechanisms
+ for NAs in NumPy, NA masks and NA bitpatterns. Currently, just
+ NA masks have been implemented, but when NA bitpatterns are implemented
+ this would return true for arrays with an NA bitpattern dtype as well.
+
+.. cfunction:: int PyArray_ContainsNA(PyArrayObject *arr, PyArrayObject *wheremask, npy_bool *whichna)
+
+ Checks whether the array *arr* contains any NA values.
+
+ If *wheremask* is non-NULL, it must be an NPY_BOOL mask which can
+ broadcast onto *arr*. Whereever the where mask is True, *arr*
+ is checked for NA, and whereever it is False, the *arr* value is
+ ignored.
+
+ The parameter *whichna* is provided for future expansion to multi-NA
+ support. When implemented, this parameter will be a 128 element
+ array of npy_bool, with the value True for the NA values that are
+ being looked for.
+
+ This function returns 1 when the array contains NA values, 0 when
+ it does not, and -1 when a error has occurred.
+
+.. cfunction:: int PyArray_AssignNA(PyArrayObject *arr, NpyNA *na, PyArrayObject *wheremask, npy_bool preservena, npy_bool *preservewhichna)
+
+ Assigns the given *na* value to elements of *arr*.
+
+ If *wheremask* is non-NULL, it must be an NPY_BOOL array broadcastable
+ onto *arr*, and only elements of *arr* with a corresponding value
+ of True in *wheremask* will have *na* assigned.
+
+ The parameters *preservena* and *preservewhichna* are provided for
+ future expansion to multi-NA support. With a single NA value, one
+ NA cannot be distinguished from another, so preserving NA values
+ does not make sense. With multiple NA values, preserving NA values
+ becomes an important concept because that implies not overwriting the
+ multi-NA payloads. The parameter *preservewhichna* will be a 128 element
+ array of npy_bool, indicating which NA payloads to preserve.
+
+ This function returns 0 for success, -1 for failure.
+
+.. cfunction:: int PyArray_AssignMaskNA(PyArrayObject *arr, npy_mask maskvalue, PyArrayObject *wheremask, npy_bool preservena, npy_bool *preservewhichna)
+
+ Assigns the given NA mask *maskvalue* to elements of *arr*.
+
+ If *wheremask* is non-NULL, it must be an NPY_BOOL array broadcastable
+ onto *arr*, and only elements of *arr* with a corresponding value
+ of True in *wheremask* will have the NA *maskvalue* assigned.
+
+ The parameters *preservena* and *preservewhichna* are provided for
+ future expansion to multi-NA support. With a single NA value, one
+ NA cannot be distinguished from another, so preserving NA values
+ does not make sense. With multiple NA values, preserving NA values
+ becomes an important concept because that implies not overwriting the
+ multi-NA payloads. The parameter *preservewhichna* will be a 128 element
+ array of npy_bool, indicating which NA payloads to preserve.
+
+ This function returns 0 for success, -1 for failure.