2 files changed, 1102 insertions, 0 deletions
diff --git a/doc/source/reference/c-api.iterator.rst b/doc/source/reference/c-api.iterator.rst
new file mode 100644
index 000000000..adb6f6081
--- /dev/null
+++ b/doc/source/reference/c-api.iterator.rst
@@ -0,0 +1,1101 @@
+Array Iterator API
+==================
+
+.. sectionauthor:: Mark Wiebe
+
+.. versionadded:: 1.6
+
+Array Iterator
+--------------
+
+The array iterator encapsulates many of the key features in ufuncs,
+allowing user code to support features like output parameters,
+preservation of memory layouts, and buffering of data with the wrong
+alignment or type, without requiring difficult coding.
+
+This page documents the API for the iterator.
+The C-API naming convention chosen is based on the one in the numpy-refactor
+branch, so will integrate naturally into the refactored code base.
+The iterator is named ``NpyIter`` and functions are
+named ``NpyIter_*``.
+
+Converting from Previous NumPy Iterators
+----------------------------------------
+
+The existing iterator API includes functions like PyArrayIter_Check,
+PyArray_Iter* and PyArray_ITER_*.  The multi-iterator array includes
+PyArray_MultiIter*, PyArray_Broadcast, and PyArray_RemoveSmallest.  The
+new iterator design replaces all of this functionality with a single object
+and associated API.  One goal of the new API is that all uses of the
+existing iterator should be replaceable with the new iterator without
+significant effort. In 1.6, the major exception to this is the neighborhood
+iterator, which does not have corresponding features in this iterator.
+
+Here is a conversion table for the regular iterator:
+
+===============================  =============================================
+``PyArray_IterNew``              ``NpyIter_New``
+``PyArray_IterAllButAxis``       ``NpyIter_New`` + ``axes`` parameter **or**
+                                 Iterator flag ``NPY_ITER_NO_INNER_ITERATION``
+``PyArray_BroadcastToShape``     **NOT SUPPORTED** (Use the support for
+                                 multiple operands instead.)
+``PyArrayIter_Check``            Will need to add this in Python exposure
+``PyArray_ITER_RESET``           ``NpyIter_Reset``
+``PyArray_ITER_NEXT``            Function pointer from ``NpyIter_GetIterNext``
+``PyArray_ITER_DATA``            ``NpyIter_GetDataPtrArray``
+``PyArray_ITER_GOTO``            ``NpyIter_GotoCoords``
+``PyArray_ITER_GOTO1D``          ``NpyIter_GotoIndex`` or
+                                 ``NpyIter_GotoIterIndex``
+``PyArray_ITER_NOTDONE``         Return value of ``iternext`` function pointer
+===============================  =============================================
+
+For the multi-iterator:
+
+===============================  =============================================
+``PyArray_MultiIterNew``         ``NpyIter_MultiNew``
+``PyArray_MultiIter_RESET``      ``NpyIter_Reset``
+``PyArray_MultiIter_NEXT``       Function pointer from ``NpyIter_GetIterNext``
+``PyArray_MultiIter_DATA``       ``NpyIter_GetDataPtrArray``
+``PyArray_MultiIter_NEXTi``      **NOT SUPPORTED** (always lock-step iteration)
+``PyArray_MultiIter_GOTO``       ``NpyIter_GotoCoords``
+``PyArray_MultiIter_GOTO1D``     ``NpyIter_GotoIndex`` or
+                                 ``NpyIter_GotoIterIndex``
+``PyArray_MultiIter_NOTDONE``    Return value of ``iternext`` function pointer
+``PyArray_Broadcast``            Handled by ``NpyIter_MultiNew``
+``PyArray_RemoveSmallest``       Iterator flag ``NPY_ITER_NO_INNER_ITERATION``
+===============================  =============================================
+
+For other API calls:
+
+===============================  =============================================
+``PyArray_ConvertToCommonType``  Iterator flag ``NPY_ITER_COMMON_DTYPE``
+===============================  =============================================
+
+Simple Iteration Example
+------------------------
+
+The best way to become familiar with the iterator is to look at its
+usage within the NumPy codebase itself. For example, here is a slightly
+tweaked version of the code for ``PyArray_CountNonzero``, which counts the
+number of non-zero elements in an array.
+
+.. code-block:: c
+
+    npy_intp PyArray_CountNonzero(PyArrayObject* self)
+    {
+        /* Nonzero boolean function */
+        PyArray_NonzeroFunc* nonzero = PyArray_DESCR(self)->f->nonzero;
+
+        NpyIter* iter;
+        NpyIter_IterNext_Fn iternext;
+        char** dataptr;
+        npy_intp* strideptr,* innersizeptr;
+
+        /* Handle zero-sized arrays specially */
+        if (PyArray_SIZE(self) == 0) {
+            return 0;
+        }
+
+        /*
+         * Create and use an iterator to count the nonzeros.
+         *   flag NPY_ITER_READONLY
+         *     - The array is never written to.
+         *   flag NPY_ITER_NO_INNER_ITERATION
+         *     - Inner loop is done outside the iterator for efficiency.
+         *   flag NPY_ITER_NPY_ITER_REFS_OK
+         *     - Reference types are acceptable.
+         *   order NPY_KEEPORDER
+         *     - Visit elements in memory order, regardless of strides.
+         *       This is good for performance when the specific order
+         *       elements are visited is unimportant.
+         *   casting NPY_NO_CASTING
+         *     - No casting is required for this operation.
+         */
+        iter = NpyIter_New(self, NPY_ITER_READONLY|
+                                 NPY_ITER_NO_INNER_ITERATION|
+                                 NPY_ITER_REFS_OK,
+                            NPY_KEEPORDER, NPY_NO_CASTING,
+                            NULL, 0, NULL, 0);
+        if (iter == NULL) {
+            return -1;
+        }
+
+        /*
+         * The iternext function gets stored in a local variable
+         * so it can be called repeatedly in an efficient manner.
+         */
+        iternext = NpyIter_GetIterNext(iter, NULL);
+        if (iternext == NULL) {
+            NpyIter_Deallocate(iter);
+            return -1;
+        }
+        /* The location of the data pointer which the iterator may update */
+        dataptr = NpyIter_GetDataPtrArray(iter);
+        /* The location of the stride which the iterator may update */
+        strideptr = NpyIter_GetInnerStrideArray(iter);
+        /* The location of the inner loop size which the iterator may update */
+        innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);
+
+        /* The iteration loop */
+        do {
+            /* Get the inner loop data/stride/count values */
+            char* data = *dataptr;
+            npy_intp stride = *strideptr;
+            npy_intp count = *innersizeptr;
+
+            /* This is a typical inner loop for NPY_ITER_NO_INNER_ITERATION */
+            while (count--) {
+                if (nonzero(data, self)) {
+                    ++nonzero_count;
+                }
+                data += stride;
+            }
+
+            /* Increment the iterator to the next inner loop */
+        } while(iternext(iter));
+
+        NpyIter_Deallocate(iter);
+
+        return nonzero_count;
+    }
+
+Simple Multi-Iteration Example
+------------------------------
+
+Here is a simple copy function using the iterator.  The ``order`` parameter
+is used to control the memory layout of the allocated result, typically
+NPY_KEEPORDER is desired.
+
+.. code-block:: c
+
+    PyObject *CopyArray(PyObject *arr, NPY_ORDER order)
+    {
+        NpyIter *iter;
+        NpyIter_IterNext_Fn iternext;
+        PyObject *op[2], *ret;
+        npy_uint32 flags;
+        npy_uint32 op_flags[2];
+        npy_intp itemsize, *innersizeptr, innerstride;
+        char **dataptrarray;
+
+        /*
+         * No inner iteration - inner loop is handled by CopyArray code
+         */
+        flags = NPY_ITER_NO_INNER_ITERATION;
+        /*
+         * Tell the constructor to automatically allocate the output.
+         * The data type of the output will match that of the input.
+         */
+        op[0] = arr;
+        op[1] = NULL;
+        op_flags[0] = NPY_ITER_READONLY;
+        op_flags[1] = NPY_ITER_WRITEONLY | NPY_ITER_ALLOCATE;
+
+        /* Construct the iterator */
+        iter = NpyIter_MultiNew(2, op, flags, order, NPY_NO_CASTING,
+                                op_flags, NULL, 0, NULL);
+        if (iter == NULL) {
+            return NULL;
+        }
+
+        /*
+         * Make a copy of the iternext function pointer and
+         * a few other variables the inner loop needs.
+         */
+        iternext = NpyIter_GetIterNext(iter);
+        innerstride = NpyIter_GetInnerStrideArray(iter)[0];
+        itemsize = NpyIter_GetDescrArray(iter)[0]->elsize;
+        /*
+         * The inner loop size and data pointers may change during the
+         * loop, so just cache the addresses.
+         */
+        innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);
+        dataptrarray = NpyIter_GetDataPtrArray(iter);
+
+        /*
+         * Note that because the iterator allocated the output,
+         * it matches the iteration order and is packed tightly,
+         * so we don't need to check it like the input.
+         */
+        if (innerstride == itemsize) {
+            do {
+                memcpy(dataptrarray[1], dataptrarray[0],
+                                        itemsize * (*innersizeptr));
+            } while (iternext(iter));
+        } else {
+            /* For efficiency, should specialize this based on item size... */
+            npy_intp i;
+            do {
+                npy_intp size = *innersizeptr;
+                char *src = dataaddr[0], *dst = dataaddr[1];
+                for(i = 0; i < size; i++, src += innerstride, dst += itemsize) {
+                    memcpy(dst, src, itemsize);
+                }
+            } while (iternext(iter));
+        }
+
+        /* Get the result from the iterator object array */
+        ret = NpyIter_GetOperandArray(iter)[1];
+        Py_INCREF(ret);
+
+        if (NpyIter_Deallocate(iter) != NPY_SUCCEED) {
+            Py_DECREF(ret);
+            return NULL;
+        }
+
+        return ret;
+    }
+
+
+Iterator Pointer Type
+---------------------
+
+The iterator layout is an internal detail, and user code only sees
+an incomplete struct.
+
+.. code-block:: c
+
+    typedef struct NpyIter_InternalOnly NpyIter;
+
+
+Construction and Destruction
+----------------------------
+
+.. cfunction:: NpyIter* NpyIter_New(PyArrayObject* op, npy_uint32 flags, NPY_ORDER order, NPY_CASTING casting, PyArray_Descr* dtype, npy_intp a_ndim, npy_intp* axes, npy_intp buffersize)
+
+    Creates an iterator for the given numpy array object ``op``.
+
+    Flags that may be passed in ``flags`` are any combination
+    of the global and per-operand flags documented in
+    ``NpyIter_MultiNew``, except for ``NPY_ITER_ALLOCATE``.
+
+    Any of the ``NPY_ORDER`` enum values may be passed to ``order``.  For
+    efficient iteration, ``NPY_KEEPORDER`` is the best option, and the other
+    orders enforce the particular iteration pattern.
+
+    Any of the ``NPY_CASTING`` enum values may be passed to ``casting``.
+    The values include ``NPY_NO_CASTING``, ``NPY_EQUIV_CASTING``,
+    ``NPY_SAFE_CASTING``, ``NPY_SAME_KIND_CASTING``, and
+    ``NPY_UNSAFE_CASTING``.  To allow the casts to occur, copying or
+    buffering must also be enabled.
+
+    If ``dtype`` isn't ``NULL``, then it requires that data type.
+    If copying is allowed, it will make a temporary copy if the data
+    is castable.  If ``UPDATEIFCOPY`` is enabled, it will also copy
+    the data back with another cast upon iterator destruction.
+
+    If ``a_ndim`` is greater than zero, ``axes`` must also be provided.
+    In this case, ``axes`` is an ``a_ndim``-sized array of ``op``'s axes.
+    A value of -1 in ``axes`` means ``newaxis``. Within the ``axes``
+    array, axes may not be repeated.
+
+    If ``buffersize`` is zero, a default buffer size is used,
+    otherwise it specifies how big of a buffer to use.  Buffers
+    which are powers of 2 such as 512 or 1024 are recommended.
+
+    Returns NULL if there is an error, otherwise returns the allocated
+    iterator.
+
+    To make an iterator similar to the old iterator, this should work.
+
+    .. code-block:: c
+
+        iter = NpyIter_New(op, NPY_ITER_READWRITE,
+                            NPY_CORDER, NPY_NO_CASTING, NULL, 0, NULL);
+
+    If you want to edit an array with aligned ``double`` code,
+    but the order doesn't matter, you would use this.
+
+    .. code-block:: c
+
+        dtype = PyArray_DescrFromType(NPY_DOUBLE);
+        iter = NpyIter_New(op, NPY_ITER_READWRITE |
+                            NPY_ITER_BUFFERED |
+                            NPY_ITER_NBO|
+                            NPY_ITER_ALIGNED,
+                            NPY_KEEPORDER,
+                            NPY_SAME_KIND_CASTING,
+                            dtype, 0, NULL);
+        Py_DECREF(dtype);
+
+.. cfunction:: NpyIter* NpyIter_MultiNew(npy_intp niter, PyArrayObject** op, npy_uint32 flags, NPY_ORDER order, NPY_CASTING casting, npy_uint32* op_flags, PyArray_Descr** op_dtypes, npy_intp oa_ndim, npy_intp** op_axes, npy_intp buffersize)
+
+    Creates an iterator for broadcasting the ``niter`` array objects provided
+    in ``op``.
+
+    For normal usage, use 0 for ``oa_ndim`` and NULL for ``op_axes``.
+    See below for a description of these parameters, which allow for
+    custom manual broadcasting as well as reordering and leaving out axes.
+
+    Any of the ``NPY_ORDER`` enum values may be passed to ``order``.  For
+    efficient iteration, ``NPY_KEEPORDER`` is the best option, and the other
+    orders enforce the particular iteration pattern.  When using
+    ``NPY_KEEPORDER``, if you also want to ensure that the iteration is
+    not reversed along an axis, you should pass the flag
+    ``NPY_ITER_DONT_NEGATE_STRIDES``.
+
+    Any of the ``NPY_CASTING`` enum values may be passed to ``casting``.
+    The values include ``NPY_NO_CASTING``, ``NPY_EQUIV_CASTING``,
+    ``NPY_SAFE_CASTING``, ``NPY_SAME_KIND_CASTING``, and
+    ``NPY_UNSAFE_CASTING``.  To allow the casts to occur, copying or
+    buffering must also be enabled.
+
+    If ``op_dtypes`` isn't ``NULL``, it specifies a data type or ``NULL``
+    for each ``op[i]``.
+
+    The parameter ``oa_ndim``, when non-zero, specifies the number of
+    dimensions that will be iterated with customized broadcasting.
+    If it is provided, ``op_axes`` must also be provided.
+    These two parameters let you control in detail how the
+    axes of the operand arrays get matched together and iterated.
+    In ``op_axes``, you must provide an array of ``niter`` pointers
+    to ``oa_ndim``-sized arrays of type ``npy_intp``.  If an entry
+    in ``op_axes`` is NULL, normal broadcasting rules will apply.
+    In ``op_axes[j][i]`` is stored either a valid axis of ``op[j]``, or
+    -1 which means ``newaxis``.  Within each ``op_axes[j]`` array, axes
+    may not be repeated.  The following example is how normal broadcasting
+    applies to a 3-D array, a 2-D array, a 1-D array and a scalar.
+
+    .. code-block:: c
+
+        npy_intp oa_ndim = 3;               /* # iteration axes */
+        npy_intp op0_axes[] = {0, 1, 2};    /* 3-D operand */
+        npy_intp op1_axes[] = {-1, 0, 1};   /* 2-D operand */
+        npy_intp op2_axes[] = {-1, -1, 0};  /* 1-D operand */
+        npy_intp op3_axes[] = {-1, -1, -1}  /* 0-D (scalar) operand */
+        npy_intp* op_axes[] = {op0_axes, op1_axes, op2_axes, op3_axes};
+
+    If ``buffersize`` is zero, a default buffer size is used,
+    otherwise it specifies how big of a buffer to use.  Buffers
+    which are powers of 2 such as 512 or 1024 are recommended.
+
+    Returns NULL if there is an error, otherwise returns the allocated
+    iterator.
+
+    Flags that may be passed in ``flags``, applying to the whole
+    iterator, are:
+
+        ``NPY_ITER_C_INDEX``, ``NPY_ITER_F_INDEX``
+
+            Causes the iterator to track an index matching C or
+            Fortran order. These options are mutually exclusive.
+
+        ``NPY_ITER_COORDS``
+
+            Causes the iterator to track array coordinates.
+            This prevents the iterator from coalescing axes to
+            produce bigger inner loops.
+
+        ``NPY_ITER_NO_INNER_ITERATION``
+
+            Causes the iterator to skip iteration of the innermost
+            loop, allowing the user of the iterator to handle it.
+
+            This flag is incompatible with ``NPY_ITER_C_INDEX``,
+            ``NPY_ITER_F_INDEX``, and ``NPY_ITER_COORDS``.
+
+        ``NPY_ITER_DONT_NEGATE_STRIDES``
+
+            This only affects the iterator when NPY_KEEPORDER is specified
+            for the order parameter.  By default with NPY_KEEPORDER, the
+            iterator reverses axes which have negative strides, so that
+            memory is traversed in a forward direction.  This disables
+            this step.  Use this flag if you want to use the underlying
+            memory-ordering of the axes, but don't want an axis reversed.
+            This is the behavior of ``numpy.ravel(a, order='K')``, for
+            instance.
+
+        ``NPY_ITER_COMMON_DTYPE``
+
+            Causes the iterator to convert all the operands to a common
+            data type, calculated based on the ufunc type promotion rules.
+            Copying or buffering must be enabled.
+
+            If the common data type is known ahead of time, don't use this
+            flag.  Instead, set the requested dtype for all the operands.
+
+        ``NPY_ITER_REFS_OK``
+
+            Indicates that arrays with reference types (object
+            arrays or structured arrays containing an object type)
+            may be accepted and used in the iterator.  If this flag
+            is enabled, the caller must be sure to check whether
+            ``NpyIter_IterationNeedsAPI(iter)`` is true, in which case
+            it may not release the GIL during iteration.
+
+        ``NPY_ITER_ZEROSIZE_OK``
+
+            Indicates that arrays with a size of zero should be permitted.
+            Since the typical iteration loop does not naturally work with
+            zero-sized arrays, you must check that the IterSize is non-zero
+            before entering the iteration loop.
+
+        ``NPY_ITER_REDUCE_OK``
+
+            Permits writeable operands with a dimension with zero
+            stride and size greater than one.  Note that such operands
+            must be read/write.
+
+            When buffering is enabled, this also switches to a special
+            buffering mode which reduces the loop length as necessary to
+            not trample on values being reduced.
+
+            Note that if you want to do a reduction on an automatically
+            allocated output, you must use ``NpyIter_GetOperandArray``
+            to get its reference, then set every value to the reduction
+            unit before doing the iteration loop.  In the case of a
+            buffered reduction, this means you must also specify the
+            flag ``NPY_ITER_DELAY_BUFALLOC``, then reset the iterator
+            after initializing the allocated operand to prepare the
+            buffers.
+
+        ``NPY_ITER_RANGED``
+
+            Enables support for iteration of sub-ranges of the full
+            ``iterindex`` range ``[0, NpyIter_IterSize(iter))``.  Use
+            the function ``NpyIter_ResetToIterIndexRange`` to specify
+            a range for iteration.
+
+            This flag can only be used with ``NPY_ITER_NO_INNER_ITERATION``
+            when ``NPY_ITER_BUFFERED`` is enabled.  This is because
+            without buffering, the inner loop is always the size of the
+            innermost iteration dimension, and allowing it to get cut up
+            would require special handling, effectively making it more
+            like the buffered version.
+
+        ``NPY_ITER_BUFFERED``
+
+            Causes the iterator to store buffering data, and use buffering
+            to satisfy data type, alignment, and byte-order requirements.
+            To buffer an operand, do not specify the ``NPY_ITER_COPY``
+            or ``NPY_ITER_UPDATEIFCOPY`` flags, because they will
+            override buffering.  Buffering is especially useful for Python
+            code using the iterator, allowing for larger chunks
+            of data at once to amortize the Python interpreter overhead.
+
+            If used with ``NPY_ITER_NO_INNER_ITERATION``, the inner loop
+            for the caller may get larger chunks than would be possible
+            without buffering, because of how the strides are laid out.
+
+            Note that if an operand is given the flag ``NPY_ITER_COPY``
+            or ``NPY_ITER_UPDATEIFCOPY``, a copy will be made in preference
+            to buffering.  Buffering will still occur when the array was
+            broadcast so elements need to be duplicated to get a constant
+            stride.
+
+            In normal buffering, the size of each inner loop is equal
+            to the buffer size, or possibly larger if ``NPY_ITER_GROWINNER``
+            is specified.  If ``NPY_ITER_REDUCE_OK`` is enabled and
+            a reduction occurs, the inner loops may become smaller depending
+            on the structure of the reduction.
+
+        ``NPY_ITER_GROWINNER``
+
+            When buffering is enabled, this allows the size of the inner
+            loop to grow when buffering isn't necessary.  This option
+            is best used if you're doing a straight pass through all the
+            data, rather than anything with small cache-friendly arrays
+            of temporary values for each inner loop.
+
+        ``NPY_ITER_DELAY_BUFALLOC``
+
+            When buffering is enabled, this delays allocation of the
+            buffers until one of the ``NpyIter_Reset*`` functions is
+            called.  This flag exists to avoid wasteful copying of
+            buffer data when making multiple copies of a buffered
+            iterator for multi-threaded iteration.
+
+            Another use of this flag is for setting up reduction operations.
+            After the iterator is created, and a reduction output
+            is allocated automatically by the iterator (be sure to use
+            READWRITE access), its value may be initialized to the reduction
+            unit.  Use ``NpyIter_GetOperandArray`` to get the object.
+            Then, call ``NpyIter_Reset`` to allocate and fill the buffers
+            with their initial values.
+
+    Flags that may be passed in ``op_flags[i]``, where ``0 <= i < niter``:
+
+        ``NPY_ITER_READWRITE``, ``NPY_ITER_READONLY``, ``NPY_ITER_WRITEONLY``
+
+            Indicate how the user of the iterator will read or write
+            to ``op[i]``.  Exactly one of these flags must be specified
+            per operand.
+
+        ``NPY_ITER_COPY``
+
+            Allow a copy of ``op[i]`` to be made if it does not
+            meet the data type or alignment requirements as specified
+            by the constructor flags and parameters.
+
+        ``NPY_ITER_UPDATEIFCOPY``
+
+            Triggers ``NPY_ITER_COPY``, and when an array operand
+            is flagged for writing and is copied, causes the data
+            in a copy to be copied back to ``op[i]`` when the iterator
+            is destroyed.
+
+            If the operand is flagged as write-only and a copy is needed,
+            an uninitialized temporary array will be created and then copied
+            to back to ``op[i]`` on destruction, instead of doing
+            the unecessary copy operation.
+
+        ``NPY_ITER_NBO``, ``NPY_ITER_ALIGNED``, ``NPY_ITER_CONTIG``
+
+            Causes the iterator to provide data for ``op[i]``
+            that is in native byte order, aligned according to
+            the dtype requirements, contiguous, or any combination.
+
+            By default, the iterator produces pointers into the
+            arrays provided, which may be aligned or unaligned, and
+            with any byte order.  If copying or buffering is not
+            enabled and the operand data doesn't satisfy the constraints,
+            an error will be raised.
+
+            The contiguous constraint applies only to the inner loop,
+            successive inner loops may have arbitrary pointer changes.
+
+            If the requested data type is in non-native byte order,
+            the NBO flag overrides it and the requested data type is
+            converted to be in native byte order.
+
+        ``NPY_ITER_ALLOCATE``
+
+            This is for output arrays, and requires that the flag
+            ``NPY_ITER_WRITEONLY`` be set.  If ``op[i]`` is NULL,
+            creates a new array with the final broadcast dimensions,
+            and a layout matching the iteration order of the iterator.
+
+            When ``op[i]`` is NULL, the requested data type
+            ``op_dtypes[i]`` may be NULL as well, in which case it is
+            automatically generated from the dtypes of the arrays which
+            are flagged as readable.  The rules for generating the dtype
+            are the same is for UFuncs.  Of special note is handling
+            of byte order in the selected dtype.  If there is exactly
+            one input, the input's dtype is used as is.  Otherwise,
+            if more than one input dtypes are combined together, the
+            output will be in native byte order.
+
+            After being allocated with this flag, the caller may retrieve
+            the new array by calling ``NpyIter_GetOperandArray`` and
+            getting the i-th object in the returned C array.  The caller
+            must call Py_INCREF on it to claim a reference to the array.
+
+        ``NPY_ITER_NO_SUBTYPE``
+
+            For use with ``NPY_ITER_ALLOCATE``, this flag disables
+            allocating an array subtype for the output, forcing
+            it to be a straight ndarray.
+
+            TODO: Maybe it would be better to introduce a function
+            ``NpyIter_GetWrappedOutput`` and remove this flag?
+
+        ``NPY_ITER_NO_BROADCAST``
+
+            Ensures that the input or output matches the iteration
+            dimensions exactly.
+
+.. cfunction:: NpyIter* NpyIter_Copy(NpyIter* iter)
+
+    Makes a copy of the given iterator.  This function is provided
+    primarily to enable multi-threaded iteration of the data.
+
+    *TODO*: Move this to a section about multithreaded iteration.
+
+    The recommended approach to multithreaded iteration is to
+    first create an iterator with the flags
+    ``NPY_ITER_NO_INNER_ITERATION``, ``NPY_ITER_RANGED``,
+    ``NPY_ITER_BUFFERED``, ``NPY_ITER_DELAY_BUFALLOC``, and
+    possibly ``NPY_ITER_GROWINNER``.  Create a copy of this iterator
+    for each thread (minus one for the first iterator).  Then, take
+    the iteration index range ``[0, NpyIter_GetIterSize(iter))`` and
+    split it up into tasks, for example using a TBB parallel_for loop.
+    When a thread gets a task to execute, it then uses its copy of
+    the iterator by calling ``NpyIter_ResetToIterIndexRange`` and
+    iterating over the full range.
+
+    When using the iterator in multi-threaded code or in code not
+    holding the Python GIL, care must be taken to only call functions
+    which are safe in that context.  ``NpyIter_Copy`` cannot be safely
+    called without the Python GIL, because it increments Python
+    references.  The ``Reset*`` and some other functions may be safely
+    called by passing in the ``errmsg`` parameter as non-NULL, so that
+    the functions will pass back errors through it instead of setting
+    a Python exception.
+
+.. cfunction:: int NpyIter_RemoveAxis(NpyIter* iter, npy_intp axis)``
+
+    Removes an axis from iteration.  This requires that
+    ``NPY_ITER_COORDS`` was set for iterator creation, and does not work
+    if buffering is enabled or an index is being tracked. This function
+    also resets the iterator to its initial state.
+
+    This is useful for setting up an accumulation loop, for example.
+    The iterator can first be created with all the dimensions, including
+    the accumulation axis, so that the output gets created correctly.
+    Then, the accumulation axis can be removed, and the calculation
+    done in a nested fashion.
+
+    **WARNING**: This function may change the internal memory layout of
+    the iterator.  Any cached functions or pointers from the iterator
+    must be retrieved again!
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+
+.. cfunction:: int NpyIter_RemoveCoords(NpyIter* iter)
+
+    If the iterator has coordinates, this strips support for them, and
+    does further iterator optimizations that are possible if coordinates
+    are not needed.  This function also resets the iterator to its initial
+    state.
+
+    **WARNING**: This function may change the internal memory layout of
+    the iterator.  Any cached functions or pointers from the iterator
+    must be retrieved again!
+
+    After calling this function, ``NpyIter_HasCoords(iter)`` will
+    return false.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: int NpyIter_RemoveInnerLoop(NpyIter* iter)
+
+    If RemoveCoords was used, you may want to specify the
+    flag ``NPY_ITER_NO_INNER_ITERATION``.  This flag is not permitted
+    together with ``NPY_ITER_COORDS``, so this function is provided
+    to enable the feature after ``NpyIter_RemoveCoords`` is called.
+    This function also resets the iterator to its initial state.
+
+    **WARNING**: This function changes the internal logic of the iterator.
+    Any cached functions or pointers from the iterator must be retrieved
+    again!
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: int NpyIter_Deallocate(NpyIter* iter)
+
+    Deallocates the iterator object.  This additionally frees any
+    copies made, triggering UPDATEIFCOPY behavior where necessary.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: int NpyIter_Reset(NpyIter* iter, char** errmsg)
+
+    Resets the iterator back to its initial state, at the beginning
+    of the iteration range.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.  If errmsg is non-NULL,
+    no Python exception is set when ``NPY_FAIL`` is returned.
+    Instead, \*errmsg is set to an error message.  When errmsg is
+    non-NULL, the function may be safely called without holding
+    the Python GIL.
+
+.. cfunction:: int NpyIter_ResetToIterIndexRange(NpyIter* iter, npy_intp istart, npy_intp iend, char** errmsg)
+
+    Resets the iterator and restricts it to the ``iterindex`` range
+    ``[istart, iend)``.  See ``NpyIter_Copy`` for an explanation of
+    how to use this for multi-threaded iteration.  This requires that
+    the flag ``NPY_ITER_RANGED`` was passed to the iterator constructor.
+
+    If you want to reset both the ``iterindex`` range and the base
+    pointers at the same time, you can do the following to avoid
+    extra buffer copying (be sure to add the return code error checks
+    when you copy this code).
+
+    .. code-block:: c
+
+        /* Set to a trivial empty range */
+        NpyIter_ResetToIterIndexRange(iter, 0, 0);
+        /* Set the base pointers */
+        NpyIter_ResetBasePointers(iter, baseptrs);
+        /* Set to the desired range */
+        NpyIter_ResetToIterIndexRange(iter, istart, iend);
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.  If errmsg is non-NULL,
+    no Python exception is set when ``NPY_FAIL`` is returned.
+    Instead, \*errmsg is set to an error message.  When errmsg is
+    non-NULL, the function may be safely called without holding
+    the Python GIL.
+
+.. cfunction:: int NpyIter_ResetBasePointers(NpyIter *iter, char** baseptrs, char** errmsg)
+
+    Resets the iterator back to its initial state, but using the values
+    in ``baseptrs`` for the data instead of the pointers from the arrays
+    being iterated.  This functions is intended to be used, together with
+    the ``op_axes`` parameter, by nested iteration code with two or more
+    iterators.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.  If errmsg is non-NULL,
+    no Python exception is set when ``NPY_FAIL`` is returned.
+    Instead, \*errmsg is set to an error message.  When errmsg is
+    non-NULL, the function may be safely called without holding
+    the Python GIL.
+
+    *TODO*: Move the following into a special section on nested iterators.
+
+    Creating iterators for nested iteration requires some care.  All
+    the iterator operands must match exactly, or the calls to
+    ``NpyIter_ResetBasePointers`` will be invalid.  This means that
+    automatic copies and output allocation should not be used haphazardly.
+    It is possible to still use the automatic data conversion and casting
+    features of the iterator by creating one of the iterators with
+    all the conversion parameters enabled, then grabbing the allocated
+    operands with the ``NpyIter_GetOperandArray`` function and passing
+    them into the constructors for the rest of the iterators.
+
+    **WARNING**: When creating iterators for nested iteration,
+    the code must not use a dimension more than once in the different
+    iterators.  If this is done, nested iteration will produce
+    out-of-bounds pointers during iteration.
+
+    **WARNING**: When creating iterators for nested iteration, buffering
+    can only be applied to the innermost iterator.  If a buffered iterator
+    is used as the source for ``baseptrs``, it will point into a small buffer
+    instead of the array and the inner iteration will be invalid.
+
+    The pattern for using nested iterators is as follows.
+
+    .. code-block:: c
+
+        NpyIter *iter1, *iter1;
+        NpyIter_IterNext_Fn iternext1, iternext2;
+        char **dataptrs1;
+
+        /*
+         * With the exact same operands, no copies allowed, and
+         * no axis in op_axes used both in iter1 and iter2.
+         * Buffering may be enabled for iter2, but not for iter1.
+         */
+        iter1 = ...; iter2 = ...;
+
+        iternext1 = NpyIter_GetIterNext(iter1);
+        iternext2 = NpyIter_GetIterNext(iter2);
+        dataptrs1 = NpyIter_GetDataPtrArray(iter1);
+
+        do {
+            NpyIter_ResetBasePointers(iter2, dataptrs1);
+            do {
+                /* Use the iter2 values */
+            } while (iternext2(iter2));
+        } while (iternext1(iter1));
+
+.. cfunction:: int NpyIter_GotoCoords(NpyIter* iter, npy_intp* coords)
+
+    Adjusts the iterator to point to the ``ndim`` coordinates
+    pointed to by ``coords``.  Returns an error if coordinates
+    are not being tracked, the coordinates are out of bounds,
+    or inner loop iteration is disabled.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: int NpyIter_GotoIndex(NpyIter* iter, npy_intp index)
+
+    Adjusts the iterator to point to the ``index`` specified.
+    If the iterator was constructed with the flag
+    ``NPY_ITER_C_INDEX``, ``index`` is the C-order index,
+    and if the iterator was constructed with the flag
+    ``NPY_ITER_F_INDEX``, ``index`` is the Fortran-order
+    index.  Returns an error if there is no index being tracked,
+    the index is out of bounds, or inner loop iteration is disabled.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: npy_intp NpyIter_GetIterSize(NpyIter* iter)
+
+    Returns the number of elements being iterated.  This is the product
+    of all the dimensions in the shape.
+
+.. cfunction:: npy_intp NpyIter_GetIterIndex(NpyIter* iter)
+
+    Gets the ``iterindex`` of the iterator, which is an index matching
+    the iteration order of the iterator.
+
+.. cfunction:: void NpyIter_GetIterIndexRange(NpyIter* iter, npy_intp* istart, npy_intp* iend)
+
+    Gets the ``iterindex`` sub-range that is being iterated.  If
+    ``NPY_ITER_RANGED`` was not specified, this always returns the
+    range ``[0, NpyIter_IterSize(iter))``.
+
+.. cfunction:: int NpyIter_GotoIterIndex(NpyIter* iter, npy_intp iterindex)
+
+    Adjusts the iterator to point to the ``iterindex`` specified.
+    The IterIndex is an index matching the iteration order of the iterator.
+    Returns an error if the ``iterindex`` is out of bounds,
+    buffering is enabled, or inner loop iteration is disabled.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: int NpyIter_HasInnerLoop(NpyIter* iter)
+
+    Returns 1 if the iterator handles the inner loop,
+    or 0 if the caller needs to handle it.  This is controlled
+    by the constructor flag ``NPY_ITER_NO_INNER_ITERATION``.
+
+.. cfunction:: int NpyIter_HasCoords(NpyIter* iter)
+
+    Returns 1 if the iterator was created with the
+    ``NPY_ITER_COORDS`` flag, 0 otherwise.
+
+.. cfunction:: int NpyIter_HasIndex(NpyIter* iter)
+
+    Returns 1 if the iterator was created with the
+    ``NPY_ITER_C_INDEX`` or ``NPY_ITER_F_INDEX``
+    flag, 0 otherwise.
+
+.. cfunction:: int NpyIter_IsBuffered(NpyIter* iter)
+
+    Returns 1 if the iterator was created with the
+    ``NPY_ITER_BUFFERED`` flag, 0 otherwise.
+
+.. cfunction:: int NpyIter_IsGrowInner(NpyIter* iter)
+
+    Returns 1 if the iterator was created with the
+    ``NPY_ITER_GROWINNER`` flag, 0 otherwise.
+
+.. cfunction:: npy_intp NpyIter_GetBufferSize(NpyIter* iter)
+
+    If the iterator is buffered, returns the size of the buffer
+    being used, otherwise returns 0.
+
+.. cfunction:: npy_intp NpyIter_GetNDim(NpyIter* iter)
+
+    Returns the number of dimensions being iterated.  If coordinates
+    were not requested in the iterator constructor, this value
+    may be smaller than the number of dimensions in the original
+    objects.
+
+.. cfunction:: npy_intp NpyIter_GetNIter(NpyIter* iter)
+
+    Returns the number of objects being iterated.
+
+.. cfunction:: npy_intp* NpyIter_GetAxisStrideArray(NpyIter* iter, npy_intp axis)
+
+    Gets the array of strides for the specified axis. Requires that
+    the iterator be tracking coordinates, and that buffering not
+    be enabled.
+
+    This may be used when you want to match up operand axes in
+    some fashion, then remove them with ``NpyIter_RemoveAxis`` to
+    handle their processing manually.  By calling this function
+    before removing the axes, you can get the strides for the
+    manual processing.
+
+    Returns ``NULL`` on error.
+
+.. cfunction:: int NpyIter_GetShape(NpyIter* iter, npy_intp* outshape)
+
+    Returns the broadcast shape of the iterator in ``outshape``.
+    This can only be called on an iterator which supports coordinates.
+
+    Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
+
+.. cfunction:: PyArray_Descr** NpyIter_GetDescrArray(NpyIter* iter)
+
+    This gives back a pointer to the ``niter`` data type Descrs for
+    the objects being iterated.  The result points into ``iter``,
+    so the caller does not gain any references to the Descrs.
+
+    This pointer may be cached before the iteration loop, calling
+    ``iternext`` will not change it.
+
+.. cfunction:: PyObject** NpyIter_GetOperandArray(NpyIter* iter)
+
+    This gives back a pointer to the ``niter`` operand PyObjects
+    that are being iterated.  The result points into ``iter``,
+    so the caller does not gain any references to the PyObjects.
+
+.. cfunction:: PyObject* NpyIter_GetIterView(NpyIter* iter, npy_intp i)
+
+    This gives back a reference to a new ndarray view, which is a view
+    into the i-th object in the array ``NpyIter_GetOperandArray()``,
+    whose dimensions and strides match the internal optimized
+    iteration pattern.  A C-order iteration of this view is equivalent
+    to the iterator's iteration order.
+
+    For example, if an iterator was created with a single array as its
+    input, and it was possible to rearrange all its axes and then
+    collapse it into a single strided iteration, this would return
+    a view that is a one-dimensional array.
+
+.. cfunction:: void NpyIter_GetReadFlags(NpyIter* iter, char* outreadflags)
+
+    Fills ``niter`` flags. Sets ``outreadflags[i]`` to 1 if
+    ``op[i]`` can be read from, and to 0 if not.
+
+.. cfunction:: void NpyIter_GetWriteFlags(NpyIter* iter, char* outwriteflags)
+
+    Fills ``niter`` flags. Sets ``outwriteflags[i]`` to 1 if
+    ``op[i]`` can be written to, and to 0 if not.
+
+Functions For Iteration
+-----------------------
+
+.. cfunction:: NpyIter_IterNext_Fn NpyIter_GetIterNext(NpyIter* iter, char** errmsg)
+
+    Returns a function pointer for iteration.  A specialized version
+    of the function pointer may be calculated by this function
+    instead of being stored in the iterator structure. Thus, to
+    get good performance, it is required that the function pointer
+    be saved in a variable rather than retrieved for each loop iteration.
+
+    Returns NULL if there is an error.  If errmsg is non-NULL,
+    no Python exception is set when ``NPY_FAIL`` is returned.
+    Instead, \*errmsg is set to an error message.  When errmsg is
+    non-NULL, the function may be safely called without holding
+    the Python GIL.
+
+    The typical looping construct is as follows.
+
+    .. code-block:: c
+
+        NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL);
+        char** dataptr = NpyIter_GetDataPtrArray(iter);
+
+        do {
+            /* use the addresses dataptr[0], ... dataptr[niter-1] */
+        } while(iternext(iter));
+
+    When ``NPY_ITER_NO_INNER_ITERATION`` is specified, the typical
+    inner loop construct is as follows.
+
+    .. code-block:: c
+
+        NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL);
+        char** dataptr = NpyIter_GetDataPtrArray(iter);
+        npy_intp* stride = NpyIter_GetInnerStrideArray(iter);
+        npy_intp* size_ptr = NpyIter_GetInnerLoopSizePtr(iter), size;
+        npy_intp iiter, niter = NpyIter_GetNIter(iter);
+
+        do {
+            size = *size_ptr;
+            while (size--) {
+                /* use the addresses dataptr[0], ... dataptr[niter-1] */
+                for (iiter = 0; iiter < niter; ++iiter) {
+                    dataptr[iiter] += stride[iiter];
+                }
+            }
+        } while (iternext());
+
+    Observe that we are using the dataptr array inside the iterator, not
+    copying the values to a local temporary.  This is possible because
+    when ``iternext()`` is called, these pointers will be overwritten
+    with fresh values, not incrementally updated.
+
+    If a compile-time fixed buffer is being used (both flags
+    ``NPY_ITER_BUFFERED`` and ``NPY_ITER_NO_INNER_ITERATION``), the
+    inner size may be used as a signal as well.  The size is guaranteed
+    to become zero when ``iternext()`` returns false, enabling the
+    following loop construct.  Note that if you use this construct,
+    you should not pass ``NPY_ITER_GROWINNER`` as a flag, because it
+    will cause larger sizes under some circumstances.
+
+    .. code-block:: c
+
+        /* The constructor should have buffersize passed as this value */
+        #define FIXED_BUFFER_SIZE 1024
+
+        NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL);
+        char **dataptr = NpyIter_GetDataPtrArray(iter);
+        npy_intp *stride = NpyIter_GetInnerStrideArray(iter);
+        npy_intp *size_ptr = NpyIter_GetInnerLoopSizePtr(iter), size;
+        npy_intp i, iiter, niter = NpyIter_GetNIter(iter);
+
+        /* One loop with a fixed inner size */
+        size = *size_ptr;
+        while (size == FIXED_BUFFER_SIZE) {
+            /*
+             * This loop could be manually unrolled by a factor
+             * which divides into FIXED_BUFFER_SIZE
+             */
+            for (i = 0; i < FIXED_BUFFER_SIZE; ++i) {
+                /* use the addresses dataptr[0], ... dataptr[niter-1] */
+                for (iiter = 0; iiter < niter; ++iiter) {
+                    dataptr[iiter] += stride[iiter];
+                }
+            }
+            iternext();
+            size = *size_ptr;
+        }
+
+        /* Finish-up loop with variable inner size */
+        if (size > 0) do {
+            size = *size_ptr;
+            while (size--) {
+                /* use the addresses dataptr[0], ... dataptr[niter-1] */
+                for (iiter = 0; iiter < niter; ++iiter) {
+                    dataptr[iiter] += stride[iiter];
+                }
+            }
+        } while (iternext());
+
+.. cfunction:: NpyIter_GetCoords_Fn NpyIter_GetGetCoords(NpyIter* iter, char** errmsg)
+
+    Returns a function pointer for getting the coordinates
+    of the iterator.  Returns NULL if the iterator does not
+    support coordinates.  It is recommended that this function
+    pointer be cached in a local variable before the iteration
+    loop.
+
+    Returns NULL if there is an error.  If errmsg is non-NULL,
+    no Python exception is set when ``NPY_FAIL`` is returned.
+    Instead, \*errmsg is set to an error message.  When errmsg is
+    non-NULL, the function may be safely called without holding
+    the Python GIL.
+
+.. cfunction:: char** NpyIter_GetDataPtrArray(NpyIter* iter)
+
+    This gives back a pointer to the ``niter`` data pointers.  If
+    ``NPY_ITER_NO_INNER_ITERATION`` was not specified, each data
+    pointer points to the current data item of the iterator.  If
+    no inner iteration was specified, it points to the first data
+    item of the inner loop.
+
+    This pointer may be cached before the iteration loop, calling
+    ``iternext`` will not change it.  This function may be safely
+    called without holding the Python GIL.
+
+.. cfunction:: npy_intp* NpyIter_GetIndexPtr(NpyIter* iter)
+
+    This gives back a pointer to the index being tracked, or NULL
+    if no index is being tracked.  It is only useable if one of
+    the flags ``NPY_ITER_C_INDEX`` or ``NPY_ITER_F_INDEX``
+    were specified during construction.
+
+When the flag ``NPY_ITER_NO_INNER_ITERATION`` is used, the code
+needs to know the parameters for doing the inner loop.  These
+functions provide that information.
+
+.. cfunction:: npy_intp* NpyIter_GetInnerStrideArray(NpyIter* iter)
+
+    Returns a pointer to an array of the ``niter`` strides,
+    one for each iterated object, to be used by the inner loop.
+
+    This pointer may be cached before the iteration loop, calling
+    ``iternext`` will not change it. This function may be safely
+    called without holding the Python GIL.
+
+.. cfunction:: npy_intp* NpyIter_GetInnerLoopSizePtr(NpyIter* iter)
+
+    Returns a pointer to the number of iterations the
+    inner loop should execute.
+
+    This address may be cached before the iteration loop, calling
+    ``iternext`` will not change it.  The value itself may change during
+    iteration, in particular if buffering is enabled.  This function
+    may be safely called without holding the Python GIL.
+
+.. cfunction:: void NpyIter_GetInnerFixedStrideArray(NpyIter* iter, npy_intp* out_strides)
+
+    Gets an array of strides which are fixed, or will not change during
+    the entire iteration.  For strides that may change, the value
+    NPY_MAX_INTP is placed in the stride.
+
+    Once the iterator is prepared for iteration (after a reset if
+    ``NPY_DELAY_BUFALLOC`` was used), call this to get the strides
+    which may be used to select a fast inner loop function.  For example,
+    if the stride is 0, that means the inner loop can always load its
+    value into a variable once, then use the variable throughout the loop,
+    or if the stride equals the itemsize, a contiguous version for that
+    operand may be used.
+
+    This function may be safely called without holding the Python GIL.
diff --git a/doc/source/reference/c-api.rst b/doc/source/reference/c-api.rst
index 9bcc68b49..7c7775889 100644
--- a/doc/source/reference/c-api.rst
+++ b/doc/source/reference/c-api.rst
@@ -44,6 +44,7 @@ code.
    c-api.config
    c-api.dtype
    c-api.array
+   c-api.iterator
    c-api.ufunc
    c-api.generalized-ufuncs
    c-api.coremath