diff options
author | Mark Wiebe <mwwiebe@gmail.com> | 2011-01-28 16:33:58 -0800 |
---|---|---|
committer | Mark Wiebe <mwwiebe@gmail.com> | 2011-01-28 16:33:58 -0800 |
commit | ab3dcf817066e7be08e0c4c6264db36674978284 (patch) | |
tree | 12275e8e4917298414a22df212b70f2239801dc7 /doc | |
parent | 67e5476a4178de55451501cfb01794c22d340b7a (diff) | |
parent | 0046a594071117d8bd379f6e9bd2d2d7a6f9852e (diff) | |
download | numpy-ab3dcf817066e7be08e0c4c6264db36674978284.tar.gz |
Merge branch 'mw_neps'
Diffstat (limited to 'doc')
-rw-r--r-- | doc/neps/deferred-ufunc-evaluation.rst | 308 | ||||
-rw-r--r-- | doc/neps/new-iterator-ufunc.rst | 1981 |
2 files changed, 2289 insertions, 0 deletions
diff --git a/doc/neps/deferred-ufunc-evaluation.rst b/doc/neps/deferred-ufunc-evaluation.rst new file mode 100644 index 000000000..95633cab5 --- /dev/null +++ b/doc/neps/deferred-ufunc-evaluation.rst @@ -0,0 +1,308 @@ +:Title: Deferred UFunc Evaluation +:Author: Mark Wiebe <mwwiebe@gmail.com> +:Content-Type: text/x-rst +:Created: 30-Nov-2010 + +******** +Abstract +******** + +This NEP describes a proposal to add deferred evaluation to NumPy's +UFuncs. This will allow Python expressions like +"a[:] = b + c + d + e" to be evaluated in a single pass through all +the variables at once, with no temporary arrays. The resulting +performance will likely be comparable to the *numexpr* library, +but with a more natural syntax. + +This idea has some interaction with UFunc error handling and +the UPDATEIFCOPY flag, affecting the design and implementation, +but the result allows for the usage of deferred evaluation +with minimal effort from the Python user's perspective. + +********** +Motivation +********** + +NumPy's style of UFunc execution causes suboptimal performance for +large expressions, because multiple temporaries are allocated and +the inputs are swept through in multiple passes. The *numexpr* library +can outperform NumPy for such large expressions, by doing the execution +in small cache-friendly blocks, and evaluating the whole expression +per element. This results in one sweep through each input, which +is significantly better for the cache. + +For an idea of how to get this kind of behavior in NumPy without +changing the Python code, consider the C++ technique of +expression templates. These can be used to quite arbitrarily +rearrange expressions using +vectors or other data structures, example,:: + + A = B + C + D; + +can be transformed into something equivalent to:: + + for(i = 0; i < A.size; ++i) { + A[i] = B[i] + C[i] + D[i]; + } + +This is done by returning a proxy object that knows how to calculate +the result instead of returning the actual object. With modern C++ +optimizing compilers, the resulting machine code is often the same +as hand-written loops. For an example of this, see the +`Blitz++ Library <http://www.oonumerics.org/blitz/docs/blitz_3.html>`_. +A more recently created library for helping write expression templates +is `Boost Proto <http://beta.boost.org/doc/libs/1_44_0/doc/html/proto.html>`_. + +By using the same idea of returning a proxy object in Python, we +can accomplish the same thing dynamically. The return object is +an ndarray without its buffer allocated, and with enough knowledge +to calculate itself when needed. When a "deferred array" is +finally evaluated, we can use the expression tree made up of +all the operand deferred arrays, effectively creating a single new +UFunc to evaluate on the fly. + + +******************* +Example Python Code +******************* + +Here's how it might be used in NumPy.:: + + # a, b, c are large ndarrays + + with np.deferredstate(True): + + d = a + b + c + # Now d is a 'deferred array,' a, b, and c are marked READONLY + # similar to the existing UPDATEIFCOPY mechanism. + + print d + # Since the value of d was required, it is evaluated so d becomes + # a regular ndarray and gets printed. + + d[:] = a*b*c + # Here, the automatically combined "ufunc" that computes + # a*b*c effectively gets an out= parameter, so no temporary + # arrays are needed whatsoever. + + e = a+b+c*d + # Now e is a 'deferred array,' a, b, c, and d are marked READONLY + + d[:] = a + # d was marked readonly, but the assignment could see that + # this was due to it being a deferred expression operand. + # This triggered the deferred evaluation so it could assign + # the value of a to d. + +There may be some surprising behavior, though.:: + + with np.deferredstate(True): + + d = a + b + c + # d is deferred + + e[:] = d + f[:] = d + g[:] = d + # d is still deferred, and its deferred expression + # was evaluated three times, once for each assignment. + # This could be detected, with d being converted to + # a regular ndarray the second time it is evaluated. + +I believe the usage that should be recommended in the documentation +is to leave the deferred state at its default, except when +evaluating a large expression that can benefit from it.:: + + # calculations + + with np.deferredstate(True): + x = <big expression> + + # more calculations + +This will avoid surprises which would be cause by always keeping +deferred usage True, like floating point warnings or exceptions +at surprising times when deferred expression are used later. +User questions like "Why does my print statement throw a +divide by zero error?" can hopefully be avoided by recommending +this approach. + +******************************** +Proposed Deferred Evaluation API +******************************** + +For deferred evaluation to work, the C API needs to be aware of its +existence, and be able to trigger evaluation when necessary. The +ndarray would gain two new flag. + + ``NPY_ISDEFERRED`` + + Indicates the expression evaluation for this ndarray instance + has been deferred. + + ``NPY_DEFERRED_WASWRITEABLE`` + + Can only be set when ``PyArray_GetDeferredUsageCount(arr) > 0``. + It indicates that when ``arr`` was first used in a deferred + expression, it was a writeable array. If this flag is set, + calling ``PyArray_CalculateAllDeferred()`` will make ``arr`` + writeable again. + +.. note:: QUESTION + + Should NPY_DEFERRED and NPY_DEFERRED_WASWRITEABLE be visible + to Python, or should accessing the flags from python trigger + PyArray_CalculateAllDeferred if necessary? + +The API would be expanded with a number of functions. + +``int PyArray_CalculateAllDeferred()`` + + This function forces all currently deferred calculations to occur. + + For example, if the error state is set to ignore all, and + np.seterr({all='raise'}), this would change what happens + to already deferred expressions. Thus, all the existing + deferred arrays should be evaluated before changing the + error state. + +``int PyArray_CalculateDeferred(PyArrayObject* arr)`` + + If 'arr' is a deferred array, allocates memory for it and + evaluates the deferred expression. If 'arr' is not a deferred + array, simply returns success. Returns NPY_SUCCESS or NPY_FAILURE. + +``int PyArray_CalculateDeferredAssignment(PyArrayObject* arr, PyArrayObject* out)`` + + If 'arr' is a deferred array, evaluates the deferred expression + into 'out', and 'arr' remains a deferred array. If 'arr' is not + a deferred array, copies its value into out. Returns NPY_SUCCESS + or NPY_FAILURE. + +``int PyArray_GetDeferredUsageCount(PyArrayObject* arr)`` + + Returns a count of how many deferred expressions use this array + as an operand. + +The Python API would be expanded as follows. + + ``numpy.setdeferred(state)`` + + Enables or disables deferred evaluation. True means to always + use deferred evaluation. False means to never use deferred + evaluation. None means to use deferred evaluation if the error + handling state is set to ignore everything. At NumPy initialization, + the deferred state is None. + + Returns the previous deferred state. + +``numpy.getdeferred()`` + + Returns the current deferred state. + +``numpy.deferredstate(state)`` + + A context manager for deferred state handling, similar to + ``numpy.errstate``. + + +Error Handling +============== + +Error handling is a thorny issue for deferred evaluation. If the +NumPy error state is {all='ignore'}, it might be reasonable to +introduce deferred evaluation as the default, however if a UFunc +can raise an error, it would be very strange for the later 'print' +statement to throw the exception instead of the actual operation which +caused the error. + +What may be a good approach is to by default enable deferred evaluation +only when the error state is set to ignore all, but allow user control with +'setdeferred' and 'getdeferred' functions. True would mean always +use deferred evaluation, False would mean never use it, and None would +mean use it only when safe (i.e. the error state is set to ignore all). + +Interaction With UPDATEIFCOPY +============================= + +The ``NPY_UPDATEIFCOPY`` documentation states: + + The data area represents a (well-behaved) copy whose information + should be transferred back to the original when this array is deleted. + + This is a special flag that is set if this array represents a copy + made because a user required certain flags in PyArray_FromAny and a + copy had to be made of some other array (and the user asked for this + flag to be set in such a situation). The base attribute then points + to the “misbehaved” array (which is set read_only). When the array + with this flag set is deallocated, it will copy its contents back to + the “misbehaved” array (casting if necessary) and will reset the + “misbehaved” array to NPY_WRITEABLE. If the “misbehaved” array was + not NPY_WRITEABLE to begin with then PyArray_FromAny would have + returned an error because NPY_UPDATEIFCOPY would not have been possible. + +The current implementation of UPDATEIFCOPY assumes that it is the only +mechanism mucking with the writeable flag in this manner. These mechanisms +must be aware of each other to work correctly. Here's an example of how +they might go wrong: + +1. Make a temporary copy of 'arr' with UPDATEIFCOPY ('arr' becomes read only) +2. Use 'arr' in a deferred expression (deferred usage count becomes one, + NPY_DEFERRED_WASWRITEABLE is **not** set, since 'arr' is read only) +3. Destroy the temporary copy, causing 'arr' to become writeable +4. Writing to 'arr' destroys the value of the deferred expression + +To deal with this issue, we make these two states mutually exclusive. + +* Usage of UPDATEIFCOPY checks the ``NPY_DEFERRED_WASWRITEABLE`` flag, + and if it's set, calls ``PyArray_CalculateAllDeferred`` to flush + all deferred calculation before proceeding. +* The ndarray gets a new flag ``NPY_UPDATEIFCOPY_TARGET`` indicating + the array will be updated and made writeable at some point in the + future. If the deferred evaluation mechanism sees this flag in + any operand, it triggers immediate evaluation. + +Other Implementation Details +============================ + +When a deferred array is created, it gets references to all the +operands of the UFunc, along with the UFunc itself. The +'DeferredUsageCount' is incremented for each operand, and later +gets decremented when the deferred expression is calculated or +the deferred array is destroyed. + +A global list of weak references to all the deferred arrays +is tracked, in order of creation. When ``PyArray_CalculateAllDeferred`` +gets called, the newest deferred array is calculated first. +This may release references to other deferred arrays contained +in the deferred expression tree, which then +never have to be calculated. + +Further Optimization +==================== + +Instead of conservatively disabling deferred evaluation when any +errors are not set to 'ignore', each UFunc could give a set +of possible errors it generates. Then, if all those errors +are set to 'ignore', deferred evaluation could be used even +if other errors are not set to ignore. + +Once the expression tree is explicitly stored, it is possible to +do transformations on it. For example add(add(a,b),c) could +be transformed into add3(a,b,c), or add(multiply(a,b),c) could +become fma(a,b,c) using the CPU fused multiply-add instruction +where available. + +While I've framed deferred evaluation as just for UFuncs, it could +be extended to other functions, such as dot(). For example, chained +matrix multiplications could be reordered to minimize the size +of intermediates, or peep-hole style optimizer passes could search +for patterns that match optimized BLAS/other high performance +library calls. + +For operations on really large arrays, integrating a JIT like LLVM into +this system might be a big benefit. The UFuncs and other operations +would provide bitcode, which could be inlined together and optimized +by the LLVM optimizers, then executed. In fact, the iterator itself +could also be represented in bitcode, allowing LLVM to consider +the entire iteration while doing its optimization. diff --git a/doc/neps/new-iterator-ufunc.rst b/doc/neps/new-iterator-ufunc.rst new file mode 100644 index 000000000..52a44413d --- /dev/null +++ b/doc/neps/new-iterator-ufunc.rst @@ -0,0 +1,1981 @@ +:Title: Optimizing Iterator/UFunc Performance +:Author: Mark Wiebe <mwwiebe@gmail.com> +:Content-Type: text/x-rst +:Created: 25-Nov-2010 + +***************** +Table of Contents +***************** + +.. contents:: + +******** +Abstract +******** + +This NEP proposes to replace the NumPy iterator and multi-iterator +with a single new iterator, designed to be more flexible and allow for +more cache-friendly data access. The new iterator also subsumes much +of the core ufunc functionality, making it easy to get the current +ufunc benefits in contexts which don't precisely fit the ufunc mold. +Key benefits include: + +* automatic reordering to find a cache-friendly access pattern +* standard and customizable broadcasting +* automatic type/byte-order/alignment conversions +* optional buffering to minimize conversion memory usage +* optional output arrays, with automatic allocation when unsupplied +* automatic output or common type selection + +A large fraction of this iterator design has already been implemented with +promising results. Construction overhead is slightly greater (a.flat: +0.5 us, newiter(a): 1.4 us and broadcast(a,b): 1.4 us, newiter([a,b]): +2.2 us), but, as shown in an example, it is already possible to improve +on the performance of the built-in NumPy mechanisms in pure Python code +together with the iterator. One example rewrites np.add, getting a +four times improvement with some Fortran-contiguous arrays, and +another improves image compositing code from 1.4s to 180ms. + +The implementation attempts to take into account +the design decisions made in the NumPy 2.0 refactor, to make its future +integration into libndarray relatively simple. + +********** +Motivation +********** + +NumPy defaults to returning C-contiguous arrays from UFuncs. This can +result in extremely poor memory access patterns when dealing with data +that is structured differently. A simple timing example illustrates +this with a more than eight times performance hit from adding +Fortran-contiguous arrays together. All timings are done using Numpy +2.0dev (Nov 22, 2010) on an Athlon 64 X2 4200+, with a 64-bit OS.:: + + In [1]: import numpy as np + In [2]: a = np.arange(1000000,dtype=np.float32).reshape(10,10,10,10,10,10) + In [3]: b, c, d = a.copy(), a.copy(), a.copy() + + In [4]: timeit a+b+c+d + 10 loops, best of 3: 28.5 ms per loop + + In [5]: timeit a.T+b.T+c.T+d.T + 1 loops, best of 3: 237 ms per loop + + In [6]: timeit a.T.ravel('A')+b.T.ravel('A')+c.T.ravel('A')+d.T.ravel('A') + 10 loops, best of 3: 29.6 ms per loop + +In this case, it is simple to recover the performance by switching to +a view of the memory, adding, then reshaping back. To further examine +the problem and see how it isn’t always as trivial to work around, +let’s consider simple code for working with image buffers in NumPy. + +Image Compositing Example +========================= + +For a more realistic example, consider an image buffer. Images are +generally stored in a Fortran-contiguous order, and the colour +channel can be treated as either a structured 'RGB' type or an extra +dimension of length three. The resulting memory layout is neither C- +nor Fortran-contiguous, but is easy to work with directly in NumPy, +because of the flexibility of the ndarray. This appears ideal, because +it makes the memory layout compatible with typical C or C++ image code, +while simultaneously giving natural access in Python. Getting the color +of pixel (x,y) is just ‘image[x,y]’. + +The performance of this layout in NumPy turns out to be very poor. +Here is code which creates two black images, and does an ‘over’ +compositing operation on them.:: + + In [9]: image1 = np.zeros((1080,1920,3), dtype=np.float32).swapaxes(0,1) + In [10]: alpha1 = np.zeros((1080,1920,1), dtype=np.float32).swapaxes(0,1) + In [11]: image2 = np.zeros((1080,1920,3), dtype=np.float32).swapaxes(0,1) + In [12]: alpha2 = np.zeros((1080,1920,1), dtype=np.float32).swapaxes(0,1) + In [13]: def composite_over(im1, al1, im2, al2): + ....: return (im1 + (1-al1)*im2, al1 + (1-al1)*al2) + + In [14]: timeit composite_over(image1,alpha1,image2,alpha2) + 1 loops, best of 3: 3.51 s per loop + +If we give up the convenient layout, and use the C-contiguous default, +the performance is about seven times better.:: + + In [16]: image1 = np.zeros((1080,1920,3), dtype=np.float32) + In [17]: alpha1 = np.zeros((1080,1920,1), dtype=np.float32) + In [18]: image2 = np.zeros((1080,1920,3), dtype=np.float32) + In [19]: alpha2 = np.zeros((1080,1920,1), dtype=np.float32) + + In [20]: timeit composite_over(image1,alpha1,image2,alpha2) + 1 loops, best of 3: 581 ms per loop + +But this is not all, since it turns out that broadcasting the alpha +channel is exacting a performance price as well. If we use an alpha +channel with 3 values instead of one, we get:: + + In [21]: image1 = np.zeros((1080,1920,3), dtype=np.float32) + In [22]: alpha1 = np.zeros((1080,1920,3), dtype=np.float32) + In [23]: image2 = np.zeros((1080,1920,3), dtype=np.float32) + In [24]: alpha2 = np.zeros((1080,1920,3), dtype=np.float32) + + In [25]: timeit composite_over(image1,alpha1,image2,alpha2) + 1 loops, best of 3: 313 ms per loop + +For a final comparison, let’s see how it performs when we use +one-dimensional arrays to ensure just a single loop does the +calculation.:: + + In [26]: image1 = np.zeros((1080*1920*3), dtype=np.float32) + In [27]: alpha1 = np.zeros((1080*1920*3), dtype=np.float32) + In [28]: image2 = np.zeros((1080*1920*3), dtype=np.float32) + In [29]: alpha2 = np.zeros((1080*1920*3), dtype=np.float32) + + In [30]: timeit composite_over(image1,alpha1,image2,alpha2) + 1 loops, best of 3: 312 ms per loop + +To get a reference performance number, I implemented this simple operation +straightforwardly in C (careful to use the same compile options as NumPy). +If I emulated the memory allocation and layout of the Python code, the +performance was roughly 0.3 seconds, very much in line with NumPy’s +performance. Combining the operations into one pass reduced the time +to roughly 0.15 seconds. + +A slight variation of this example is to use a single memory block +with four channels (1920,1080,4) instead of separate image and alpha. +This is more typical in image processing applications, and here’s how +that looks with a C-contiguous layout.:: + + In [31]: image1 = np.zeros((1080,1920,4), dtype=np.float32) + In [32]: image2 = np.zeros((1080,1920,4), dtype=np.float32) + In [33]: def composite_over(im1, im2): + ....: ret = (1-im1[:,:,-1])[:,:,np.newaxis]*im2 + ....: ret += im1 + ....: return ret + + In [34]: timeit composite_over(image1,image2) + 1 loops, best of 3: 481 ms per loop + +To see the improvements that implementation of the new iterator as +proposed can produce, go to the example continued after the +proposed API, near the bottom of the document. + +************************* +Improving Cache-Coherency +************************* + +In order to get the best performance from UFunc calls, the pattern of +memory reads should be as regular as possible. Modern CPUs attempt to +predict the memory read/write pattern and fill the cache ahead of time. +The most predictable pattern is for all the inputs and outputs to be +sequentially processed in the same order. + +I propose that by default, the memory layout of the UFunc outputs be as +close to that of the inputs as possible. Whenever there is an ambiguity +or a mismatch, it defaults to a C-contiguous layout. + +To understand how to accomplish this, we first consider the strides of +all the inputs after the shapes have been normalized for broadcasting. +By determining whether a set of strides are compatible and/or ambiguous, +we can determine an output memory layout which maximizes coherency. + +In broadcasting, the input shapes are first transformed to broadcast +shapes by prepending singular dimensions, then the broadcast strides +are created, where any singular dimension’s stride is set to zero. + +Strides may be negative as well, and in certain cases this can be +normalized to fit the following discussion. If all the strides for a +particular axis are negative or zero, the strides for that dimension +can be negated after adjusting the base data pointers appropriately. + +Here's an example of how three inputs with C-contiguous layouts result in +broadcast strides. To simplify things, the examples use an itemsize of 1. + +================== ======== ======= ======= +Input shapes: (5,3,7) (5,3,1) (1,7) +Broadcast shapes: (5,3,7) (5,3,1) (1,1,7) +Broadcast strides: (21,7,1) (3,1,0) (0,0,1) +================== ======== ======= ======= + +*Compatible Strides* - A set of strides are compatible if there exists +a permutation of the axes such that the strides are decreasing for every +stride in the set, excluding entries that are zero. + +The example above satisfies the definition with the identity permutation. +In the motivation image example, the strides are slightly different if +we separate the colour and alpha information or not. The permutation +which demonstrates compatibility here is the transposition (0,1). + +============================= ===================== ===================== +Input/Broadcast shapes: Image (1920, 1080, 3) Alpha (1920, 1080, 1) +Broadcast strides (separate): (3,5760,1) (1,1920,0) +Broadcast strides (together): (4,7680,1) (4,7680,0) +============================= ===================== ===================== + +*Ambiguous Strides* - A set of compatible strides are ambiguous if +more than one permutation of the axes exists such that the strides are +decreasing for every stride in the set, excluding entries that are zero. + +This typically occurs when every axis has a 0-stride somewhere in the +set of strides. The simplest example is in two dimensions, as follows. + +================== ===== ===== +Broadcast shapes: (1,3) (5,1) +Broadcast strides: (0,1) (1,0) +================== ===== ===== + +There may, however, be unambiguous compatible strides without a single +input forcing the entire layout, as in this example: + +================== ======= ======= +Broadcast shapes: (1,3,4) (5,3,1) +Broadcast strides: (0,4,1) (3,1,0) +================== ======= ======= + +In the face of ambiguity, we have a choice to either completely throw away +the fact that the strides are compatible, or try to resolve the ambiguity +by adding an additional constraint. I think the appropriate choice +is to resolve it by picking the memory layout closest to C-contiguous, +but still compatible with the input strides. + +Output Layout Selection Algorithm +================================= + +The output ndarray memory layout we would like to produce is as follows: + +=============================== ============================================= +Consistent/Unambiguous strides: The single consistent layout +Consistent/Ambiguous strides: The consistent layout closest to C-contiguous +Inconsistent strides: C-contiguous +=============================== ============================================= + +Here is pseudo-code for an algorithm to compute the permutation for the +output layout.:: + + perm = range(ndim) # Identity, i.e. C-contiguous + # Insertion sort, ignoring 0-strides + # Note that the sort must be stable, and 0-strides may + # be reordered if necessary, but should be moved as little + # as possible. + for i0 = 1 to ndim-1: + # ipos is where perm[i0] will get inserted + ipos = i0 + j0 = perm[i0] + for i1 = i0-1 to 0: + j1 = perm[i1] + ambig, shouldswap = True, False + # Check whether any strides are ordered wrong + for strides in broadcast_strides: + if strides[j0] != 0 and strides[j1] != 0: + if strides[j0] > strides[j1]: + # Only set swap if it's still ambiguous. + if ambig: + shouldswap = True + else: + # Set swap even if it's not ambiguous, + # because not swapping is the choice + # for conflicts as well. + shouldswap = False + ambig = False + # If there was an unambiguous comparison, either shift ipos + # to i1 or stop looking for the comparison + if not ambig: + if shouldswap: + ipos = i1 + else: + break + # Insert perm[i0] into the right place + if ipos != i0: + for i1 = i0-1 to ipos: + perm[i1+1] = perm[i1] + perm[ipos] = j0 + # perm is now the closest consistent ordering to C-contiguous + return perm + +********************* +Coalescing Dimensions +********************* + +In many cases, the memory layout allows for the use of a one-dimensional +loop instead of tracking multiple coordinates within the iterator. +The existing code already exploits this when the data is C-contiguous, +but since we're reordering the axes, we can apply this optimization +more generally. + +Once the iteration strides have been sorted to be monotonically +decreasing, any dimensions which could be coalesced are side by side. +If for all the operands, incrementing by strides[i+1] shape[i+1] times +is the same as incrementing by strides[i], or strides[i+1]*shape[i+1] == +strides[i], dimensions i and i+1 can be coalesced into a single dimension. + +Here is pseudo-code for coalescing.:: + + # Figure out which pairs of dimensions can be coalesced + can_coalesce = [False]*ndim + for strides, shape in zip(broadcast_strides, broadcast_shape): + for i = 0 to ndim-2: + if strides[i+1]*shape[i+1] == strides[i]: + can_coalesce[i] = True + # Coalesce the types + new_ndim = ndim - count_nonzero(can_coalesce) + for strides, shape in zip(broadcast_strides, broadcast_shape): + j = 0 + for i = 0 to ndim-1: + # Note that can_coalesce[ndim-1] is always False, so + # there is no out-of-bounds access here. + if can_coalesce[i]: + shape[i+1] = shape[i]*shape[i+1] + else: + strides[j] = strides[i] + shape[j] = shape[i] + j += 1 + +************************* +Inner Loop Specialization +************************* + +Specialization is handled purely by the inner loop function, so this +optimization is independent of the others. Some specialization is +already done, like for the reduce operation. The idea is mentioned in +http://projects.scipy.org/numpy/wiki/ProjectIdeas, “use intrinsics +(SSE-instructions) to speed up low-level loops in NumPy.” + +Here are some possibilities for two-argument functions, +covering the important cases of add/subtract/multiply/divide. + +* The first or second argument is a single value (i.e. a 0 stride + value) and does not alias the output. arr = arr + 1; arr = 1 + arr + + * Can load the constant once instead of reloading it from memory every time + +* The strides match the size of the data type. C- or + Fortran-contiguous data, for example + + * Can do a simple loop without using strides + +* The strides match the size of the data type, and they are + both 16-byte aligned (or differ from 16-byte aligned by the same offset) + + * Can use SSE to process multiple values at once + +* The first input and the output are the same single value + (i.e. a reduction operation). + + * This is already specialized for many UFuncs in the existing code + +The above cases are not generally mutually exclusive, for example a +constant argument may be combined with SSE when the strides match the +data type size, and reductions can be optimized with SSE as well. + +********************** +Implementation Details +********************** + +Except for inner loop specialization, the discussed +optimizations significantly affect ufunc_object.c and the +PyArrayIterObject/PyArrayMultiIterObject used to do the broadcasting. +In general, it should be possible to emulate the current behavior where it +is desired, but I believe the default should be to produce and manipulate +memory layouts which will give the best performance. + +To support the new cache-friendly behavior, we introduce a new +option ‘K’ (for “keep”) for any ``order=`` parameter. + +The proposed ‘order=’ flags become as follows: + +=== ===================================================================================== +‘C’ C-contiguous layout +‘F’ Fortran-contiguous layout +‘A’ ‘F’ if the input(s) have a Fortran-contiguous layout, ‘C’ otherwise (“Any Contiguous”) +‘K’ a layout equivalent to ‘C’ followed by some permutation of the axes, as close to the layout of the input(s) as possible (“Keep Layout”) +=== ===================================================================================== + +Or as an enum:: + + /* For specifying array memory layout or iteration order */ + typedef enum { + /* Fortran order if inputs are all Fortran, C otherwise */ + NPY_ANYORDER=-1, + /* C order */ + NPY_CORDER=0, + /* Fortran order */ + NPY_FORTRANORDER=1, + /* An order as close to the inputs as possible */ + NPY_KEEPORDER=2 + } NPY_ORDER; + + +Perhaps a good strategy is to first implement the capabilities discussed +here without changing the defaults. Once they are implemented and +well-tested, the defaults can change from ``order='C'`` to ``order='K'`` +everywhere appropriate. UFuncs additionally should gain an ``order=`` +parameter to control the layout of their output(s). + +The iterator can do automatic casting, and I have created a sequence +of progressively more permissive casting rules. Perhaps for 2.0, NumPy +could adopt this enum as its prefered way of dealing with casting.:: + + /* For specifying allowed casting in operations which support it */ + typedef enum { + /* Only allow identical types */ + NPY_NO_CASTING=0, + /* Allow identical and byte swapped types */ + NPY_EQUIV_CASTING=1, + /* Only allow safe casts */ + NPY_SAFE_CASTING=2, + /* Allow safe casts and casts within the same kind */ + NPY_SAME_KIND_CASTING=3, + /* Allow any casts */ + NPY_UNSAFE_CASTING=4 + } NPY_CASTING; + +Iterator Rewrite +================ + +Based on an analysis of the code, it appears that refactoring the existing +iteration objects to implement these optimizations is prohibitively +difficult. Additionally, some usage of the iterator requires modifying +internal values or flags, so code using the iterator would have to +change anyway. Thus we propose creating a new iterator object which +subsumes the existing iterator functionality and expands it to account +for the optimizations. + +High level goals for the replacement iterator include: + +* Small memory usage and a low number of memory allocations. +* Simple cases (like flat arrays) should have very little overhead. +* Combine single and multiple iteration into one object. + +Capabilities that should be provided to user code: + +* Iterate in C, Fortran, or “Fastest” (default) order. +* Track a C-style or Fortran-style flat index if requested + (existing iterator always tracks a C-style index). This can be done + independently of the iteration order. +* Track the coordinates if requested (the existing iterator requires + manually changing an internal iterator flag to guarantee this). +* Skip iteration of the last internal dimension so that it can be + processed with an inner loop. +* Jump to a specific coordinate in the array. +* Iterate an arbitrary subset of axes (to support, for example, reduce + with multiple axes at once). +* Ability to automatically allocate output parameters if a NULL input + is provided, These outputs should have a memory layout matching + the iteration order, and are the mechanism for the ``order='K'`` + support. +* Automatic copying and/or buffering of inputs which do not satisfy + type/byte-order/alignment requirements. The caller's iteration inner + loop should be the same no matter what buffering or copying is done. + +Notes for implementation: + +* User code must never touch the inside of the iterator. This allows + for drastic changes of the internal memory layout in the future, if + higher-performance implementation strategies are found. +* Use a function pointer instead of a macro for iteration. + This way, specializations can be created for the common cases, + like when ndim is small, for different flag settings, and when the + number of arrays iterated is small. Also, an iteration pattern + can be prescribed that makes a copy of the function pointer first + to allow the compiler to keep the function pointer + in a register. +* Dynamically create the memory layout, to minimize the number of + cache lines taken up by the iterator (for LP64, + sizeof(PyArrayIterObject) is about 2.5KB, and a binary operation + like plus needs three of these for the Multi-Iterator). +* Isolate the C-API object from Python reference counting, so that + it can be used naturally from C. The Python object then becomes + a wrapper around the C iterator. This is analogous to the + PEP 3118 design separation of Py_buffer and memoryview. + +Proposed Iterator Memory Layout +=============================== + +The following struct describes the iterator memory. All items +are packed together, which means that different values of the flags, +ndim, and niter will produce slightly different layouts. :: + + struct { + /* Flags indicate what optimizations have been applied, and + * affect the layout of this struct. */ + uint32 itflags; + /* Number of iteration dimensions. If FLAGS_HASCOORDS is set, + * it matches the creation ndim, otherwise it may be smaller. */ + uint16 ndim; + /* Number of objects being iterated. This is fixed at creation time. */ + uint16 niter; + + /* The number of times the iterator will iterate */ + intp itersize; + + /* The permutation is only used when FLAGS_HASCOORDS is set, + * and is placed here so its position depends on neither ndim + * nor niter. */ + intp perm[ndim]; + + /* The data types of all the operands */ + PyArray_Descr *dtypes[niter]; + /* Backups of the starting axisdata 'ptr' values, to support Reset */ + char *resetdataptr[niter]; + /* Backup of the starting index value, to support Reset */ + npy_intp resetindex; + + /* When the iterator is destroyed, Py_XDECREF is called on all + these objects */ + PyObject *objects[niter]; + + /* Flags indicating read/write status and buffering + * for each operand. */ + uint8 opitflags[niter]; + /* Padding to make things intp-aligned again */ + uint8 padding[]; + + /* If some or all of the inputs are being buffered */ + #if (flags&FLAGS_BUFFERED) + struct buffer_data { + /* The size of the buffer, and which buffer we're on. + * the i-th iteration has i = buffersize*bufferindex+pos + */ + intp buffersize; + /* For tracking position inside the buffer */ + intp size, pos; + /* The strides for the pointers */ + intp stride[niter]; + /* Pointers to the data for the current iterator position. + * The buffer_data.value ptr[i] equals either + * axis_data[0].ptr[i] or buffer_data.buffers[i] depending + * on whether copying to the buffer was necessary. + */ + char* ptr[niter]; + /* Functions to do the copyswap and casting necessary */ + transferfn_t readtransferfn[niter]; + void *readtransferdata[niter]; + transferfn_t writetransferfn[niter]; + void *writetransferdata[niter]; + /* Pointers to the allocated buffers for operands + * which the iterator determined needed buffering + */ + char *buffers[niter]; + }; + #endif /* FLAGS_BUFFERED */ + + /* Data per axis, starting with the most-frequently + * updated, and in decreasing order after that. */ + struct axis_data { + /* The shape of this axis */ + intp shape; + /* The current coordinate along this axis */ + intp coord; + /* The operand and index strides for this axis + intp stride[niter]; + {intp indexstride;} #if (flags&FLAGS_HASINDEX); + /* The operand pointers and index values for this axis */ + char* ptr[niter]; + {intp index;} #if (flags&FLAGS_HASINDEX); + }[ndim]; + }; + +The array of axis_data structs is ordered to be in increasing rapidity +of increment updates. If the ``perm`` is the identity, this means it’s +reversed from the C-order. This is done so data items touched +most often are closest to the beginning of the struct, where the +common properties are, resulting in increased cache coherency. +It also simplifies the iternext call, while making getcoord and +related functions slightly more complicated. + +Proposed Iterator API +===================== + +The existing iterator API includes functions like PyArrayIter_Check, +PyArray_Iter* and PyArray_ITER_*. The multi-iterator array includes +PyArray_MultiIter*, PyArray_Broadcast, and PyArray_RemoveSmallest. The +new iterator design replaces all of this functionality with a single object +and associated API. One goal of the new API is that all uses of the +existing iterator should be replaceable with the new iterator without +significant effort. + +The C-API naming convention chosen is based on the one in the numpy-refactor +branch, where libndarray has the array named ``NpyArray`` and functions +named ``NpyArray_*``. The iterator is named ``NpyIter`` and functions are +named ``NpyIter_*``. + +The Python exposure has the iterator named ``np.newiter``. One possible +release strategy for this iterator would be to release a 1.X (1.6?) version +with the iterator added, but not used by the NumPy code. Then, 2.0 can +be release with it fully integrated. If this strategy is chosen, the +naming convention and API should be finalized as much as possible before +the 1.X release. The name ``np.iter`` can't be used because it conflicts +with the Python built-in ``iter``. I would suggest the name ``np.nditer`` +within Python, as it is currently unused. + +In addition to the performance goals set out for the new iterator, +it appears the API can be refactored to better support some common +NumPy programming idioms. + +By moving some functionality currently in the UFunc code into the +iterator, it should make it easier for extension code which wants +to emulate UFunc behavior in cases which don't quite fit the +UFunc paradigm. In particular, emulating the UFunc buffering behavior +is not a trivial enterprise. + +Old -> New Iterator API Conversion +---------------------------------- + +For the regular iterator: + +=============================== ============================================= +``PyArray_IterNew`` ``NpyIter_New`` +``PyArray_IterAllButAxis`` ``NpyIter_New`` + ``axes`` parameter **or** + Iterator flag ``NPY_ITER_NO_INNER_ITERATION`` +``PyArray_BroadcastToShape`` **NOT SUPPORTED** (but could be, if needed) +``PyArrayIter_Check`` Will need to add this in Python exposure +``PyArray_ITER_RESET`` ``NpyIter_Reset`` +``PyArray_ITER_NEXT`` Function pointer from ``NpyIter_GetIterNext`` +``PyArray_ITER_DATA`` ``NpyIter_GetDataPtrArray`` +``PyArray_ITER_GOTO`` ``NpyIter_GotoCoords`` +``PyArray_ITER_GOTO1D`` ``NpyIter_GotoIndex`` +``PyArray_ITER_NOTDONE`` Return value of ``iternext`` function pointer +=============================== ============================================= + +For the multi-iterator: + +=============================== ============================================= +``PyArray_MultiIterNew`` ``NpyIter_MultiNew`` +``PyArray_MultiIter_RESET`` ``NpyIter_Reset`` +``PyArray_MultiIter_NEXT`` Function pointer from ``NpyIter_GetIterNext`` +``PyArray_MultiIter_DATA`` ``NpyIter_GetDataPtrArray`` +``PyArray_MultiIter_NEXTi`` **NOT SUPPORTED** (always lock-step iteration) +``PyArray_MultiIter_GOTO`` ``NpyIter_GotoCoords`` +``PyArray_MultiIter_GOTO1D`` ``NpyIter_GotoIndex`` +``PyArray_MultiIter_NOTDONE`` Return value of ``iternext`` function pointer +``PyArray_Broadcast`` Handled by ``NpyIter_MultiNew`` +``PyArray_RemoveSmallest`` Iterator flag ``NPY_ITER_NO_INNER_ITERATION`` +=============================== ============================================= + +For other API calls: + +=============================== ============================================= +``PyArray_ConvertToCommonType`` Iterator flag ``NPY_ITER_COMMON_DTYPE`` +=============================== ============================================= + + +Iterator Pointer Type +--------------------- + +The iterator structure is internally generated, but a type is still needed +to provide warnings and/or errors when the wrong type is passed to +the API. We do this with a typedef of an incomplete struct + +``typedef struct NpyIter_InternalOnly NpyIter;`` + + +Construction and Destruction +---------------------------- + +``NpyIter* NpyIter_New(PyArrayObject* op, npy_uint32 flags, NPY_ORDER order, NPY_CASTING casting, PyArray_Descr* dtype, npy_intp a_ndim, npy_intp *axes, npy_intp buffersize)`` + + Creates an iterator for the given numpy array object ``op``. + + Flags that may be passed in ``flags`` are any combination + of the global and per-operand flags documented in + ``NpyIter_MultiNew``, except for ``NPY_ITER_ALLOCATE``. + + Any of the ``NPY_ORDER`` enum values may be passed to ``order``. For + efficient iteration, ``NPY_KEEPORDER`` is the best option, and the other + orders enforce the particular iteration pattern. + + Any of the ``NPY_CASTING`` enum values may be passed to ``casting``. + The values include ``NPY_NO_CASTING``, ``NPY_EQUIV_CASTING``, + ``NPY_SAFE_CASTING``, ``NPY_SAME_KIND_CASTING``, and + ``NPY_UNSAFE_CASTING``. To allow the casts to occur, copying or + buffering must also be enabled. + + If ``dtype`` isn't ``NULL``, then it requires that data type. + If copying is allowed, it will make a temporary copy if the data + is castable. If ``UPDATEIFCOPY`` is enabled, it will also copy + the data back with another cast upon iterator destruction. + + If ``a_ndim`` is greater than zero, ``axes`` must also be provided. + In this case, ``axes`` is an ``a_ndim``-sized array of ``op``'s axes. + A value of -1 in ``axes`` means ``newaxis``. Within the ``axes`` + array, axes may not be repeated. + + If ``buffersize`` is zero, a default buffer size is used, + otherwise it specifies how big of a buffer to use. Buffers + which are powers of 2 such as 512 or 1024 are recommended. + + Returns NULL if there is an error, otherwise returns the allocated + iterator. + + To make an iterator similar to the old iterator, this should work.:: + + iter = NpyIter_New(op, NPY_ITER_READWRITE, + NPY_CORDER, NPY_NO_CASTING, NULL, 0, NULL); + + If you want to edit an array with aligned ``double`` code, + but the order doesn't matter, you would use this.:: + + dtype = PyArray_DescrFromType(NPY_DOUBLE); + iter = NpyIter_New(op, NPY_ITER_READWRITE | + NPY_ITER_BUFFERED | + NPY_ITER_NBO, + NPY_ITER_ALIGNED, + NPY_KEEPORDER, + NPY_SAME_KIND_CASTING, + dtype, 0, NULL); + Py_DECREF(dtype); + +``NpyIter* NpyIter_MultiNew(npy_intp niter, PyArrayObject** op, npy_uint32 flags, NPY_ORDER order, NPY_CASTING casting, npy_uint32 *op_flags, PyArray_Descr** op_dtypes, npy_intp oa_ndim, npy_intp **op_axes, npy_intp buffersize)`` + + Creates an iterator for broadcasting the ``niter`` array objects provided + in ``op``. + + For normal usage, use 0 for ``oa_ndim`` and NULL for ``op_axes``. + See below for a description of these parameters, which allow for + custom manual broadcasting as well as reordering and leaving out axes. + + Any of the ``NPY_ORDER`` enum values may be passed to ``order``. For + efficient iteration, ``NPY_KEEPORDER`` is the best option, and the other + orders enforce the particular iteration pattern. When using + ``NPY_KEEPORDER``, if you also want to ensure that the iteration is + not reversed along an axis, you should pass the flag + ``NPY_ITER_DONT_REVERSE_AXES``. + + Any of the ``NPY_CASTING`` enum values may be passed to ``casting``. + The values include ``NPY_NO_CASTING``, ``NPY_EQUIV_CASTING``, + ``NPY_SAFE_CASTING``, ``NPY_SAME_KIND_CASTING``, and + ``NPY_UNSAFE_CASTING``. To allow the casts to occur, copying or + buffering must also be enabled. + + If ``op_dtypes`` isn't ``NULL``, it specifies a data type or ``NULL`` + for each ``op[i]``. + + The parameter ``oa_ndim``, when non-zero, specifies the number of + dimensions that will be iterated with customized broadcasting. + If it is provided, ``op_axes`` must also be provided. + These two parameters let you control in detail how the + axes of the operand arrays get matched together and iterated. + In ``op_axes``, you must provide an array of ``niter`` pointers + to ``oa_ndim``-sized arrays of type ``npy_intp``. If an entry + in ``op_axes`` is NULL, normal broadcasting rules will apply. + In ``op_axes[j][i]`` is stored either a valid axis of ``op[j]``, or + -1 which means ``newaxis``. Within each ``op_axes[j]`` array, axes + may not be repeated. The following example is how normal broadcasting + applies to a 3-D array, a 2-D array, a 1-D array and a scalar.:: + + npy_intp oa_ndim = 3; /* # iteration axes */ + npy_intp op0_axes[] = {0, 1, 2}; /* 3-D operand */ + npy_intp op1_axes[] = {-1, 0, 1}; /* 2-D operand */ + npy_intp op2_axes[] = {-1, -1, 0}; /* 1-D operand */ + npy_intp op3_axes[] = {-1, -1, -1} /* 0-D (scalar) operand */ + npy_intp *op_axes[] = {op0_axes, op1_axes, op2_axes, op3_axes}; + + If ``buffersize`` is zero, a default buffer size is used, + otherwise it specifies how big of a buffer to use. Buffers + which are powers of 2 such as 512 or 1024 are recommended. + + Returns NULL if there is an error, otherwise returns the allocated + iterator. + + Flags that may be passed in ``flags``, applying to the whole + iterator, are: + + ``NPY_ITER_C_INDEX``, ``NPY_ITER_F_INDEX`` + + Causes the iterator to track an index matching C or + Fortran order. These options are mutually exclusive. + + ``NPY_ITER_COORDS`` + + Causes the iterator to track array coordinates. + This prevents the iterator from coalescing axes to + produce bigger inner loops. + + ``NPY_ITER_NO_INNER_ITERATION`` + + Causes the iterator to skip iteration of the innermost + loop, allowing the user of the iterator to handle it. + + This flag is incompatible with ``NPY_ITER_C_INDEX``, + ``NPY_ITER_F_INDEX``, and ``NPY_ITER_COORDS``. + + ``NPY_ITER_DONT_REVERSE_AXES`` + + This only affects the iterator when NPY_KEEPORDER is specified + for the order parameter. By default with NPY_KEEPORDER, the + iterator reverses axes which have negative strides, so that + memory is traversed in a forward direction. This disables + this step. Use this flag if you want to use the underlying + memory-ordering of the axes, but don't want an axis reversed. + This is the behavior of ``numpy.ravel(a, order='K')``, for + instance. + + ``NPY_ITER_COMMON_DTYPE`` + + Causes the iterator to convert all the operands to a common + data type, calculated based on the ufunc type promotion rules. + The flags for each operand must be set so that the appropriate + casting is permitted, and copying or buffering must be enabled. + + If the common data type is known ahead of time, don't use this + flag. Instead, set the requested dtype for all the operands. + + ``NPY_ITER_REFS_OK`` + + Indicates that arrays with reference types (object + arrays or structured arrays containing an object type) + may be accepted and used in the iterator. If this flag + is enabled, the caller must be sure to check whether + ``NpyIter_IterationNeedsAPI(iter)`` is true, in which case + it may not release the GIL during iteration. + + ``NPY_ITER_ZEROSIZE_OK`` + + Indicates that arrays with a size of zero should be permitted. + Since the typical iteration loop does not naturally work with + zero-sized arrays, you must check that the IterSize is non-zero + before entering the iteration loop. + + ``NPY_ITER_REDUCE_OK`` + + Permits writeable operands with a dimension with zero + stride and size greater than one. Note that such operands + must be read/write. + + When buffering is enabled, this also switches to a special + buffering mode which reduces the loop length as necessary to + not trample on values being reduced. + + Note that if you want to do a reduction on an automatically + allocated output, you must use ``NpyIter_GetOperandArray`` + to get its reference, then set every value to the reduction + unit before doing the iteration loop. In the case of a + buffered reduction, this means you must also specify the + flag ``NPY_ITER_DELAY_BUFALLOC``, then reset the iterator + after initializing the allocated operand to prepare the + buffers. + + ``NPY_ITER_RANGED`` + + Enables support for iteration of sub-ranges of the full + ``iterindex`` range ``[0, NpyIter_IterSize(iter))``. Use + the function ``NpyIter_ResetToIterIndexRange`` to specify + a range for iteration. + + This flag can only be used with ``NPY_ITER_NO_INNER_ITERATION`` + when ``NPY_ITER_BUFFERED`` is enabled. This is because + without buffering, the inner loop is always the size of the + innermost iteration dimension, and allowing it to get cut up + would require special handling, effectively making it more + like the buffered version. + + ``NPY_ITER_BUFFERED`` + + Causes the iterator to store buffering data, and use buffering + to satisfy data type, alignment, and byte-order requirements. + To buffer an operand, do not specify the ``NPY_ITER_COPY`` + or ``NPY_ITER_UPDATEIFCOPY`` flags, because they will + override buffering. Buffering is especially useful for Python + code using the iterator, allowing for larger chunks + of data at once to amortize the Python interpreter overhead. + + If used with ``NPY_ITER_NO_INNER_ITERATION``, the inner loop + for the caller may get larger chunks than would be possible + without buffering, because of how the strides are laid out. + + Note that if an operand is given the flag ``NPY_ITER_COPY`` + or ``NPY_ITER_UPDATEIFCOPY``, a copy will be made in preference + to buffering. Buffering will still occur when the array was + broadcast so elements need to be duplicated to get a constant + stride. + + In normal buffering, the size of each inner loop is equal + to the buffer size, or possibly larger if ``NPY_ITER_GROWINNER`` + is specified. If ``NPY_ITER_REDUCE_OK`` is enabled and + a reduction occurs, the inner loops may become smaller depending + on the structure of the reduction. + + ``NPY_ITER_GROWINNER`` + + When buffering is enabled, this allows the size of the inner + loop to grow when buffering isn't necessary. This option + is best used if you're doing a straight pass through all the + data, rather than anything with small cache-friendly arrays + of temporary values for each inner loop. + + ``NPY_ITER_DELAY_BUFALLOC`` + + When buffering is enabled, this delays allocation of the + buffers until one of the ``NpyIter_Reset*`` functions is + called. This flag exists to avoid wasteful copying of + buffer data when making multiple copies of a buffered + iterator for multi-threaded iteration. + + Another use of this flag is for setting up reduction operations. + After the iterator is created, and a reduction output + is allocated automatically by the iterator (be sure to use + READWRITE access), its value may be initialized to the reduction + unit. Use ``NpyIter_GetOperandArray`` to get the object. + Then, call ``NpyIter_Reset`` to allocate and fill the buffers + with their initial values. + + Flags that may be passed in ``op_flags[i]``, where ``0 <= i < niter``: + + ``NPY_ITER_READWRITE``, ``NPY_ITER_READONLY``, ``NPY_ITER_WRITEONLY`` + + Indicate how the user of the iterator will read or write + to ``op[i]``. Exactly one of these flags must be specified + per operand. + + ``NPY_ITER_COPY`` + + Allow a copy of ``op[i]`` to be made if it does not + meet the data type or alignment requirements as specified + by the constructor flags and parameters. + + ``NPY_ITER_UPDATEIFCOPY`` + + Triggers ``NPY_ITER_COPY``, and when an array operand + is flagged for writing and is copied, causes the data + in a copy to be copied back to ``op[i]`` when the iterator + is destroyed. + + If the operand is flagged as write-only and a copy is needed, + an uninitialized temporary array will be created and then copied + to back to ``op[i]`` on destruction, instead of doing + the unecessary copy operation. + + ``NPY_ITER_NBO``, ``NPY_ITER_ALIGNED``, ``NPY_ITER_CONTIG`` + + Causes the iterator to provide data for ``op[i]`` + that is in native byte order, aligned according to + the dtype requirements, contiguous, or any combination. + + By default, the iterator produces pointers into the + arrays provided, which may be aligned or unaligned, and + with any byte order. If copying or buffering is not + enabled and the operand data doesn't satisfy the constraints, + an error will be raised. + + The contiguous constraint applies only to the inner loop, + successive inner loops may have arbitrary pointer changes. + + If the requested data type is in non-native byte order, + the NBO flag overrides it and the requested data type is + converted to be in native byte order. + + ``NPY_ITER_ALLOCATE`` + + This is for output arrays, and requires that the flag + ``NPY_ITER_WRITEONLY`` be set. If ``op[i]`` is NULL, + creates a new array with the final broadcast dimensions, + and a layout matching the iteration order of the iterator. + + When ``op[i]`` is NULL, the requested data type + ``op_dtypes[i]`` may be NULL as well, in which case it is + automatically generated from the dtypes of the arrays which + are flagged as readable. The rules for generating the dtype + are the same is for UFuncs. Of special note is handling + of byte order in the selected dtype. If there is exactly + one input, the input's dtype is used as is. Otherwise, + if more than one input dtypes are combined together, the + output will be in native byte order. + + After being allocated with this flag, the caller may retrieve + the new array by calling ``NpyIter_GetOperandArray`` and + getting the i-th object in the returned C array. The caller + must call Py_INCREF on it to claim a reference to the array. + + ``NPY_ITER_NO_SUBTYPE`` + + For use with ``NPY_ITER_ALLOCATE``, this flag disables + allocating an array subtype for the output, forcing + it to be a straight ndarray. + + TODO: Maybe it would be better to introduce a function + ``NpyIter_GetWrappedOutput`` and remove this flag? + + ``NPY_ITER_NO_BROADCAST`` + + Ensures that the input or output matches the iteration + dimensions exactly. + + ``NPY_ITER_WRITEABLE_REFERENCES`` + + By default, the iterator fails on creation if the iterator + has a writeable operand where the data type involves Python + references. Adding this flag indicates that the code using + the iterator is aware of this possibility and handles it + correctly. + +``NpyIter *NpyIter_Copy(NpyIter *iter)`` + + Makes a copy of the given iterator. This function is provided + primarily to enable multi-threaded iteration of the data. + + *TODO*: Move this to a section about multithreaded iteration. + + The recommended approach to multithreaded iteration is to + first create an iterator with the flags + ``NPY_ITER_NO_INNER_ITERATION``, ``NPY_ITER_RANGED``, + ``NPY_ITER_BUFFERED``, ``NPY_ITER_DELAY_BUFALLOC``, and + possibly ``NPY_ITER_GROWINNER``. Create a copy of this iterator + for each thread (minus one for the first iterator). Then, take + the iteration index range ``[0, NpyIter_GetIterSize(iter))`` and + split it up into tasks, for example using a TBB parallel_for loop. + When a thread gets a task to execute, it then uses its copy of + the iterator by calling ``NpyIter_ResetToIterIndexRange`` and + iterating over the full range. + + When using the iterator in multi-threaded code or in code not + holding the Python GIL, care must be taken to only call functions + which are safe in that context. ``NpyIter_Copy`` cannot be safely + called without the Python GIL, because it increments Python + references. The ``Reset*`` and some other functions may be safely + called by passing in the ``errmsg`` parameter as non-NULL, so that + the functions will pass back errors through it instead of setting + a Python exception. + +``int NpyIter_UpdateIter(NpyIter *iter, npy_intp i, npy_uint32 op_flags, NPY_CASTING casting, PyArray_Descr *dtype)`` **UNIMPLEMENTED** + + Updates the i-th operand within the iterator to possibly have a new + data type or more restrictive flag attributes. A use-case for + this is to allow the automatic allocation to determine an + output data type based on the standard NumPy type promotion rules, + then use this function to convert the inputs and possibly the + automatic output to a different data type during processing. + + This operation can only be done if ``NPY_ITER_COORDS`` was passed + as a flag to the iterator. If coordinates are not needed, + call the function ``NpyIter_RemoveCoords()`` once no more calls to + ``NpyIter_UpdateIter`` are needed. + + If the i-th operand has already been copied, an error is thrown. To + avoid this, leave all the flags out except the read/write indicators + for any operand that later has ``NpyIter_UpdateIter`` called on it. + + The flags that may be passed in ``op_flags`` are + ``NPY_ITER_COPY``, ``NPY_ITER_UPDATEIFCOPY``, + ``NPY_ITER_NBO``, ``NPY_ITER_ALIGNED``, ``NPY_ITER_CONTIG``. + +``int NpyIter_RemoveAxis(NpyIter *iter, npy_intp axis)`` + + Removes an axis from iteration. This requires that + ``NPY_ITER_COORDS`` was set for iterator creation, and does not work + if buffering is enabled or an index is being tracked. This function + also resets the iterator to its initial state. + + This is useful for setting up an accumulation loop, for example. + The iterator can first be created with all the dimensions, including + the accumulation axis, so that the output gets created correctly. + Then, the accumulation axis can be removed, and the calculation + done in a nested fashion. + + **WARNING**: This function may change the internal memory layout of + the iterator. Any cached functions or pointers from the iterator + must be retrieved again! + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + + +``int NpyIter_RemoveCoords(NpyIter *iter)`` + + If the iterator has coordinates, this strips support for them, and + does further iterator optimizations that are possible if coordinates + are not needed. This function also resets the iterator to its initial + state. + + **WARNING**: This function may change the internal memory layout of + the iterator. Any cached functions or pointers from the iterator + must be retrieved again! + + After calling this function, ``NpyIter_HasCoords(iter)`` will + return false. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``int NpyIter_RemoveInnerLoop(NpyIter *iter)`` + + If UpdateIter/RemoveCoords was used, you may want to specify the + flag ``NPY_ITER_NO_INNER_ITERATION``. This flag is not permitted + together with ``NPY_ITER_COORDS``, so this function is provided + to enable the feature after ``NpyIter_RemoveCoords`` is called. + This function also resets the iterator to its initial state. + + **WARNING**: This function changes the internal logic of the iterator. + Any cached functions or pointers from the iterator must be retrieved + again! + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``int NpyIter_Deallocate(NpyIter *iter)`` + + Deallocates the iterator object. This additionally frees any + copies made, triggering UPDATEIFCOPY behavior where necessary. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``int NpyIter_Reset(NpyIter *iter, char **errmsg)`` + + Resets the iterator back to its initial state, at the beginning + of the iteration range. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. If errmsg is non-NULL, + no Python exception is set when ``NPY_FAIL`` is returned. + Instead, \*errmsg is set to an error message. When errmsg is + non-NULL, the function may be safely called without holding + the Python GIL. + +``int NpyIter_ResetToIterIndexRange(NpyIter *iter, npy_intp istart, npy_intp iend, char **errmsg)`` + + Resets the iterator and restricts it to the ``iterindex`` range + ``[istart, iend)``. See ``NpyIter_Copy`` for an explanation of + how to use this for multi-threaded iteration. This requires that + the flag ``NPY_ITER_RANGED`` was passed to the iterator constructor. + + If you want to reset both the ``iterindex`` range and the base + pointers at the same time, you can do the following to avoid + extra buffer copying (be sure to add the return code error checks + when you copy this code).:: + + /* Set to a trivial empty range */ + NpyIter_ResetToIterIndexRange(iter, 0, 0); + /* Set the base pointers */ + NpyIter_ResetBasePointers(iter, baseptrs); + /* Set to the desired range */ + NpyIter_ResetToIterIndexRange(iter, istart, iend); + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. If errmsg is non-NULL, + no Python exception is set when ``NPY_FAIL`` is returned. + Instead, \*errmsg is set to an error message. When errmsg is + non-NULL, the function may be safely called without holding + the Python GIL. + +``int NpyIter_ResetBasePointers(NpyIter *iter, char **baseptrs, char **errmsg)`` + + Resets the iterator back to its initial state, but using the values + in ``baseptrs`` for the data instead of the pointers from the arrays + being iterated. This functions is intended to be used, together with + the ``op_axes`` parameter, by nested iteration code with two or more + iterators. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. If errmsg is non-NULL, + no Python exception is set when ``NPY_FAIL`` is returned. + Instead, \*errmsg is set to an error message. When errmsg is + non-NULL, the function may be safely called without holding + the Python GIL. + + *TODO*: Move the following into a special section on nested iterators. + + Creating iterators for nested iteration requires some care. All + the iterator operands must match exactly, or the calls to + ``NpyIter_ResetBasePointers`` will be invalid. This means that + automatic copies and output allocation should not be used haphazardly. + It is possible to still use the automatic data conversion and casting + features of the iterator by creating one of the iterators with + all the conversion parameters enabled, then grabbing the allocated + operands with the ``NpyIter_GetOperandArray`` function and passing + them into the constructors for the rest of the iterators. + + **WARNING**: When creating iterators for nested iteration, + the code must not use a dimension more than once in the different + iterators. If this is done, nested iteration will produce + out-of-bounds pointers during iteration. + + **WARNING**: When creating iterators for nested iteration, buffering + can only be applied to the innermost iterator. If a buffered iterator + is used as the source for ``baseptrs``, it will point into a small buffer + instead of the array and the inner iteration will be invalid. + + The pattern for using nested iterators is as follows.:: + + NpyIter *iter1, *iter1; + NpyIter_IterNext_Fn iternext1, iternext2; + char **dataptrs1; + + /* + * With the exact same operands, no copies allowed, and + * no axis in op_axes used both in iter1 and iter2. + * Buffering may be enabled for iter2, but not for iter1. + */ + iter1 = ...; iter2 = ...; + + iternext1 = NpyIter_GetIterNext(iter1); + iternext2 = NpyIter_GetIterNext(iter2); + dataptrs1 = NpyIter_GetDataPtrArray(iter1); + + do { + NpyIter_ResetBasePointers(iter2, dataptrs1); + do { + /* Use the iter2 values */ + } while (iternext2(iter2)); + } while (iternext1(iter1)); + +``int NpyIter_GotoCoords(NpyIter *iter, npy_intp *coords)`` + + Adjusts the iterator to point to the ``ndim`` coordinates + pointed to by ``coords``. Returns an error if coordinates + are not being tracked, the coordinates are out of bounds, + or inner loop iteration is disabled. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``int NpyIter_GotoIndex(NpyIter *iter, npy_intp index)`` + + Adjusts the iterator to point to the ``index`` specified. + If the iterator was constructed with the flag + ``NPY_ITER_C_INDEX``, ``index`` is the C-order index, + and if the iterator was constructed with the flag + ``NPY_ITER_F_INDEX``, ``index`` is the Fortran-order + index. Returns an error if there is no index being tracked, + the index is out of bounds, or inner loop iteration is disabled. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``npy_intp NpyIter_GetIterSize(NpyIter *iter)`` + + Returns the number of elements being iterated. This is the product + of all the dimensions in the shape. + +``npy_intp NpyIter_GetReduceBlockSizeFactor(NpyIter *iter)`` **UNIMPLEMENTED** + + This provides a factor that must divide into the blocksize used + for ranged iteration to safely multithread a reduction. If + the iterator has no reduction, it returns 1. + + When using ranged iteration to multithread a reduction, there are + two possible ways to do the reduction: + + If there is a big reduction to a small output, make a temporary + array initialized to the reduction unit for each thread, then have + each thread reduce into its temporary. When that is complete, + combine the temporaries together. You can detect this case by + observing that ``NpyIter_GetReduceBlockSizeFactor`` returns a + large value, for instance half or a third of ``NpyIter_GetIterSize``. + You should also check that the output is small just to be sure. + + If there are many small reductions to a big output, and the reduction + dimensions are inner dimensions, ``NpyIter_GetReduceBlockSizeFactor`` + will return a small number, and as long as the block size you choose + for multithreading is ``NpyIter_GetReduceBlockSizeFactor(iter)*n`` + for some ``n``, the operation will be safe. + + The bad case is when the a reduction dimension is the outermost + loop in the iterator. For example, if you have a C-order + array with shape (3,1000,1000), and you reduce on dimension 0, + ``NpyIter_GetReduceBlockSizeFactor`` will return a size equal to + ``NpyIter_GetIterSize`` for ``NPY_KEEPORDER`` or ``NPY_CORDER`` + iteration orders. While it is bad for the CPU cache, perhaps + in the future another order possibility could be provided, maybe + ``NPY_REDUCEORDER``, which pushes the reduction axes to the inner + loop, but otherwise is the same as ``NPY_KEEPORDER``. + +``npy_intp NpyIter_GetIterIndex(NpyIter *iter)`` + + Gets the ``iterindex`` of the iterator, which is an index matching + the iteration order of the iterator. + +``void NpyIter_GetIterIndexRange(NpyIter *iter, npy_intp *istart, npy_intp *iend)`` + + Gets the ``iterindex`` sub-range that is being iterated. If + ``NPY_ITER_RANGED`` was not specified, this always returns the + range ``[0, NpyIter_IterSize(iter))``. + +``int NpyIter_GotoIterIndex(NpyIter *iter, npy_intp iterindex)`` + + Adjusts the iterator to point to the ``iterindex`` specified. + The IterIndex is an index matching the iteration order of the iterator. + Returns an error if the ``iterindex`` is out of bounds, + buffering is enabled, or inner loop iteration is disabled. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``int NpyIter_HasInnerLoop(NpyIter *iter`` + + Returns 1 if the iterator handles the inner loop, + or 0 if the caller needs to handle it. This is controlled + by the constructor flag ``NPY_ITER_NO_INNER_ITERATION``. + +``int NpyIter_HasCoords(NpyIter *iter)`` + + Returns 1 if the iterator was created with the + ``NPY_ITER_COORDS`` flag, 0 otherwise. + +``int NpyIter_HasIndex(NpyIter *iter)`` + + Returns 1 if the iterator was created with the + ``NPY_ITER_C_INDEX`` or ``NPY_ITER_F_INDEX`` + flag, 0 otherwise. + +``int NpyIter_IsBuffered(NpyIter *iter)`` + + Returns 1 if the iterator was created with the + ``NPY_ITER_BUFFERED`` flag, 0 otherwise. + +``int NpyIter_IsGrowInner(NpyIter *iter)`` + + Returns 1 if the iterator was created with the + ``NPY_ITER_GROWINNER`` flag, 0 otherwise. + +``npy_intp NpyIter_GetBufferSize(NpyIter *iter)`` + + If the iterator is buffered, returns the size of the buffer + being used, otherwise returns 0. + +``npy_intp NpyIter_GetNDim(NpyIter *iter)`` + + Returns the number of dimensions being iterated. If coordinates + were not requested in the iterator constructor, this value + may be smaller than the number of dimensions in the original + objects. + +``npy_intp NpyIter_GetNIter(NpyIter *iter)`` + + Returns the number of objects being iterated. + +``npy_intp *NpyIter_GetAxisStrideArray(NpyIter *iter, npy_intp axis)`` + + Gets the array of strides for the specified axis. Requires that + the iterator be tracking coordinates, and that buffering not + be enabled. + + This may be used when you want to match up operand axes in + some fashion, then remove them with ``NpyIter_RemoveAxis`` to + handle their processing manually. By calling this function + before removing the axes, you can get the strides for the + manual processing. + + Returns ``NULL`` on error. + +``int NpyIter_GetShape(NpyIter *iter, npy_intp *outshape)`` + + Returns the broadcast shape of the iterator in ``outshape``. + This can only be called on an iterator which supports coordinates. + + Returns ``NPY_SUCCEED`` or ``NPY_FAIL``. + +``PyArray_Descr **NpyIter_GetDescrArray(NpyIter *iter)`` + + This gives back a pointer to the ``niter`` data type Descrs for + the objects being iterated. The result points into ``iter``, + so the caller does not gain any references to the Descrs. + + This pointer may be cached before the iteration loop, calling + ``iternext`` will not change it. + +``PyObject **NpyIter_GetOperandArray(NpyIter *iter)`` + + This gives back a pointer to the ``niter`` operand PyObjects + that are being iterated. The result points into ``iter``, + so the caller does not gain any references to the PyObjects. + +``PyObject *NpyIter_GetIterView(NpyIter *iter, npy_intp i)`` + + This gives back a reference to a new ndarray view, which is a view + into the i-th object in the array ``NpyIter_GetOperandArray()``, + whose dimensions and strides match the internal optimized + iteration pattern. A C-order iteration of this view is equivalent + to the iterator's iteration order. + + For example, if an iterator was created with a single array as its + input, and it was possible to rearrange all its axes and then + collapse it into a single strided iteration, this would return + a view that is a one-dimensional array. + +``void NpyIter_GetReadFlags(NpyIter *iter, char *outreadflags)`` + + Fills ``niter`` flags. Sets ``outreadflags[i]`` to 1 if + ``op[i]`` can be read from, and to 0 if not. + +``void NpyIter_GetWriteFlags(NpyIter *iter, char *outwriteflags)`` + + Fills ``niter`` flags. Sets ``outwriteflags[i]`` to 1 if + ``op[i]`` can be written to, and to 0 if not. + +Functions For Iteration +----------------------- + +``NpyIter_IterNext_Fn NpyIter_GetIterNext(NpyIter *iter, char **errmsg)`` + + Returns a function pointer for iteration. A specialized version + of the function pointer may be calculated by this function + instead of being stored in the iterator structure. Thus, to + get good performance, it is required that the function pointer + be saved in a variable rather than retrieved for each loop iteration. + + Returns NULL if there is an error. If errmsg is non-NULL, + no Python exception is set when ``NPY_FAIL`` is returned. + Instead, \*errmsg is set to an error message. When errmsg is + non-NULL, the function may be safely called without holding + the Python GIL. + + The typical looping construct is as follows.:: + + NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL); + char **dataptr = NpyIter_GetDataPtrArray(iter); + + do { + /* use the addresses dataptr[0], ... dataptr[niter-1] */ + } while(iternext(iter)); + + When ``NPY_ITER_NO_INNER_ITERATION`` is specified, the typical + inner loop construct is as follows.:: + + NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL); + char **dataptr = NpyIter_GetDataPtrArray(iter); + npy_intp *stride = NpyIter_GetInnerStrideArray(iter); + npy_intp *size_ptr = NpyIter_GetInnerLoopSizePtr(iter), size; + npy_intp iiter, niter = NpyIter_GetNIter(iter); + + do { + size = *size_ptr; + while (size--) { + /* use the addresses dataptr[0], ... dataptr[niter-1] */ + for (iiter = 0; iiter < niter; ++iiter) { + dataptr[iiter] += stride[iiter]; + } + } + } while (iternext()); + + Observe that we are using the dataptr array inside the iterator, not + copying the values to a local temporary. This is possible because + when ``iternext()`` is called, these pointers will be overwritten + with fresh values, not incrementally updated. + + If a compile-time fixed buffer is being used (both flags + ``NPY_ITER_BUFFERED`` and ``NPY_ITER_NO_INNER_ITERATION``), the + inner size may be used as a signal as well. The size is guaranteed + to become zero when ``iternext()`` returns false, enabling the + following loop construct. Note that if you use this construct, + you should not pass ``NPY_ITER_GROWINNER`` as a flag, because it + will cause larger sizes under some circumstances.:: + + /* The constructor should have buffersize passed as this value */ + #define FIXED_BUFFER_SIZE 1024 + + NpyIter_IterNext_Fn iternext = NpyIter_GetIterNext(iter, NULL); + char **dataptr = NpyIter_GetDataPtrArray(iter); + npy_intp *stride = NpyIter_GetInnerStrideArray(iter); + npy_intp *size_ptr = NpyIter_GetInnerLoopSizePtr(iter), size; + npy_intp i, iiter, niter = NpyIter_GetNIter(iter); + + /* One loop with a fixed inner size */ + size = *size_ptr; + while (size == FIXED_BUFFER_SIZE) { + /* + * This loop could be manually unrolled by a factor + * which divides into FIXED_BUFFER_SIZE + */ + for (i = 0; i < FIXED_BUFFER_SIZE; ++i) { + /* use the addresses dataptr[0], ... dataptr[niter-1] */ + for (iiter = 0; iiter < niter; ++iiter) { + dataptr[iiter] += stride[iiter]; + } + } + iternext(); + size = *size_ptr; + } + + /* Finish-up loop with variable inner size */ + if (size > 0) do { + size = *size_ptr; + while (size--) { + /* use the addresses dataptr[0], ... dataptr[niter-1] */ + for (iiter = 0; iiter < niter; ++iiter) { + dataptr[iiter] += stride[iiter]; + } + } + } while (iternext()); + +``NpyIter_GetCoords_Fn NpyIter_GetGetCoords(NpyIter *iter, char **errmsg)`` + + Returns a function pointer for getting the coordinates + of the iterator. Returns NULL if the iterator does not + support coordinates. It is recommended that this function + pointer be cached in a local variable before the iteration + loop. + + Returns NULL if there is an error. If errmsg is non-NULL, + no Python exception is set when ``NPY_FAIL`` is returned. + Instead, \*errmsg is set to an error message. When errmsg is + non-NULL, the function may be safely called without holding + the Python GIL. + +``char **NpyIter_GetDataPtrArray(NpyIter *iter)`` + + This gives back a pointer to the ``niter`` data pointers. If + ``NPY_ITER_NO_INNER_ITERATION`` was not specified, each data + pointer points to the current data item of the iterator. If + no inner iteration was specified, it points to the first data + item of the inner loop. + + This pointer may be cached before the iteration loop, calling + ``iternext`` will not change it. This function may be safely + called without holding the Python GIL. + +``npy_intp *NpyIter_GetIndexPtr(NpyIter *iter)`` + + This gives back a pointer to the index being tracked, or NULL + if no index is being tracked. It is only useable if one of + the flags ``NPY_ITER_C_INDEX`` or ``NPY_ITER_F_INDEX`` + were specified during construction. + +When the flag ``NPY_ITER_NO_INNER_ITERATION`` is used, the code +needs to know the parameters for doing the inner loop. These +functions provide that information. + +``npy_intp *NpyIter_GetInnerStrideArray(NpyIter *iter)`` + + Returns a pointer to an array of the ``niter`` strides, + one for each iterated object, to be used by the inner loop. + + This pointer may be cached before the iteration loop, calling + ``iternext`` will not change it. This function may be safely + called without holding the Python GIL. + +``npy_intp* NpyIter_GetInnerLoopSizePtr(NpyIter *iter)`` + + Returns a pointer to the number of iterations the + inner loop should execute. + + This address may be cached before the iteration loop, calling + ``iternext`` will not change it. The value itself may change during + iteration, in particular if buffering is enabled. This function + may be safely called without holding the Python GIL. + +``void NpyIter_GetInnerFixedStrideArray(NpyIter *iter, npy_intp *out_strides)`` + + Gets an array of strides which are fixed, or will not change during + the entire iteration. For strides that may change, the value + NPY_MAX_INTP is placed in the stride. + + Once the iterator is prepared for iteration (after a reset if + ``NPY_DELAY_BUFALLOC`` was used), call this to get the strides + which may be used to select a fast inner loop function. For example, + if the stride is 0, that means the inner loop can always load its + value into a variable once, then use the variable throughout the loop, + or if the stride equals the itemsize, a contiguous version for that + operand may be used. + + This function may be safely called without holding the Python GIL. + +Examples +-------- + +A copy function using the iterator. The ``order`` parameter +is used to control the memory layout of the allocated +result. + +If the input is a reference type, this function will fail. +To fix this, the code must be changed to specially handle writeable +references, and add ``NPY_ITER_WRITEABLE_REFERENCES`` to the flags.:: + + /* NOTE: This code has not been compiled/tested */ + PyObject *CopyArray(PyObject *arr, NPY_ORDER order) + { + NpyIter *iter; + NpyIter_IterNext_Fn iternext; + PyObject *op[2], *ret; + npy_uint32 flags; + npy_uint32 op_flags[2]; + npy_intp itemsize, *innersizeptr, innerstride; + char **dataptrarray; + + /* + * No inner iteration - inner loop is handled by CopyArray code + */ + flags = NPY_ITER_NO_INNER_ITERATION; + /* + * Tell the constructor to automatically allocate the output. + * The data type of the output will match that of the input. + */ + op[0] = arr; + op[1] = NULL; + op_flags[0] = NPY_ITER_READONLY; + op_flags[1] = NPY_ITER_WRITEONLY | NPY_ITER_ALLOCATE; + + /* Construct the iterator */ + iter = NpyIter_MultiNew(2, op, flags, order, NPY_NO_CASTING, + op_flags, NULL, 0, NULL); + if (iter == NULL) { + return NULL; + } + + /* + * Make a copy of the iternext function pointer and + * a few other variables the inner loop needs. + */ + iternext = NpyIter_GetIterNext(iter); + innerstride = NpyIter_GetInnerStrideArray(iter)[0]; + itemsize = NpyIter_GetDescrArray(iter)[0]->elsize; + /* + * The inner loop size and data pointers may change during the + * loop, so just cache the addresses. + */ + innersizeptr = NpyIter_GetInnerLoopSizePtr(iter); + dataptrarray = NpyIter_GetDataPtrArray(iter); + + /* + * Note that because the iterator allocated the output, + * it matches the iteration order and is packed tightly, + * so we don't need to check it like the input. + */ + if (innerstride == itemsize) { + do { + memcpy(dataptrarray[1], dataptrarray[0], + itemsize * (*innersizeptr)); + } while (iternext(iter)); + } else { + /* Should specialize this further based on item size... */ + npy_intp i; + do { + npy_intp size = *innersizeptr; + char *src = dataaddr[0], *dst = dataaddr[1]; + for(i = 0; i < size; i++, src += innerstride, dst += itemsize) { + memcpy(dst, src, itemsize); + } + } while (iternext(iter)); + } + + /* Get the result from the iterator object array */ + ret = NpyIter_GetOperandArray(iter)[1]; + Py_INCREF(ret); + + if (NpyIter_Deallocate(iter) != NPY_SUCCEED) { + Py_DECREF(ret); + return NULL; + } + + return ret; + } + +Python Lambda UFunc Example +--------------------------- + +To show how the new iterator allows the definition of efficient UFunc-like +functions in pure Python, we demonstrate the function ``luf``, which +makes a lambda-expression act like a UFunc. This is very similar to the +``numexpr`` library, but only takes a few lines of code. + +First, here is the definition of the ``luf`` function.:: + + def luf(lamdaexpr, *args, **kwargs): + """Lambda UFunc + + e.g. + c = luf(lambda i,j:i+j, a, b, order='K', + casting='safe', buffersize=8192) + + c = np.empty(...) + luf(lambda i,j:i+j, a, b, out=c, order='K', + casting='safe', buffersize=8192) + """ + + nargs = len(args) + op = args + (kwargs.get('out',None),) + it = np.newiter(op, ['buffered','no_inner_iteration'], + [['readonly','nbo_aligned']]*nargs + + [['writeonly','allocate','no_broadcast']], + order=kwargs.get('order','K'), + casting=kwargs.get('casting','safe'), + buffersize=kwargs.get('buffersize',0)) + while not it.finished: + it[-1] = lamdaexpr(*it[:-1]) + it.iternext() + + return it.operands[-1] + +Then, by using ``luf`` instead of straight Python expressions, we +can gain some performance from better cache behavior.:: + + In [2]: a = np.random.random((50,50,50,10)) + In [3]: b = np.random.random((50,50,1,10)) + In [4]: c = np.random.random((50,50,50,1)) + + In [5]: timeit 3*a+b-(a/c) + 1 loops, best of 3: 138 ms per loop + + In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) + 10 loops, best of 3: 60.9 ms per loop + + In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) + Out[7]: True + + +Python Addition Example +----------------------- + +The iterator has been mostly written and exposed to Python. To +see how it behaves, let's see what we can do with the np.add ufunc. +Even without changing the core of NumPy, we will be able to use +the iterator to make a faster add function. + +The Python exposure supplies two iteration interfaces, one which +follows the Python iterator protocol, and another which mirrors the +C-style do-while pattern. The native Python approach is better +in most cases, but if you need the iterator's coordinates or +index, use the C-style pattern. + +Here is how we might write an ``iter_add`` function, using the +Python iterator protocol.:: + + def iter_add_py(x, y, out=None): + addop = np.add + + it = np.newiter([x,y,out], [], + [['readonly'],['readonly'],['writeonly','allocate']]) + + for (a, b, c) in it: + addop(a, b, c) + + return it.operands[2] + +Here is the same function, but following the C-style pattern.:: + + def iter_add(x, y, out=None): + addop = np.add + + it = np.newiter([x,y,out], [], + [['readonly'],['readonly'],['writeonly','allocate']]) + + while not it.finished: + addop(it[0], it[1], it[2]) + it.iternext() + + return it.operands[2] + +Some noteworthy points about this function: + +* Cache np.add as a local variable to reduce namespace lookups +* Inputs are readonly, output is writeonly, and will be allocated + automatically if it is None. +* Uses np.add's out parameter to avoid an extra copy. + +Let's create some test variables, and time this function as well as the +built-in np.add.:: + + In [1]: a = np.arange(1000000,dtype='f4').reshape(100,100,100) + In [2]: b = np.arange(10000,dtype='f4').reshape(1,100,100) + In [3]: c = np.arange(10000,dtype='f4').reshape(100,100,1) + + In [4]: timeit iter_add(a, b) + 1 loops, best of 3: 7.03 s per loop + + In [5]: timeit np.add(a, b) + 100 loops, best of 3: 6.73 ms per loop + +At a thousand times slower, this is clearly not very good. One feature +of the iterator, designed to help speed up the inner loops, is the flag +``no_inner_iteration``. This is the same idea as the old iterator's +``PyArray_IterAllButAxis``, but slightly smarter. Let's modify +``iter_add`` to use this feature.:: + + def iter_add_noinner(x, y, out=None): + addop = np.add + + it = np.newiter([x,y,out], ['no_inner_iteration'], + [['readonly'],['readonly'],['writeonly','allocate']]) + + for (a, b, c) in it: + addop(a, b, c) + + return it.operands[2] + +The performance improves dramatically.:: + + In[6]: timeit iter_add_noinner(a, b) + 100 loops, best of 3: 7.1 ms per loop + +The performance is basically as good as the built-in function! It +turns out this is because the iterator was able to coalesce the last two +dimensions, resulting in 100 adds of 10000 elements each. If the +inner loop doesn't become as large, the performance doesn't improve +as dramatically. Let's use ``c`` instead of ``b`` to see how this works.:: + + In[7]: timeit iter_add_noinner(a, c) + 10 loops, best of 3: 76.4 ms per loop + +It's still a lot better than seven seconds, but still over ten times worse +than the built-in function. Here, the inner loop has 100 elements, +and it's iterating 10000 times. If we were coding in C, our performance +would already be as good as the built-in performance, but in Python +there is too much overhead. + +This leads us to another feature of the iterator, its ability to give +us views of the iterated memory. The views it gives us are structured +so that processing them in C-order, like the built-in NumPy code does, +gives the same access order as the iterator itself. Effectively, we +are using the iterator to solve for a good memory access pattern, then +using other NumPy machinery to efficiently execute it. Let's +modify ``iter_add`` once again.:: + + def iter_add_itview(x, y, out=None): + it = np.newiter([x,y,out], [], + [['readonly'],['readonly'],['writeonly','allocate']]) + + (a, b, c) = it.itviews + np.add(a, b, c) + + return it.operands[2] + +Now the performance pretty closely matches the built-in function's.:: + + In [8]: timeit iter_add_itview(a, b) + 100 loops, best of 3: 6.18 ms per loop + + In [9]: timeit iter_add_itview(a, c) + 100 loops, best of 3: 6.69 ms per loop + +Let us now step back to a case similar to the original motivation for the +new iterator. Here are the same calculations in Fortran memory order instead +Of C memory order.:: + + In [10]: a = np.arange(1000000,dtype='f4').reshape(100,100,100).T + In [12]: b = np.arange(10000,dtype='f4').reshape(100,100,1).T + In [11]: c = np.arange(10000,dtype='f4').reshape(1,100,100).T + + In [39]: timeit np.add(a, b) + 10 loops, best of 3: 34.3 ms per loop + + In [41]: timeit np.add(a, c) + 10 loops, best of 3: 31.6 ms per loop + + In [44]: timeit iter_add_itview(a, b) + 100 loops, best of 3: 6.58 ms per loop + + In [43]: timeit iter_add_itview(a, c) + 100 loops, best of 3: 6.33 ms per loop + +As you can see, the performance of the built-in function dropped +significantly, but our newly-written add function maintained essentially +the same performance. As one final test, let's try several adds chained +together.:: + + In [4]: timeit np.add(np.add(np.add(a,b), c), a) + 1 loops, best of 3: 99.5 ms per loop + + In [9]: timeit iter_add_itview(iter_add_itview(iter_add_itview(a,b), c), a) + 10 loops, best of 3: 29.3 ms per loop + +Also, just to check that it's doing the same thing,:: + + In [22]: np.all( + ....: iter_add_itview(iter_add_itview(iter_add_itview(a,b), c), a) == + ....: np.add(np.add(np.add(a,b), c), a) + ....: ) + + Out[22]: True + +Image Compositing Example Revisited +----------------------------------- + +For motivation, we had an example that did an 'over' composite operation +on two images. Now let's see how we can write the function with +the new iterator. + +Here is one of the original functions, for reference, and some +random image data.:: + + In [5]: rand1 = np.random.random_sample(1080*1920*4).astype(np.float32) + In [6]: rand2 = np.random.random_sample(1080*1920*4).astype(np.float32) + In [7]: image1 = rand1.reshape(1080,1920,4).swapaxes(0,1) + In [8]: image2 = rand2.reshape(1080,1920,4).swapaxes(0,1) + + In [3]: def composite_over(im1, im2): + ....: ret = (1-im1[:,:,-1])[:,:,np.newaxis]*im2 + ....: ret += im1 + ....: return ret + + In [4]: timeit composite_over(image1,image2) + 1 loops, best of 3: 1.39 s per loop + +Here's the same function, rewritten to use a new iterator. Note how +easy it was to add an optional output parameter.:: + + In [5]: def composite_over_it(im1, im2, out=None, buffersize=4096): + ....: it = np.newiter([im1, im1[:,:,-1], im2, out], + ....: ['buffered','no_inner_iteration'], + ....: [['readonly']]*3+[['writeonly','allocate']], + ....: op_axes=[None,[0,1,np.newaxis],None,None], + ....: buffersize=buffersize) + ....: while not it.finished: + ....: np.multiply(1-it[1], it[2], it[3]) + ....: it[3] += it[0] + ....: it.iternext() + ....: return it.operands[3] + + In [6]: timeit composite_over_it(image1, image2) + 1 loops, best of 3: 197 ms per loop + +A big speed improvement, over even the best previous attempt using +straight NumPy and a C-order array! By playing with the buffer size, we can +see how the speed improves until we hit the limits of the CPU cache +in the inner loop.:: + + In [7]: timeit composite_over_it(image1, image2, buffersize=2**7) + 1 loops, best of 3: 1.23 s per loop + + In [8]: timeit composite_over_it(image1, image2, buffersize=2**8) + 1 loops, best of 3: 699 ms per loop + + In [9]: timeit composite_over_it(image1, image2, buffersize=2**9) + 1 loops, best of 3: 418 ms per loop + + In [10]: timeit composite_over_it(image1, image2, buffersize=2**10) + 1 loops, best of 3: 287 ms per loop + + In [11]: timeit composite_over_it(image1, image2, buffersize=2**11) + 1 loops, best of 3: 225 ms per loop + + In [12]: timeit composite_over_it(image1, image2, buffersize=2**12) + 1 loops, best of 3: 194 ms per loop + + In [13]: timeit composite_over_it(image1, image2, buffersize=2**13) + 1 loops, best of 3: 180 ms per loop + + In [14]: timeit composite_over_it(image1, image2, buffersize=2**14) + 1 loops, best of 3: 192 ms per loop + + In [15]: timeit composite_over_it(image1, image2, buffersize=2**15) + 1 loops, best of 3: 280 ms per loop + + In [16]: timeit composite_over_it(image1, image2, buffersize=2**16) + 1 loops, best of 3: 328 ms per loop + + In [17]: timeit composite_over_it(image1, image2, buffersize=2**17) + 1 loops, best of 3: 345 ms per loop + +And finally, to double check that it's working, we can compare the two +functions.:: + + In [18]: np.all(composite_over(image1, image2) == + ...: composite_over_it(image1, image2)) + Out[18]: True + +Image Compositing With NumExpr +------------------------------ + +As a test of the iterator, numexpr has been enhanced to allow use of +the iterator instead of its internal broadcasting code. First, let's +implement the composite operation with numexpr.:: + + In [22]: def composite_over_ne(im1, im2, out=None): + ....: ima = im1[:,:,-1][:,:,np.newaxis] + ....: return ne.evaluate("im1+(1-ima)*im2") + + In [23]: timeit composite_over_ne(image1,image2) + 1 loops, best of 3: 1.25 s per loop + +This beats the straight NumPy operation, but isn't very good. Switching +to the iterator version of numexpr, we get a big improvement over the +straight Python function using the iterator. Note that in this is on +a dual core machine.:: + + In [29]: def composite_over_ne_it(im1, im2, out=None): + ....: ima = im1[:,:,-1][:,:,np.newaxis] + ....: return ne.evaluate_iter("im1+(1-ima)*im2") + + In [30]: timeit composite_over_ne_it(image1,image2) + 10 loops, best of 3: 67.2 ms per loop + + In [31]: ne.set_num_threads(1) + In [32]: timeit composite_over_ne_it(image1,image2) + 10 loops, best of 3: 91.1 ms per loop + |