summaryrefslogtreecommitdiff
path: root/doc/neps
diff options
context:
space:
mode:
Diffstat (limited to 'doc/neps')
-rw-r--r--doc/neps/deferred-ufunc-evaluation.rst308
1 files changed, 308 insertions, 0 deletions
diff --git a/doc/neps/deferred-ufunc-evaluation.rst b/doc/neps/deferred-ufunc-evaluation.rst
new file mode 100644
index 000000000..f16f41768
--- /dev/null
+++ b/doc/neps/deferred-ufunc-evaluation.rst
@@ -0,0 +1,308 @@
+:Title: Deferred UFunc Evaluation
+:Author: Mark Wiebe <mwwiebe@gmail.com>
+:Content-Type: text/x-rst
+:Created: 30-Nov-2010
+
+********
+Abstract
+********
+
+This NEP describes a proposal to add deferred evaluation to NumPy's
+UFuncs. This will allow Python expressions like
+"a[:] = b + c + d + e" to be evaluated in a single pass through all
+the variables at once, with no temporary arrays. The resulting
+performance will likely be comparable to the *numexpr* library,
+but with a more natural syntax.
+
+This idea has some interaction with UFunc error handling and
+the UPDATEIFCOPY flag, affecting the design and implementation,
+but the result allows for the usage of deferred evaluation
+with minimal effort from the Python user's perspective.
+
+**********
+Motivation
+**********
+
+NumPy's style of UFunc execution causes suboptimal performance for
+large expressions, because multiple temporaries are allocated and
+the inputs are swept through in multiple passes. The *numexpr* library
+can outperform NumPy for such large expressions, by doing the execution
+in small cache-friendly blocks, and evaluating the whole expression
+per element. This results in one sweep through each input, which
+is significantly better for the cache.
+
+For an idea of how to get this kind of behavior in NumPy without
+changing the Python code, consider the C++ technique of
+expression templates. These can be used to quite arbitrarily
+rearrange expressions using
+vectors or other data structures, example,::
+
+ A = B + C + D;
+
+can be transformed into something equivalent to::
+
+ for(i = 0; i < A.size; ++i) {
+ A[i] = B[i] + C[i] + D[i];
+ }
+
+This is done by returning a proxy object that knows how to calculate
+the result instead of returning the actual object. With modern C++
+optimizing compilers, the resulting machine code is often the same
+as hand-written loops. For an example of this, see the
+`Blitz++ Library <http://www.oonumerics.org/blitz/docs/blitz_3.html>`_.
+A more recently created library for helping write expression templates
+is `Boost Proto <http://beta.boost.org/doc/libs/1_44_0/doc/html/proto.html>`_.
+
+By using the same idea of returning a proxy object in Python, we
+can accomplish the same thing dynamically. The return object is
+an ndarray without its buffer allocated, and with enough knowledge
+to calculate itself when needed. When a "deferred array" is
+finally evaluated, we can use the expression tree made up of
+all the operand deferred arrays, effectively creating a single new
+UFunc to evaluate on the fly.
+
+
+*******************
+Example Python Code
+*******************
+
+Here's how it might be used in NumPy.::
+
+ # a, b, c are large ndarrays
+
+ with np.deferredstate(True):
+
+ d = a + b + c
+ # Now d is a 'deferred array,' a, b, and c are marked READONLY
+ # similar to the existing UPDATEIFCOPY mechanism.
+
+ print d
+ # Since the value of d was required, it is evaluated so d becomes
+ # a regular ndarray and gets printed.
+
+ d[:] = a*b*c
+ # Here, the automatically combined "ufunc" that computes
+ # a*b*c effectively gets an out= parameter, so no temporary
+ # arrays are needed whatsoever.
+
+ e = a+b+c*d
+ # Now e is a 'deferred array,' a, b, c, and d are marked READONLY
+
+ d[:] = a
+ # d was marked readonly, but the assignment could see that
+ # this was due to it being a deferred expression operand.
+ # This triggered the deferred evaluation so it could assign
+ # the value of a to d.
+
+There may be some surprising behavior, though.::
+
+ with np.deferredstate(True):
+
+ d = a + b + c
+ # d is deferred
+
+ e[:] = d
+ f[:] = d
+ g[:] = d
+ # d is still deferred, and its deferred expression
+ # was evaluated three times, once for each assignment.
+ # This could be detected, with d being converted to
+ # a regular ndarray the second time it is evaluated.
+
+I believe the usage that should be recommended in the documentation
+is to leave the deferred state at its default, except when
+evaluating a large expression that can benefit from it.::
+
+ # calculations
+
+ with np.deferredstate(True):
+ x = <big expression>
+
+ # more calculations
+
+This will avoid surprises which would be cause by always keeping
+deferred usage True, like floating point warnings or exceptions
+at surprising times when deferred expression are used later.
+User questions like "Why does my print statement throw a
+divide by zero error?" can hopefully be avoided by recommending
+this approach.
+
+********************************
+Proposed Deferred Evaluation API
+********************************
+
+For deferred evaluation to work, the C API needs to be aware of its
+existence, and be able to trigger evaluation when necessary. The
+ndarray would gain two new flag.
+
+ ``NPY_ISDEFERRED``
+
+ Indicates the expression evaluation for this ndarray instance
+ has been deferred.
+
+ ``NPY_DEFERRED_WASWRITEABLE``
+
+ Can only be set when ``PyArray_GetDeferredUsageCount(arr) > 0``.
+ It indicates that when ``arr`` was first used in a deferred
+ expression, it was a writeable array. If this flag is set,
+ calling ``PyArray_CalculateAllDeferred()`` will make ``arr``
+ writeable again.
+
+.. note:: QUESTION
+
+ Should NPY_DEFERRED and NPY_DEFERRED_WASWRITEABLE be visible
+ to Python, or should accessing the flags from python trigger
+ PyArray_CalculateAllDeferred if necessary?
+
+The API would be expanded with a number of functions.
+
+``int PyArray_CalculateAllDeferred()``
+
+ This function forces all currently deferred calculations to occur.
+
+ For example, if the error state is set to ignore all, and
+ np.seterr({all='raise'}), this would change what happens
+ to already deferred expressions. Thus, all the existing
+ deferred arrays should be evaluated before changing the
+ error state.
+
+``int PyArray_CalculateDeferred(PyArrayObject* arr)``
+
+ If 'arr' is a deferred array, allocates memory for it and
+ evaluates the deferred expression. If 'arr' is not a deferred
+ array, simply returns success. Returns NPY_SUCCESS or NPY_FAILURE.
+
+``int PyArray_CalculateDeferredAssignment(PyArrayObject* arr, PyArrayObject* out)``
+
+ If 'arr' is a deferred array, evaluates the deferred expression
+ into 'out', and 'arr' remains a deferred array. If 'arr' is not
+ a deferred array, copies its value into out. Returns NPY_SUCCESS
+ or NPY_FAILURE.
+
+``int PyArray_GetDeferredUsageCount(PyArrayObject* arr)``
+
+ Returns a count of how many deferred expressions use this array
+ as an operand.
+
+The Python API would be expanded as follows.
+
+ ``numpy.setdeferred(state)``
+
+ Enables or disables deferred evaluation. True means to always
+ use deferred evaluation. False means to never use deferred
+ evaluation. None means to use deferred evaluation if the error
+ handling state is set to ignore everything. At NumPy initialization,
+ the deferred state is None.
+
+ Returns the previous deferred state.
+
+``numpy.getdeferred()``
+
+ Returns the current deferred state.
+
+``numpy.deferredstate(state)``
+
+ A context manager for deferred state handling, similar to
+ n``umpy.errstate``.
+
+
+Error Handling
+==============
+
+Error handling is a thorny issue for deferred evaluation. If the
+NumPy error state is {all='ignore'}, it might be reasonable to
+introduce deferred evaluation as the default, however if a UFunc
+can raise an error, it would be very strange for the later 'print'
+statement to throw the exception instead of the actual operation which
+caused the error.
+
+What may be a good approach is to by default enable deferred evaluation
+only when the error state is set to ignore all, but allow user control with
+'setdeferred' and 'getdeferred' functions. True would mean always
+use deferred evaluation, False would mean never use it, and None would
+mean use it only when safe (i.e. the error state is set to ignore all).
+
+Interaction With UPDATEIFCOPY
+=============================
+
+The ``NPY_UPDATEIFCOPY`` documentation states:
+
+ The data area represents a (well-behaved) copy whose information
+ should be transferred back to the original when this array is deleted.
+
+ This is a special flag that is set if this array represents a copy
+ made because a user required certain flags in PyArray_FromAny and a
+ copy had to be made of some other array (and the user asked for this
+ flag to be set in such a situation). The base attribute then points
+ to the “misbehaved” array (which is set read_only). When the array
+ with this flag set is deallocated, it will copy its contents back to
+ the “misbehaved” array (casting if necessary) and will reset the
+ “misbehaved” array to NPY_WRITEABLE. If the “misbehaved” array was
+ not NPY_WRITEABLE to begin with then PyArray_FromAny would have
+ returned an error because NPY_UPDATEIFCOPY would not have been possible.
+
+The current implementation of UPDATEIFCOPY assumes that it is the only
+mechanism mucking with the writeable flag in this manner. These mechanisms
+must be aware of each other to work correctly. Here's an example of how
+they might go wrong:
+
+1. Make a temporary copy of 'arr' with UPDATEIFCOPY ('arr' becomes read only)
+2. Use 'arr' in a deferred expression (deferred usage count becomes one,
+ NPY_DEFERRED_WASWRITEABLE is **not** set, since 'arr' is read only)
+3. Destroy the temporarry copy, causing 'arr' to become writeable
+4. Writing to 'arr' destroys the value of the deferred expression
+
+To deal with this issue, we make these two states mutually exclusive.
+
+* Usage of UPDATEIFCOPY checks the ``NPY_DEFERRED_WASWRITEABLE`` flag,
+ and if it's set, calls ``PyArray_CalculateAllDeferred`` to flush
+ all deferred calculation before proceeding.
+* The ndarray gets a new flag ``NPY_UPDATEIFCOPY_TARGET`` indicating
+ the array will be updated and made writeable at some point in the
+ future. If the deferred evaluation mechanism sees this flag in
+ any operand, it triggers immediate evaluation.
+
+Other Implementation Details
+============================
+
+When a deferred array is created, it gets references to all the
+operands of the UFunc, along with the UFunc itself. The
+'DeferredUsageCount' is incremented for each operand, and later
+gets decremented when the deferred expression is calculated or
+the deferred array is destroyed.
+
+A global list of weak references to all the deferred arrays
+is tracked, in order of creation. When ``PyArray_CalculateAllDeferred``
+gets called, the newest deferred array is calculated first.
+This may release references to other deferred arrays contained
+in the deferred expression tree, which then
+never have to be calculated.
+
+Further Optimization
+====================
+
+Instead of conservatively disabling deferred evaluation when any
+errors are not set to 'ignore', each UFunc could give a set
+of possible errors it generates. Then, if all those errors
+are set to 'ignore', deferred evaluation could be used even
+if other errors are not set to ignore.
+
+Once the expression tree is explicitly stored, it is possible to
+do transformations on it. For example add(add(a,b),c) could
+be transformed into add3(a,b,c), or add(multiply(a,b),c) could
+become fma(a,b,c) using the CPU fused multiply-add instruction
+where available.
+
+While I've framed deferred evaluation as just for UFuncs, it could
+be extended to other functions, such as dot(). For example, chained
+matrix multiplications could be reordered to minimize the size
+of intermediates, or peep-hole style optimizer passes could search
+for patterns that match optimized BLAS/other high performance
+library calls.
+
+For operations on really large arrays, integrating a JIT like LLVM into
+this system might be a big benefit. The UFuncs and other operations
+would provide bitcode, which could be inlined together and optimized
+by the LLVM optimizers, then executed. In fact, the iterator itself
+could also be represented in bitcode, allowing LLVM to consider
+the entire iteration while doing its optimization.