1 files changed, 308 insertions, 0 deletions
diff --git a/doc/neps/deferred-ufunc-evaluation.rst b/doc/neps/deferred-ufunc-evaluation.rst
new file mode 100644
index 000000000..f16f41768
--- /dev/null
+++ b/doc/neps/deferred-ufunc-evaluation.rst
@@ -0,0 +1,308 @@
+:Title: Deferred UFunc Evaluation
+:Author: Mark Wiebe <mwwiebe@gmail.com>
+:Content-Type: text/x-rst
+:Created: 30-Nov-2010
+
+********
+Abstract
+********
+
+This NEP describes a proposal to add deferred evaluation to NumPy's
+UFuncs.  This will allow Python expressions like
+"a[:] = b + c + d + e" to be evaluated in a single pass through all
+the variables at once, with no temporary arrays.  The resulting
+performance will likely be comparable to the *numexpr* library,
+but with a more natural syntax.
+
+This idea has some interaction with UFunc error handling and
+the UPDATEIFCOPY flag, affecting the design and implementation,
+but the result allows for the usage of deferred evaluation
+with minimal effort from the Python user's perspective.
+
+**********
+Motivation
+**********
+
+NumPy's style of UFunc execution causes suboptimal performance for
+large expressions, because multiple temporaries are allocated and
+the inputs are swept through in multiple passes.  The *numexpr* library
+can outperform NumPy for such large expressions, by doing the execution
+in small cache-friendly blocks, and evaluating the whole expression
+per element.  This results in one sweep through each input, which
+is significantly better for the cache.
+
+For an idea of how to get this kind of behavior in NumPy without
+changing the Python code, consider the C++ technique of
+expression templates. These can be used to quite arbitrarily
+rearrange expressions using
+vectors or other data structures, example,::
+
+    A = B + C + D;
+
+can be transformed into something equivalent to::
+
+    for(i = 0; i < A.size; ++i) {
+        A[i] = B[i] + C[i] + D[i];
+    }
+
+This is done by returning a proxy object that knows how to calculate
+the result instead of returning the actual object.  With modern C++
+optimizing compilers, the resulting machine code is often the same
+as hand-written loops.  For an example of this, see the
+`Blitz++ Library <http://www.oonumerics.org/blitz/docs/blitz_3.html>`_.
+A more recently created library for helping write expression templates
+is `Boost Proto <http://beta.boost.org/doc/libs/1_44_0/doc/html/proto.html>`_.
+
+By using the same idea of returning a proxy object in Python, we
+can accomplish the same thing dynamically.  The return object is
+an ndarray without its buffer allocated, and with enough knowledge
+to calculate itself when needed.  When a "deferred array" is
+finally evaluated, we can use the expression tree made up of
+all the operand deferred arrays, effectively creating a single new
+UFunc to evaluate on the fly.
+
+
+*******************
+Example Python Code
+*******************
+
+Here's how it might be used in NumPy.::
+
+    # a, b, c are large ndarrays
+
+    with np.deferredstate(True):
+
+        d = a + b + c
+        # Now d is a 'deferred array,' a, b, and c are marked READONLY
+        # similar to the existing UPDATEIFCOPY mechanism.
+
+        print d
+        # Since the value of d was required, it is evaluated so d becomes
+        # a regular ndarray and gets printed.
+
+        d[:] = a*b*c
+        # Here, the automatically combined "ufunc" that computes
+        # a*b*c effectively gets an out= parameter, so no temporary
+        # arrays are needed whatsoever.
+
+        e = a+b+c*d
+        # Now e is a 'deferred array,' a, b, c, and d are marked READONLY
+
+        d[:] = a
+        # d was marked readonly, but the assignment could see that
+        # this was due to it being a deferred expression operand.
+        # This triggered the deferred evaluation so it could assign
+        # the value of a to d.
+
+There may be some surprising behavior, though.::
+
+    with np.deferredstate(True):
+    
+        d = a + b + c
+        # d is deferred
+
+        e[:] = d
+        f[:] = d
+        g[:] = d
+        # d is still deferred, and its deferred expression
+        # was evaluated three times, once for each assignment.
+        # This could be detected, with d being converted to
+        # a regular ndarray the second time it is evaluated.
+
+I believe the usage that should be recommended in the documentation
+is to leave the deferred state at its default, except when
+evaluating a large expression that can benefit from it.::
+
+    # calculations
+
+    with np.deferredstate(True):
+        x = <big expression>
+
+    # more calculations
+
+This will avoid surprises which would be cause by always keeping
+deferred usage True, like floating point warnings or exceptions
+at surprising times when deferred expression are used later.
+User questions like "Why does my print statement throw a
+divide by zero error?" can hopefully be avoided by recommending
+this approach.
+
+********************************
+Proposed Deferred Evaluation API
+********************************
+
+For deferred evaluation to work, the C API needs to be aware of its
+existence, and be able to trigger evaluation when necessary.  The
+ndarray would gain two new flag.
+
+    ``NPY_ISDEFERRED``
+
+        Indicates the expression evaluation for this ndarray instance
+        has been deferred.
+
+    ``NPY_DEFERRED_WASWRITEABLE``
+
+        Can only be set when ``PyArray_GetDeferredUsageCount(arr) > 0``.
+        It indicates that when ``arr`` was first used in a deferred
+        expression, it was a writeable array.  If this flag is set,
+        calling ``PyArray_CalculateAllDeferred()`` will make ``arr``
+        writeable again.
+
+.. note:: QUESTION
+
+    Should NPY_DEFERRED and NPY_DEFERRED_WASWRITEABLE be visible
+    to Python, or should accessing the flags from python trigger
+    PyArray_CalculateAllDeferred if necessary?
+
+The API would be expanded with a number of functions.
+
+``int PyArray_CalculateAllDeferred()``
+
+    This function forces all currently deferred calculations to occur.
+    
+    For example, if the error state is set to ignore all, and
+    np.seterr({all='raise'}), this would change what happens
+    to already deferred expressions.  Thus, all the existing
+    deferred arrays should be evaluated before changing the
+    error state.
+
+``int PyArray_CalculateDeferred(PyArrayObject* arr)``
+
+    If 'arr' is a deferred array, allocates memory for it and
+    evaluates the deferred expression.  If 'arr' is not a deferred
+    array, simply returns success.  Returns NPY_SUCCESS or NPY_FAILURE.
+
+``int PyArray_CalculateDeferredAssignment(PyArrayObject* arr, PyArrayObject* out)``
+
+    If 'arr' is a deferred array, evaluates the deferred expression
+    into 'out', and 'arr' remains a deferred array.  If 'arr' is not
+    a deferred array, copies its value into out.  Returns NPY_SUCCESS
+    or NPY_FAILURE.
+
+``int PyArray_GetDeferredUsageCount(PyArrayObject* arr)``
+
+    Returns a count of how many deferred expressions use this array
+    as an operand.
+
+The Python API would be expanded as follows.
+ 
+ ``numpy.setdeferred(state)``
+
+    Enables or disables deferred evaluation. True means to always
+    use deferred evaluation.  False means to never use deferred
+    evaluation.  None means to use deferred evaluation if the error
+    handling state is set to ignore everything.  At NumPy initialization,
+    the deferred state is None.
+
+    Returns the previous deferred state.
+
+``numpy.getdeferred()``
+
+    Returns the current deferred state.
+
+``numpy.deferredstate(state)``
+
+    A context manager for deferred state handling, similar to
+    n``umpy.errstate``.
+
+
+Error Handling
+==============
+
+Error handling is a thorny issue for deferred evaluation.  If the
+NumPy error state is {all='ignore'}, it might be reasonable to
+introduce deferred evaluation as the default, however if a UFunc
+can raise an error, it would be very strange for the later 'print'
+statement to throw the exception instead of the actual operation which
+caused the error.
+
+What may be a good approach is to by default enable deferred evaluation
+only when the error state is set to ignore all, but allow user control with
+'setdeferred' and 'getdeferred' functions.  True would mean always
+use deferred evaluation, False would mean never use it, and None would
+mean use it only when safe (i.e. the error state is set to ignore all).
+
+Interaction With UPDATEIFCOPY
+=============================
+
+The ``NPY_UPDATEIFCOPY`` documentation states:
+
+    The data area represents a (well-behaved) copy whose information
+    should be transferred back to the original when this array is deleted.
+
+    This is a special flag that is set if this array represents a copy
+    made because a user required certain flags in PyArray_FromAny and a
+    copy had to be made of some other array (and the user asked for this
+    flag to be set in such a situation). The base attribute then points
+    to the “misbehaved” array (which is set read_only). When the array
+    with this flag set is deallocated, it will copy its contents back to
+    the “misbehaved” array (casting if necessary) and will reset the
+    “misbehaved” array to NPY_WRITEABLE. If the “misbehaved” array was
+    not NPY_WRITEABLE to begin with then PyArray_FromAny would have
+    returned an error because NPY_UPDATEIFCOPY would not have been possible.
+
+The current implementation of UPDATEIFCOPY assumes that it is the only
+mechanism mucking with the writeable flag in this manner.  These mechanisms
+must be aware of each other to work correctly.  Here's an example of how
+they might go wrong:
+
+1. Make a temporary copy of 'arr' with UPDATEIFCOPY ('arr' becomes read only)
+2. Use 'arr' in a deferred expression (deferred usage count becomes one,
+   NPY_DEFERRED_WASWRITEABLE is **not** set, since 'arr' is read only)
+3. Destroy the temporarry copy, causing 'arr' to become writeable
+4. Writing to 'arr' destroys the value of the deferred expression
+
+To deal with this issue, we make these two states mutually exclusive.
+
+* Usage of UPDATEIFCOPY checks the ``NPY_DEFERRED_WASWRITEABLE`` flag,
+  and if it's set, calls ``PyArray_CalculateAllDeferred`` to flush
+  all deferred calculation before proceeding.
+* The ndarray gets a new flag ``NPY_UPDATEIFCOPY_TARGET`` indicating
+  the array will be updated and made writeable at some point in the
+  future.  If the deferred evaluation mechanism sees this flag in
+  any operand, it triggers immediate evaluation.
+
+Other Implementation Details
+============================
+
+When a deferred array is created, it gets references to all the
+operands of the UFunc, along with the UFunc itself.  The 
+'DeferredUsageCount' is incremented for each operand, and later
+gets decremented when the deferred expression is calculated or
+the deferred array is destroyed.
+
+A global list of weak references to all the deferred arrays
+is tracked, in order of creation.  When ``PyArray_CalculateAllDeferred``
+gets called, the newest deferred array is calculated first.
+This may release references to other deferred arrays contained
+in the deferred expression tree, which then
+never have to be calculated.
+
+Further Optimization
+====================
+
+Instead of conservatively disabling deferred evaluation when any
+errors are not set to 'ignore', each UFunc could give a set
+of possible errors it generates.  Then, if all those errors
+are set to 'ignore', deferred evaluation could be used even
+if other errors are not set to ignore.
+
+Once the expression tree is explicitly stored, it is possible to
+do transformations on it.  For example add(add(a,b),c) could
+be transformed into add3(a,b,c), or add(multiply(a,b),c) could
+become fma(a,b,c) using the CPU fused multiply-add instruction
+where available.
+
+While I've framed deferred evaluation as just for UFuncs, it could
+be extended to other functions, such as dot().  For example, chained
+matrix multiplications could be reordered to minimize the size
+of intermediates, or peep-hole style optimizer passes could search
+for patterns that match optimized BLAS/other high performance
+library calls.
+
+For operations on really large arrays, integrating a JIT like LLVM into
+this system might be a big benefit.  The UFuncs and other operations
+would provide bitcode, which could be inlined together and optimized
+by the LLVM optimizers, then executed.  In fact, the iterator itself
+could also be represented in bitcode, allowing LLVM to consider
+the entire iteration while doing its optimization.