summaryrefslogtreecommitdiff
path: root/doc/source/dev/internals.code-explanations.rst
blob: eaa629523c176a78dc409e05c73bba3089a0e74d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
.. currentmodule:: numpy

.. _c-code-explanations:

*************************
NumPy C code explanations
*************************

    Fanaticism consists of redoubling your efforts when you have forgotten
    your aim.
    --- *George Santayana*

    An authority is a person who can tell you more about something than
    you really care to know.
    --- *Unknown*

This page attempts to explain the logic behind some of the new
pieces of code. The purpose behind these explanations is to enable
somebody to be able to understand the ideas behind the implementation
somewhat more easily than just staring at the code. Perhaps in this
way, the algorithms can be improved on, borrowed from, and/or
optimized by more people.


Memory model
============

.. index::
   pair: ndarray; memory model

One fundamental aspect of the :class:`ndarray` is that an array is seen as a
"chunk" of memory starting at some location. The interpretation of
this memory depends on the :term:`stride` information. For each dimension in
an :math:`N`-dimensional array, an integer (:term:`stride`) dictates how many
bytes must be skipped to get to the next element in that dimension.
Unless you have a single-segment array, this :term:`stride` information must
be consulted when traversing through an array. It is not difficult to
write code that accepts strides, you just have to use ``char*``
pointers because strides are in units of bytes. Keep in mind also that
strides do not have to be unit-multiples of the element size. Also,
remember that if the number of dimensions of the array is 0 (sometimes
called a ``rank-0`` array), then the :term:`strides <stride>` and
:term:`dimensions <dimension>` variables are ``NULL``.

Besides the structural information contained in the strides and
dimensions members of the :c:type:`PyArrayObject`, the flags contain
important information about how the data may be accessed. In particular,
the :c:data:`NPY_ARRAY_ALIGNED` flag is set when the memory is on a
suitable boundary according to the datatype array. Even if you have
a :term:`contiguous` chunk of memory, you cannot just assume it is safe to
dereference a datatype-specific pointer to an element. Only if the
:c:data:`NPY_ARRAY_ALIGNED` flag is set, this is a safe operation. On
some platforms it will work but on others, like Solaris, it will cause
a bus error. The :c:data:`NPY_ARRAY_WRITEABLE` should also be ensured
if you plan on writing to the memory area of the array. It is also
possible to obtain a pointer to an unwritable memory area. Sometimes,
writing to the memory area when the :c:data:`NPY_ARRAY_WRITEABLE` flag is not
set will just be rude. Other times it can cause program crashes (*e.g.*
a data-area that is a read-only memory-mapped file).


Data-type encapsulation
=======================

.. seealso:: :ref:`arrays.dtypes`

.. index::
   single: dtype

The :ref:`datatype <arrays.dtypes>` is an important abstraction of the
:class:`ndarray`. Operations
will look to the datatype to provide the key functionality that is
needed to operate on the array. This functionality is provided in the
list of function pointers pointed to by the ``f`` member of the
:c:type:`PyArray_Descr` structure. In this way, the number of datatypes can be
extended simply by providing a :c:type:`PyArray_Descr` structure with suitable
function pointers in the ``f`` member. For built-in types, there are some
optimizations that bypass this mechanism, but the point of the datatype
abstraction is to allow new datatypes to be added.

One of the built-in datatypes, the :class:`void` datatype allows for
arbitrary :term:`structured types <structured data type>` containing 1 or more
fields as elements of the array. A :term:`field` is simply another datatype
object along with an offset into the current structured type. In order to
support arbitrarily nested fields, several recursive implementations of
datatype access are implemented for the void type. A common idiom is to cycle
through the elements of the dictionary and perform a specific operation based on
the datatype object stored at the given offset. These offsets can be
arbitrary numbers. Therefore, the possibility of encountering misaligned
data must be recognized and taken into account if necessary.


N-D Iterators
=============

.. seealso:: :ref:`arrays.nditer`

.. index::
   single: array iterator

A very common operation in much of NumPy code is the need to iterate
over all the elements of a general, strided, N-dimensional array. This
operation of a general-purpose N-dimensional loop is abstracted in the
notion of an iterator object. To write an N-dimensional loop, you only
have to create an iterator object from an ndarray, work with the
:c:member:`dataptr <PyArrayIterObject.dataptr>` member of the iterator object
structure and call the macro :c:func:`PyArray_ITER_NEXT` on the iterator
object to move to the next element. The ``next`` element is always in
C-contiguous order. The macro works by first special-casing the C-contiguous,
1-D, and 2-D cases which work very simply.

For the general case, the iteration works by keeping track of a list
of coordinate counters in the iterator object. At each iteration, the
last coordinate counter is increased (starting from 0). If this
counter is smaller than one less than the size of the array in that
dimension (a pre-computed and stored value), then the counter is
increased and the :c:member:`dataptr <PyArrayIterObject.dataptr>` member is
increased by the strides in that
dimension and the macro ends. If the end of a dimension is reached,
the counter for the last dimension is reset to zero and the
:c:member:`dataptr <PyArrayIterObject.dataptr>` is
moved back to the beginning of that dimension by subtracting the
strides value times one less than the number of elements in that
dimension (this is also pre-computed and stored in the
:c:member:`backstrides <PyArrayIterObject.backstrides>`
member of the iterator object). In this case, the macro does not end,
but a local dimension counter is decremented so that the next-to-last
dimension replaces the role that the last dimension played and the
previously-described tests are executed again on the next-to-last
dimension. In this way, the :c:member:`dataptr <PyArrayIterObject.dataptr>`
is adjusted appropriately for arbitrary striding.

The :c:member:`coordinates <PyArrayIterObject.coordinates>` member of the
:c:type:`PyArrayIterObject` structure maintains
the current N-d counter unless the underlying array is C-contiguous in
which case the coordinate counting is bypassed. The
:c:member:`index <PyArrayIterObject.index>` member of
the :c:type:`PyArrayIterObject` keeps track of the current flat index of the
iterator. It is updated by the :c:func:`PyArray_ITER_NEXT` macro.


Broadcasting
============

.. seealso:: :ref:`basics.broadcasting`

.. index::
   single: broadcasting

In Numeric, the ancestor of NumPy, broadcasting was implemented in several
lines of code buried deep in ``ufuncobject.c``. In NumPy, the notion of
broadcasting has been abstracted so that it can be performed in multiple places.
Broadcasting is handled by the function :c:func:`PyArray_Broadcast`. This
function requires a :c:type:`PyArrayMultiIterObject` (or something that is a
binary equivalent) to be passed in. The :c:type:`PyArrayMultiIterObject` keeps
track of the broadcast number of dimensions and size in each
dimension along with the total size of the broadcast result. It also
keeps track of the number of arrays being broadcast and a pointer to
an iterator for each of the arrays being broadcast.

The :c:func:`PyArray_Broadcast` function takes the iterators that have already
been defined and uses them to determine the broadcast shape in each
dimension (to create the iterators at the same time that broadcasting
occurs then use the :c:func:`PyArray_MultiIterNew` function).
Then, the iterators are
adjusted so that each iterator thinks it is iterating over an array
with the broadcast size. This is done by adjusting the iterators
number of dimensions, and the :term:`shape` in each dimension. This works
because the iterator strides are also adjusted. Broadcasting only
adjusts (or adds) length-1 dimensions. For these dimensions, the
strides variable is simply set to 0 so that the data-pointer for the
iterator over that array doesn't move as the broadcasting operation
operates over the extended dimension.

Broadcasting was always implemented in Numeric using 0-valued strides
for the extended dimensions. It is done in exactly the same way in
NumPy. The big difference is that now the array of strides is kept
track of in a :c:type:`PyArrayIterObject`, the iterators involved in a
broadcast result are kept track of in a :c:type:`PyArrayMultiIterObject`,
and the :c:func:`PyArray_Broadcast` call implements the
:ref:`general-broadcasting-rules`.


Array Scalars
=============

.. seealso:: :ref:`arrays.scalars`

.. index::
   single: array scalars

The array scalars offer a hierarchy of Python types that allow a one-to-one
correspondence between the datatype stored in an array and the
Python-type that is returned when an element is extracted from the
array. An exception to this rule was made with object arrays. Object
arrays are heterogeneous collections of arbitrary Python objects. When
you select an item from an object array, you get back the original
Python object (and not an object array scalar which does exist but is
rarely used for practical purposes).

The array scalars also offer the same methods and attributes as arrays
with the intent that the same code can be used to support arbitrary
dimensions (including 0-dimensions). The array scalars are read-only
(immutable) with the exception of the void scalar which can also be
written to so that structured array field setting works more naturally
(``a[0]['f1'] = value``).


Indexing
========

.. seealso:: :ref:`basics.indexing`, :ref:`arrays.indexing`

.. index::
   single: indexing

All Python indexing operations ``arr[index]`` are organized by first preparing
the index and finding the index type. The supported index types are:

* integer
* :const:`newaxis`
* :term:`python:slice`
* :py:data:`Ellipsis`
* integer arrays/array-likes (advanced)
* boolean (single boolean array); if there is more than one boolean array as
  the index or the shape does not match exactly, the boolean array will be
  converted to an integer array instead.
* 0-d boolean (and also integer); 0-d boolean arrays are a special
  case that has to be handled in the advanced indexing code. They signal
  that a 0-d boolean array had to be interpreted as an integer array.

As well as the scalar array special case signaling that an integer array
was interpreted as an integer index, which is important because an integer
array index forces a copy but is ignored if a scalar is returned (full integer
index). The prepared index is guaranteed to be valid with the exception of
out of bound values and broadcasting errors for advanced indexing. This
includes that an :py:data:`Ellipsis` is added for incomplete indices for
example when a two-dimensional array is indexed with a single integer.

The next step depends on the type of index which was found. If all
dimensions are indexed with an integer a scalar is returned or set. A
single boolean indexing array will call specialized boolean functions.
Indices containing an :py:data:`Ellipsis` or :term:`python:slice` but no
advanced indexing will always create a view into the old array by calculating
the new strides and memory offset.  This view can then either be returned or,
for assignments, filled using ``PyArray_CopyObject``. Note that
``PyArray_CopyObject`` may also be called on temporary arrays in other branches
to support complicated assignments when the array is of object :class:`dtype`.

Advanced indexing
-----------------

By far the most complex case is advanced indexing, which may or may not be
combined with typical view-based indexing. Here integer indices are
interpreted as view-based. Before trying to understand this, you may want
to make yourself familiar with its subtleties. The advanced indexing code
has three different branches and one special case:

* There is one indexing array and it, as well as the assignment array, can
  be iterated trivially. For example, they may be contiguous. Also, the
  indexing array must be of :class:`intp` type and the value array in
  assignments should be of the correct type. This is purely a fast path.
* There are only integer array indices so that no subarray exists.
* View-based and advanced indexing is mixed. In this case, the view-based
  indexing defines a collection of subarrays that are combined by the
  advanced indexing. For example, ``arr[[1, 2, 3], :]`` is created by
  vertically stacking the subarrays ``arr[1, :]``, ``arr[2, :]``, and
  ``arr[3, :]``.
* There is a subarray but it has exactly one element. This case can be handled
  as if there is no subarray but needs some care during setup.

Deciding what case applies, checking broadcasting, and determining the kind
of transposition needed are all done in :c:func:`PyArray_MapIterNew`. After
setting up, there are two cases. If there is no subarray or it only has one
element, no subarray iteration is necessary and an iterator is prepared
which iterates all indexing arrays *as well as* the result or value array.
If there is a subarray, there are three iterators prepared. One for the
indexing arrays, one for the result or value array (minus its subarray),
and one for the subarrays of the original and the result/assignment array.
The first two iterators give (or allow calculation) of the pointers into
the start of the subarray, which then allows restarting the subarray
iteration.

When advanced indices are next to each other transposing may be necessary.
All necessary transposing is handled by :c:func:`PyArray_MapIterSwapAxes` and
has to be handled by the caller unless :c:func:`PyArray_MapIterNew` is asked to
allocate the result.

After preparation, getting and setting are relatively straightforward,
although the different modes of iteration need to be considered. Unless
there is only a single indexing array during item getting, the validity of
the indices is checked beforehand. Otherwise, it is handled in the inner
loop itself for optimization.

.. _ufuncs-internals:

Universal functions
===================

.. seealso:: :ref:`ufuncs`, :ref:`ufuncs-basics`

.. index::
   single: ufunc

Universal functions are callable objects that take :math:`N` inputs
and produce :math:`M` outputs by wrapping basic 1-D loops that work
element-by-element into full easy-to-use functions that seamlessly
implement :ref:`broadcasting <basics.broadcasting>`,
:ref:`type-checking <ufuncs.casting>`,
:ref:`buffered coercion <use-of-internal-buffers>`, and
:ref:`output-argument handling <ufuncs-output-type>`. New universal functions
are normally created in C, although there is a mechanism for creating ufuncs
from Python functions (:func:`frompyfunc`). The user must supply a 1-D loop that
implements the basic function taking the input scalar values and
placing the resulting scalars into the appropriate output slots as
explained in implementation.


Setup
-----

Every :class:`ufunc` calculation involves some overhead related to setting up
the calculation. The practical significance of this overhead is that
even though the actual calculation of the ufunc is very fast, you will
be able to write array and type-specific code that will work faster
for small arrays than the ufunc. In particular, using ufuncs to
perform many calculations on 0-D arrays will be slower than other
Python-based solutions (the silently-imported ``scalarmath`` module exists
precisely to give array scalars the look-and-feel of ufunc based
calculations with significantly reduced overhead).

When a :class:`ufunc` is called, many things must be done. The information
collected from these setup operations is stored in a loop object. This
loop object is a C-structure (that could become a Python object but is
not initialized as such because it is only used internally). This loop
object has the layout needed to be used with :c:func:`PyArray_Broadcast`
so that the broadcasting can be handled in the same way as it is handled in
other sections of code.

The first thing done is to look up in the thread-specific global
dictionary the current values for the buffer-size, the error mask, and
the associated error object. The state of the error mask controls what
happens when an error condition is found. It should be noted that
checking of the hardware error flags is only performed after each 1-D
loop is executed. This means that if the input and output arrays are
contiguous and of the correct type so that a single 1-D loop is
performed, then the flags may not be checked until all elements of the
array have been calculated. Looking up these values in a thread-specific
dictionary takes time which is easily ignored for all but
very small arrays.

After checking, the thread-specific global variables, the inputs are
evaluated to determine how the ufunc should proceed and the input and
output arrays are constructed if necessary. Any inputs which are not
arrays are converted to arrays (using context if necessary). Which of
the inputs are scalars (and therefore converted to 0-D arrays) is
noted.

Next, an appropriate 1-D loop is selected from the 1-D loops available
to the :class:`ufunc` based on the input array types. This 1-D loop is selected
by trying to match the signature of the datatypes of the inputs
against the available signatures. The signatures corresponding to
built-in types are stored in the :attr:`ufunc.types` member of the ufunc
structure. The signatures corresponding to user-defined types are stored in a
linked list of function information with the head element stored as a
``CObject`` in the ``userloops`` dictionary keyed by the datatype number
(the first user-defined type in the argument list is used as the key).
The signatures are searched until a signature is found to which the
input arrays can all be cast safely (ignoring any scalar arguments
which are not allowed to determine the type of the result). The
implication of this search procedure is that "lesser types" should be
placed below "larger types" when the signatures are stored. If no 1-D
loop is found, then an error is reported. Otherwise, the ``argument_list``
is updated with the stored signature --- in case casting is necessary
and to fix the output types assumed by the 1-D loop.

If the ufunc has 2 inputs and 1 output and the second input is an
``Object`` array then a special-case check is performed so that
``NotImplemented`` is returned if the second input is not an ndarray, has
the :obj:`~numpy.class.__array_priority__` attribute, and has an ``__r{op}__``
special method. In this way, Python is signaled to give the other object a
chance to complete the operation instead of using generic object-array
calculations. This allows (for example) sparse matrices to override
the multiplication operator 1-D loop.

For input arrays that are smaller than the specified buffer size,
copies are made of all non-contiguous, misaligned, or out-of-byteorder
arrays to ensure that for small arrays, a single loop is
used. Then, array iterators are created for all the input arrays and
the resulting collection of iterators is broadcast to a single shape.

The output arguments (if any) are then processed and any missing
return arrays are constructed. If any provided output array doesn't
have the correct type (or is misaligned) and is smaller than the
buffer size, then a new output array is constructed with the special
:c:data:`NPY_ARRAY_WRITEBACKIFCOPY` flag set. At the end of the function,
:c:func:`PyArray_ResolveWritebackIfCopy` is called so that 
its contents will be copied back into the output array.
Iterators for the output arguments are then processed.

Finally, the decision is made about how to execute the looping
mechanism to ensure that all elements of the input arrays are combined
to produce the output arrays of the correct type. The options for loop
execution are one-loop (for :term`contiguous`, aligned, and correct data
type), strided-loop (for non-contiguous but still aligned and correct
data type), and a buffered loop (for misaligned or incorrect data
type situations). Depending on which execution method is called for,
the loop is then set up and computed.


Function call
-------------

This section describes how the basic universal function computation loop is
set up and executed for each of the three different kinds of execution. If
:c:data:`NPY_ALLOW_THREADS` is defined during compilation, then as long as
no object arrays are involved, the Python Global Interpreter Lock (GIL) is
released prior to calling the loops.  It is re-acquired if necessary to
handle error conditions. The hardware error flags are checked only after
the 1-D loop is completed.


One loop
~~~~~~~~

This is the simplest case of all. The ufunc is executed by calling the
underlying 1-D loop exactly once. This is possible only when we have
aligned data of the correct type (including byteorder) for both input
and output and all arrays have uniform strides (either :term:`contiguous`,
0-D, or 1-D). In this case, the 1-D computational loop is called once
to compute the calculation for the entire array. Note that the
hardware error flags are only checked after the entire calculation is
complete.


Strided loop
~~~~~~~~~~~~

When the input and output arrays are aligned and of the correct type,
but the striding is not uniform (non-contiguous and 2-D or larger),
then a second looping structure is employed for the calculation. This
approach converts all of the iterators for the input and output
arguments to iterate over all but the largest dimension. The inner
loop is then handled by the underlying 1-D computational loop. The
outer loop is a standard iterator loop on the converted iterators. The
hardware error flags are checked after each 1-D loop is completed.


Buffered loop
~~~~~~~~~~~~~

This is the code that handles the situation whenever the input and/or
output arrays are either misaligned or of the wrong datatype
(including being byteswapped) from what the underlying 1-D loop
expects. The arrays are also assumed to be non-contiguous. The code
works very much like the strided-loop except for the inner 1-D loop is
modified so that pre-processing is performed on the inputs and post-processing
is performed on the outputs in ``bufsize`` chunks (where
``bufsize`` is a user-settable parameter). The underlying 1-D
computational loop is called on data that is copied over (if it needs
to be). The setup code and the loop code is considerably more
complicated in this case because it has to handle:

- memory allocation of the temporary buffers

- deciding whether or not to use buffers on the input and output data
  (misaligned and/or wrong datatype)

- copying and possibly casting data for any inputs or outputs for which
  buffers are necessary.

- special-casing ``Object`` arrays so that reference counts are properly
  handled when copies and/or casts are necessary.

- breaking up the inner 1-D loop into ``bufsize`` chunks (with a possible
  remainder).

Again, the hardware error flags are checked at the end of each 1-D
loop.


Final output manipulation
-------------------------

Ufuncs allow other array-like classes to be passed seamlessly through
the interface in that inputs of a particular class will induce the
outputs to be of that same class. The mechanism by which this works is
the following. If any of the inputs are not ndarrays and define the
:obj:`~numpy.class.__array_wrap__` method, then the class with the largest
:obj:`~numpy.class.__array_priority__` attribute determines the type of all the
outputs (with the exception of any output arrays passed in). The
:obj:`~numpy.class.__array_wrap__` method of the input array will be called
with the ndarray being returned from the ufunc as its input. There are two
calling styles of the :obj:`~numpy.class.__array_wrap__` function supported.
The first takes the ndarray as the first argument and a tuple of "context" as
the second argument. The context is (ufunc, arguments, output argument
number). This is the first call tried. If a ``TypeError`` occurs, then the
function is called with just the ndarray as the first argument.


Methods
-------

There are three methods of ufuncs that require calculation similar to
the general-purpose ufuncs. These are :meth:`ufunc.reduce`,
:meth:`ufunc.accumulate`, and :meth:`ufunc.reduceat`. Each of these
methods requires a setup command followed by a
loop. There are four loop styles possible for the methods
corresponding to no-elements, one-element, strided-loop, and buffered-loop.
These are the same basic loop styles as implemented for the
general-purpose function call except for the no-element and one-element
cases which are special-cases occurring when the input array
objects have 0 and 1 elements respectively.


Setup
~~~~~

The setup function for all three methods is ``construct_reduce``.
This function creates a reducing loop object and fills it with the
parameters needed to complete the loop. All of the methods only work
on ufuncs that take 2-inputs and return 1 output. Therefore, the
underlying 1-D loop is selected assuming a signature of ``[otype,
otype, otype]`` where ``otype`` is the requested reduction
datatype. The buffer size and error handling are then retrieved from
(per-thread) global storage. For small arrays that are misaligned or
have incorrect datatype, a copy is made so that the un-buffered
section of code is used. Then, the looping strategy is selected. If
there is 1 element or 0 elements in the array, then a simple looping
method is selected. If the array is not misaligned and has the
correct datatype, then strided looping is selected. Otherwise,
buffered looping must be performed. Looping parameters are then
established, and the return array is constructed.  The output array is
of a different :term:`shape` depending on whether the method is
:meth:`reduce <ufunc.reduce>`, :meth:`accumulate <ufunc.accumulate>`, or
:meth:`reduceat <ufunc.reduceat>`. If an output array is already provided, then
its shape is checked. If the output array is not C-contiguous,
aligned, and of the correct data type, then a temporary copy is made
with the :c:data:`NPY_ARRAY_WRITEBACKIFCOPY` flag set. In this way, the methods
will be able to work with a well-behaved output array but the result will be
copied back into the true output array when
:c:func:`PyArray_ResolveWritebackIfCopy` is called at function completion.
Finally, iterators are set up to loop over the correct :term:`axis`
(depending on the value of axis provided to the method) and the setup
routine returns to the actual computation routine.


:meth:`Reduce <ufunc.reduce>` 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. index::
   triple: ufunc; methods; reduce

All of the ufunc methods use the same underlying 1-D computational
loops with input and output arguments adjusted so that the appropriate
reduction takes place. For example, the key to the functioning of
:meth:`reduce <ufunc.reduce>` is that the 1-D loop is called with the output
and the second input pointing to the same position in memory and both having
a step-size of 0. The first input is pointing to the input array with a
step-size given by the appropriate stride for the selected axis. In this
way, the operation performed is

.. math::
   :nowrap:

   \begin{align*}
   o & = & i[0] \\
   o & = & i[k]\textrm{<op>}o\quad k=1\ldots N
   \end{align*}

where :math:`N+1` is the number of elements in the input, :math:`i`,
:math:`o` is the output, and :math:`i[k]` is the
:math:`k^{\textrm{th}}` element of :math:`i` along the selected axis.
This basic operation is repeated for arrays with greater than 1
dimension so that the reduction takes place for every 1-D sub-array
along the selected axis. An iterator with the selected dimension
removed handles this looping.

For buffered loops, care must be taken to copy and cast data before
the loop function is called because the underlying loop expects
aligned data of the correct datatype (including byteorder). The
buffered loop must handle this copying and casting prior to calling
the loop function on chunks no greater than the user-specified
``bufsize``.


:meth:`Accumulate <ufunc.accumulate>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. index::
   triple: ufunc; methods; accumulate

The :meth:`accumulate <ufunc.accumulate>` method is very similar to
the :meth:`reduce <ufunc.reduce>` method in that
the output and the second input both point to the output. The
difference is that the second input points to memory one stride behind
the current output pointer. Thus, the operation performed is

.. math::
   :nowrap:

   \begin{align*}
   o[0] & = & i[0] \\
   o[k] & = & i[k]\textrm{<op>}o[k-1]\quad k=1\ldots N.
   \end{align*}

The output has the same shape as the input and each 1-D loop operates
over :math:`N` elements when the shape in the selected axis is :math:`N+1`.
Again, buffered loops take care to copy and cast the data before
calling the underlying 1-D computational loop.


:meth:`Reduceat <ufunc.reduceat>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. index::
   triple: ufunc; methods; reduceat
   single: ufunc

The :meth:`reduceat <ufunc.reduceat>` function is a generalization of both the
:meth:`reduce <ufunc.reduce>` and :meth:`accumulate <ufunc.accumulate>`
functions. It implements a :meth:`reduce <ufunc.reduce>` over ranges of
the input array specified by indices. The extra indices argument is checked to
be sure that every input is not too large for the input array along
the selected dimension before the loop calculations take place. The
loop implementation is handled using code that is very similar to the
:meth:`reduce <ufunc.reduce>` code repeated as many times as there are elements
in the indices input. In particular: the first input pointer passed to the
underlying 1-D computational loop points to the input array at the
correct location indicated by the index array. In addition, the output
pointer and the second input pointer passed to the underlying 1-D loop
point to the same position in memory. The size of the 1-D
computational loop is fixed to be the difference between the current
index and the next index (when the current index is the last index,
then the next index is assumed to be the length of the array along the
selected dimension). In this way, the 1-D loop will implement a
:meth:`reduce <ufunc.reduce>` over the specified indices.

Misaligned or a loop datatype that does not match the input and/or
output datatype is handled using buffered code wherein data is
copied to a temporary buffer and cast to the correct datatype if
necessary prior to calling the underlying 1-D function. The temporary
buffers are created in (element) sizes no bigger than the user
settable buffer-size value. Thus, the loop must be flexible enough to
call the underlying 1-D computational loop enough times to complete
the total calculation in chunks no bigger than the buffer-size.