summaryrefslogtreecommitdiff
path: root/doc/source/dev/internals.rst
blob: 14e5f3141e00ec5a86a5cdcc11ea56a9a14b2f6e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
.. currentmodule:: numpy

.. _numpy-internals:

*************************************
Internal organization of NumPy arrays
*************************************

It helps to understand a bit about how NumPy arrays are handled under the covers
to help understand NumPy better. This section will not go into great detail.
Those wishing to understand the full details are requested to refer to Travis
Oliphant's book `Guide to NumPy <http://web.mit.edu/dvp/Public/numpybook.pdf>`_.

NumPy arrays consist of two major components: the raw array data (from now on,
referred to as the data buffer), and the information about the raw array data.
The data buffer is typically what people think of as arrays in C or Fortran,
a :term:`contiguous` (and fixed) block of memory containing fixed-sized data
items. NumPy also contains a significant set of data that describes how to
interpret the data in the data buffer. This extra information contains (among
other things):

 1) The basic data element's size in bytes.
 2) The start of the data within the data buffer (an offset relative to the
    beginning of the data buffer).
 3) The number of :term:`dimensions <dimension>` and the size of each dimension.
 4) The separation between elements for each dimension (the :term:`stride`).
    This does not have to be a multiple of the element size.
 5) The byte order of the data (which may not be the native byte order).
 6) Whether the buffer is read-only.
 7) Information (via the :class:`dtype` object) about the interpretation of the
    basic data element. The basic data element may be as simple as an int or a
    float, or it may be a compound object (e.g.,
    :term:`struct-like <structured data type>`), a fixed character field,
    or Python object pointers.
 8) Whether the array is to be interpreted as :term:`C-order <C order>`
    or :term:`Fortran-order <Fortran order>`.

This arrangement allows for the very flexible use of arrays. One thing that it
allows is simple changes to the metadata to change the interpretation of the
array buffer. Changing the byteorder of the array is a simple change involving
no rearrangement of the data. The :term:`shape` of the array can be changed very
easily without changing anything in the data buffer or any data copying at all.

Among other things that are made possible is one can create a new array metadata
object that uses the same data buffer
to create a new :term:`view` of that data buffer that has a different
interpretation of the buffer (e.g., different shape, offset, byte order,
strides, etc) but shares the same data bytes. Many operations in NumPy do just
this such as :term:`slicing <python:slice>`. Other operations, such as
transpose, don't move data elements around in the array, but rather change the
information about the shape and strides so that the indexing of the array
changes, but the data in the doesn't move.

Typically these new versions of the array metadata but the same data buffer are
new views into the data buffer. There is a different :class:`ndarray` object,
but it uses the same data buffer. This is why it is necessary to force copies
through the use of the :func:`copy` method if one really wants to make a new
and independent copy of the data buffer.

New views into arrays mean the object reference counts for the data buffer
increase. Simply doing away with the original array object will not remove the
data buffer if other views of it still exist.

Multidimensional array indexing order issues
============================================

.. seealso:: :ref:`basics.indexing`

What is the right way to index
multi-dimensional arrays? Before you jump to conclusions about the one and
true way to index multi-dimensional arrays, it pays to understand why this is
a confusing issue. This section will try to explain in detail how NumPy
indexing works and why we adopt the convention we do for images, and when it
may be appropriate to adopt other conventions.

The first thing to understand is
that there are two conflicting conventions for indexing 2-dimensional arrays.
Matrix notation uses the first index to indicate which row is being selected and
the second index to indicate which column is selected. This is opposite the
geometrically oriented-convention for images where people generally think the
first index represents x position (i.e., column) and the second represents y
position (i.e., row). This alone is the source of much confusion;
matrix-oriented users and image-oriented users expect two different things with
regard to indexing.

The second issue to understand is how indices correspond
to the order in which the array is stored in memory. In Fortran, the first index
is the most rapidly varying index when moving through the elements of a
two-dimensional array as it is stored in memory. If you adopt the matrix
convention for indexing, then this means the matrix is stored one column at a
time (since the first index moves to the next row as it changes). Thus Fortran
is considered a Column-major language. C has just the opposite convention. In
C, the last index changes most rapidly as one moves through the array as
stored in memory. Thus C is a Row-major language. The matrix is stored by
rows. Note that in both cases it presumes that the matrix convention for
indexing is being used, i.e., for both Fortran and C, the first index is the
row. Note this convention implies that the indexing convention is invariant
and that the data order changes to keep that so.

But that's not the only way
to look at it. Suppose one has large two-dimensional arrays (images or
matrices) stored in data files. Suppose the data are stored by rows rather than
by columns. If we are to preserve our index convention (whether matrix or
image) that means that depending on the language we use, we may be forced to
reorder the data if it is read into memory to preserve our indexing
convention. For example, if we read row-ordered data into memory without
reordering, it will match the matrix indexing convention for C, but not for
Fortran. Conversely, it will match the image indexing convention for Fortran,
but not for C. For C, if one is using data stored in row order, and one wants
to preserve the image index convention, the data must be reordered when
reading into memory.

In the end, what you do for Fortran or C depends on
which is more important, not reordering data or preserving the indexing
convention. For large images, reordering data is potentially expensive, and
often the indexing convention is inverted to avoid that.

The situation with
NumPy makes this issue yet more complicated. The internal machinery of NumPy
arrays is flexible enough to accept any ordering of indices. One can simply
reorder indices by manipulating the internal :term:`stride` information for
arrays without reordering the data at all. NumPy will know how to map the new
index order to the data without moving the data.

So if this is true, why not choose
the index order that matches what you most expect? In particular, why not define
row-ordered images to use the image convention? (This is sometimes referred
to as the Fortran convention vs the C convention, thus the 'C' and 'FORTRAN'
order options for array ordering in NumPy.) The drawback of doing this is
potential performance penalties. It's common to access the data sequentially,
either implicitly in array operations or explicitly by looping over rows of an
image. When that is done, then the data will be accessed in non-optimal order.
As the first index is incremented, what is actually happening is that elements
spaced far apart in memory are being sequentially accessed, with usually poor
memory access speeds. For example, for a two-dimensional image ``im`` defined so
that ``im[0, 10]`` represents the value at ``x = 0``, ``y = 10``. To be
consistent with usual Python behavior then ``im[0]`` would represent a column
at ``x = 0``. Yet that data would be spread over the whole array since the data
are stored in row order. Despite the flexibility of NumPy's indexing, it can't
really paper over the fact basic operations are rendered inefficient because of
data order or that getting contiguous subarrays is still awkward (e.g.,
``im[:, 0]`` for the first row, vs ``im[0]``). Thus one can't use an idiom such
as for row in ``im``; for col in ``im`` does work, but doesn't yield contiguous
column data.

As it turns out, NumPy is
smart enough when dealing with :ref:`ufuncs <ufuncs-internals>` to determine
which index is the most rapidly varying one in memory and uses that for the
innermost loop. Thus for ufuncs, there is no large intrinsic advantage to
either approach in most cases. On the other hand, use of :attr:`ndarray.flat`
with a FORTRAN ordered array will lead to non-optimal memory access as adjacent
elements in the flattened array (iterator, actually) are not contiguous in
memory.

Indeed, the fact is that Python
indexing on lists and other sequences naturally leads to an outside-to-inside
ordering (the first index gets the largest grouping, the next largest,
and the last gets the smallest element). Since image data are normally stored
in rows, this corresponds to the position within rows being the last item
indexed.

If you do want to use Fortran ordering realize that
there are two approaches to consider: 1) accept that the first index is just not
the most rapidly changing in memory and have all your I/O routines reorder
your data when going from memory to disk or visa versa, or use NumPy's
mechanism for mapping the first index to the most rapidly varying data. We
recommend the former if possible. The disadvantage of the latter is that many
of NumPy's functions will yield arrays without Fortran ordering unless you are
careful to use the ``order`` keyword. Doing this would be highly inconvenient.

Otherwise, we recommend simply learning to reverse the usual order of indices
when accessing elements of an array. Granted, it goes against the grain, but
it is more in line with Python semantics and the natural order of the data.