summaryrefslogtreecommitdiff
path: root/numpy/doc/npy-format.txt
blob: 8364680969b68462945cb7bd780607100b1e1f85 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
Title: A Simple File Format for NumPy Arrays
Discussions-To: numpy-discussion@mail.scipy.org
Version: $Revision$
Last-Modified: $Date$
Author: Robert Kern <robert.kern@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 20-Dec-2007


Abstract

    We propose a standard binary file format (NPY) for persisting
    a single arbitrary NumPy array on disk.  The format stores all of
    the shape and dtype information necessary to reconstruct the array
    correctly even on another machine with a different architecture.
    The format is designed to be as simple as possible while achieving
    its limited goals.  The implementation is intended to be pure
    Python and distributed as part of the main numpy package.


Rationale

    A lightweight, omnipresent system for saving NumPy arrays to disk
    is a frequent need.  Python in general has pickle [1] for saving
    most Python objects to disk.  This often works well enough with
    NumPy arrays for many purposes, but it has a few drawbacks:

    - Dumping or loading a pickle file require the duplication of the
      data in memory.  For large arrays, this can be a showstopper.

    - The array data is not directly accessible through
      memory-mapping.  Now that numpy has that capability, it has
      proved very useful for loading large amounts of data (or more to
      the point: avoiding loading large amounts of data when you only
      need a small part).

    Both of these problems can be addressed by dumping the raw bytes
    to disk using ndarray.tofile() and numpy.fromfile().  However,
    these have their own problems:

    - The data which is written has no information about the shape or
      dtype of the array.

    - It is incapable of handling object arrays.

    The NPY file format is an evolutionary advance over these two
    approaches.  Its design is mostly limited to solving the problems
    with pickles and tofile()/fromfile().  It does not intend to solve
    more complicated problems for which more complicated formats like
    HDF5 [2] are a better solution.


Use Cases

    - Neville Newbie has just started to pick up Python and NumPy.  He
      has not installed many packages, yet, nor learned the standard
      library, but he has been playing with NumPy at the interactive
      prompt to do small tasks.  He gets a result that he wants to
      save.

    - Annie Analyst has been using large nested record arrays to
      represent her statistical data.  She wants to convince her
      R-using colleague, David Doubter, that Python and NumPy are
      awesome by sending him her analysis code and data.  She needs
      the data to load at interactive speeds.  Since David does not
      use Python usually, needing to install large packages would turn
      him off.

    - Simon Seismologist is developing new seismic processing tools.
      One of his algorithms requires large amounts of intermediate
      data to be written to disk.  The data does not really fit into
      the industry-standard SEG-Y schema, but he already has a nice
      record-array dtype for using it internally.

    - Polly Parallel wants to split up a computation on her multicore
      machine as simply as possible.  Parts of the computation can be
      split up among different processes without any communication
      between processes; they just need to fill in the appropriate
      portion of a large array with their results.  Having several
      child processes memory-mapping a common array is a good way to
      achieve this.


Requirements

    The format MUST be able to:

    - Represent all NumPy arrays including nested record
      arrays and object arrays.

    - Represent the data in its native binary form.

    - Be contained in a single file.

    - Support Fortran-contiguous arrays directly.

    - Store all of the necessary information to reconstruct the array
      including shape and dtype on a machine of a different
      architecture.  Both little-endian and big-endian arrays must be
      supported and a file with little-endian numbers will yield
      a little-endian array on any machine reading the file.  The
      types must be described in terms of their actual sizes.  For
      example, if a machine with a 64-bit C "long int" writes out an
      array with "long ints", a reading machine with 32-bit C "long
      ints" will yield an array with 64-bit integers.

    - Be reverse engineered.  Datasets often live longer than the
      programs that created them.  A competent developer should be
      able create a solution in his preferred programming language to
      read most NPY files that he has been given without much
      documentation.

    - Allow memory-mapping of the data.

    - Be read from a filelike stream object instead of an actual file.
      This allows the implementation to be tested easily and makes the
      system more flexible.  NPY files can be stored in ZIP files and
      easily read from a ZipFile object.

    - Store object arrays.  Since general Python objects are
      complicated and can only be reliably serialized by pickle (if at
      all), many of the other requirements are waived for files
      containing object arrays.  Files with object arrays do not have
      to be mmapable since that would be technically impossible.  We
      cannot expect the pickle format to be reverse engineered without
      knowledge of pickle.  However, one should at least be able to
      read and write object arrays with the same generic interface as
      other arrays.

    - Be read and written using APIs provided in the numpy package
      itself without any other libraries.  The implementation inside
      numpy may be in C if necessary.

    The format explicitly *does not* need to:

    - Support multiple arrays in a file.  Since we require filelike
      objects to be supported, one could use the API to build an ad
      hoc format that supported multiple arrays.  However, solving the
      general problem and use cases is beyond the scope of the format
      and the API for numpy.

    - Fully handle arbitrary subclasses of numpy.ndarray.  Subclasses
      will be accepted for writing, but only the array data will be
      written out.  A regular numpy.ndarray object will be created
      upon reading the file.  The API can be used to build a format
      for a particular subclass, but that is out of scope for the
      general NPY format.


Format Specification: Version 1.0

    The first 6 bytes are a magic string: exactly "\x93NUMPY".

    The next 1 byte is an unsigned byte: the major version number of
    the file format, e.g. \x01.

    The next 1 byte is an unsigned byte: the minor version number of
    the file format, e.g. \x00.  Note: the version of the file format
    is not tied to the version of the numpy package.

    The next 2 bytes form a little-endian unsigned short int: the
    length of the header data HEADER_LEN.

    The next HEADER_LEN bytes form the header data describing the
    array's format.  It is an ASCII string which contains a Python
    literal expression of a dictionary.  It is terminated by a newline
    ('\n') and padded with spaces ('\x20') to make the total length of
    the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
    alignment purposes.

    The dictionary contains three keys:

        "descr" : dtype.descr
            An object that can be passed as an argument to the
            numpy.dtype() constructor to create the array's dtype.

        "fortran_order" : bool
            Whether the array data is Fortran-contiguous or not.
            Since Fortran-contiguous arrays are a common form of
            non-C-contiguity, we allow them to be written directly to
            disk for efficiency.

        "shape" : tuple of int
            The shape of the array.

    For repeatability and readability, this dictionary is formatted
    using pprint.pformat() so the keys are in alphabetic order.

    Following the header comes the array data.  If the dtype contains
    Python objects (i.e. dtype.hasobject is True), then the data is
    a Python pickle of the array.  Otherwise the data is the
    contiguous (either C- or Fortran-, depending on fortran_order)
    bytes of the array.  Consumers can figure out the number of bytes
    by multiplying the number of elements given by the shape (noting
    that shape=() means there is 1 element) by dtype.itemsize.


Conventions

    We recommend using the ".npy" extension for files following this
    format.  This is by no means a requirement; applications may wish
    to use this file format but use an extension specific to the
    application.  In the absence of an obvious alternative, however,
    we suggest using ".npy".

    For a simple way to combine multiple arrays into a single file,
    one can use ZipFile to contain multiple ".npy" files.  We
    recommend using the file extension ".npz" for these archives.


Alternatives

    The author believes that this system (or one along these lines) is
    about the simplest system that satisfies all of the requirements.
    However, one must always be wary of introducing a new binary
    format to the world.

    HDF5 [2] is a very flexible format that should be able to
    represent all of NumPy's arrays in some fashion.  It is probably
    the only widely-used format that can faithfully represent all of
    NumPy's array features.  It has seen substantial adoption by the
    scientific community in general and the NumPy community in
    particular.  It is an excellent solution for a wide variety of
    array storage problems with or without NumPy.

    HDF5 is a complicated format that more or less implements
    a hierarchical filesystem-in-a-file.  This fact makes satisfying
    some of the Requirements difficult.  To the author's knowledge, as
    of this writing, there is no application or library that reads or
    writes even a subset of HDF5 files that does not use the canonical
    libhdf5 implementation.  This implementation is a large library
    that is not always easy to build.  It would be infeasible to
    include it in numpy.

    It might be feasible to target an extremely limited subset of
    HDF5.  Namely, there would be only one object in it: the array.
    Using contiguous storage for the data, one should be able to
    implement just enough of the format to provide the same metadata
    that the proposed format does.  One could still meet all of the
    technical requirements like mmapability.

    We would accrue a substantial benefit by being able to generate
    files that could be read by other HDF5 software.  Furthermore, by
    providing the first non-libhdf5 implementation of HDF5, we would
    be able to encourage more adoption of simple HDF5 in applications
    where it was previously infeasible because of the size of the
    library.  The basic work may encourage similar dead-simple
    implementations in other languages and further expand the
    community.

    The remaining concern is about reverse engineerability of the
    format.  Even the simple subset of HDF5 would be very difficult to
    reverse engineer given just a file by itself.  However, given the
    prominence of HDF5, this might not be a substantial concern.

    In conclusion, we are going forward with the design laid out in
    this document.  If someone writes code to handle the simple subset
    of HDF5 that would be useful to us, we may consider a revision of
    the file format.


Implementation

    The current implementation is in the trunk of the numpy SVN
    repository and will be part of the 1.0.5 release.

        http://svn.scipy.org/svn/numpy/trunk

    Specifically, the file format.py in this directory implements the
    format as described here.


References

    [1] http://docs.python.org/lib/module-pickle.html

    [2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html


Copyright

    This document has been placed in the public domain.



Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: