summaryrefslogtreecommitdiff
path: root/doc/neps/npy-format.rst
diff options
context:
space:
mode:
authorRalf Gommers <ralf.gommers@googlemail.com>2014-04-21 18:01:10 +0200
committerRalf Gommers <ralf.gommers@googlemail.com>2014-04-21 18:05:11 +0200
commit7d3e739a342121da74837a73a0374ed148aafe86 (patch)
treefdf7e582d0728608638a362682e0076fee809ad7 /doc/neps/npy-format.rst
parentb66af02c283ddd1e33436ad3a14adec0bb09785f (diff)
downloadnumpy-7d3e739a342121da74837a73a0374ed148aafe86.tar.gz
DOC: fix reST formatting of npy-format NEP.
Diffstat (limited to 'doc/neps/npy-format.rst')
-rw-r--r--doc/neps/npy-format.rst407
1 files changed, 201 insertions, 206 deletions
diff --git a/doc/neps/npy-format.rst b/doc/neps/npy-format.rst
index 95b6f755e..f7bb799f2 100644
--- a/doc/neps/npy-format.rst
+++ b/doc/neps/npy-format.rst
@@ -2,296 +2,291 @@
A Simple File Format for NumPy Arrays
=====================================
-Discussions-To: numpy-discussion@mail.scipy.org
-Version: $Revision$
-Last-Modified: $Date$
Author: Robert Kern <robert.kern@gmail.com>
Status: Draft
-Type: Standards Track
-Content-Type: text/plain
Created: 20-Dec-2007
Abstract
--------
- We propose a standard binary file format (NPY) for persisting
- a single arbitrary NumPy array on disk. The format stores all of
- the shape and dtype information necessary to reconstruct the array
- correctly even on another machine with a different architecture.
- The format is designed to be as simple as possible while achieving
- its limited goals. The implementation is intended to be pure
- Python and distributed as part of the main numpy package.
+We propose a standard binary file format (NPY) for persisting
+a single arbitrary NumPy array on disk. The format stores all of
+the shape and dtype information necessary to reconstruct the array
+correctly even on another machine with a different architecture.
+The format is designed to be as simple as possible while achieving
+its limited goals. The implementation is intended to be pure
+Python and distributed as part of the main numpy package.
Rationale
---------
- A lightweight, omnipresent system for saving NumPy arrays to disk
- is a frequent need. Python in general has pickle [1] for saving
- most Python objects to disk. This often works well enough with
- NumPy arrays for many purposes, but it has a few drawbacks:
+A lightweight, omnipresent system for saving NumPy arrays to disk
+is a frequent need. Python in general has pickle [1] for saving
+most Python objects to disk. This often works well enough with
+NumPy arrays for many purposes, but it has a few drawbacks:
- - Dumping or loading a pickle file require the duplication of the
- data in memory. For large arrays, this can be a showstopper.
+- Dumping or loading a pickle file require the duplication of the
+ data in memory. For large arrays, this can be a showstopper.
- - The array data is not directly accessible through
- memory-mapping. Now that numpy has that capability, it has
- proved very useful for loading large amounts of data (or more to
- the point: avoiding loading large amounts of data when you only
- need a small part).
+- The array data is not directly accessible through
+ memory-mapping. Now that numpy has that capability, it has
+ proved very useful for loading large amounts of data (or more to
+ the point: avoiding loading large amounts of data when you only
+ need a small part).
- Both of these problems can be addressed by dumping the raw bytes
- to disk using ndarray.tofile() and numpy.fromfile(). However,
- these have their own problems:
+Both of these problems can be addressed by dumping the raw bytes
+to disk using ndarray.tofile() and numpy.fromfile(). However,
+these have their own problems:
- - The data which is written has no information about the shape or
- dtype of the array.
+- The data which is written has no information about the shape or
+ dtype of the array.
- - It is incapable of handling object arrays.
+- It is incapable of handling object arrays.
- The NPY file format is an evolutionary advance over these two
- approaches. Its design is mostly limited to solving the problems
- with pickles and tofile()/fromfile(). It does not intend to solve
- more complicated problems for which more complicated formats like
- HDF5 [2] are a better solution.
+The NPY file format is an evolutionary advance over these two
+approaches. Its design is mostly limited to solving the problems
+with pickles and tofile()/fromfile(). It does not intend to solve
+more complicated problems for which more complicated formats like
+HDF5 [2] are a better solution.
Use Cases
---------
- - Neville Newbie has just started to pick up Python and NumPy. He
- has not installed many packages, yet, nor learned the standard
- library, but he has been playing with NumPy at the interactive
- prompt to do small tasks. He gets a result that he wants to
- save.
-
- - Annie Analyst has been using large nested record arrays to
- represent her statistical data. She wants to convince her
- R-using colleague, David Doubter, that Python and NumPy are
- awesome by sending him her analysis code and data. She needs
- the data to load at interactive speeds. Since David does not
- use Python usually, needing to install large packages would turn
- him off.
-
- - Simon Seismologist is developing new seismic processing tools.
- One of his algorithms requires large amounts of intermediate
- data to be written to disk. The data does not really fit into
- the industry-standard SEG-Y schema, but he already has a nice
- record-array dtype for using it internally.
-
- - Polly Parallel wants to split up a computation on her multicore
- machine as simply as possible. Parts of the computation can be
- split up among different processes without any communication
- between processes; they just need to fill in the appropriate
- portion of a large array with their results. Having several
- child processes memory-mapping a common array is a good way to
- achieve this.
+- Neville Newbie has just started to pick up Python and NumPy. He
+ has not installed many packages, yet, nor learned the standard
+ library, but he has been playing with NumPy at the interactive
+ prompt to do small tasks. He gets a result that he wants to
+ save.
+
+- Annie Analyst has been using large nested record arrays to
+ represent her statistical data. She wants to convince her
+ R-using colleague, David Doubter, that Python and NumPy are
+ awesome by sending him her analysis code and data. She needs
+ the data to load at interactive speeds. Since David does not
+ use Python usually, needing to install large packages would turn
+ him off.
+
+- Simon Seismologist is developing new seismic processing tools.
+ One of his algorithms requires large amounts of intermediate
+ data to be written to disk. The data does not really fit into
+ the industry-standard SEG-Y schema, but he already has a nice
+ record-array dtype for using it internally.
+
+- Polly Parallel wants to split up a computation on her multicore
+ machine as simply as possible. Parts of the computation can be
+ split up among different processes without any communication
+ between processes; they just need to fill in the appropriate
+ portion of a large array with their results. Having several
+ child processes memory-mapping a common array is a good way to
+ achieve this.
Requirements
------------
- The format MUST be able to:
+The format MUST be able to:
- - Represent all NumPy arrays including nested record
- arrays and object arrays.
+- Represent all NumPy arrays including nested record
+ arrays and object arrays.
- - Represent the data in its native binary form.
+- Represent the data in its native binary form.
- - Be contained in a single file.
+- Be contained in a single file.
- - Support Fortran-contiguous arrays directly.
+- Support Fortran-contiguous arrays directly.
- - Store all of the necessary information to reconstruct the array
- including shape and dtype on a machine of a different
- architecture. Both little-endian and big-endian arrays must be
- supported and a file with little-endian numbers will yield
- a little-endian array on any machine reading the file. The
- types must be described in terms of their actual sizes. For
- example, if a machine with a 64-bit C "long int" writes out an
- array with "long ints", a reading machine with 32-bit C "long
- ints" will yield an array with 64-bit integers.
+- Store all of the necessary information to reconstruct the array
+ including shape and dtype on a machine of a different
+ architecture. Both little-endian and big-endian arrays must be
+ supported and a file with little-endian numbers will yield
+ a little-endian array on any machine reading the file. The
+ types must be described in terms of their actual sizes. For
+ example, if a machine with a 64-bit C "long int" writes out an
+ array with "long ints", a reading machine with 32-bit C "long
+ ints" will yield an array with 64-bit integers.
- - Be reverse engineered. Datasets often live longer than the
- programs that created them. A competent developer should be
- able to create a solution in his preferred programming language to
- read most NPY files that he has been given without much
- documentation.
+- Be reverse engineered. Datasets often live longer than the
+ programs that created them. A competent developer should be
+ able to create a solution in his preferred programming language to
+ read most NPY files that he has been given without much
+ documentation.
- - Allow memory-mapping of the data.
+- Allow memory-mapping of the data.
- - Be read from a filelike stream object instead of an actual file.
- This allows the implementation to be tested easily and makes the
- system more flexible. NPY files can be stored in ZIP files and
- easily read from a ZipFile object.
+- Be read from a filelike stream object instead of an actual file.
+ This allows the implementation to be tested easily and makes the
+ system more flexible. NPY files can be stored in ZIP files and
+ easily read from a ZipFile object.
- - Store object arrays. Since general Python objects are
- complicated and can only be reliably serialized by pickle (if at
- all), many of the other requirements are waived for files
- containing object arrays. Files with object arrays do not have
- to be mmapable since that would be technically impossible. We
- cannot expect the pickle format to be reverse engineered without
- knowledge of pickle. However, one should at least be able to
- read and write object arrays with the same generic interface as
- other arrays.
+- Store object arrays. Since general Python objects are
+ complicated and can only be reliably serialized by pickle (if at
+ all), many of the other requirements are waived for files
+ containing object arrays. Files with object arrays do not have
+ to be mmapable since that would be technically impossible. We
+ cannot expect the pickle format to be reverse engineered without
+ knowledge of pickle. However, one should at least be able to
+ read and write object arrays with the same generic interface as
+ other arrays.
- - Be read and written using APIs provided in the numpy package
- itself without any other libraries. The implementation inside
- numpy may be in C if necessary.
+- Be read and written using APIs provided in the numpy package
+ itself without any other libraries. The implementation inside
+ numpy may be in C if necessary.
- The format explicitly *does not* need to:
+The format explicitly *does not* need to:
- - Support multiple arrays in a file. Since we require filelike
- objects to be supported, one could use the API to build an ad
- hoc format that supported multiple arrays. However, solving the
- general problem and use cases is beyond the scope of the format
- and the API for numpy.
+- Support multiple arrays in a file. Since we require filelike
+ objects to be supported, one could use the API to build an ad
+ hoc format that supported multiple arrays. However, solving the
+ general problem and use cases is beyond the scope of the format
+ and the API for numpy.
- - Fully handle arbitrary subclasses of numpy.ndarray. Subclasses
- will be accepted for writing, but only the array data will be
- written out. A regular numpy.ndarray object will be created
- upon reading the file. The API can be used to build a format
- for a particular subclass, but that is out of scope for the
- general NPY format.
+- Fully handle arbitrary subclasses of numpy.ndarray. Subclasses
+ will be accepted for writing, but only the array data will be
+ written out. A regular numpy.ndarray object will be created
+ upon reading the file. The API can be used to build a format
+ for a particular subclass, but that is out of scope for the
+ general NPY format.
Format Specification: Version 1.0
---------------------------------
- The first 6 bytes are a magic string: exactly "\x93NUMPY".
+The first 6 bytes are a magic string: exactly "\x93NUMPY".
- The next 1 byte is an unsigned byte: the major version number of
- the file format, e.g. \x01.
+The next 1 byte is an unsigned byte: the major version number of
+the file format, e.g. \x01.
- The next 1 byte is an unsigned byte: the minor version number of
- the file format, e.g. \x00. Note: the version of the file format
- is not tied to the version of the numpy package.
+The next 1 byte is an unsigned byte: the minor version number of
+the file format, e.g. \x00. Note: the version of the file format
+is not tied to the version of the numpy package.
- The next 2 bytes form a little-endian unsigned short int: the
- length of the header data HEADER_LEN.
+The next 2 bytes form a little-endian unsigned short int: the
+length of the header data HEADER_LEN.
- The next HEADER_LEN bytes form the header data describing the
- array's format. It is an ASCII string which contains a Python
- literal expression of a dictionary. It is terminated by a newline
- ('\n') and padded with spaces ('\x20') to make the total length of
- the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
- alignment purposes.
+The next HEADER_LEN bytes form the header data describing the
+array's format. It is an ASCII string which contains a Python
+literal expression of a dictionary. It is terminated by a newline
+('\n') and padded with spaces ('\x20') to make the total length of
+the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
+alignment purposes.
- The dictionary contains three keys:
+The dictionary contains three keys:
- "descr" : dtype.descr
- An object that can be passed as an argument to the
- numpy.dtype() constructor to create the array's dtype.
+ "descr" : dtype.descr
+ An object that can be passed as an argument to the
+ numpy.dtype() constructor to create the array's dtype.
- "fortran_order" : bool
- Whether the array data is Fortran-contiguous or not.
- Since Fortran-contiguous arrays are a common form of
- non-C-contiguity, we allow them to be written directly to
- disk for efficiency.
+ "fortran_order" : bool
+ Whether the array data is Fortran-contiguous or not.
+ Since Fortran-contiguous arrays are a common form of
+ non-C-contiguity, we allow them to be written directly to
+ disk for efficiency.
- "shape" : tuple of int
- The shape of the array.
+ "shape" : tuple of int
+ The shape of the array.
- For repeatability and readability, this dictionary is formatted
- using pprint.pformat() so the keys are in alphabetic order.
+For repeatability and readability, this dictionary is formatted
+using pprint.pformat() so the keys are in alphabetic order.
- Following the header comes the array data. If the dtype contains
- Python objects (i.e. dtype.hasobject is True), then the data is
- a Python pickle of the array. Otherwise the data is the
- contiguous (either C- or Fortran-, depending on fortran_order)
- bytes of the array. Consumers can figure out the number of bytes
- by multiplying the number of elements given by the shape (noting
- that shape=() means there is 1 element) by dtype.itemsize.
+Following the header comes the array data. If the dtype contains
+Python objects (i.e. dtype.hasobject is True), then the data is
+a Python pickle of the array. Otherwise the data is the
+contiguous (either C- or Fortran-, depending on fortran_order)
+bytes of the array. Consumers can figure out the number of bytes
+by multiplying the number of elements given by the shape (noting
+that shape=() means there is 1 element) by dtype.itemsize.
Conventions
-----------
- We recommend using the ".npy" extension for files following this
- format. This is by no means a requirement; applications may wish
- to use this file format but use an extension specific to the
- application. In the absence of an obvious alternative, however,
- we suggest using ".npy".
+We recommend using the ".npy" extension for files following this
+format. This is by no means a requirement; applications may wish
+to use this file format but use an extension specific to the
+application. In the absence of an obvious alternative, however,
+we suggest using ".npy".
- For a simple way to combine multiple arrays into a single file,
- one can use ZipFile to contain multiple ".npy" files. We
- recommend using the file extension ".npz" for these archives.
+For a simple way to combine multiple arrays into a single file,
+one can use ZipFile to contain multiple ".npy" files. We
+recommend using the file extension ".npz" for these archives.
Alternatives
------------
- The author believes that this system (or one along these lines) is
- about the simplest system that satisfies all of the requirements.
- However, one must always be wary of introducing a new binary
- format to the world.
-
- HDF5 [2] is a very flexible format that should be able to
- represent all of NumPy's arrays in some fashion. It is probably
- the only widely-used format that can faithfully represent all of
- NumPy's array features. It has seen substantial adoption by the
- scientific community in general and the NumPy community in
- particular. It is an excellent solution for a wide variety of
- array storage problems with or without NumPy.
-
- HDF5 is a complicated format that more or less implements
- a hierarchical filesystem-in-a-file. This fact makes satisfying
- some of the Requirements difficult. To the author's knowledge, as
- of this writing, there is no application or library that reads or
- writes even a subset of HDF5 files that does not use the canonical
- libhdf5 implementation. This implementation is a large library
- that is not always easy to build. It would be infeasible to
- include it in numpy.
-
- It might be feasible to target an extremely limited subset of
- HDF5. Namely, there would be only one object in it: the array.
- Using contiguous storage for the data, one should be able to
- implement just enough of the format to provide the same metadata
- that the proposed format does. One could still meet all of the
- technical requirements like mmapability.
-
- We would accrue a substantial benefit by being able to generate
- files that could be read by other HDF5 software. Furthermore, by
- providing the first non-libhdf5 implementation of HDF5, we would
- be able to encourage more adoption of simple HDF5 in applications
- where it was previously infeasible because of the size of the
- library. The basic work may encourage similar dead-simple
- implementations in other languages and further expand the
- community.
-
- The remaining concern is about reverse engineerability of the
- format. Even the simple subset of HDF5 would be very difficult to
- reverse engineer given just a file by itself. However, given the
- prominence of HDF5, this might not be a substantial concern.
-
- In conclusion, we are going forward with the design laid out in
- this document. If someone writes code to handle the simple subset
- of HDF5 that would be useful to us, we may consider a revision of
- the file format.
+The author believes that this system (or one along these lines) is
+about the simplest system that satisfies all of the requirements.
+However, one must always be wary of introducing a new binary
+format to the world.
+
+HDF5 [2] is a very flexible format that should be able to
+represent all of NumPy's arrays in some fashion. It is probably
+the only widely-used format that can faithfully represent all of
+NumPy's array features. It has seen substantial adoption by the
+scientific community in general and the NumPy community in
+particular. It is an excellent solution for a wide variety of
+array storage problems with or without NumPy.
+
+HDF5 is a complicated format that more or less implements
+a hierarchical filesystem-in-a-file. This fact makes satisfying
+some of the Requirements difficult. To the author's knowledge, as
+of this writing, there is no application or library that reads or
+writes even a subset of HDF5 files that does not use the canonical
+libhdf5 implementation. This implementation is a large library
+that is not always easy to build. It would be infeasible to
+include it in numpy.
+
+It might be feasible to target an extremely limited subset of
+HDF5. Namely, there would be only one object in it: the array.
+Using contiguous storage for the data, one should be able to
+implement just enough of the format to provide the same metadata
+that the proposed format does. One could still meet all of the
+technical requirements like mmapability.
+
+We would accrue a substantial benefit by being able to generate
+files that could be read by other HDF5 software. Furthermore, by
+providing the first non-libhdf5 implementation of HDF5, we would
+be able to encourage more adoption of simple HDF5 in applications
+where it was previously infeasible because of the size of the
+library. The basic work may encourage similar dead-simple
+implementations in other languages and further expand the
+community.
+
+The remaining concern is about reverse engineerability of the
+format. Even the simple subset of HDF5 would be very difficult to
+reverse engineer given just a file by itself. However, given the
+prominence of HDF5, this might not be a substantial concern.
+
+In conclusion, we are going forward with the design laid out in
+this document. If someone writes code to handle the simple subset
+of HDF5 that would be useful to us, we may consider a revision of
+the file format.
Implementation
--------------
- The current implementation is included in the 1.0.5 release of numpy.
+The current implementation is included in the 1.0.5 release of numpy.
- http://github.com/numpy/numpy/blob/v1.5.0/numpy/lib/format.py
+ http://github.com/numpy/numpy/blob/v1.5.0/numpy/lib/format.py
- Specifically, the file format.py in this directory implements the
- format as described here.
+Specifically, the file format.py in this directory implements the
+format as described here.
References
----------
- [1] http://docs.python.org/lib/module-pickle.html
+[1] http://docs.python.org/lib/module-pickle.html
- [2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html
+[2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html
Copyright
---------
- This document has been placed in the public domain.
+This document has been placed in the public domain.