DOC: fix reST formatting of npy-format NEP.

author: Ralf Gommers <ralf.gommers@googlemail.com> 2014-04-21 18:01:10 +0200
committer: Ralf Gommers <ralf.gommers@googlemail.com> 2014-04-21 18:05:11 +0200
commit: 7d3e739a342121da74837a73a0374ed148aafe86 (patch)
tree: fdf7e582d0728608638a362682e0076fee809ad7 /doc/neps/npy-format.rst
parent: b66af02c283ddd1e33436ad3a14adec0bb09785f (diff)
download: numpy-7d3e739a342121da74837a73a0374ed148aafe86.tar.gz
1 files changed, 201 insertions, 206 deletions
diff --git a/doc/neps/npy-format.rst b/doc/neps/npy-format.rst
index 95b6f755e..f7bb799f2 100644
--- a/doc/neps/npy-format.rst
+++ b/doc/neps/npy-format.rst
@@ -2,296 +2,291 @@
 A Simple File Format for NumPy Arrays
 =====================================
 
-Discussions-To: numpy-discussion@mail.scipy.org
-Version: $Revision$
-Last-Modified: $Date$
 Author: Robert Kern <robert.kern@gmail.com>
 Status: Draft
-Type: Standards Track
-Content-Type: text/plain
 Created: 20-Dec-2007
 
 
 Abstract
 --------
 
-    We propose a standard binary file format (NPY) for persisting
-    a single arbitrary NumPy array on disk.  The format stores all of
-    the shape and dtype information necessary to reconstruct the array
-    correctly even on another machine with a different architecture.
-    The format is designed to be as simple as possible while achieving
-    its limited goals.  The implementation is intended to be pure
-    Python and distributed as part of the main numpy package.
+We propose a standard binary file format (NPY) for persisting
+a single arbitrary NumPy array on disk.  The format stores all of
+the shape and dtype information necessary to reconstruct the array
+correctly even on another machine with a different architecture.
+The format is designed to be as simple as possible while achieving
+its limited goals.  The implementation is intended to be pure
+Python and distributed as part of the main numpy package.
 
 
 Rationale
 ---------
 
-    A lightweight, omnipresent system for saving NumPy arrays to disk
-    is a frequent need.  Python in general has pickle [1] for saving
-    most Python objects to disk.  This often works well enough with
-    NumPy arrays for many purposes, but it has a few drawbacks:
+A lightweight, omnipresent system for saving NumPy arrays to disk
+is a frequent need.  Python in general has pickle [1] for saving
+most Python objects to disk.  This often works well enough with
+NumPy arrays for many purposes, but it has a few drawbacks:
 
-    - Dumping or loading a pickle file require the duplication of the
-      data in memory.  For large arrays, this can be a showstopper.
+- Dumping or loading a pickle file require the duplication of the
+  data in memory.  For large arrays, this can be a showstopper.
 
-    - The array data is not directly accessible through
-      memory-mapping.  Now that numpy has that capability, it has
-      proved very useful for loading large amounts of data (or more to
-      the point: avoiding loading large amounts of data when you only
-      need a small part).
+- The array data is not directly accessible through
+  memory-mapping.  Now that numpy has that capability, it has
+  proved very useful for loading large amounts of data (or more to
+  the point: avoiding loading large amounts of data when you only
+  need a small part).
 
-    Both of these problems can be addressed by dumping the raw bytes
-    to disk using ndarray.tofile() and numpy.fromfile().  However,
-    these have their own problems:
+Both of these problems can be addressed by dumping the raw bytes
+to disk using ndarray.tofile() and numpy.fromfile().  However,
+these have their own problems:
 
-    - The data which is written has no information about the shape or
-      dtype of the array.
+- The data which is written has no information about the shape or
+  dtype of the array.
 
-    - It is incapable of handling object arrays.
+- It is incapable of handling object arrays.
 
-    The NPY file format is an evolutionary advance over these two
-    approaches.  Its design is mostly limited to solving the problems
-    with pickles and tofile()/fromfile().  It does not intend to solve
-    more complicated problems for which more complicated formats like
-    HDF5 [2] are a better solution.
+The NPY file format is an evolutionary advance over these two
+approaches.  Its design is mostly limited to solving the problems
+with pickles and tofile()/fromfile().  It does not intend to solve
+more complicated problems for which more complicated formats like
+HDF5 [2] are a better solution.
 
 
 Use Cases
 ---------
 
-    - Neville Newbie has just started to pick up Python and NumPy.  He
-      has not installed many packages, yet, nor learned the standard
-      library, but he has been playing with NumPy at the interactive
-      prompt to do small tasks.  He gets a result that he wants to
-      save.
-
-    - Annie Analyst has been using large nested record arrays to
-      represent her statistical data.  She wants to convince her
-      R-using colleague, David Doubter, that Python and NumPy are
-      awesome by sending him her analysis code and data.  She needs
-      the data to load at interactive speeds.  Since David does not
-      use Python usually, needing to install large packages would turn
-      him off.
-
-    - Simon Seismologist is developing new seismic processing tools.
-      One of his algorithms requires large amounts of intermediate
-      data to be written to disk.  The data does not really fit into
-      the industry-standard SEG-Y schema, but he already has a nice
-      record-array dtype for using it internally.
-
-    - Polly Parallel wants to split up a computation on her multicore
-      machine as simply as possible.  Parts of the computation can be
-      split up among different processes without any communication
-      between processes; they just need to fill in the appropriate
-      portion of a large array with their results.  Having several
-      child processes memory-mapping a common array is a good way to
-      achieve this.
+- Neville Newbie has just started to pick up Python and NumPy.  He
+  has not installed many packages, yet, nor learned the standard
+  library, but he has been playing with NumPy at the interactive
+  prompt to do small tasks.  He gets a result that he wants to
+  save.
+
+- Annie Analyst has been using large nested record arrays to
+  represent her statistical data.  She wants to convince her
+  R-using colleague, David Doubter, that Python and NumPy are
+  awesome by sending him her analysis code and data.  She needs
+  the data to load at interactive speeds.  Since David does not
+  use Python usually, needing to install large packages would turn
+  him off.
+
+- Simon Seismologist is developing new seismic processing tools.
+  One of his algorithms requires large amounts of intermediate
+  data to be written to disk.  The data does not really fit into
+  the industry-standard SEG-Y schema, but he already has a nice
+  record-array dtype for using it internally.
+
+- Polly Parallel wants to split up a computation on her multicore
+  machine as simply as possible.  Parts of the computation can be
+  split up among different processes without any communication
+  between processes; they just need to fill in the appropriate
+  portion of a large array with their results.  Having several
+  child processes memory-mapping a common array is a good way to
+  achieve this.
 
 
 Requirements
 ------------
 
-    The format MUST be able to:
+The format MUST be able to:
 
-    - Represent all NumPy arrays including nested record
-      arrays and object arrays.
+- Represent all NumPy arrays including nested record
+  arrays and object arrays.
 
-    - Represent the data in its native binary form.
+- Represent the data in its native binary form.
 
-    - Be contained in a single file.
+- Be contained in a single file.
 
-    - Support Fortran-contiguous arrays directly.
+- Support Fortran-contiguous arrays directly.
 
-    - Store all of the necessary information to reconstruct the array
-      including shape and dtype on a machine of a different
-      architecture.  Both little-endian and big-endian arrays must be
-      supported and a file with little-endian numbers will yield
-      a little-endian array on any machine reading the file.  The
-      types must be described in terms of their actual sizes.  For
-      example, if a machine with a 64-bit C "long int" writes out an
-      array with "long ints", a reading machine with 32-bit C "long
-      ints" will yield an array with 64-bit integers.
+- Store all of the necessary information to reconstruct the array
+  including shape and dtype on a machine of a different
+  architecture.  Both little-endian and big-endian arrays must be
+  supported and a file with little-endian numbers will yield
+  a little-endian array on any machine reading the file.  The
+  types must be described in terms of their actual sizes.  For
+  example, if a machine with a 64-bit C "long int" writes out an
+  array with "long ints", a reading machine with 32-bit C "long
+  ints" will yield an array with 64-bit integers.
 
-    - Be reverse engineered.  Datasets often live longer than the
-      programs that created them.  A competent developer should be
-      able to create a solution in his preferred programming language to
-      read most NPY files that he has been given without much
-      documentation.
+- Be reverse engineered.  Datasets often live longer than the
+  programs that created them.  A competent developer should be
+  able to create a solution in his preferred programming language to
+  read most NPY files that he has been given without much
+  documentation.
 
-    - Allow memory-mapping of the data.
+- Allow memory-mapping of the data.
 
-    - Be read from a filelike stream object instead of an actual file.
-      This allows the implementation to be tested easily and makes the
-      system more flexible.  NPY files can be stored in ZIP files and
-      easily read from a ZipFile object.
+- Be read from a filelike stream object instead of an actual file.
+  This allows the implementation to be tested easily and makes the
+  system more flexible.  NPY files can be stored in ZIP files and
+  easily read from a ZipFile object.
 
-    - Store object arrays.  Since general Python objects are
-      complicated and can only be reliably serialized by pickle (if at
-      all), many of the other requirements are waived for files
-      containing object arrays.  Files with object arrays do not have
-      to be mmapable since that would be technically impossible.  We
-      cannot expect the pickle format to be reverse engineered without
-      knowledge of pickle.  However, one should at least be able to
-      read and write object arrays with the same generic interface as
-      other arrays.
+- Store object arrays.  Since general Python objects are
+  complicated and can only be reliably serialized by pickle (if at
+  all), many of the other requirements are waived for files
+  containing object arrays.  Files with object arrays do not have
+  to be mmapable since that would be technically impossible.  We
+  cannot expect the pickle format to be reverse engineered without
+  knowledge of pickle.  However, one should at least be able to
+  read and write object arrays with the same generic interface as
+  other arrays.
 
-    - Be read and written using APIs provided in the numpy package
-      itself without any other libraries.  The implementation inside
-      numpy may be in C if necessary.
+- Be read and written using APIs provided in the numpy package
+  itself without any other libraries.  The implementation inside
+  numpy may be in C if necessary.
 
-    The format explicitly *does not* need to:
+The format explicitly *does not* need to:
 
-    - Support multiple arrays in a file.  Since we require filelike
-      objects to be supported, one could use the API to build an ad
-      hoc format that supported multiple arrays.  However, solving the
-      general problem and use cases is beyond the scope of the format
-      and the API for numpy.
+- Support multiple arrays in a file.  Since we require filelike
+  objects to be supported, one could use the API to build an ad
+  hoc format that supported multiple arrays.  However, solving the
+  general problem and use cases is beyond the scope of the format
+  and the API for numpy.
 
-    - Fully handle arbitrary subclasses of numpy.ndarray.  Subclasses
-      will be accepted for writing, but only the array data will be
-      written out.  A regular numpy.ndarray object will be created
-      upon reading the file.  The API can be used to build a format
-      for a particular subclass, but that is out of scope for the
-      general NPY format.
+- Fully handle arbitrary subclasses of numpy.ndarray.  Subclasses
+  will be accepted for writing, but only the array data will be
+  written out.  A regular numpy.ndarray object will be created
+  upon reading the file.  The API can be used to build a format
+  for a particular subclass, but that is out of scope for the
+  general NPY format.
 
 
 Format Specification: Version 1.0
 ---------------------------------
 
-    The first 6 bytes are a magic string: exactly "\x93NUMPY".
+The first 6 bytes are a magic string: exactly "\x93NUMPY".
 
-    The next 1 byte is an unsigned byte: the major version number of
-    the file format, e.g. \x01.
+The next 1 byte is an unsigned byte: the major version number of
+the file format, e.g. \x01.
 
-    The next 1 byte is an unsigned byte: the minor version number of
-    the file format, e.g. \x00.  Note: the version of the file format
-    is not tied to the version of the numpy package.
+The next 1 byte is an unsigned byte: the minor version number of
+the file format, e.g. \x00.  Note: the version of the file format
+is not tied to the version of the numpy package.
 
-    The next 2 bytes form a little-endian unsigned short int: the
-    length of the header data HEADER_LEN.
+The next 2 bytes form a little-endian unsigned short int: the
+length of the header data HEADER_LEN.
 
-    The next HEADER_LEN bytes form the header data describing the
-    array's format.  It is an ASCII string which contains a Python
-    literal expression of a dictionary.  It is terminated by a newline
-    ('\n') and padded with spaces ('\x20') to make the total length of
-    the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
-    alignment purposes.
+The next HEADER_LEN bytes form the header data describing the
+array's format.  It is an ASCII string which contains a Python
+literal expression of a dictionary.  It is terminated by a newline
+('\n') and padded with spaces ('\x20') to make the total length of
+the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
+alignment purposes.
 
-    The dictionary contains three keys:
+The dictionary contains three keys:
 
-        "descr" : dtype.descr
-            An object that can be passed as an argument to the
-            numpy.dtype() constructor to create the array's dtype.
+    "descr" : dtype.descr
+        An object that can be passed as an argument to the
+        numpy.dtype() constructor to create the array's dtype.
 
-        "fortran_order" : bool
-            Whether the array data is Fortran-contiguous or not.
-            Since Fortran-contiguous arrays are a common form of
-            non-C-contiguity, we allow them to be written directly to
-            disk for efficiency.
+    "fortran_order" : bool
+        Whether the array data is Fortran-contiguous or not.
+        Since Fortran-contiguous arrays are a common form of
+        non-C-contiguity, we allow them to be written directly to
+        disk for efficiency.
 
-        "shape" : tuple of int
-            The shape of the array.
+    "shape" : tuple of int
+        The shape of the array.
 
-    For repeatability and readability, this dictionary is formatted
-    using pprint.pformat() so the keys are in alphabetic order.
+For repeatability and readability, this dictionary is formatted
+using pprint.pformat() so the keys are in alphabetic order.
 
-    Following the header comes the array data.  If the dtype contains
-    Python objects (i.e. dtype.hasobject is True), then the data is
-    a Python pickle of the array.  Otherwise the data is the
-    contiguous (either C- or Fortran-, depending on fortran_order)
-    bytes of the array.  Consumers can figure out the number of bytes
-    by multiplying the number of elements given by the shape (noting
-    that shape=() means there is 1 element) by dtype.itemsize.
+Following the header comes the array data.  If the dtype contains
+Python objects (i.e. dtype.hasobject is True), then the data is
+a Python pickle of the array.  Otherwise the data is the
+contiguous (either C- or Fortran-, depending on fortran_order)
+bytes of the array.  Consumers can figure out the number of bytes
+by multiplying the number of elements given by the shape (noting
+that shape=() means there is 1 element) by dtype.itemsize.
 
 
 Conventions
 -----------
 
-    We recommend using the ".npy" extension for files following this
-    format.  This is by no means a requirement; applications may wish
-    to use this file format but use an extension specific to the
-    application.  In the absence of an obvious alternative, however,
-    we suggest using ".npy".
+We recommend using the ".npy" extension for files following this
+format.  This is by no means a requirement; applications may wish
+to use this file format but use an extension specific to the
+application.  In the absence of an obvious alternative, however,
+we suggest using ".npy".
 
-    For a simple way to combine multiple arrays into a single file,
-    one can use ZipFile to contain multiple ".npy" files.  We
-    recommend using the file extension ".npz" for these archives.
+For a simple way to combine multiple arrays into a single file,
+one can use ZipFile to contain multiple ".npy" files.  We
+recommend using the file extension ".npz" for these archives.
 
 
 Alternatives
 ------------
 
-    The author believes that this system (or one along these lines) is
-    about the simplest system that satisfies all of the requirements.
-    However, one must always be wary of introducing a new binary
-    format to the world.
-
-    HDF5 [2] is a very flexible format that should be able to
-    represent all of NumPy's arrays in some fashion.  It is probably
-    the only widely-used format that can faithfully represent all of
-    NumPy's array features.  It has seen substantial adoption by the
-    scientific community in general and the NumPy community in
-    particular.  It is an excellent solution for a wide variety of
-    array storage problems with or without NumPy.
-
-    HDF5 is a complicated format that more or less implements
-    a hierarchical filesystem-in-a-file.  This fact makes satisfying
-    some of the Requirements difficult.  To the author's knowledge, as
-    of this writing, there is no application or library that reads or
-    writes even a subset of HDF5 files that does not use the canonical
-    libhdf5 implementation.  This implementation is a large library
-    that is not always easy to build.  It would be infeasible to
-    include it in numpy.
-
-    It might be feasible to target an extremely limited subset of
-    HDF5.  Namely, there would be only one object in it: the array.
-    Using contiguous storage for the data, one should be able to
-    implement just enough of the format to provide the same metadata
-    that the proposed format does.  One could still meet all of the
-    technical requirements like mmapability.
-
-    We would accrue a substantial benefit by being able to generate
-    files that could be read by other HDF5 software.  Furthermore, by
-    providing the first non-libhdf5 implementation of HDF5, we would
-    be able to encourage more adoption of simple HDF5 in applications
-    where it was previously infeasible because of the size of the
-    library.  The basic work may encourage similar dead-simple
-    implementations in other languages and further expand the
-    community.
-
-    The remaining concern is about reverse engineerability of the
-    format.  Even the simple subset of HDF5 would be very difficult to
-    reverse engineer given just a file by itself.  However, given the
-    prominence of HDF5, this might not be a substantial concern.
-
-    In conclusion, we are going forward with the design laid out in
-    this document.  If someone writes code to handle the simple subset
-    of HDF5 that would be useful to us, we may consider a revision of
-    the file format.
+The author believes that this system (or one along these lines) is
+about the simplest system that satisfies all of the requirements.
+However, one must always be wary of introducing a new binary
+format to the world.
+
+HDF5 [2] is a very flexible format that should be able to
+represent all of NumPy's arrays in some fashion.  It is probably
+the only widely-used format that can faithfully represent all of
+NumPy's array features.  It has seen substantial adoption by the
+scientific community in general and the NumPy community in
+particular.  It is an excellent solution for a wide variety of
+array storage problems with or without NumPy.
+
+HDF5 is a complicated format that more or less implements
+a hierarchical filesystem-in-a-file.  This fact makes satisfying
+some of the Requirements difficult.  To the author's knowledge, as
+of this writing, there is no application or library that reads or
+writes even a subset of HDF5 files that does not use the canonical
+libhdf5 implementation.  This implementation is a large library
+that is not always easy to build.  It would be infeasible to
+include it in numpy.
+
+It might be feasible to target an extremely limited subset of
+HDF5.  Namely, there would be only one object in it: the array.
+Using contiguous storage for the data, one should be able to
+implement just enough of the format to provide the same metadata
+that the proposed format does.  One could still meet all of the
+technical requirements like mmapability.
+
+We would accrue a substantial benefit by being able to generate
+files that could be read by other HDF5 software.  Furthermore, by
+providing the first non-libhdf5 implementation of HDF5, we would
+be able to encourage more adoption of simple HDF5 in applications
+where it was previously infeasible because of the size of the
+library.  The basic work may encourage similar dead-simple
+implementations in other languages and further expand the
+community.
+
+The remaining concern is about reverse engineerability of the
+format.  Even the simple subset of HDF5 would be very difficult to
+reverse engineer given just a file by itself.  However, given the
+prominence of HDF5, this might not be a substantial concern.
+
+In conclusion, we are going forward with the design laid out in
+this document.  If someone writes code to handle the simple subset
+of HDF5 that would be useful to us, we may consider a revision of
+the file format.
 
 
 Implementation
 --------------
 
-    The current implementation is included in the 1.0.5 release of numpy.
+The current implementation is included in the 1.0.5 release of numpy.
 
-        http://github.com/numpy/numpy/blob/v1.5.0/numpy/lib/format.py
+    http://github.com/numpy/numpy/blob/v1.5.0/numpy/lib/format.py
 
-    Specifically, the file format.py in this directory implements the
-    format as described here.
+Specifically, the file format.py in this directory implements the
+format as described here.
 
 
 References
 ----------
 
-    [1] http://docs.python.org/lib/module-pickle.html
+[1] http://docs.python.org/lib/module-pickle.html
 
-    [2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html
+[2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html
 
 
 Copyright
 ---------
 
-    This document has been placed in the public domain.
+This document has been placed in the public domain.
author	Ralf Gommers <ralf.gommers@googlemail.com>	2014-04-21 18:01:10 +0200
committer	Ralf Gommers <ralf.gommers@googlemail.com>	2014-04-21 18:05:11 +0200
commit	7d3e739a342121da74837a73a0374ed148aafe86 (patch)
tree	fdf7e582d0728608638a362682e0076fee809ad7 /doc/neps/npy-format.rst
parent	b66af02c283ddd1e33436ad3a14adec0bb09785f (diff)
download	numpy-7d3e739a342121da74837a73a0374ed148aafe86.tar.gz