diff options
author | Jarrod Millman <millman@berkeley.edu> | 2008-08-31 03:28:56 +0000 |
---|---|---|
committer | Jarrod Millman <millman@berkeley.edu> | 2008-08-31 03:28:56 +0000 |
commit | eae7e11f672e1c1d3bd66959005db1ff2c4a2c2c (patch) | |
tree | 740ac6a71aa1ac7a7ee2d7924056a893f6331a8b /doc/neps | |
parent | a33cf5171aa0325e9b036eae8ada70f78a6c3912 (diff) | |
download | numpy-eae7e11f672e1c1d3bd66959005db1ff2c4a2c2c.tar.gz |
moving and adding neps
Diffstat (limited to 'doc/neps')
-rw-r--r-- | doc/neps/datetime.rst | 354 | ||||
-rw-r--r-- | doc/neps/npy-format.txt | 294 | ||||
-rw-r--r-- | doc/neps/pep_buffer.txt | 869 |
3 files changed, 1517 insertions, 0 deletions
diff --git a/doc/neps/datetime.rst b/doc/neps/datetime.rst new file mode 100644 index 000000000..75cc7db88 --- /dev/null +++ b/doc/neps/datetime.rst @@ -0,0 +1,354 @@ +==================================================================== + A (second) proposal for implementing some date/time types in NumPy +==================================================================== + +:Author: Francesc Alted i Abad +:Contact: faltet@pytables.com +:Author: Ivan Vilata i Balaguer +:Contact: ivan@selidor.net +:Date: 2008-07-16 + + +Executive summary +================= + +A date/time mark is something very handy to have in many fields where +one has to deal with data sets. While Python has several modules that +define a date/time type (like the integrated ``datetime`` [1]_ or +``mx.DateTime`` [2]_), NumPy has a lack of them. + +In this document, we are proposing the addition of a series of date/time +types to fill this gap. The requirements for the proposed types are +two-folded: 1) they have to be fast to operate with and 2) they have to +be as compatible as possible with the existing ``datetime`` module that +comes with Python. + + +Types proposed +============== + +To start with, it is virtually impossible to come up with a single +date/time type that fills the needs of every case of use. So, after +pondering about different possibilities, we have stick with *two* +different types, namely ``datetime64`` and ``timedelta64`` (these names +are preliminary and can be changed), that can have different resolutions +so as to cover different needs. + +**Important note:** the resolution is conceived here as a metadata that + *complements* a date/time dtype, *without changing the base type*. + +Now it goes a detailed description of the proposed types. + + +``datetime64`` +-------------- + +It represents a time that is absolute (i.e. not relative). It is +implemented internally as an ``int64`` type. The internal epoch is +POSIX epoch (see [3]_). + +Resolution +~~~~~~~~~~ + +It accepts different resolutions and for each of these resolutions, it +will support different time spans. The table below describes the +resolutions supported with its corresponding time spans. + ++----------------------+----------------------------------+ +| Resolution | Time span (years) | ++----------------------+----------------------------------+ +| Code | Meaning | | ++======================+==================================+ +| Y | year | [9.2e18 BC, 9.2e18 AC] | +| Q | quarter | [3.0e18 BC, 3.0e18 AC] | +| M | month | [7.6e17 BC, 7.6e17 AC] | +| W | week | [1.7e17 BC, 1.7e17 AC] | +| d | day | [2.5e16 BC, 2.5e16 AC] | +| h | hour | [1.0e15 BC, 1.0e15 AC] | +| m | minute | [1.7e13 BC, 1.7e13 AC] | +| s | second | [ 2.9e9 BC, 2.9e9 AC] | +| ms | millisecond | [ 2.9e6 BC, 2.9e6 AC] | +| us | microsecond | [290301 BC, 294241 AC] | +| ns | nanosecond | [ 1678 AC, 2262 AC] | ++----------------------+----------------------------------+ + +Building a ``datetime64`` dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The proposed way to specify the resolution in the dtype constructor +is: + +Using parameters in the constructor:: + + dtype('datetime64', res="us") # the default res. is microseconds + +Using the long string notation:: + + dtype('datetime64[us]') # equivalent to dtype('datetime64') + +Using the short string notation:: + + dtype('T8[us]') # equivalent to dtype('T8') + +Compatibility issues +~~~~~~~~~~~~~~~~~~~~ + +This will be fully compatible with the ``datetime`` class of the +``datetime`` module of Python only when using a resolution of +microseconds. For other resolutions, the conversion process will +loose precision or will overflow as needed. + + +``timedelta64`` +--------------- + +It represents a time that is relative (i.e. not absolute). It is +implemented internally as an ``int64`` type. + +Resolution +~~~~~~~~~~ + +It accepts different resolutions and for each of these resolutions, it +will support different time spans. The table below describes the +resolutions supported with its corresponding time spans. + ++----------------------+--------------------------+ +| Resolution | Time span | ++----------------------+--------------------------+ +| Code | Meaning | | ++======================+==========================+ +| W | week | +- 1.7e17 years | +| D | day | +- 2.5e16 years | +| h | hour | +- 1.0e15 years | +| m | minute | +- 1.7e13 years | +| s | second | +- 2.9e12 years | +| ms | millisecond | +- 2.9e9 years | +| us | microsecond | +- 2.9e6 years | +| ns | nanosecond | +- 292 years | +| ps | picosecond | +- 106 days | +| fs | femtosecond | +- 2.6 hours | +| as | attosecond | +- 9.2 seconds | ++----------------------+--------------------------+ + +Building a ``timedelta64`` dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The proposed way to specify the resolution in the dtype constructor +is: + +Using parameters in the constructor:: + + dtype('timedelta64', res="us") # the default res. is microseconds + +Using the long string notation:: + + dtype('timedelta64[us]') # equivalent to dtype('datetime64') + +Using the short string notation:: + + dtype('t8[us]') # equivalent to dtype('t8') + +Compatibility issues +~~~~~~~~~~~~~~~~~~~~ + +This will be fully compatible with the ``timedelta`` class of the +``datetime`` module of Python only when using a resolution of +microseconds. For other resolutions, the conversion process will +loose precision or will overflow as needed. + + +Example of use +============== + +Here it is an example of use for the ``datetime64``:: + + In [10]: t = numpy.zeros(5, dtype="datetime64[ms]") + + In [11]: t[0] = datetime.datetime.now() # setter in action + + In [12]: t[0] + Out[12]: '2008-07-16T13:39:25.315' # representation in ISO 8601 format + + In [13]: print t + [2008-07-16T13:39:25.315 1970-01-01T00:00:00.0 + 1970-01-01T00:00:00.0 1970-01-01T00:00:00.0 1970-01-01T00:00:00.0] + + In [14]: t[0].item() # getter in action + Out[14]: datetime.datetime(2008, 7, 16, 13, 39, 25, 315000) + + In [15]: print t.dtype + datetime64[ms] + +And here it goes an example of use for the ``timedelta64``:: + + In [8]: t1 = numpy.zeros(5, dtype="datetime64[s]") + + In [9]: t2 = numpy.ones(5, dtype="datetime64[s]") + + In [10]: t = t2 - t1 + + In [11]: t[0] = 24 # setter in action (setting to 24 seconds) + + In [12]: t[0] + Out[12]: 24 # representation as an int64 + + In [13]: print t + [24 1 1 1 1] + + In [14]: t[0].item() # getter in action + Out[14]: datetime.timedelta(0, 24) + + In [15]: print t.dtype + timedelta64[s] + + +Operating with date/time arrays +=============================== + +``datetime64`` vs ``datetime64`` +-------------------------------- + +The only operation allowed between absolute dates is the subtraction:: + + In [10]: numpy.ones(5, "T8") - numpy.zeros(5, "T8") + Out[10]: array([1, 1, 1, 1, 1], dtype=timedelta64[us]) + +But not other operations:: + + In [11]: numpy.ones(5, "T8") + numpy.zeros(5, "T8") + TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray' + +``datetime64`` vs ``timedelta64`` +--------------------------------- + +It will be possible to add and subtract relative times from absolute +dates:: + + In [10]: numpy.zeros(5, "T8[Y]") + numpy.ones(5, "t8[Y]") + Out[10]: array([1971, 1971, 1971, 1971, 1971], dtype=datetime64[Y]) + + In [11]: numpy.ones(5, "T8[Y]") - 2 * numpy.ones(5, "t8[Y]") + Out[11]: array([1969, 1969, 1969, 1969, 1969], dtype=datetime64[Y]) + +But not other operations:: + + In [12]: numpy.ones(5, "T8[Y]") * numpy.ones(5, "t8[Y]") + TypeError: unsupported operand type(s) for *: 'numpy.ndarray' and 'numpy.ndarray' + +``timedelta64`` vs anything +--------------------------- + +Finally, it will be possible to operate with relative times as if they +were regular int64 dtypes *as long as* the result can be converted back +into a ``timedelta64``:: + + In [10]: numpy.ones(5, 't8') + Out[10]: array([1, 1, 1, 1, 1], dtype=timedelta64[us]) + + In [11]: (numpy.ones(5, 't8[M]') + 2) ** 3 + Out[11]: array([27, 27, 27, 27, 27], dtype=timedelta64[M]) + +But:: + + In [12]: numpy.ones(5, 't8') + 1j + TypeError: The result cannot be converted into a ``timedelta64`` + + +dtype/resolution conversions +============================ + +For changing the date/time dtype of an existing array, we propose to use +the ``.astype()`` method. This will be mainly useful for changing +resolutions. + +For example, for absolute dates:: + + In[10]: t1 = numpy.zeros(5, dtype="datetime64[s]") + + In[11]: print t1 + [1970-01-01T00:00:00 1970-01-01T00:00:00 1970-01-01T00:00:00 + 1970-01-01T00:00:00 1970-01-01T00:00:00] + + In[12]: print t1.astype('datetime64[d]') + [1970-01-01 1970-01-01 1970-01-01 1970-01-01 1970-01-01] + +For relative times:: + + In[10]: t1 = numpy.ones(5, dtype="timedelta64[s]") + + In[11]: print t1 + [1 1 1 1 1] + + In[12]: print t1.astype('timedelta64[ms]') + [1000 1000 1000 1000 1000] + +Changing directly from/to relative to/from absolute dtypes will not be +supported:: + + In[13]: numpy.zeros(5, dtype="datetime64[s]").astype('timedelta64') + TypeError: data type cannot be converted to the desired type + + +Final considerations +==================== + +Why the ``origin`` metadata disappeared +--------------------------------------- + +During the discussion of the date/time dtypes in the NumPy list, the +idea of having an ``origin`` metadata that complemented the definition +of the absolute ``datetime64`` was initially found to be useful. + +However, after thinking more about this, Ivan and me find that the +combination of an absolute ``datetime64`` with a relative +``timedelta64`` does offer the same functionality while removing the +need for the additional ``origin`` metadata. This is why we have +removed it from this proposal. + + +Resolution and dtype issues +--------------------------- + +The date/time dtype's resolution metadata cannot be used in general as +part of typical dtype usage. For example, in:: + + numpy.zeros(5, dtype=numpy.datetime64) + +we have to found yet a sensible way to pass the resolution. Perhaps the +next would work:: + + numpy.zeros(5, dtype=numpy.datetime64(res='Y')) + +but we are not sure if this would collide with the spirit of the NumPy +dtypes. + +At any rate, one can always do:: + + numpy.zeros(5, dtype=numpy.dtype('datetime64', res='Y')) + +BTW, prior to all of this, one should also elucidate whether:: + + numpy.dtype('datetime64', res='Y') + +or:: + + numpy.dtype('datetime64[Y]') + numpy.dtype('T8[Y]') + +would be a consistent way to instantiate a dtype in NumPy. We do really +think that could be a good way, but we would need to hear the opinion of +the expert. Travis? + + + +.. [1] http://docs.python.org/lib/module-datetime.html +.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime +.. [3] http://en.wikipedia.org/wiki/Unix_time + + +.. Local Variables: +.. mode: rst +.. coding: utf-8 +.. fill-column: 72 +.. End: + diff --git a/doc/neps/npy-format.txt b/doc/neps/npy-format.txt new file mode 100644 index 000000000..836468096 --- /dev/null +++ b/doc/neps/npy-format.txt @@ -0,0 +1,294 @@ +Title: A Simple File Format for NumPy Arrays +Discussions-To: numpy-discussion@mail.scipy.org +Version: $Revision$ +Last-Modified: $Date$ +Author: Robert Kern <robert.kern@gmail.com> +Status: Draft +Type: Standards Track +Content-Type: text/plain +Created: 20-Dec-2007 + + +Abstract + + We propose a standard binary file format (NPY) for persisting + a single arbitrary NumPy array on disk. The format stores all of + the shape and dtype information necessary to reconstruct the array + correctly even on another machine with a different architecture. + The format is designed to be as simple as possible while achieving + its limited goals. The implementation is intended to be pure + Python and distributed as part of the main numpy package. + + +Rationale + + A lightweight, omnipresent system for saving NumPy arrays to disk + is a frequent need. Python in general has pickle [1] for saving + most Python objects to disk. This often works well enough with + NumPy arrays for many purposes, but it has a few drawbacks: + + - Dumping or loading a pickle file require the duplication of the + data in memory. For large arrays, this can be a showstopper. + + - The array data is not directly accessible through + memory-mapping. Now that numpy has that capability, it has + proved very useful for loading large amounts of data (or more to + the point: avoiding loading large amounts of data when you only + need a small part). + + Both of these problems can be addressed by dumping the raw bytes + to disk using ndarray.tofile() and numpy.fromfile(). However, + these have their own problems: + + - The data which is written has no information about the shape or + dtype of the array. + + - It is incapable of handling object arrays. + + The NPY file format is an evolutionary advance over these two + approaches. Its design is mostly limited to solving the problems + with pickles and tofile()/fromfile(). It does not intend to solve + more complicated problems for which more complicated formats like + HDF5 [2] are a better solution. + + +Use Cases + + - Neville Newbie has just started to pick up Python and NumPy. He + has not installed many packages, yet, nor learned the standard + library, but he has been playing with NumPy at the interactive + prompt to do small tasks. He gets a result that he wants to + save. + + - Annie Analyst has been using large nested record arrays to + represent her statistical data. She wants to convince her + R-using colleague, David Doubter, that Python and NumPy are + awesome by sending him her analysis code and data. She needs + the data to load at interactive speeds. Since David does not + use Python usually, needing to install large packages would turn + him off. + + - Simon Seismologist is developing new seismic processing tools. + One of his algorithms requires large amounts of intermediate + data to be written to disk. The data does not really fit into + the industry-standard SEG-Y schema, but he already has a nice + record-array dtype for using it internally. + + - Polly Parallel wants to split up a computation on her multicore + machine as simply as possible. Parts of the computation can be + split up among different processes without any communication + between processes; they just need to fill in the appropriate + portion of a large array with their results. Having several + child processes memory-mapping a common array is a good way to + achieve this. + + +Requirements + + The format MUST be able to: + + - Represent all NumPy arrays including nested record + arrays and object arrays. + + - Represent the data in its native binary form. + + - Be contained in a single file. + + - Support Fortran-contiguous arrays directly. + + - Store all of the necessary information to reconstruct the array + including shape and dtype on a machine of a different + architecture. Both little-endian and big-endian arrays must be + supported and a file with little-endian numbers will yield + a little-endian array on any machine reading the file. The + types must be described in terms of their actual sizes. For + example, if a machine with a 64-bit C "long int" writes out an + array with "long ints", a reading machine with 32-bit C "long + ints" will yield an array with 64-bit integers. + + - Be reverse engineered. Datasets often live longer than the + programs that created them. A competent developer should be + able create a solution in his preferred programming language to + read most NPY files that he has been given without much + documentation. + + - Allow memory-mapping of the data. + + - Be read from a filelike stream object instead of an actual file. + This allows the implementation to be tested easily and makes the + system more flexible. NPY files can be stored in ZIP files and + easily read from a ZipFile object. + + - Store object arrays. Since general Python objects are + complicated and can only be reliably serialized by pickle (if at + all), many of the other requirements are waived for files + containing object arrays. Files with object arrays do not have + to be mmapable since that would be technically impossible. We + cannot expect the pickle format to be reverse engineered without + knowledge of pickle. However, one should at least be able to + read and write object arrays with the same generic interface as + other arrays. + + - Be read and written using APIs provided in the numpy package + itself without any other libraries. The implementation inside + numpy may be in C if necessary. + + The format explicitly *does not* need to: + + - Support multiple arrays in a file. Since we require filelike + objects to be supported, one could use the API to build an ad + hoc format that supported multiple arrays. However, solving the + general problem and use cases is beyond the scope of the format + and the API for numpy. + + - Fully handle arbitrary subclasses of numpy.ndarray. Subclasses + will be accepted for writing, but only the array data will be + written out. A regular numpy.ndarray object will be created + upon reading the file. The API can be used to build a format + for a particular subclass, but that is out of scope for the + general NPY format. + + +Format Specification: Version 1.0 + + The first 6 bytes are a magic string: exactly "\x93NUMPY". + + The next 1 byte is an unsigned byte: the major version number of + the file format, e.g. \x01. + + The next 1 byte is an unsigned byte: the minor version number of + the file format, e.g. \x00. Note: the version of the file format + is not tied to the version of the numpy package. + + The next 2 bytes form a little-endian unsigned short int: the + length of the header data HEADER_LEN. + + The next HEADER_LEN bytes form the header data describing the + array's format. It is an ASCII string which contains a Python + literal expression of a dictionary. It is terminated by a newline + ('\n') and padded with spaces ('\x20') to make the total length of + the magic string + 4 + HEADER_LEN be evenly divisible by 16 for + alignment purposes. + + The dictionary contains three keys: + + "descr" : dtype.descr + An object that can be passed as an argument to the + numpy.dtype() constructor to create the array's dtype. + + "fortran_order" : bool + Whether the array data is Fortran-contiguous or not. + Since Fortran-contiguous arrays are a common form of + non-C-contiguity, we allow them to be written directly to + disk for efficiency. + + "shape" : tuple of int + The shape of the array. + + For repeatability and readability, this dictionary is formatted + using pprint.pformat() so the keys are in alphabetic order. + + Following the header comes the array data. If the dtype contains + Python objects (i.e. dtype.hasobject is True), then the data is + a Python pickle of the array. Otherwise the data is the + contiguous (either C- or Fortran-, depending on fortran_order) + bytes of the array. Consumers can figure out the number of bytes + by multiplying the number of elements given by the shape (noting + that shape=() means there is 1 element) by dtype.itemsize. + + +Conventions + + We recommend using the ".npy" extension for files following this + format. This is by no means a requirement; applications may wish + to use this file format but use an extension specific to the + application. In the absence of an obvious alternative, however, + we suggest using ".npy". + + For a simple way to combine multiple arrays into a single file, + one can use ZipFile to contain multiple ".npy" files. We + recommend using the file extension ".npz" for these archives. + + +Alternatives + + The author believes that this system (or one along these lines) is + about the simplest system that satisfies all of the requirements. + However, one must always be wary of introducing a new binary + format to the world. + + HDF5 [2] is a very flexible format that should be able to + represent all of NumPy's arrays in some fashion. It is probably + the only widely-used format that can faithfully represent all of + NumPy's array features. It has seen substantial adoption by the + scientific community in general and the NumPy community in + particular. It is an excellent solution for a wide variety of + array storage problems with or without NumPy. + + HDF5 is a complicated format that more or less implements + a hierarchical filesystem-in-a-file. This fact makes satisfying + some of the Requirements difficult. To the author's knowledge, as + of this writing, there is no application or library that reads or + writes even a subset of HDF5 files that does not use the canonical + libhdf5 implementation. This implementation is a large library + that is not always easy to build. It would be infeasible to + include it in numpy. + + It might be feasible to target an extremely limited subset of + HDF5. Namely, there would be only one object in it: the array. + Using contiguous storage for the data, one should be able to + implement just enough of the format to provide the same metadata + that the proposed format does. One could still meet all of the + technical requirements like mmapability. + + We would accrue a substantial benefit by being able to generate + files that could be read by other HDF5 software. Furthermore, by + providing the first non-libhdf5 implementation of HDF5, we would + be able to encourage more adoption of simple HDF5 in applications + where it was previously infeasible because of the size of the + library. The basic work may encourage similar dead-simple + implementations in other languages and further expand the + community. + + The remaining concern is about reverse engineerability of the + format. Even the simple subset of HDF5 would be very difficult to + reverse engineer given just a file by itself. However, given the + prominence of HDF5, this might not be a substantial concern. + + In conclusion, we are going forward with the design laid out in + this document. If someone writes code to handle the simple subset + of HDF5 that would be useful to us, we may consider a revision of + the file format. + + +Implementation + + The current implementation is in the trunk of the numpy SVN + repository and will be part of the 1.0.5 release. + + http://svn.scipy.org/svn/numpy/trunk + + Specifically, the file format.py in this directory implements the + format as described here. + + +References + + [1] http://docs.python.org/lib/module-pickle.html + + [2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html + + +Copyright + + This document has been placed in the public domain. + + + +Local Variables: +mode: indented-text +indent-tabs-mode: nil +sentence-end-double-space: t +fill-column: 70 +coding: utf-8 +End: diff --git a/doc/neps/pep_buffer.txt b/doc/neps/pep_buffer.txt new file mode 100644 index 000000000..a154d2792 --- /dev/null +++ b/doc/neps/pep_buffer.txt @@ -0,0 +1,869 @@ +:PEP: 3118 +:Title: Revising the buffer protocol +:Version: $Revision$ +:Last-Modified: $Date$ +:Authors: Travis Oliphant <oliphant@ee.byu.edu>, Carl Banks <pythondev@aerojockey.com> +:Status: Draft +:Type: Standards Track +:Content-Type: text/x-rst +:Created: 28-Aug-2006 +:Python-Version: 3000 + +Abstract +======== + +This PEP proposes re-designing the buffer interface (PyBufferProcs +function pointers) to improve the way Python allows memory sharing +in Python 3.0 + +In particular, it is proposed that the character buffer portion +of the API be elminated and the multiple-segment portion be +re-designed in conjunction with allowing for strided memory +to be shared. In addition, the new buffer interface will +allow the sharing of any multi-dimensional nature of the +memory and what data-format the memory contains. + +This interface will allow any extension module to either +create objects that share memory or create algorithms that +use and manipulate raw memory from arbitrary objects that +export the interface. + + +Rationale +========= + +The Python 2.X buffer protocol allows different Python types to +exchange a pointer to a sequence of internal buffers. This +functionality is *extremely* useful for sharing large segments of +memory between different high-level objects, but it is too limited and +has issues: + +1. There is the little used "sequence-of-segments" option + (bf_getsegcount) that is not well motivated. + +2. There is the apparently redundant character-buffer option + (bf_getcharbuffer) + +3. There is no way for a consumer to tell the buffer-API-exporting + object it is "finished" with its view of the memory and + therefore no way for the exporting object to be sure that it is + safe to reallocate the pointer to the memory that it owns (for + example, the array object reallocating its memory after sharing + it with the buffer object which held the original pointer led + to the infamous buffer-object problem). + +4. Memory is just a pointer with a length. There is no way to + describe what is "in" the memory (float, int, C-structure, etc.) + +5. There is no shape information provided for the memory. But, + several array-like Python types could make use of a standard + way to describe the shape-interpretation of the memory + (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video + Libraries, ctypes, NumPy, data-base interfaces, etc.) + +6. There is no way to share discontiguous memory (except through + the sequence of segments notion). + + There are two widely used libraries that use the concept of + discontiguous memory: PIL and NumPy. Their view of discontiguous + arrays is different, though. The proposed buffer interface allows + sharing of either memory model. Exporters will use only one + approach and consumers may choose to support discontiguous + arrays of each type however they choose. + + NumPy uses the notion of constant striding in each dimension as its + basic concept of an array. With this concept, a simple sub-region + of a larger array can be described without copying the data. + Thus, stride information is the additional information that must be + shared. + + The PIL uses a more opaque memory representation. Sometimes an + image is contained in a contiguous segment of memory, but sometimes + it is contained in an array of pointers to the contiguous segments + (usually lines) of the image. The PIL is where the idea of multiple + buffer segments in the original buffer interface came from. + + NumPy's strided memory model is used more often in computational + libraries and because it is so simple it makes sense to support + memory sharing using this model. The PIL memory model is sometimes + used in C-code where a 2-d array can be then accessed using double + pointer indirection: e.g. image[i][j]. + + The buffer interface should allow the object to export either of these + memory models. Consumers are free to either require contiguous memory + or write code to handle one or both of these memory models. + +Proposal Overview +================= + +* Eliminate the char-buffer and multiple-segment sections of the + buffer-protocol. + +* Unify the read/write versions of getting the buffer. + +* Add a new function to the interface that should be called when + the consumer object is "done" with the memory area. + +* Add a new variable to allow the interface to describe what is in + memory (unifying what is currently done now in struct and + array) + +* Add a new variable to allow the protocol to share shape information + +* Add a new variable for sharing stride information + +* Add a new mechanism for sharing arrays that must + be accessed using pointer indirection. + +* Fix all objects in the core and the standard library to conform + to the new interface + +* Extend the struct module to handle more format specifiers + +* Extend the buffer object into a new memory object which places + a Python veneer around the buffer interface. + +* Add a few functions to make it easy to copy contiguous data + in and out of object supporting the buffer interface. + +Specification +============= + +While the new specification allows for complicated memory sharing. +Simple contiguous buffers of bytes can still be obtained from an +object. In fact, the new protocol allows a standard mechanism for +doing this even if the original object is not represented as a +contiguous chunk of memory. + +The easiest way to obtain a simple contiguous chunk of memory is +to use the provided C-API to obtain a chunk of memory. + + +Change the PyBufferProcs structure to + +:: + + typedef struct { + getbufferproc bf_getbuffer; + releasebufferproc bf_releasebuffer; + } + + +:: + + typedef int (*getbufferproc)(PyObject *obj, PyBuffer *view, int flags) + +This function returns 0 on success and -1 on failure (and raises an +error). The first variable is the "exporting" object. The second +argument is the address to a bufferinfo structure. If view is NULL, +then no information is returned but a lock on the memory is still +obtained. In this case, the corresponding releasebuffer should also +be called with NULL. + +The third argument indicates what kind of buffer the exporter is +allowed to return. It essentially tells the exporter what kind of +memory area the consumer can deal with. It also indicates what +members of the PyBuffer structure the consumer is going to care about. + +The exporter can use this information to simplify how much of the PyBuffer +structure is filled in and/or raise an error if the object can't support +a simpler view of its memory. + +Thus, the caller can request a simple "view" and either receive it or +have an error raised if it is not possible. + +All of the following assume that at least buf, len, and readonly +will always be utilized by the caller. + +Py_BUF_SIMPLE + + The returned buffer will be assumed to be readable (the object may + or may not have writeable memory). Only the buf, len, and readonly + variables may be accessed. The format will be assumed to be + unsigned bytes . This is a "stand-alone" flag constant. It never + needs to be \|'d to the others. The exporter will raise an + error if it cannot provide such a contiguous buffer. + +Py_BUF_WRITEABLE + + The returned buffer must be writeable. If it is not writeable, + then raise an error. + +Py_BUF_READONLY + + The returned buffer must be readonly. If the object is already + read-only or it can make its memory read-only (and there are no + other views on the object) then it should do so and return the + buffer information. If the object does not have read-only memory + (or cannot make it read-only), then an error should be raised. + +Py_BUF_FORMAT + + The returned buffer must have true format information. This would + be used when the consumer is going to be checking for what 'kind' + of data is actually stored. An exporter should always be able + to provide this information if requested. + +Py_BUF_SHAPE + + The returned buffer must have shape information. The memory will + be assumed C-style contiguous (last dimension varies the fastest). + The exporter may raise an error if it cannot provide this kind + of contiguous buffer. + +Py_BUF_STRIDES (implies Py_BUF_SHAPE) + + The returned buffer must have strides information. This would be + used when the consumer can handle strided, discontiguous arrays. + Handling strides automatically assumes you can handle shape. + The exporter may raise an error if cannot provide a strided-only + representation of the data (i.e. without the suboffsets). + +Py_BUF_OFFSETS (implies Py_BUF_STRIDES) + + The returned buffer must have suboffsets information. This would + be used when the consumer can handle indirect array referencing + implied by these suboffsets. + +Py_BUF_FULL (Py_BUF_OFFSETS | Py_BUF_WRITEABLE | Py_BUF_FORMAT) + +Thus, the consumer simply wanting a contiguous chunk of bytes from +the object would use Py_BUF_SIMPLE, while a consumer that understands +how to make use of the most complicated cases could use Py_BUF_INDIRECT. + +If format information is going to be probed, then Py_BUF_FORMAT must +be \|'d to the flags otherwise the consumer assumes it is unsigned +bytes. + +There is a C-API that simple exporting objects can use to fill-in the +buffer info structure correctly according to the provided flags if a +contiguous chunk of "unsigned bytes" is all that can be exported. + + +The bufferinfo structure is:: + + struct bufferinfo { + void *buf; + Py_ssize_t len; + int readonly; + const char *format; + int ndims; + Py_ssize_t *shape; + Py_ssize_t *strides; + Py_ssize_t *suboffsets; + int itemsize; + void *internal; + } PyBuffer; + +Before calling this function, the bufferinfo structure can be filled +with whatever. Upon return from getbufferproc, the bufferinfo +structure is filled in with relevant information about the buffer. +This same bufferinfo structure must be passed to bf_releasebuffer (if +available) when the consumer is done with the memory. The caller is +responsible for keeping a reference to obj until releasebuffer is +called (i.e. this call does not alter the reference count of obj). + +The members of the bufferinfo structure are: + +buf + a pointer to the start of the memory for the object + +len + the total bytes of memory the object uses. This should be the + same as the product of the shape array multiplied by the number of + bytes per item of memory. + +readonly + an integer variable to hold whether or not the memory is + readonly. 1 means the memory is readonly, zero means the + memory is writeable. + +format + a NULL-terminated format-string (following the struct-style syntax + including extensions) indicating what is in each element of + memory. The number of elements is len / itemsize, where itemsize + is the number of bytes implied by the format. For standard + unsigned bytes use a format string of "B". + +ndims + a variable storing the number of dimensions the memory represents. + Must be >=0. + +shape + an array of ``Py_ssize_t`` of length ``ndims`` indicating the + shape of the memory as an N-D array. Note that ``((*shape)[0] * + ... * (*shape)[ndims-1])*itemsize = len``. If ndims is 0 (indicating + a scalar), then this must be NULL. + +strides + address of a ``Py_ssize_t*`` variable that will be filled with a + pointer to an array of ``Py_ssize_t`` of length ``ndims`` (or NULL + if ndims is 0). indicating the number of bytes to skip to get to + the next element in each dimension. If this is not requested by + the caller (BUF_STRIDES is not set), then this member of the + structure will not be used and the consumer is assuming the array + is C-style contiguous. If this is not the case, then an error + should be raised. If this member is requested by the caller + (BUF_STRIDES is set), then it must be filled in. + + +suboffsets + address of a ``Py_ssize_t *`` variable that will be filled with a + pointer to an array of ``Py_ssize_t`` of length ``*ndims``. If + these suboffset numbers are >=0, then the value stored along the + indicated dimension is a pointer and the suboffset value dictates + how many bytes to add to the pointer after de-referencing. A + suboffset value that it negative indicates that no de-referencing + should occur (striding in a contiguous memory block). If all + suboffsets are negative (i.e. no de-referencing is needed, then + this must be NULL. + + For clarity, here is a function that returns a pointer to the + element in an N-D array pointed to by an N-dimesional index when + there are both strides and suboffsets.:: + + void* get_item_pointer(int ndim, void* buf, Py_ssize_t* strides, + Py_ssize_t* suboffsets, Py_ssize_t *indices) { + char* pointer = (char*)buf; + int i; + for (i = 0; i < ndim; i++) { + pointer += strides[i]*indices[i]; + if (suboffsets[i] >=0 ) { + pointer = *((char**)pointer) + suboffsets[i]; + } + } + return (void*)pointer; + } + + Notice the suboffset is added "after" the dereferencing occurs. + Thus slicing in the ith dimension would add to the suboffsets in + the (i-1)st dimension. Slicing in the first dimension would change + the location of the starting pointer directly (i.e. buf would + be modified). + +itemsize + This is a storage for the itemsize of each element of the shared + memory. It can be obtained using PyBuffer_SizeFromFormat but an + exporter may know it without making this call and thus storing it + is more convenient and faster. + +internal + This is for use internally by the exporting object. For example, + this might be re-cast as an integer by the exporter and used to + store flags about whether or not the shape, strides, and suboffsets + arrays must be freed when the buffer is released. The consumer + should never touch this value. + + +The exporter is responsible for making sure the memory pointed to by +buf, format, shape, strides, and suboffsets is valid until +releasebuffer is called. If the exporter wants to be able to change +shape, strides, and/or suboffsets before releasebuffer is called then +it should allocate those arrays when getbuffer is called (pointing to +them in the buffer-info structure provided) and free them when +releasebuffer is called. + + +The same bufferinfo struct should be used in the release-buffer +interface call. The caller is responsible for the memory of the +bufferinfo structure itself. + +``typedef int (*releasebufferproc)(PyObject *obj, PyBuffer *view)`` + Callers of getbufferproc must make sure that this function is + called when memory previously acquired from the object is no + longer needed. The exporter of the interface must make sure that + any memory pointed to in the bufferinfo structure remains valid + until releasebuffer is called. + + Both of these routines are optional for a type object + + If the releasebuffer function is not provided then it does not ever + need to be called. + +Exporters will need to define a releasebuffer function if they can +re-allocate their memory, strides, shape, suboffsets, or format +variables which they might share through the struct bufferinfo. +Several mechanisms could be used to keep track of how many getbuffer +calls have been made and shared. Either a single variable could be +used to keep track of how many "views" have been exported, or a +linked-list of bufferinfo structures filled in could be maintained in +each object. + +All that is specifically required by the exporter, however, is to +ensure that any memory shared through the bufferinfo structure remains +valid until releasebuffer is called on the bufferinfo structure. + + +New C-API calls are proposed +============================ + +:: + + int PyObject_CheckBuffer(PyObject *obj) + +Return 1 if the getbuffer function is available otherwise 0. + +:: + + int PyObject_GetBuffer(PyObject *obj, PyBuffer *view, + int flags) + +This is a C-API version of the getbuffer function call. It checks to +make sure object has the required function pointer and issues the +call. Returns -1 and raises an error on failure and returns 0 on +success. + +:: + + int PyObject_ReleaseBuffer(PyObject *obj, PyBuffer *view) + +This is a C-API version of the releasebuffer function call. It checks +to make sure the object has the required function pointer and issues +the call. Returns 0 on success and -1 (with an error raised) on +failure. This function always succeeds if there is no releasebuffer +function for the object. + +:: + + PyObject *PyObject_GetMemoryView(PyObject *obj) + +Return a memory-view object from an object that defines the buffer interface. + +A memory-view object is an extended buffer object that could replace +the buffer object (but doesn't have to). It's C-structure is + +:: + + typedef struct { + PyObject_HEAD + PyObject *base; + int ndims; + Py_ssize_t *starts; /* slice starts */ + Py_ssize_t *stops; /* slice stops */ + Py_ssize_t *steps; /* slice steps */ + } PyMemoryViewObject; + +This is functionally similar to the current buffer object except only +a reference to base is kept. The actual memory for base must be +re-grabbed using the buffer-protocol, whenever it is needed. + +The getbuffer and releasebuffer for this object use the underlying +base object (adjusted using the slice information). If the number of +dimensions of the base object (or the strides or the size) has changed +when a new view is requested, then the getbuffer will trigger an error. + +This memory-view object will support mult-dimensional slicing. Slices +of the memory-view object are other memory-view objects. When an +"element" from the memory-view is returned it is always a tuple of +bytes object + format string which can then be interpreted using the +struct module if desired. + +:: + + int PyBuffer_SizeFromFormat(const char *) + +Return the implied itemsize of the data-format area from a struct-style +description. + +:: + + int PyObject_GetContiguous(PyObject *obj, void **buf, Py_ssize_t *len, + char **format, char fortran) + +Return a contiguous chunk of memory representing the buffer. If a +copy is made then return 1. If no copy was needed return 0. If an +error occurred in probing the buffer interface, then return -1. The +contiguous chunk of memory is pointed to by ``*buf`` and the length of +that memory is ``*len``. If the object is multi-dimensional, then if +fortran is 'F', the first dimension of the underlying array will vary +the fastest in the buffer. If fortran is 'C', then the last dimension +will vary the fastest (C-style contiguous). If fortran is 'A', then it +does not matter and you will get whatever the object decides is more +efficient. + +:: + + int PyObject_CopyToObject(PyObject *obj, void *buf, Py_ssize_t len, + char fortran) + +Copy ``len`` bytes of data pointed to by the contiguous chunk of +memory pointed to by ``buf`` into the buffer exported by obj. Return +0 on success and return -1 and raise an error on failure. If the +object does not have a writeable buffer, then an error is raised. If +fortran is 'F', then if the object is multi-dimensional, then the data +will be copied into the array in Fortran-style (first dimension varies +the fastest). If fortran is 'C', then the data will be copied into the +array in C-style (last dimension varies the fastest). If fortran is 'A', then +it does not matter and the copy will be made in whatever way is more +efficient. + +:: + + void PyBuffer_FreeMem(void *buf) + +This function frees the memory returned by PyObject_GetContiguous if a +copy was made. Do not call this function unless +PyObject_GetContiguous returns a 1 indicating that new memory was +created. + + +These last three C-API calls allow a standard way of getting data in and +out of Python objects into contiguous memory areas no matter how it is +actually stored. These calls use the extended buffer interface to perform +their work. + +:: + + int PyBuffer_IsContiguous(PyBuffer *view, char fortran); + +Return 1 if the memory defined by the view object is C-style (fortran = 'C') +or Fortran-style (fortran = 'A') contiguous. Return 0 otherwise. + +:: + + void PyBuffer_FillContiguousStrides(int *ndims, Py_ssize_t *shape, + int itemsize, + Py_ssize_t *strides, char fortran) + +Fill the strides array with byte-strides of a contiguous (C-style if +fortran is 0 or Fortran-style if fortran is 1) array of the given +shape with the given number of bytes per element. + +:: + + int PyBuffer_FillInfo(PyBuffer *view, void *buf, + Py_ssize_t len, int readonly, int infoflags) + +Fills in a buffer-info structure correctly for an exporter that can +only share a contiguous chunk of memory of "unsigned bytes" of the +given length. Returns 0 on success and -1 (with raising an error) on +error. + + +Additions to the struct string-syntax +===================================== + +The struct string-syntax is missing some characters to fully +implement data-format descriptions already available elsewhere (in +ctypes and NumPy for example). The Python 2.5 specification is +at http://docs.python.org/lib/module-struct.html + +Here are the proposed additions: + + +================ =========== +Character Description +================ =========== +'t' bit (number before states how many bits) +'?' platform _Bool type +'g' long double +'c' ucs-1 (latin-1) encoding +'u' ucs-2 +'w' ucs-4 +'O' pointer to Python Object +'Z' complex (whatever the next specifier is) +'&' specific pointer (prefix before another charater) +'T{}' structure (detailed layout inside {}) +'(k1,k2,...,kn)' multi-dimensional array of whatever follows +':name:' optional name of the preceeding element +'X{}' pointer to a function (optional function + signature inside {}) +' \n\t' ignored (allow better readability) + -- this may already be true +================ =========== + +The struct module will be changed to understand these as well and +return appropriate Python objects on unpacking. Unpacking a +long-double will return a decimal object or a ctypes long-double. +Unpacking 'u' or 'w' will return Python unicode. Unpacking a +multi-dimensional array will return a list (of lists if >1d). +Unpacking a pointer will return a ctypes pointer object. Unpacking a +function pointer will return a ctypes call-object (perhaps). Unpacking +a bit will return a Python Bool. White-space in the struct-string +syntax will be ignored if it isn't already. Unpacking a named-object +will return some kind of named-tuple-like object that acts like a +tuple but whose entries can also be accessed by name. Unpacking a +nested structure will return a nested tuple. + +Endian-specification ('!', '@','=','>','<', '^') is also allowed +inside the string so that it can change if needed. The +previously-specified endian string is in force until changed. The +default endian is '@' which means native data-types and alignment. If +un-aligned, native data-types are requested, then the endian +specification is '^'. + +According to the struct-module, a number can preceed a character +code to specify how many of that type there are. The +(k1,k2,...,kn) extension also allows specifying if the data is +supposed to be viewed as a (C-style contiguous, last-dimension +varies the fastest) multi-dimensional array of a particular format. + +Functions should be added to ctypes to create a ctypes object from +a struct description, and add long-double, and ucs-2 to ctypes. + +Examples of Data-Format Descriptions +==================================== + +Here are some examples of C-structures and how they would be +represented using the struct-style syntax. + +<named> is the constructor for a named-tuple (not-specified yet). + +float + 'f' <--> Python float +complex double + 'Zd' <--> Python complex +RGB Pixel data + 'BBB' <--> (int, int, int) + 'B:r: B:g: B:b:' <--> <named>((int, int, int), ('r','g','b')) + +Mixed endian (weird but possible) + '>i:big: <i:little:' <--> <named>((int, int), ('big', 'little')) + +Nested structure + :: + + struct { + int ival; + struct { + unsigned short sval; + unsigned char bval; + unsigned char cval; + } sub; + } + """i:ival: + T{ + H:sval: + B:bval: + B:cval: + }:sub: + """ +Nested array + :: + + struct { + int ival; + double data[16*4]; + } + """i:ival: + (16,4)d:data: + """ + + +Code to be affected +=================== + +All objects and modules in Python that export or consume the old +buffer interface will be modified. Here is a partial list. + +* buffer object +* bytes object +* string object +* array module +* struct module +* mmap module +* ctypes module + +Anything else using the buffer API. + + +Issues and Details +================== + +It is intended that this PEP will be back-ported to Python 2.6 by +adding the C-API and the two functions to the existing buffer +protocol. + +The proposed locking mechanism relies entirely on the exporter object +to not invalidate any of the memory pointed to by the buffer structure +until a corresponding releasebuffer is called. If it wants to be able +to change its own shape and/or strides arrays, then it needs to create +memory for these in the bufferinfo structure and copy information +over. + +The sharing of strided memory and suboffsets is new and can be seen as +a modification of the multiple-segment interface. It is motivated by +NumPy and the PIL. NumPy objects should be able to share their +strided memory with code that understands how to manage strided memory +because strided memory is very common when interfacing with compute +libraries. + +Also, with this approach it should be possible to write generic code +that works with both kinds of memory. + +Memory management of the format string, the shape array, the strides +array, and the suboffsets array in the bufferinfo structure is always +the responsibility of the exporting object. The consumer should not +set these pointers to any other memory or try to free them. + +Several ideas were discussed and rejected: + + Having a "releaser" object whose release-buffer was called. This + was deemed unacceptable because it caused the protocol to be + asymmetric (you called release on something different than you + "got" the buffer from). It also complicated the protocol without + providing a real benefit. + + Passing all the struct variables separately into the function. + This had the advantage that it allowed one to set NULL to + variables that were not of interest, but it also made the function + call more difficult. The flags variable allows the same + ability of consumers to be "simple" in how they call the protocol. + +Code +======== + +The authors of the PEP promise to contribute and maintain the code for +this proposal but will welcome any help. + + + + +Examples +========= + +Ex. 1 +----------- + +This example shows how an image object that uses contiguous lines might expose its buffer. + +:: + + struct rgba { + unsigned char r, g, b, a; + }; + + struct ImageObject { + PyObject_HEAD; + ... + struct rgba** lines; + Py_ssize_t height; + Py_ssize_t width; + Py_ssize_t shape_array[2]; + Py_ssize_t stride_array[2]; + Py_ssize_t view_count; + }; + +"lines" points to malloced 1-D array of (struct rgba*). Each pointer +in THAT block points to a seperately malloced array of (struct rgba). + +In order to access, say, the red value of the pixel at x=30, y=50, you'd use "lines[50][30].r". + +So what does ImageObject's getbuffer do? Leaving error checking out:: + + int Image_getbuffer(PyObject *self, PyBuffer *view, int flags) { + + static Py_ssize_t suboffsets[2] = { -1, 0 }; + + view->buf = self->lines; + view->len = self->height*self->width; + view->readonly = 0; + view->ndims = 2; + self->shape_array[0] = height; + self->shape_array[1] = width; + view->shape = &self->shape_array; + self->stride_array[0] = sizeof(struct rgba*); + self->stride_array[1] = sizeof(struct rgba); + view->strides = &self->stride_array; + view->suboffsets = suboffsets; + + self->view_count ++; + + return 0; + } + + + int Image_releasebuffer(PyObject *self, PyBuffer *view) { + self->view_count--; + return 0; + } + + +Ex. 2 +----------- + +This example shows how an object that wants to expose a contiguous +chunk of memory (which will never be re-allocated while the object is +alive) would do that. + +:: + + int myobject_getbuffer(PyObject *self, PyBuffer *view, int flags) { + + void *buf; + Py_ssize_t len; + int readonly=0; + + buf = /* Point to buffer */ + len = /* Set to size of buffer */ + readonly = /* Set to 1 if readonly */ + + return PyObject_FillBufferInfo(view, buf, len, readonly, flags); + } + +No releasebuffer is necessary because the memory will never +be re-allocated so the locking mechanism is not needed. + +Ex. 3 +----------- + +A consumer that wants to only get a simple contiguous chunk of bytes +from a Python object, obj would do the following: + +:: + + PyBuffer view; + int ret; + + if (PyObject_GetBuffer(obj, &view, Py_BUF_SIMPLE) < 0) { + /* error return */ + } + + /* Now, view.buf is the pointer to memory + view.len is the length + view.readonly is whether or not the memory is read-only. + */ + + + /* After using the information and you don't need it anymore */ + + if (PyObject_ReleaseBuffer(obj, &view) < 0) { + /* error return */ + } + + +Ex. 4 +----------- + +A consumer that wants to be able to use any object's memory but is +writing an algorithm that only handle contiguous memory could do the following: + +:: + + void *buf; + Py_ssize_t len; + char *format; + + if (PyObject_GetContiguous(obj, &buf, &len, &format, 0) < 0) { + /* error return */ + } + + /* process memory pointed to by buffer if format is correct */ + + /* Optional: + + if, after processing, we want to copy data from buffer back + into the the object + + we could do + */ + + if (PyObject_CopyToObject(obj, buf, len, 0) < 0) { + /* error return */ + } + + +Copyright +========= + +This PEP is placed in the public domain |