From f2392bc1a8a0275576899fd720cb237c72ec50ba Mon Sep 17 00:00:00 2001 From: Travis Oliphant Date: Thu, 11 Jun 2009 22:16:39 +0000 Subject: Working on date-time... --- doc/neps/datetime-proposal.rst | 591 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 591 insertions(+) create mode 100644 doc/neps/datetime-proposal.rst (limited to 'doc/neps') diff --git a/doc/neps/datetime-proposal.rst b/doc/neps/datetime-proposal.rst new file mode 100644 index 000000000..da854d773 --- /dev/null +++ b/doc/neps/datetime-proposal.rst @@ -0,0 +1,591 @@ +==================================================================== + A proposal for implementing some date/time types in NumPy +==================================================================== + +:Author: Travis Oliphant +:Contact: oliphant@enthought.com +:Date: 2009-06-09 + +Based on the original (third) proposal by + +:Author: Francesc Alted i Abad +:Contact: faltet@pytables.com +:Author: Ivan Vilata i Balaguer +:Contact: ivan@selidor.net +:Date: 2008-07-30 + + + +Executive summary +================= + +A date/time mark is something very handy to have in many fields where +one has to deal with data sets. While Python has several modules that +define a date/time type (like the integrated ``datetime`` [1]_ or +``mx.DateTime`` [2]_), NumPy has a lack of them. + +In this document, we are proposing the addition of date/time +types to fill this gap. The requirements for the proposed types are +two-folded: 1) they have to be fast to operate with and 2) they have to +be as compatible as possible with the existing ``datetime`` module that +comes with Python. + + +Types proposed +============== + +To start with, it is virtually impossible to come up with a single +date/time type that fills the needs of every case of use. So, there are +two different types: ``timedelta64`` and ``datetime64``. The +``timedelta64`` represents a relative time difference (i.e. between two +events). + +The ``datetime64`` represents an absolute time. Internally it is +represented as the number of time units between the intended +time and the epoch (12:00am on January 1, 1970). + + + + after pondering about different possibilities, we +have stuck with *two* different types, namely ``datetime64`` and +``timedelta64`` (these names are preliminary and can be changed), that +can have different time units so as to cover different needs. + +.. Important:: the time unit is conceived here as metadata that + *complements* a date/time dtype, *without changing the base type*. It + provides information about the *meaning* of the stored numbers, not + about their *structure*. + +Now follows a detailed description of the proposed types. + + +``datetime64`` +-------------- + +It represents a time that is absolute (i.e. not relative). It is +implemented internally as an ``int64`` type. The internal epoch is the +POSIX epoch (see [3]_). Like POSIX, the representation of a date +doesn't take leap seconds into account. + +In time unit *conversions* and time *representations* (but not in other +time computations), the value -2**63 (0x8000000000000000) is interpreted +as an invalid or unknown date, *Not a Time* or *NaT*. See the section +on time unit conversions for more information. + +Time units +~~~~~~~~~~ + +It accepts different time units, each of them implying a different time +span. The table below describes the time units supported with their +corresponding time spans. + +======== ================ ========================== + Time unit Time span (years) +------------------------- -------------------------- + Code Meaning +======== ================ ========================== + Y year [9.2e18 BC, 9.2e18 AC] + M month [7.6e17 BC, 7.6e17 AC] + W week [1.7e17 BC, 1.7e17 AC] + B business day [3.5e16 BC, 3.5e16 AC] + D day [2.5e16 BC, 2.5e16 AC] + h hour [1.0e15 BC, 1.0e15 AC] + m minute [1.7e13 BC, 1.7e13 AC] + s second [ 2.9e9 BC, 2.9e9 AC] + ms millisecond [ 2.9e6 BC, 2.9e6 AC] + us microsecond [290301 BC, 294241 AC] + ns nanosecond [ 1678 AC, 2262 AC] +======== ================ ========================== + +The value of an absolute date is thus *an integer number of units of the +chosen time unit* passed since the internal epoch. When working with +business days, Saturdays and Sundays are simply ignored from the count +(i.e. day 3 in business days is not Saturday 1970-01-03, but Monday +1970-01-05). + +Building a ``datetime64`` dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The proposed ways to specify the time unit in the dtype constructor are: + +Using the long string notation:: + + dtype('datetime64[us]') + +Using the short string notation:: + + dtype('M8[us]') + +Note that a time unit should always be specified, as there is not a +default. + + +Setting and getting values +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The objects with this dtype can be set in a series of ways:: + + t = numpy.ones(3, dtype='M8[s]') + t[0] = 1199164176 # assign to July 30th, 2008 at 17:31:00 + t[1] = datetime.datetime(2008, 7, 30, 17, 31, 01) # with datetime module + t[2] = '2008-07-30T17:31:02' # with ISO 8601 + +And can be get in different ways too:: + + str(t[0]) --> 2008-07-30T17:31:00 + repr(t[1]) --> datetime64(1199164177, 's') + str(t[0].item()) --> 2008-07-30 17:31:00 # datetime module object + repr(t[0].item()) --> datetime.datetime(2008, 7, 30, 17, 31) # idem + str(t) --> [2008-07-30T17:31:00 2008-07-30T17:31:01 2008-07-30T17:31:02] + repr(t) --> array([1199164176, 1199164177, 1199164178], + dtype='datetime64[s]') + +Comparisons +~~~~~~~~~~~ + +The comparisons will be supported too:: + + numpy.array(['1980'], 'M8[Y]') == numpy.array(['1979'], 'M8[Y]') + --> [False] + +or by applying broadcasting:: + + numpy.array(['1979', '1980'], 'M8[Y]') == numpy.datetime64('1980', 'Y') + --> [False, True] + +The next should work too:: + + numpy.array(['1979', '1980'], 'M8[Y]') == '1980-01-01' + --> [False, True] + +because the right hand expression can be broadcasted into an array of 2 +elements of dtype 'M8[Y]'. + +Compatibility issues +~~~~~~~~~~~~~~~~~~~~ + +This will be fully compatible with the ``datetime`` class of the +``datetime`` module of Python only when using a time unit of +microseconds. For other time units, the conversion process will loose +precision or will overflow as needed. The conversion from/to a +``datetime`` object doesn't take leap seconds into account. + + +``timedelta64`` +--------------- + +It represents a time that is relative (i.e. not absolute). It is +implemented internally as an ``int64`` type. + +In time unit *conversions* and time *representations* (but not in other +time computations), the value -2**63 (0x8000000000000000) is interpreted +as an invalid or unknown time, *Not a Time* or *NaT*. See the section +on time unit conversions for more information. + +Time units +~~~~~~~~~~ + +It accepts different time units, each of them implying a different time +span. The table below describes the time units supported with their +corresponding time spans. + +======== ================ ========================== + Time unit Time span +------------------------- -------------------------- + Code Meaning +======== ================ ========================== + Y year +- 9.2e18 years + M month +- 7.6e17 years + W week +- 1.7e17 years + B business day +- 3.5e16 years + D day +- 2.5e16 years + h hour +- 1.0e15 years + m minute +- 1.7e13 years + s second +- 2.9e12 years + ms millisecond +- 2.9e9 years + us microsecond +- 2.9e6 years + ns nanosecond +- 292 years + ps picosecond +- 106 days + fs femtosecond +- 2.6 hours + as attosecond +- 9.2 seconds +======== ================ ========================== + +The value of a time delta is thus *an integer number of units of the +chosen time unit*. + +Building a ``timedelta64`` dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The proposed ways to specify the time unit in the dtype constructor are: + +Using the long string notation:: + + dtype('timedelta64[us]') + +Using the short string notation:: + + dtype('m8[us]') + +Note that a time unit should always be specified, as there is not a +default. + +Setting and getting values +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The objects with this dtype can be set in a series of ways:: + + t = numpy.ones(3, dtype='m8[ms]') + t[0] = 12 # assign to 12 ms + t[1] = datetime.timedelta(0, 0, 13000) # 13 ms + t[2] = '0:00:00.014' # 14 ms + +And can be get in different ways too:: + + str(t[0]) --> 0:00:00.012 + repr(t[1]) --> timedelta64(13, 'ms') + str(t[0].item()) --> 0:00:00.012000 # datetime module object + repr(t[0].item()) --> datetime.timedelta(0, 0, 12000) # idem + str(t) --> [0:00:00.012 0:00:00.014 0:00:00.014] + repr(t) --> array([12, 13, 14], dtype="timedelta64[ms]") + +Comparisons +~~~~~~~~~~~ + +The comparisons will be supported too:: + + numpy.array([12, 13, 14], 'm8[ms]') == numpy.array([12, 13, 13], 'm8[ms]') + --> [True, True, False] + +or by applying broadcasting:: + + numpy.array([12, 13, 14], 'm8[ms]') == numpy.timedelta64(13, 'ms') + --> [False, True, False] + +The next should work too:: + + numpy.array([12, 13, 14], 'm8[ms]') == '0:00:00.012' + --> [True, False, False] + +because the right hand expression can be broadcasted into an array of 3 +elements of dtype 'm8[ms]'. + +Compatibility issues +~~~~~~~~~~~~~~~~~~~~ + +This will be fully compatible with the ``timedelta`` class of the +``datetime`` module of Python only when using a time unit of +microseconds. For other units, the conversion process will loose +precision or will overflow as needed. + + +Examples of use +=============== + +Here it is an example of use for the ``datetime64``:: + + In [5]: numpy.datetime64(42, 'us') + Out[5]: datetime64(42, 'us') + + In [6]: print numpy.datetime64(42, 'us') + 1970-01-01T00:00:00.000042 # representation in ISO 8601 format + + In [7]: print numpy.datetime64(367.7, 'D') # decimal part is lost + 1971-01-02 # still ISO 8601 format + + In [8]: numpy.datetime('2008-07-18T12:23:18', 'm') # from ISO 8601 + Out[8]: datetime64(20273063, 'm') + + In [9]: print numpy.datetime('2008-07-18T12:23:18', 'm') + Out[9]: 2008-07-18T12:23 + + In [10]: t = numpy.zeros(5, dtype="datetime64[ms]") + + In [11]: t[0] = datetime.datetime.now() # setter in action + + In [12]: print t + [2008-07-16T13:39:25.315 1970-01-01T00:00:00.000 + 1970-01-01T00:00:00.000 1970-01-01T00:00:00.000 + 1970-01-01T00:00:00.000] + + In [13]: repr(t) + Out[13]: array([267859210457, 0, 0, 0, 0], dtype="datetime64[ms]") + + In [14]: t[0].item() # getter in action + Out[14]: datetime.datetime(2008, 7, 16, 13, 39, 25, 315000) + + In [15]: print t.dtype + dtype('datetime64[ms]') + +And here it goes an example of use for the ``timedelta64``:: + + In [5]: numpy.timedelta64(10, 'us') + Out[5]: timedelta64(10, 'us') + + In [6]: print numpy.timedelta64(10, 'us') + 0:00:00.000010 + + In [7]: print numpy.timedelta64(3600.2, 'm') # decimal part is lost + 2 days, 12:00 + + In [8]: t1 = numpy.zeros(5, dtype="datetime64[ms]") + + In [9]: t2 = numpy.ones(5, dtype="datetime64[ms]") + + In [10]: t = t2 - t1 + + In [11]: t[0] = datetime.timedelta(0, 24) # setter in action + + In [12]: print t + [0:00:24.000 0:00:01.000 0:00:01.000 0:00:01.000 0:00:01.000] + + In [13]: print repr(t) + Out[13]: array([24000, 1, 1, 1, 1], dtype="timedelta64[ms]") + + In [14]: t[0].item() # getter in action + Out[14]: datetime.timedelta(0, 24) + + In [15]: print t.dtype + dtype('timedelta64[s]') + + +Operating with date/time arrays +=============================== + +``datetime64`` vs ``datetime64`` +-------------------------------- + +The only arithmetic operation allowed between absolute dates is the +subtraction:: + + In [10]: numpy.ones(3, "M8[s]") - numpy.zeros(3, "M8[s]") + Out[10]: array([1, 1, 1], dtype=timedelta64[s]) + +But not other operations:: + + In [11]: numpy.ones(3, "M8[s]") + numpy.zeros(3, "M8[s]") + TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray' + +Comparisons between absolute dates are allowed. + +Casting rules +~~~~~~~~~~~~~ + +When operating (basically, only the subtraction will be allowed) two +absolute times with different unit times, the outcome would be to raise +an exception. This is because the ranges and time-spans of the different +time units can be very different, and it is not clear at all what time +unit will be preferred for the user. For example, this should be +allowed:: + + >>> numpy.ones(3, dtype="M8[Y]") - numpy.zeros(3, dtype="M8[Y]") + array([1, 1, 1], dtype="timedelta64[Y]") + +But the next should not:: + + >>> numpy.ones(3, dtype="M8[Y]") - numpy.zeros(3, dtype="M8[ns]") + raise numpy.IncompatibleUnitError # what unit to choose? + + +``datetime64`` vs ``timedelta64`` +--------------------------------- + +It will be possible to add and subtract relative times from absolute +dates:: + + In [10]: numpy.zeros(5, "M8[Y]") + numpy.ones(5, "m8[Y]") + Out[10]: array([1971, 1971, 1971, 1971, 1971], dtype=datetime64[Y]) + + In [11]: numpy.ones(5, "M8[Y]") - 2 * numpy.ones(5, "m8[Y]") + Out[11]: array([1969, 1969, 1969, 1969, 1969], dtype=datetime64[Y]) + +But not other operations:: + + In [12]: numpy.ones(5, "M8[Y]") * numpy.ones(5, "m8[Y]") + TypeError: unsupported operand type(s) for *: 'numpy.ndarray' and 'numpy.ndarray' + +Casting rules +~~~~~~~~~~~~~ + +In this case the absolute time should have priority for determining the +time unit of the outcome. That would represent what the people wants to +do most of the times. For example, this would allow to do:: + + >>> series = numpy.array(['1970-01-01', '1970-02-01', '1970-09-01'], + dtype='datetime64[D]') + >>> series2 = series + numpy.timedelta(1, 'Y') # Add 2 relative years + >>> series2 + array(['1972-01-01', '1972-02-01', '1972-09-01'], + dtype='datetime64[D]') # the 'D'ay time unit has been chosen + + +``timedelta64`` vs ``timedelta64`` +---------------------------------- + +Finally, it will be possible to operate with relative times as if they +were regular int64 dtypes *as long as* the result can be converted back +into a ``timedelta64``:: + + In [10]: numpy.ones(3, 'm8[us]') + Out[10]: array([1, 1, 1], dtype="timedelta64[us]") + + In [11]: (numpy.ones(3, 'm8[M]') + 2) ** 3 + Out[11]: array([27, 27, 27], dtype="timedelta64[M]") + +But:: + + In [12]: numpy.ones(5, 'm8') + 1j + TypeError: the result cannot be converted into a ``timedelta64`` + +Casting rules +~~~~~~~~~~~~~ + +When combining two ``timedelta64`` dtypes with different time units the +outcome will be the shorter of both ("keep the precision" rule). For +example:: + + In [10]: numpy.ones(3, 'm8[s]') + numpy.ones(3, 'm8[m]') + Out[10]: array([61, 61, 61], dtype="timedelta64[s]") + +However, due to the impossibility to know the exact duration of a +relative year or a relative month, when these time units appear in one +of the operands, the operation will not be allowed:: + + In [11]: numpy.ones(3, 'm8[Y]') + numpy.ones(3, 'm8[D]') + raise numpy.IncompatibleUnitError # how to convert relative years to days? + +In order to being able to perform the above operation a new NumPy +function, called ``change_timeunit`` is proposed. Its signature will +be:: + + change_timeunit(time_object, new_unit, reference) + +where 'time_object' is the time object whose unit is to be changed, +'new_unit' is the desired new time unit, and 'reference' is an absolute +date (NumPy datetime64 scalar) that will be used to allow the conversion +of relative times in case of using time units with an uncertain number +of smaller time units (relative years or months cannot be expressed in +days). + +With this, the above operation can be done as follows:: + + In [10]: t_years = numpy.ones(3, 'm8[Y]') + + In [11]: t_days = numpy.change_timeunit(t_years, 'D', '2001-01-01') + + In [12]: t_days + numpy.ones(3, 'm8[D]') + Out[12]: array([366, 366, 366], dtype="timedelta64[D]") + + +dtype vs time units conversions +=============================== + +For changing the date/time dtype of an existing array, we propose to use +the ``.astype()`` method. This will be mainly useful for changing time +units. + +For example, for absolute dates:: + + In[10]: t1 = numpy.zeros(5, dtype="datetime64[s]") + + In[11]: print t1 + [1970-01-01T00:00:00 1970-01-01T00:00:00 1970-01-01T00:00:00 + 1970-01-01T00:00:00 1970-01-01T00:00:00] + + In[12]: print t1.astype('datetime64[D]') + [1970-01-01 1970-01-01 1970-01-01 1970-01-01 1970-01-01] + +For relative times:: + + In[10]: t1 = numpy.ones(5, dtype="timedelta64[s]") + + In[11]: print t1 + [1 1 1 1 1] + + In[12]: print t1.astype('timedelta64[ms]') + [1000 1000 1000 1000 1000] + +Changing directly from/to relative to/from absolute dtypes will not be +supported:: + + In[13]: numpy.zeros(5, dtype="datetime64[s]").astype('timedelta64') + TypeError: data type cannot be converted to the desired type + +Business days have the peculiarity that they do not cover a continuous +line of time (they have gaps at weekends). Thus, when converting from +any ordinary time to business days, it can happen that the original time +is not representable. In that case, the result of the conversion is +*Not a Time* (*NaT*):: + + In[10]: t1 = numpy.arange(5, dtype="datetime64[D]") + + In[11]: print t1 + [1970-01-01 1970-01-02 1970-01-03 1970-01-04 1970-01-05] + + In[12]: t2 = t1.astype("datetime64[B]") + + In[13]: print t2 # 1970 begins in a Thursday + [1970-01-01 1970-01-02 NaT NaT 1970-01-05] + +When converting back to ordinary days, NaT values are left untouched +(this happens in all time unit conversions):: + + In[14]: t3 = t2.astype("datetime64[D]") + + In[13]: print t3 + [1970-01-01 1970-01-02 NaT NaT 1970-01-05] + + +Final considerations +==================== + +Why the ``origin`` metadata disappeared +--------------------------------------- + +During the discussion of the date/time dtypes in the NumPy list, the +idea of having an ``origin`` metadata that complemented the definition +of the absolute ``datetime64`` was initially found to be useful. + +However, after thinking more about this, we found that the combination +of an absolute ``datetime64`` with a relative ``timedelta64`` does offer +the same functionality while removing the need for the additional +``origin`` metadata. This is why we have removed it from this proposal. + +Operations with mixed time units +-------------------------------- + +Whenever an operation between two time values of the same dtype with the +same unit is accepted, the same operation with time values of different +units should be possible (e.g. adding a time delta in seconds and one in +microseconds), resulting in an adequate time unit. The exact semantics +of this kind of operations is defined int the "Casting rules" +subsections of the "Operating with date/time arrays" section. + +Due to the peculiarities of business days, it is most probable that +operations mixing business days with other time units will not be +allowed. + +Why there is not a ``quarter`` time unit? +----------------------------------------- + +This proposal tries to focus on the most common used set of time units +to operate with, and the ``quarter`` can be considered more of a derived +unit. Besides, the use of a ``quarter`` normally requires that it can +start at whatever month of the year, and as we are not including support +for a time ``origin`` metadata, this is not a viable venue here. +Finally, if we were to add the ``quarter`` then people should expect to +find a ``biweekly``, ``semester`` or ``biyearly`` just to put some +examples of other derived units, and we find this a bit too overwhelming +for this proposal purposes. + + +.. [1] http://docs.python.org/lib/module-datetime.html +.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime +.. [3] http://en.wikipedia.org/wiki/Unix_time + + +.. Local Variables: +.. mode: rst +.. coding: utf-8 +.. fill-column: 72 +.. End: + -- cgit v1.2.1