summaryrefslogtreecommitdiff
path: root/doc/neps
diff options
context:
space:
mode:
Diffstat (limited to 'doc/neps')
-rw-r--r--doc/neps/groupby_additions.rst112
1 files changed, 112 insertions, 0 deletions
diff --git a/doc/neps/groupby_additions.rst b/doc/neps/groupby_additions.rst
new file mode 100644
index 000000000..28e1c29ac
--- /dev/null
+++ b/doc/neps/groupby_additions.rst
@@ -0,0 +1,112 @@
+====================================================================
+ A proposal for adding groupby functionality to NumPy
+====================================================================
+
+:Author: Travis Oliphant
+:Contact: oliphant@enthought.com
+:Date: 2010-04-27
+
+
+Executive summary
+=================
+
+NumPy provides tools for handling data and doing calculations in much
+the same way as relational algebra allows. However, the common group-by
+functionality is not easily handled. The reduce methods of NumPy's
+ufuncs are a natural place to put this groupby behavior. This NEP
+describes two additional methods for ufuncs (reduceby and reducein) and
+two additional functions (segment and edges) which can help add this
+functionality.
+
+Example Use Case
+================
+Suppose you have a NumPy structured array containing information about
+the number of purchases at several stores over multiple days. To be clear, the
+structured array data-type is:
+
+dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
+ ('store', i4), ('SKU', 'S6'), ('number', i4)]
+
+Suppose there is a 1-d NumPy array of this data-type and you would like
+to compute various statistics (max, min, mean, sum, etc.) on the number
+of products sold, by product, by month, by store, etc.
+
+Currently, this could be done by using reduce methods on the number
+field of the array, coupled with in-place sorting, unique with
+return_inverse=True and bincount, etc. However, for such a common
+data-analysis need, it would be nice to have standard and more direct
+ways to get the results.
+
+
+Ufunc methods proposed
+======================
+
+It is proposed to add two new reduce-style methods to the ufuncs:
+reduceby and reducein. The reducein method is intended to be a simpler
+to use version of reduceat, while the reduceby method is intended to
+provide group-by capability on reductions.
+
+reducein::
+
+ <ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
+
+ Perform a local reduce with slices specified by pairs of indices.
+
+ The reduction occurs along the provided axis, using the provided
+ data-type to calculate intermediate results, storing the result into
+ the array out (if provided).
+
+ The indices array provides the start and end indices for the
+ reduction. If the length of the indices array is odd, then the
+ final index provides the beginning point for the final reduction
+ and the ending point is the end of arr.
+
+ This generalizes along the given axis, the behavior:
+
+ [<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
+ for i in range(len(indices)/2)]
+
+ This assumes indices is of even length
+
+ Example:
+ >>> a = [0,1,2,4,5,6,9,10]
+ >>> add.reducein(a,[0,3,2,5,-2])
+ [3, 11, 19]
+
+ Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
+
+reduceby::
+
+ <ufunc>.reduceby(arr, by, dtype=None, out=None)
+
+ Perform a reduction in arr over unique non-negative integers in by.
+
+
+ Let N=arr.ndim and M=by.ndim. Then, by.shape[:N] == arr.shape.
+ In addition, let I be an N-length index tuple, then by[I]
+ contains the location in the output array for the reduction to
+ be stored. Notice that if N == M, then by[I] is a non-negative
+ integer, while if N < M, then by[I] is an array of indices into
+ the output array.
+
+ The reduction is computed on groups specified by unique indices
+ into the output array. The index is either the single
+ non-negative integer if N == M or if N < M, the entire
+ (M-N+1)-length index by[I] considered as a whole.
+
+
+Functions proposed
+==================
+
+segment::
+
+
+edges::
+
+
+.. Local Variables:
+.. mode: rst
+.. coding: utf-8
+.. fill-column: 72
+.. End:
+