diff options
author | Sebastian Berg <sebastian@sipsolutions.net> | 2022-06-10 06:48:15 -0700 |
---|---|---|
committer | GitHub <noreply@github.com> | 2022-06-10 07:48:15 -0600 |
commit | 71f7f7c9df592094c3ca8cfb45cea2211de24a58 (patch) | |
tree | 40ab82b516a2a43dada998dfa56fb2a9a4858102 /benchmarks | |
parent | 7d39b0f1135859598ed6bd086a11124a52d8a969 (diff) | |
download | numpy-71f7f7c9df592094c3ca8cfb45cea2211de24a58.tar.gz |
ENH: Implement string comparison ufuncs (or almost) (#21041)
* ENH: Implement string comparison ufuncs (or almost)
This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially `np.compare_chararrays` (that one may have a bit more
overhead now).
In general the new code should be much faster, and has a lot of easier
optimization potential. It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.
This further fixes a couple of bugs with byte-swapped strings.
The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a `FutureWarning` (instead of just False).
* MAINT: Do not use C99 tagged struct init in C++
C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.
* BENCH: Add basic string comparison benchmarks
* DOC,STY: Fixup string-comparisons comments based on review
Thanks to Marten's comments, a few clarfications and slight fixups.
* ENH: Use `memcmp` because it may be faster for the byte case
* TST: Improve string and unicode comparison tests.
* MAINT: Use switch statement based on review
As suggested be Serge.
Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
* TST: Make unicode byte-swap test slightly more concrete
The issue is that the `view` needs to use native byte-order, so
just ensure native byte-order for the view, and then do another cast
to get it right.
* BUG: Add `np.compare_chararrays` to test and fix typo
* TST: Add test for empty string comparisons
* TST: Fixup string test based on martens review
* MAINT: Move definitions back into string_ufuncs.h
* MAINT: Use enum class for comparison operator templating
This removes the need for a dynamic (or static) assert in the
switch statement.
* Template version of add_loop to avoid redundant code
* STY: Fixup style, two spaces, error is -1
* STY: Small `string_ufuncs.cpp` fixups based on Serge's review
* MAINT: Fix merge conflict (ensure_dtype_nbo was removed)
Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
Diffstat (limited to 'benchmarks')
-rw-r--r-- | benchmarks/benchmarks/bench_strings.py | 45 |
1 files changed, 45 insertions, 0 deletions
diff --git a/benchmarks/benchmarks/bench_strings.py b/benchmarks/benchmarks/bench_strings.py new file mode 100644 index 000000000..e500d7f3f --- /dev/null +++ b/benchmarks/benchmarks/bench_strings.py @@ -0,0 +1,45 @@ +from __future__ import absolute_import, division, print_function + +from .common import Benchmark + +import numpy as np +import operator + + +_OPERATORS = { + '==': operator.eq, + '!=': operator.ne, + '<': operator.lt, + '<=': operator.le, + '>': operator.gt, + '>=': operator.ge, +} + + +class StringComparisons(Benchmark): + # Basic string comparison speed tests + params = [ + [100, 10000, (1000, 20)], + ['U', 'S'], + [True, False], + ['==', '!=', '<', '<=', '>', '>=']] + param_names = ['shape', 'dtype', 'contig', 'operator'] + int64 = np.dtype(np.int64) + + def setup(self, shape, dtype, contig, operator): + self.arr = np.arange(np.prod(shape)).astype(dtype).reshape(shape) + self.arr_identical = self.arr.copy() + self.arr_different = self.arr[::-1].copy() + + if not contig: + self.arr = self.arr[..., ::2] + self.arr_identical = self.arr_identical[..., ::2] + self.arr_different = self.arr_different[..., ::2] + + self.operator = _OPERATORS[operator] + + def time_compare_identical(self, shape, dtype, contig, operator): + self.operator(self.arr, self.arr_identical) + + def time_compare_different(self, shape, dtype, contig, operator): + self.operator(self.arr, self.arr_different) |