delta/python-packages/numpy.git - github.com: numpy/numpy.git

diff options

author	Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>	2021-11-18 12:51:00 -0800
committer	Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>	2021-11-18 12:51:00 -0800
commit	2ff7ab64d4e7d5928e96ca95b85350aa9caa2b63 (patch)
tree	0628e1e3bbbd38bf921d5046d62dbbbcc96f2258 /numpy/array_api/_set_functions.py
parent	a5130840fe6e2aa0ff12d95ccdfdadc5589bb88a (diff)
download	numpy-2ff7ab64d4e7d5928e96ca95b85350aa9caa2b63.tar.gz

Reorganize NEON min/max implementation to be more generic

Thank you @seiko2plus for the excellent example. Reorganized code so that it can be used for other architectures. Core implementations and unroll factors should be the same as before for ARM NEON. Beyond reorganizing, we've added default implementations using universal intrinsics for non-ARM-NEON. Additionally, we've moved most min, max, fmin, fmax implementations to a new dispatchable source file: numpy/core/src/umath/loops_minmax.dispatch.c.src **Testing** - Apple silicon M1 native (arm64 / aarch64) -- No test failures - Apple silicon M1 Rosetta (x86_64) -- No new test failures - iMacPro1,1 (AVX512F) -- No test failures **Benchmarks** - Apple silicon M1 native (arm64 / aarch64) - Similar improvements as before reorg (comparison below) - x86_64 (both Apple silicon M1 Rosetta and iMacPro1,1 AVX512F) - Some x86_64 benchmarks are better, some are worse Apple silicon M1 native (arm64 / aarch64) comparison to original implementation / before reorg: ``` before after ratio [559ddede] [a3463b09] <gh-issue-17989/improve-neon-min-max> <gh-issue-17989/feedback/round-1> + 6.45±0.04μs 7.07±0.09μs 1.10 bench_lib.Nan.time_nanargmin(200, 0.1) + 32.1±0.3μs 35.2±0.2μs 1.10 bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'd') + 29.1±0.02μs 31.8±0.05μs 1.10 bench_core.Core.time_array_int_l1000 + 69.0±0.2μs 75.3±3μs 1.09 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f') + 92.0±1μs 99.5±0.5μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'd') + 9.29±0.1μs 9.99±0.5μs 1.08 bench_ma.UFunc.time_1d(True, True, 10) + 338±0.6μs 362±10μs 1.07 bench_function_base.Sort.time_sort('quick', 'int16', ('random',)) + 4.21±0.03μs 4.48±0.2μs 1.07 bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>) + 12.3±0.06μs 13.1±0.7μs 1.06 bench_function_base.Median.time_even_small + 1.27±0μs 1.35±0.06μs 1.06 bench_itemselection.PutMask.time_dense(False, 'float16') + 139±1ns 147±6ns 1.06 bench_core.Core.time_array_1 + 33.7±0.01μs 35.5±2μs 1.05 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'f') + 69.4±0.1μs 73.1±0.2μs 1.05 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'f') + 225±0.09μs 237±9μs 1.05 bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 2047]) - 15.7±0.5μs 14.9±0.03μs 0.95 bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>) - 34.2±2μs 32.0±0.03μs 0.94 bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'f') - 1.03±0.05ms 955±3μs 0.92 bench_lib.Nan.time_nanargmax(200000, 50.0) - 6.97±0.08μs 6.43±0.02μs 0.92 bench_ma.UFunc.time_scalar(True, False, 10) - 5.41±0μs 4.98±0.01μs 0.92 bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F') - 22.4±0.01μs 20.6±0.02μs 0.92 bench_core.Core.time_array_float64_l1000 - 1.51±0.01ms 1.38±0ms 0.92 bench_core.CorrConv.time_correlate(1000, 10000, 'same') - 10.1±0.2μs 9.27±0.09μs 0.92 bench_ufunc.UFunc.time_ufunc_types('invert') - 8.50±0.02μs 7.80±0.09μs 0.92 bench_indexing.ScalarIndexing.time_assign_cast(1) - 29.5±0.2μs 26.6±0.03μs 0.90 bench_ma.Concatenate.time_it('masked', 100) - 2.09±0.02ms 1.87±0ms 0.90 bench_ma.UFunc.time_2d(True, True, 1000) - 298±10μs 267±0.3μs 0.89 bench_app.MaxesOfDots.time_it - 10.7±0.2μs 9.60±0.02μs 0.89 bench_ma.UFunc.time_1d(True, True, 100) - 567±3μs 505±2μs 0.89 bench_lib.Nan.time_nanargmax(200000, 90.0) - 342±0.9μs 282±5μs 0.83 bench_lib.Nan.time_nanargmax(200000, 2.0) - 307±0.7μs 244±0.8μs 0.80 bench_lib.Nan.time_nanargmax(200000, 0.1) - 309±1μs 241±0.1μs 0.78 bench_lib.Nan.time_nanargmax(200000, 0) ```

Diffstat (limited to 'numpy/array_api/_set_functions.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: