diff options
author | Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com> | 2021-11-18 12:51:00 -0800 |
---|---|---|
committer | Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com> | 2021-11-18 12:51:00 -0800 |
commit | 2ff7ab64d4e7d5928e96ca95b85350aa9caa2b63 (patch) | |
tree | 0628e1e3bbbd38bf921d5046d62dbbbcc96f2258 /numpy/array_api/_set_functions.py | |
parent | a5130840fe6e2aa0ff12d95ccdfdadc5589bb88a (diff) | |
download | numpy-2ff7ab64d4e7d5928e96ca95b85350aa9caa2b63.tar.gz |
Reorganize NEON min/max implementation to be more generic
Thank you @seiko2plus for the excellent example.
Reorganized code so that it can be used for other architectures. Core implementations and unroll factors should be the same as before for ARM NEON. Beyond reorganizing, we've added default implementations using universal intrinsics for non-ARM-NEON. Additionally, we've moved most min, max, fmin, fmax implementations to a new dispatchable source file: numpy/core/src/umath/loops_minmax.dispatch.c.src
**Testing**
- Apple silicon M1 native (arm64 / aarch64) -- No test failures
- Apple silicon M1 Rosetta (x86_64) -- No new test failures
- iMacPro1,1 (AVX512F) -- No test failures
**Benchmarks**
- Apple silicon M1 native (arm64 / aarch64)
- Similar improvements as before reorg (comparison below)
- x86_64 (both Apple silicon M1 Rosetta and iMacPro1,1 AVX512F)
- Some x86_64 benchmarks are better, some are worse
Apple silicon M1 native (arm64 / aarch64) comparison to original implementation / before reorg:
```
before after ratio
[559ddede] [a3463b09]
<gh-issue-17989/improve-neon-min-max> <gh-issue-17989/feedback/round-1>
+ 6.45±0.04μs 7.07±0.09μs 1.10 bench_lib.Nan.time_nanargmin(200, 0.1)
+ 32.1±0.3μs 35.2±0.2μs 1.10 bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'd')
+ 29.1±0.02μs 31.8±0.05μs 1.10 bench_core.Core.time_array_int_l1000
+ 69.0±0.2μs 75.3±3μs 1.09 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
+ 92.0±1μs 99.5±0.5μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'd')
+ 9.29±0.1μs 9.99±0.5μs 1.08 bench_ma.UFunc.time_1d(True, True, 10)
+ 338±0.6μs 362±10μs 1.07 bench_function_base.Sort.time_sort('quick', 'int16', ('random',))
+ 4.21±0.03μs 4.48±0.2μs 1.07 bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>)
+ 12.3±0.06μs 13.1±0.7μs 1.06 bench_function_base.Median.time_even_small
+ 1.27±0μs 1.35±0.06μs 1.06 bench_itemselection.PutMask.time_dense(False, 'float16')
+ 139±1ns 147±6ns 1.06 bench_core.Core.time_array_1
+ 33.7±0.01μs 35.5±2μs 1.05 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'f')
+ 69.4±0.1μs 73.1±0.2μs 1.05 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'f')
+ 225±0.09μs 237±9μs 1.05 bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 2047])
- 15.7±0.5μs 14.9±0.03μs 0.95 bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
- 34.2±2μs 32.0±0.03μs 0.94 bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'f')
- 1.03±0.05ms 955±3μs 0.92 bench_lib.Nan.time_nanargmax(200000, 50.0)
- 6.97±0.08μs 6.43±0.02μs 0.92 bench_ma.UFunc.time_scalar(True, False, 10)
- 5.41±0μs 4.98±0.01μs 0.92 bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F')
- 22.4±0.01μs 20.6±0.02μs 0.92 bench_core.Core.time_array_float64_l1000
- 1.51±0.01ms 1.38±0ms 0.92 bench_core.CorrConv.time_correlate(1000, 10000, 'same')
- 10.1±0.2μs 9.27±0.09μs 0.92 bench_ufunc.UFunc.time_ufunc_types('invert')
- 8.50±0.02μs 7.80±0.09μs 0.92 bench_indexing.ScalarIndexing.time_assign_cast(1)
- 29.5±0.2μs 26.6±0.03μs 0.90 bench_ma.Concatenate.time_it('masked', 100)
- 2.09±0.02ms 1.87±0ms 0.90 bench_ma.UFunc.time_2d(True, True, 1000)
- 298±10μs 267±0.3μs 0.89 bench_app.MaxesOfDots.time_it
- 10.7±0.2μs 9.60±0.02μs 0.89 bench_ma.UFunc.time_1d(True, True, 100)
- 567±3μs 505±2μs 0.89 bench_lib.Nan.time_nanargmax(200000, 90.0)
- 342±0.9μs 282±5μs 0.83 bench_lib.Nan.time_nanargmax(200000, 2.0)
- 307±0.7μs 244±0.8μs 0.80 bench_lib.Nan.time_nanargmax(200000, 0.1)
- 309±1μs 241±0.1μs 0.78 bench_lib.Nan.time_nanargmax(200000, 0)
```
Diffstat (limited to 'numpy/array_api/_set_functions.py')
0 files changed, 0 insertions, 0 deletions