summaryrefslogtreecommitdiff
path: root/doc/source/reference
diff options
context:
space:
mode:
authorSayed Adel <seiko@imavr.com>2021-11-23 03:56:47 +0200
committerSayed Adel <seiko@imavr.com>2021-12-08 22:18:07 +0200
commit83f2db74867425bd1a504dc5e388404cba222185 (patch)
treec0b1e8f22b83f934c85a1ec86f35db8cc129ebdb /doc/source/reference
parent9fd4162476e4499c71567f79197cc2f9f9076219 (diff)
downloadnumpy-83f2db74867425bd1a504dc5e388404cba222185.tar.gz
DOC, SIMD: Improve build options and move them into a separated page
Diffstat (limited to 'doc/source/reference')
-rw-r--r--doc/source/reference/simd/build-options.rst373
-rw-r--r--doc/source/reference/simd/index.rst4
-rw-r--r--doc/source/reference/simd/log_example.txt79
3 files changed, 456 insertions, 0 deletions
diff --git a/doc/source/reference/simd/build-options.rst b/doc/source/reference/simd/build-options.rst
new file mode 100644
index 000000000..18341022b
--- /dev/null
+++ b/doc/source/reference/simd/build-options.rst
@@ -0,0 +1,373 @@
+*****************
+CPU build options
+*****************
+
+Description
+-----------
+
+The following options are mainly used to change the default behavior of optimizations
+that targeting certain CPU features:
+
+- ``--cpu-baseline``: minimal set of required CPU features.
+ Default value is ``min`` which provides the minimum CPU features that can
+ safely run on a wide range of platforms within the processor family.
+
+ .. note::
+
+ During the runtime, NumPy modules will fail to load if any of specified features
+ are not supported by the target CPU (raise python runtime error).
+
+- ``--cpu-dispatch``: dispatched set of additional CPU features.
+ Default value is ``max -xop -fma4`` which enables all CPU
+ features, except for AMD legacy features(in case of X86).
+
+ .. note::
+
+ During the runtime, NumPy modules will skip any specified features
+ that are not available in the target CPU.
+
+These options are accessible through Distutils commands
+`build`, `build_clib` and `build_ext`, they accept a set of
+:ref:`CPU features <opt-supported-features>` or groups of features that
+gather several features or :ref:`special options <opt-special-options>` that
+perform a series of procedures.
+
+.. note::
+
+ If `build_clib`` or `build_ext` are not specified by the user,
+ the arguments of `build` will be used instead, which also holds the default values.
+
+To customize both `build_ext` and `build_clib`::
+
+ cd /path/to/numpy
+ python setup.py build --cpu-baseline="avx2 fma3" install --user
+
+To customize only `build_ext`::
+
+ cd /path/to/numpy
+ python setup.py build_ext --cpu-baseline="avx2 fma3" install --user
+
+To customize only `build_clib`::
+
+ cd /path/to/numpy
+ python setup.py build_clib --cpu-baseline="avx2 fma3" install --user
+
+You can also customize CPU/build options through PIP command::
+
+ pip install --no-use-pep517 --global-option=build \
+ --global-option="--cpu-baseline=avx2 fma3" \
+ --global-option="--cpu-dispatch=max" ./
+
+Quick Start
+-----------
+
+In general, the default settings tend to not impose certain CPU features that
+may not be available on some older processors. Raising the ceiling of the
+baseline features will often improve performance and may also reduce
+binary size.
+
+
+The following are the most common scenarios that may require changing
+the default settings:
+
+
+I am building NumPy for my local use
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+And I do not intend to export the build to other users or target a
+different CPU than what the host has.
+
+set `native` for baseline, or manualy specify the CPU features in case of option
+`native` isn't supported by your platform::
+
+ python setup.py build --cpu-baseline="native" bdist
+
+Building NumPy with extra CPU features isn't necessary for this case,
+since all supported features are already defined within the baseline features::
+
+ python setup.py build --cpu-baseline=native --cpu-dispatch=none bdist
+
+.. note::
+
+ A fatal error will be raised if `native` wasn't supported by the host platform.
+
+I do not want to support the old processors of the `x86` architecture
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+since most of the CPUs nowadays support at least `AVX`, `F16C` features::
+
+ python setup.py build --cpu-baseline="avx f16c" bdist
+
+.. note::
+
+ `--cpu-baseline` force combine all implied features, so there's no need
+ to add SSE features.
+
+
+I'm facing the same case above but with `ppc64` architecture
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Then raise the ceiling of the baseline features to Power8::
+
+ python setup.py build --cpu-baseline="vsx2" bdist
+
+Having issues with `AVX512` features?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You may have some reservations about including of `AVX512` or
+any other CPU feature and you want to exclude from the dispatched features::
+
+ python setup.py build --cpu-dispatch="max -avx512f -avx512cd \
+ -avx512_knl -avx512_knm -avx512_skx -avx512_clx -avx512_cnl -avx512_icl" \
+ bdist
+
+.. _opt-supported-features:
+
+Supported Features
+------------------
+
+The names of the features can express one feature or a group of features,
+as shown in the following tables supported depend on the lowest interest:
+
+.. note::
+
+ The following features may not be supported by all compilers,
+ also some compilers may produce different set of implied features
+ when it comes to features like ``AVX512``, ``AVX2``, and ``FMA3``.
+ See :ref:`opt-platform-differences` for more details.
+
+.. include:: generated_tables/cpu_features.inc
+
+.. _opt-special-options:
+
+Special Options
+---------------
+
+- ``NONE``: enable no features.
+
+- ``NATIVE``: Enables all CPU features that supported by the host CPU,
+ this operation is based on the compiler flags (`-march=native, -xHost, /QxHost`)
+
+- ``MIN``: Enables the minimum CPU features that can safely run on a wide range of platforms:
+
+ .. table::
+ :align: left
+
+ ====================================== =======================================
+ For Arch Implies
+ ====================================== =======================================
+ x86 (32-bit mode) ``SSE`` ``SSE2``
+ x86_64 ``SSE`` ``SSE2`` ``SSE3``
+ IBM/POWER (big-endian mode) ``NONE``
+ IBM/POWER (little-endian mode) ``VSX`` ``VSX2``
+ ARMHF ``NONE``
+ ARM64 A.K. AARCH64 ``NEON`` ``NEON_FP16`` ``NEON_VFPV4``
+ ``ASIMD``
+ ====================================== =======================================
+
+- ``MAX``: Enables all supported CPU features by the compiler and platform.
+
+- ``Operators-/+``: remove or add features, useful with options ``MAX``, ``MIN`` and ``NATIVE``.
+
+Behaviors
+---------
+
+- CPU features and other options are case-insensitive, for example::
+
+ python setup.py build --cpu-dispatch="SSE41 avx2 FMA3"
+
+- The order of the requested optimizations doesn't matter::
+
+ python setup.py build --cpu-dispatch="SSE41 AVX2 FMA3"
+ # equivalent to
+ python setup.py build --cpu-dispatch="FMA3 AVX2 SSE41"
+
+- Either commas or spaces or '+' can be used as a separator,
+ for example::
+
+ python setup.py build --cpu-dispatch="avx2 avx512f"
+ # or
+ python setup.py build --cpu-dispatch=avx2,avx512f
+ # or
+ python setup.py build --cpu-dispatch="avx2+avx512f"
+
+ all works but arguments should be enclosed in quotes or escaped
+ by backslash if any spaces envoloved.
+
+- ``--cpu-baseline`` combains all implied CPU features, for example::
+
+ python setup.py build --cpu-baseline=sse42
+ # equivalent to
+ python setup.py build --cpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
+
+- ``--cpu-baseline`` will be treated as "native" if compiler native flag
+ `-march=native` or `-xHost` or `QxHost` is enabled through environment variable
+ `CFLAGS`::
+
+ export CFLAGS="-march=native"
+ python setup.py install --user
+ # is equivalent to
+ python setup.py build --cpu-baseline=native install --user
+
+- ``--cpu-baseline`` escapes any specified features that aren't supported
+ by the target platform or compiler rather than raising fatal errors.
+
+ .. note::
+
+ Since ``--cpu-baseline`` combains all implied features, the maximum
+ supported of implied features will be enabled rather than escape all of them.
+ For example::
+
+ # Requesting `AVX2,FMA3` but the compiler only support **SSE** features
+ python setup.py build --cpu-baseline="avx2 fma3"
+ # is equivalent to
+ python setup.py build --cpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
+
+- ``--cpu-dispatch`` does not combain any of implied CPU features,
+ so you must add them unless you want to disable one or all of them::
+
+ # Only dispatches AVX2 and FMA3
+ python setup.py build --cpu-dispatch=avx2,fma3
+ # Dispatches AVX and SSE features
+ python setup.py build --cpu-baseline=ssse3,sse41,sse42,avx,avx2,fma3
+
+- ``--cpu-dispatch`` escapes any specified baseline features and also escapes
+ any features not supported by the target platform or compiler without rasing
+ fatal errors.
+
+Eventually, You should always check the final report through the build log
+to verify the enabled features. See :ref:`opt-build-report` for more details.
+
+.. _opt-platform-differences:
+
+Platform differences
+--------------------
+
+Some exceptional conditions force us to link some features together when it come to
+certain compilers or architectures, resulting in the impossibility of building them separately.
+
+These conditions can be divided into two parts, as follows:
+
+**Architectural compatibility**
+
+The need to align certain CPU features that are assured to be supported by
+successive generations of the same architecture, some cases:
+
+ - On ppc64le ``VSX(ISA 2.06)`` and ``VSX2(ISA 2.07)`` both imply one another since the
+ first generation that supports little-endian mode is Power-8`(ISA 2.07)`
+ - On AArch64 ``NEON NEON_FP16 NEON_VFPV4 ASIMD`` implies each other since they are part of the
+ hardware baseline.
+
+For example::
+
+ # On ARMv8/A64, specify NEON is going to enable Advanced SIMD
+ # and all predecessor extensions
+ python setup.py build --cpu-baseline=neon
+ # which equivalent to
+ python setup.py build --cpu-baseline="neon neon_fp16 neon_vfpv4 asimd"
+
+.. note::
+
+ Please take a deep look at :ref:`opt-supported-features`,
+ in order to determine the features that imply one another.
+
+**Compilation compatibility**
+
+Some compilers don't provide independent support for all CPU features. For instance
+**Intel**'s compiler doesn't provide separated flags for ``AVX2`` and ``FMA3``,
+it makes sense since all Intel CPUs that comes with ``AVX2`` also support ``FMA3``,
+but this approach is incompatible with other **x86** CPUs from **AMD** or **VIA**.
+
+For example::
+
+ # Specify AVX2 will force enables FMA3 on Intel compilers
+ python setup.py build --cpu-baseline=avx2
+ # which equivalent to
+ python setup.py build --cpu-baseline="avx2 fma3"
+
+
+The following tables only show the differences imposed by some compilers from the
+general context that been shown in the :ref:`opt-supported-features` tables:
+
+.. note::
+
+ Features names with strikeout represent the unsupported CPU features.
+
+.. raw:: html
+
+ <style>
+ .enabled-feature {color:green; font-weight:bold;}
+ .disabled-feature {color:red; text-decoration: line-through;}
+ </style>
+
+.. role:: enabled
+ :class: enabled-feature
+
+.. role:: disabled
+ :class: disabled-feature
+
+.. include:: generated_tables/compilers-diff.inc
+
+.. _opt-build-report:
+
+Build report
+------------
+
+In most cases, the CPU build options do not produce any fatal errors that lead to hanging the build.
+Most of the errors that may appear in the build log serve as heavy warnings due to the lack of some
+expected CPU features by the compiler.
+
+So we strongly recommend checking the final report log, to be aware of what kind of CPU features
+are enabled and what are not.
+
+You can find the final report of CPU optimizations at the end of the build log,
+and here is how it looks on x86_64/gcc:
+
+.. raw:: html
+
+ <style>#build-report .highlight-bash pre{max-height:450px; overflow-y: scroll;}</style>
+
+.. literalinclude:: log_example.txt
+ :language: bash
+
+As you see, there is a separate report for each of `build_ext` and `build_clib`
+that includes several sections, and each section has several values, representing the following:
+
+**Platform**:
+
+- :enabled:`Architecture`: The architecture name of target CPU, it should be one of
+ (x86, x64, ppc64, ppc64le, armhf, aarch64, unknown)
+
+- :enabled:`Compiler`: The compiler name, it should be one of
+ (gcc, clang, msvc, icc, iccw, unix-like)
+
+**CPU baseline**:
+
+- :enabled:`Requested`: The specific features and options to ``--cpu-baseline`` as-is.
+- :enabled:`Enabled`: The final set of enabled CPU features.
+- :enabled:`Flags`: The compiler flags that been used to all NumPy `C/C++` sources
+ during the compilation except for temporary sources that been used for generating
+ the binary objects of dispatched features.
+- :enabled:`Extra checks`: list of internal checks that activate certain functionality
+ or intrinsics related to the enabled features, useful for debugging when it comes
+ to developing SIMD kernels.
+
+**CPU dispatch**:
+
+- :enabled:`Requested`: The specific features and options to ``--cpu-dispatch`` as-is.
+- :enabled:`Enabled`: The final set of enabled CPU features.
+- :enabled:`Generated`: At the beginning of the next row of this property,
+ the features for which optimizations have been generated are shown in the
+ form of several sections with similar properties explained as follows:
+
+ - :enabled:`One or multiple dispatched feature`: The implied CPU features.
+ - :enabled:`Flags`: The compiler flags that been used for these features.
+ - :enabled:`Extra checks`: Similar to the baseline but for these dispatched features.
+ - :enabled:`Detect`: Set of CPU features that need be detected in runtime in order to
+ execute the generated optimizations.
+ - The lines that come after the above property and end with a ':' on a separate line,
+ represent the paths of c/c++ sources that define the generated optimizations.
+
+Runtime Trace
+-------------
+TODO
diff --git a/doc/source/reference/simd/index.rst b/doc/source/reference/simd/index.rst
index 4115338e9..c84d948a9 100644
--- a/doc/source/reference/simd/index.rst
+++ b/doc/source/reference/simd/index.rst
@@ -33,5 +33,9 @@ The optimization process in NumPy is carried out in three layers:
NumPy community had a deep discussion before implementing this work,
please check `NEP-38`_ for more clarification.
+.. toctree::
+
+ build-options
+
.. _`NEP-38`: https://numpy.org/neps/nep-0038-SIMD-optimizations.html
diff --git a/doc/source/reference/simd/log_example.txt b/doc/source/reference/simd/log_example.txt
new file mode 100644
index 000000000..b0c732433
--- /dev/null
+++ b/doc/source/reference/simd/log_example.txt
@@ -0,0 +1,79 @@
+########### EXT COMPILER OPTIMIZATION ###########
+Platform :
+ Architecture: x64
+ Compiler : gcc
+
+CPU baseline :
+ Requested : 'min'
+ Enabled : SSE SSE2 SSE3
+ Flags : -msse -msse2 -msse3
+ Extra checks: none
+
+CPU dispatch :
+ Requested : 'max -xop -fma4'
+ Enabled : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
+ Generated :
+ :
+ SSE41 : SSE SSE2 SSE3 SSSE3
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1
+ Extra checks: none
+ Detect : SSE SSE2 SSE3 SSSE3 SSE41
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithmetic.dispatch.c
+ : numpy/core/src/umath/_umath_tests.dispatch.c
+ :
+ SSE42 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2
+ Extra checks: none
+ Detect : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42
+ : build/src.linux-x86_64-3.9/numpy/core/src/_simd/_simd.dispatch.c
+ :
+ AVX2 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mavx2
+ Extra checks: none
+ Detect : AVX F16C AVX2
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithm_fp.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithmetic.dispatch.c
+ : numpy/core/src/umath/_umath_tests.dispatch.c
+ :
+ (FMA3 AVX2) : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
+ Extra checks: none
+ Detect : AVX F16C FMA3 AVX2
+ : build/src.linux-x86_64-3.9/numpy/core/src/_simd/_simd.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_exponent_log.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_trigonometric.dispatch.c
+ :
+ AVX512F : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f
+ Extra checks: AVX512F_REDUCE
+ Detect : AVX512F
+ : build/src.linux-x86_64-3.9/numpy/core/src/_simd/_simd.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithm_fp.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithmetic.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_exponent_log.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_trigonometric.dispatch.c
+ :
+ AVX512_SKX : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD
+ Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq
+ Extra checks: AVX512BW_MASK AVX512DQ_MASK
+ Detect : AVX512_SKX
+ : build/src.linux-x86_64-3.9/numpy/core/src/_simd/_simd.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_arithmetic.dispatch.c
+ : build/src.linux-x86_64-3.9/numpy/core/src/umath/loops_exponent_log.dispatch.c
+CCompilerOpt.cache_flush[804] : write cache to path -> /home/seiko/work/repos/numpy/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py
+
+########### CLIB COMPILER OPTIMIZATION ###########
+Platform :
+ Architecture: x64
+ Compiler : gcc
+
+CPU baseline :
+ Requested : 'min'
+ Enabled : SSE SSE2 SSE3
+ Flags : -msse -msse2 -msse3
+ Extra checks: none
+
+CPU dispatch :
+ Requested : 'max -xop -fma4'
+ Enabled : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
+ Generated : none