diff options
author | Matti Picus <matti.picus@gmail.com> | 2020-07-12 16:30:42 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2020-07-12 08:30:42 -0500 |
commit | b234742e2e26ff886f52864c73460fe4916a66d0 (patch) | |
tree | 5f6f90804adf53329d03edefadffd6d1e07a6a1b /doc/source/reference/simd | |
parent | 62fa23c44fb49c1d238e1de4f791ffc3ca4b1d11 (diff) | |
download | numpy-b234742e2e26ff886f52864c73460fe4916a66d0.tar.gz |
DOC: Add SIMD optimization documentation (gh-15551)
Add documentation for the new build infrastructure and API developed to enable
universal intrinsics. Written by @seiko2plus with some fixes by @mattip.
* DOC: add SIMD optimization doc (seiko2plus)
* DOC: reformat as valid RST
* trim whitespace
* first part of Understanding CPU Dispatching
* update build options and remove implied features, gonna update it later
* add more explanations for the dispatcher and fix doc style
* fix up style
* add figure
* Improve and more explanations for Understanding CPU Dispatching
* fix up syntax
* DOC: tweak formatting
* DOC: more tweaks
* fix rst formatting
* DOC: Generate CPU features tables from CCompilerOpt
* DOC: move files around
* DOC: add comment to top of file
* DOC: rebuild tables, fix links
* DOC: minor copyedits
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
Diffstat (limited to 'doc/source/reference/simd')
-rw-r--r-- | doc/source/reference/simd/simd-optimizations-tables.inc | 110 | ||||
-rw-r--r-- | doc/source/reference/simd/simd-optimizations.py | 166 | ||||
-rw-r--r-- | doc/source/reference/simd/simd-optimizations.rst | 497 |
3 files changed, 773 insertions, 0 deletions
diff --git a/doc/source/reference/simd/simd-optimizations-tables.inc b/doc/source/reference/simd/simd-optimizations-tables.inc new file mode 100644 index 000000000..d5b82ee0c --- /dev/null +++ b/doc/source/reference/simd/simd-optimizations-tables.inc @@ -0,0 +1,110 @@ +.. generated via source/reference/simd/simd-optimizations.py + +``X86`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ======== ================================================================================================================= + Name Implies + ======== ================================================================================================================= + SSE ``SSE`` ``SSE2`` + SSE2 ``SSE`` ``SSE2`` + SSE3 ``SSE`` ``SSE2`` + SSSE3 ``SSE`` ``SSE2`` ``SSE3`` + SSE41 ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` + POPCNT ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` + SSE42 ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` + AVX ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` + XOP ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` + FMA4 ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` + F16C ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` + FMA3 ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` + AVX2 ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` + AVX512F ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` + AVX512CD ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` + ======== ================================================================================================================= + +``X86`` - Group names +~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ========== ===================================================== =========================================================================================================================================================================== + Name Gather Implies + ========== ===================================================== =========================================================================================================================================================================== + AVX512_KNL ``AVX512ER`` ``AVX512PF`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` + AVX512_KNM ``AVX5124FMAPS`` ``AVX5124VNNIW`` ``AVX512VPOPCNTDQ`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` ``AVX512_KNL`` + AVX512_SKX ``AVX512VL`` ``AVX512BW`` ``AVX512DQ`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` + AVX512_CLX ``AVX512VNNI`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` ``AVX512_SKX`` + AVX512_CNL ``AVX512IFMA`` ``AVX512VBMI`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` ``AVX512_SKX`` + AVX512_ICL ``AVX512VBMI2`` ``AVX512BITALG`` ``AVX512VPOPCNTDQ`` ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE41`` ``POPCNT`` ``SSE42`` ``AVX`` ``F16C`` ``FMA3`` ``AVX2`` ``AVX512F`` ``AVX512CD`` ``AVX512_SKX`` ``AVX512_CLX`` ``AVX512_CNL`` + ========== ===================================================== =========================================================================================================================================================================== + +``IBM/POWER`` ``big-endian`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ==== ================ + Name Implies + ==== ================ + VSX + VSX2 ``VSX`` + VSX3 ``VSX`` ``VSX2`` + ==== ================ + +``IBM/POWER`` ``little-endian mode`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ==== ================ + Name Implies + ==== ================ + VSX ``VSX`` ``VSX2`` + VSX2 ``VSX`` ``VSX2`` + VSX3 ``VSX`` ``VSX2`` + ==== ================ + +``ARMHF`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ========== =========================================================== + Name Implies + ========== =========================================================== + NEON + NEON_FP16 ``NEON`` + NEON_VFPV4 ``NEON`` ``NEON_FP16`` + ASIMD ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` + ASIMDHP ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMDDP ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMDFHM ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` ``ASIMDHP`` + ========== =========================================================== + +``ARM64`` ``AARCH64`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + + ========== =========================================================== + Name Implies + ========== =========================================================== + NEON ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + NEON_FP16 ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + NEON_VFPV4 ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMD ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMDHP ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMDDP ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` + ASIMDFHM ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` ``ASIMD`` ``ASIMDHP`` + ========== =========================================================== + +
\ No newline at end of file diff --git a/doc/source/reference/simd/simd-optimizations.py b/doc/source/reference/simd/simd-optimizations.py new file mode 100644 index 000000000..628356163 --- /dev/null +++ b/doc/source/reference/simd/simd-optimizations.py @@ -0,0 +1,166 @@ +""" +Generate CPU features tables from CCompilerOpt +""" +from os import sys, path +gen_path = path.dirname(path.realpath(__file__)) +from numpy.distutils.ccompiler_opt import CCompilerOpt + +class FakeCCompilerOpt(CCompilerOpt): + fake_info = "" + def __init__(self, *args, **kwargs): + no_cc = None + CCompilerOpt.__init__(self, no_cc, **kwargs) + def dist_compile(self, sources, flags, **kwargs): + return sources + def dist_info(self): + return FakeCCompilerOpt.fake_info + @staticmethod + def dist_log(*args, stderr=False): + # avoid printing + pass + def feature_test(self, name, force_flags=None): + # To speed up + return True + + def gen_features_table(self, features, ignore_groups=True, + field_names=["Name", "Implies"], **kwargs): + rows = [] + for f in features: + is_group = "group" in self.feature_supported.get(f, {}) + if ignore_groups and is_group: + continue + implies = self.feature_sorted(self.feature_implies(f)) + implies = ' '.join(['``%s``' % i for i in implies]) + rows.append([f, implies]) + return self.gen_rst_table(field_names, rows, **kwargs) + + def gen_gfeatures_table(self, features, + field_names=["Name", "Gather", "Implies"], + **kwargs): + rows = [] + for f in features: + gather = self.feature_supported.get(f, {}).get("group", None) + if not gather: + continue + implies = self.feature_sorted(self.feature_implies(f)) + implies = ' '.join(['``%s``' % i for i in implies]) + gather = ' '.join(['``%s``' % i for i in gather]) + rows.append([f, gather, implies]) + return self.gen_rst_table(field_names, rows, **kwargs) + + + def gen_rst_table(self, field_names, rows, margin_left=2): + assert(not rows or len(field_names) == len(rows[0])) + rows.append(field_names) + fld_len = len(field_names) + cls_len = [max(len(c[i]) for c in rows) for i in range(fld_len)] + del rows[-1] + padding = 0 + cformat = ' '.join('{:<%d}' % (i+padding) for i in cls_len) + border = cformat.format(*['='*i for i in cls_len]) + + rows = [cformat.format(*row) for row in rows] + # header + rows = [border, cformat.format(*field_names), border] + rows + # footer + rows += [border] + # add left margin + rows = [(' ' * margin_left) + r for r in rows] + return '\n'.join(rows) + +if __name__ == '__main__': + margin_left = 4*1 + ############### x86 ############### + FakeCCompilerOpt.fake_info = "x86_64 gcc" + x64_gcc = FakeCCompilerOpt(cpu_baseline="max") + x86_tables = """\ +``X86`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{x86_features} + +``X86`` - Group names +~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{x86_gfeatures} + +""".format( + x86_features = x64_gcc.gen_features_table( + x64_gcc.cpu_baseline_names(), margin_left=margin_left + ), + x86_gfeatures = x64_gcc.gen_gfeatures_table( + x64_gcc.cpu_baseline_names(), margin_left=margin_left + ) + ) + ############### Power ############### + FakeCCompilerOpt.fake_info = "ppc64 gcc" + ppc64_gcc = FakeCCompilerOpt(cpu_baseline="max") + FakeCCompilerOpt.fake_info = "ppc64le gcc" + ppc64le_gcc = FakeCCompilerOpt(cpu_baseline="max") + ppc64_tables = """\ +``IBM/POWER`` ``big-endian`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{ppc64_features} + +``IBM/POWER`` ``little-endian mode`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{ppc64le_features} + +""".format( + ppc64_features = ppc64_gcc.gen_features_table( + ppc64_gcc.cpu_baseline_names(), margin_left=margin_left + ), + ppc64le_features = ppc64le_gcc.gen_features_table( + ppc64le_gcc.cpu_baseline_names(), margin_left=margin_left + ) + ) + ############### Arm ############### + FakeCCompilerOpt.fake_info = "armhf gcc" + armhf_gcc = FakeCCompilerOpt(cpu_baseline="max") + FakeCCompilerOpt.fake_info = "aarch64 gcc" + aarch64_gcc = FakeCCompilerOpt(cpu_baseline="max") + arm_tables = """\ +``ARMHF`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{armhf_features} + +``ARM64`` ``AARCH64`` - CPU feature names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. table:: + :align: left + +{aarch64_features} + + """.format( + armhf_features = armhf_gcc.gen_features_table( + armhf_gcc.cpu_baseline_names(), margin_left=margin_left + ), + aarch64_features = aarch64_gcc.gen_features_table( + aarch64_gcc.cpu_baseline_names(), margin_left=margin_left + ) + ) + # TODO: diff the difference among all supported compilers + with open(path.join(gen_path, 'simd-optimizations-tables.inc'), 'wt') as fd: + fd.write(f'.. generated via {__file__}\n\n') + fd.write(x86_tables) + fd.write(ppc64_tables) + fd.write(arm_tables) diff --git a/doc/source/reference/simd/simd-optimizations.rst b/doc/source/reference/simd/simd-optimizations.rst new file mode 100644 index 000000000..eb7eb2a83 --- /dev/null +++ b/doc/source/reference/simd/simd-optimizations.rst @@ -0,0 +1,497 @@ +****************** +SIMD Optimizations +****************** + +NumPy provides a set of macros that define `Universal Intrinsics`_ to +abstract out typical platform-specific intrinsics so SIMD code needs to be +written only once. There are three layers: + +- Code is *written* using the universal intrinsic macros, with guards that + will enable use of the macros only when the compiler recognizes them. + In NumPy, these are used to construct multiple ufunc loops. Current policy is + to create three loops: One loop is the default and uses no intrinsics. One + uses the minimum intrinsics required on the architecture. And the third is + written using the maximum set of intrinsics possible. +- At *compile* time, a distutils command is used to define the minimum and + maximum features to support, based on user choice and compiler support. The + appropriate macros are overlayed with the platform / architecture intrinsics, + and the three loops are compiled. +- At *runtime import*, the CPU is probed for the set of supported intrinsic + features. A mechanism is used to grab the pointer to the most appropriate + function, and this will be the one called for the function. + + +Build options for compilation +============================= + +- ``--cpu-baseline``: minimal set of required optimizations. Default + value is ``min`` which provides the minimum CPU features that can + safely run on a wide range of platforms within the processor family. + +- ``--cpu-dispatch``: dispatched set of additional optimizations. + The default value for ``x86`` is ``max -xop -fma4`` which enables all CPU + features, except for AMD legacy features. + +The command arguments are available in ``build``, ``build_clib``, and +``build_ext``. +if ``build_clib`` or ``build_ext`` are not specified by the user, the arguments of +``build`` will be used instead, which also holds the default values. + +Optimization names can be CPU features or groups of features that gather +several features or special options to perform a series of procedures. + + +The following tables show the current supported optimizations sorted from the lowest to the highest interest. + +.. include:: simd-optimizations-tables.inc + +Special options +~~~~~~~~~~~~~~~ + +- ``NONE``: enable no features + +- ``NATIVE``: Enables all CPU features that supported by the current + machine, this operation is based on the compiler flags (``-march=native, -xHost, /QxHost``) + +- ``MIN``: Enables the minimum CPU features that can safely run on a wide range of platforms: + + .. table:: + :align: left + + ====================================== ======================================= + For Arch Returns + ====================================== ======================================= + ``x86`` ``SSE`` ``SSE2`` + ``x86`` ``64-bit mode`` ``SSE`` ``SSE2`` ``SSE3`` + ``IBM/POWER`` ``big-endian mode`` ``NONE`` + ``IBM/POWER`` ``little-endian mode`` ``VSX`` ``VSX2`` + ``ARMHF`` ``NONE`` + ``ARM64`` ``AARCH64`` ``NEON`` ``NEON_FP16`` ``NEON_VFPV4`` + ``ASIMD`` + ====================================== ======================================= + +- ``MAX``: Enables all supported CPU features by the Compiler and platform. + +- ``Operators-/+``: remove or add features, useful with options ``MAX``, ``MIN`` and ``NATIVE``. + +NOTES +~~~~~~~~~~~~~ +- CPU features and other options are case-insensitive. + +- The order of the requsted optimizations doesn't matter. + +- Either commas or spaces can be used as a separator, e.g. ``--cpu-dispatch``\ = + "avx2 avx512f" or ``--cpu-dispatch``\ = "avx2, avx512f" both work, but the + arguments must be enclosed in quotes. + +- The operand ``+`` is only added for nominal reasons, For example: + ``--cpu-basline= "min avx2"`` is equivalent to ``--cpu-basline="min + avx2"``. + ``--cpu-basline="min,avx2"`` is equivalent to ``--cpu-basline`="min,+avx2"`` + +- If the CPU feature is not supported by the user platform or + compiler, it will be skipped rather than raising a fatal error. + +- Any specified CPU feature to ``--cpu-dispatch`` will be skipped if + it's part of CPU baseline features + +- The ``--cpu-baseline`` argument force-enables implied features, + e.g. ``--cpu-baseline``\ ="sse42" is equivalent to + ``--cpu-baseline``\ ="sse sse2 sse3 ssse3 sse41 popcnt sse42" + +- The value of ``--cpu-baseline`` will be treated as "native" if + compiler native flag ``-march=native`` or ``-xHost`` or ``QxHost`` is + enabled through environment variable ``CFLAGS`` + +- The validation process for the requsted optimizations when it comes to + ``--cpu-baseline`` isn't strict. For example, if the user requested + ``AVX2`` but the compiler doesn't support it then we just skip it and return + the maximum optimization that the compiler can handle depending on the + implied features of ``AVX2``, let us assume ``AVX``. + +- The user should always check the final report through the build log + to verify the enabled features. + +Special cases +~~~~~~~~~~~~~ + +Behaviors and Errors +~~~~~~~~~~~~~~~~~~~~ + + + +Usage and Examples +~~~~~~~~~~~~~~~~~~ + +Report and Trace +~~~~~~~~~~~~~~~~ + +Understanding CPU Dispatching, How the NumPy dispatcher works? +============================================================== + +NumPy dispatcher is based on multi-source compiling, which means taking +a certain source and compiling it multiple times with different compiler +flags and also with different **C** definitions that affect the code +paths to enable certain instruction-sets for each compiled object +depending on the required optimizations, then combining the returned +objects together. + +.. figure:: ../figures/opt-infra.png + +This mechanism should support all compilers and it doesn't require any +compiler-specific extension, but at the same time it is adds a few steps to +normal compilation that are explained as follows: + +1- Configuration +~~~~~~~~~~~~~~~~ + +Configuring the required optimization by the user before starting to build the +source files via the two command arguments as explained above: + +- ``--cpu-baseline``: minimal set of required optimizations. + +- ``--cpu-dispatch``: dispatched set of additional optimizations. + + +2- Discovering the environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this part, we check the compiler and platform architecture +and cache some of the intermediary results to speed up rebuilding. + +3- Validating the requested optimizations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By testing them against the compiler, and seeing what the compiler can +support according to the requested optimizations. + +4- Generating the main configuration header +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The generated header ``_cpu_dispatch.h`` contains all the definitions and +headers of instruction-sets for the required optimizations that have been +validated during the previous step. + +It also contains extra C definitions that are used for defining NumPy's +Python-level module attributes ``__cpu_baseline__`` and ``__cpu_dispaٍtch__``. + +**What is in this header?** + +The example header was dynamically generated by gcc on an X86 machine. +The compiler supports ``--cpu-baseline="sse sse2 sse3"`` and +``--cpu-dispatch="ssse3 sse41"``, and the result is below. + +.. code:: c + + // The header should be located at numpy/numpy/core/src/common/_cpu_dispatch.h + /**NOTE + ** C definitions prefixed with "NPY_HAVE_" represent + ** the required optimzations. + ** + ** C definitions prefixed with 'NPY__CPU_TARGET_' are protected and + ** shouldn't be used by any NumPy C sources. + */ + /******* baseline features *******/ + /** SSE **/ + #define NPY_HAVE_SSE 1 + #include <xmmintrin.h> + /** SSE2 **/ + #define NPY_HAVE_SSE2 1 + #include <emmintrin.h> + /** SSE3 **/ + #define NPY_HAVE_SSE3 1 + #include <pmmintrin.h> + + /******* dispatch-able features *******/ + #ifdef NPY__CPU_TARGET_SSSE3 + /** SSSE3 **/ + #define NPY_HAVE_SSSE3 1 + #include <tmmintrin.h> + #endif + #ifdef NPY__CPU_TARGET_SSE41 + /** SSE41 **/ + #define NPY_HAVE_SSE41 1 + #include <smmintrin.h> + #endif + +**Baseline features** are the minimal set of required optimizations configured +via ``--cpu-baseline``. They have no preprocessor guards and they're +always on, which means they can be used in any source. + +Does this mean NumPy's infrastructure passes the compiler's flags of +baseline features to all sources? + +Definitely, yes. But the :ref:`dispatch-able sources <dispatchable-sources>` are +treated differently. + +What if the user specifies certain **baseline features** during the +build but at runtime the machine doesn't support even these +features? Will the compiled code be called via one of these definitions, or +maybe the compiler itself auto-generated/vectorized certain piece of code +based on the provided command line compiler flags? + +During the loading of the NumPy module, there's a validation step +which detects this behavior. It will raise a Python runtime error to inform the +user. This is to prevent the CPU reaching an illegal instruction error causing +a segfault. + +**Dispatch-able features** are our dispatched set of additional optimizations +that were configured via ``--cpu-dispatch``. They are not activated by +default and are always guarded by other C definitions prefixed with +``NPY__CPU_TARGET_``. C definitions ``NPY__CPU_TARGET_`` are only +enabled within **dispatch-able sources**. + +.. _dispatchable-sources: + +5- Dispatch-able sources and configuration statements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Dispatch-able sources are special **C** files that can be compiled multiple +times with different compiler flags and also with different **C** +definitions. These affect code paths to enable certain +instruction-sets for each compiled object according to "**the +configuration statements**" that must be declared between a **C** +comment\ ``(/**/)`` and start with a special mark **@targets** at the +top of each dispatch-able source. At the same time, dispatch-able +sources will be treated as normal **C** sources if the optimization was +disabled by the command argument ``--disable-optimization`` . + +**What are configuration statements?** + +Configuration statements are sort of keywords combined together to +determine the required optimization for the dispatch-able source. + +Example: + +.. code:: c + + /*@targets avx2 avx512f vsx2 vsx3 asimd asimdhp */ + // C code + +The keywords mainly represent the additional optimizations configured +through ``--cpu-dispatch``, but it can also represent other options such as: + +- Target groups: pre-configured configuration statements used for + managing the required optimizations from outside the dispatch-able source. + +- Policies: collections of options used for changing the default + behaviors or forcing the compilers to perform certain things. + +- "baseline": a unique keyword represents the minimal optimizations + that configured through ``--cpu-baseline`` + +**Numpy's infrastructure handles dispatch-able sources in four steps**: + +- **(A) Recognition**: Just like source templates and F2PY, the + dispatch-able sources requires a special extension ``*.dispatch.c`` + to mark C dispatch-able source files, and for C++ + ``*.dispatch.cpp`` or ``*.dispatch.cxx`` + **NOTE**: C++ not supported yet. + +- **(B) Parsing and validating**: In this step, the + dispatch-able sources that had been filtered by the previous step + are parsed and validated by the configuration statements for each one + of them one by one in order to determine the required optimizations. + +- **(C) Wrapping**: This is the approach taken by NumPy's + infrastructure, which has proved to be sufficiently flexible in order + to compile a single source multiple times with different **C** + definitions and flags that affect the code paths. The process is + achieved by creating a temporary **C** source for each required + optimization that related to the additional optimization, which + contains the declarations of the **C** definitions and includes the + involved source via the **C** directive **#include**. For more + clarification take a look at the following code for AVX512F : + + .. code:: c + + /* + * this definition is used by NumPy utilities as suffixes for the + * exported symbols + */ + #define NPY__CPU_TARGET_CURRENT AVX512F + /* + * The following definitions enable + * definitions of the dispatch-able features that are defined within the main + * configuration header. These are definitions for the implied features. + */ + #define NPY__CPU_TARGET_SSE + #define NPY__CPU_TARGET_SSE2 + #define NPY__CPU_TARGET_SSE3 + #define NPY__CPU_TARGET_SSSE3 + #define NPY__CPU_TARGET_SSE41 + #define NPY__CPU_TARGET_POPCNT + #define NPY__CPU_TARGET_SSE42 + #define NPY__CPU_TARGET_AVX + #define NPY__CPU_TARGET_F16C + #define NPY__CPU_TARGET_FMA3 + #define NPY__CPU_TARGET_AVX2 + #define NPY__CPU_TARGET_AVX512F + // our dispatch-able source + #include "/the/absuolate/path/of/hello.dispatch.c" + +- **(D) Dispatch-able configuration header**: The infrastructure + generates a config header for each dispatch-able source, this header + mainly contains two abstract **C** macros used for identifying the + generated objects, so they can be used for runtime dispatching + certain symbols from the generated objects by any **C** source. It is + also used for forward declarations. + + The generated header takes the name of the dispatch-able source after + excluding the extension and replace it with '**.h**', for example + assume we have a dispatch-able source called **hello.dispatch.c** and + contains the following: + + .. code:: c + + // hello.dispatch.c + /*@targets baseline sse42 avx512f */ + #include <stdio.h> + #include "numpy/utils.h" // NPY_CAT, NPY_TOSTR + + #ifndef NPY__CPU_TARGET_CURRENT + // wrapping the dispatch-able source only happens to the addtional optimizations + // but if the keyword 'baseline' provided within the configuration statments, + // the infrastructure will add extra compiling for the dispatch-able source by + // passing it as-is to the compiler without any changes. + #define CURRENT_TARGET(X) X + #define NPY__CPU_TARGET_CURRENT baseline // for printing only + #else + // since we reach to this point, that's mean we're dealing with + // the addtional optimizations, so it could be SSE42 or AVX512F + #define CURRENT_TARGET(X) NPY_CAT(NPY_CAT(X, _), NPY__CPU_TARGET_CURRENT) + #endif + // Macro 'CURRENT_TARGET' adding the current target as suffux to the exported symbols, + // to avoid linking duplications, NumPy already has a macro called + // 'NPY_CPU_DISPATCH_CURFX' similar to it, located at + // numpy/numpy/core/src/common/npy_cpu_dispatch.h + // NOTE: we tend to not adding suffixes to the baseline exported symbols + void CURRENT_TARGET(simd_whoami)(const char *extra_info) + { + printf("I'm " NPY_TOSTR(NPY__CPU_TARGET_CURRENT) ", %s\n", extra_info); + } + + Now assume you attached **hello.dispatch.c** to the source tree, then + the infrastructure should generate a temporary config header called + **hello.dispatch.h** that can be reached by any source in the source + tree, and it should contain the following code : + + .. code:: c + + #ifndef NPY__CPU_DISPATCH_EXPAND_ + // To expand the macro calls in this header + #define NPY__CPU_DISPATCH_EXPAND_(X) X + #endif + // Undefining the following macros, due to the possibility of including config headers + // multiple times within the same source and since each config header represents + // different required optimizations according to the specified configuration + // statements in the dispatch-able source that derived from it. + #undef NPY__CPU_DISPATCH_BASELINE_CALL + #undef NPY__CPU_DISPATCH_CALL + // nothing strange here, just a normal preprocessor callback + // enabled only if 'baseline' spesfied withiin the configration statments + #define NPY__CPU_DISPATCH_BASELINE_CALL(CB, ...) \ + NPY__CPU_DISPATCH_EXPAND_(CB(__VA_ARGS__)) + // 'NPY__CPU_DISPATCH_CALL' is an abstract macro is used for dispatching + // the required optimizations that specified within the configuration statements. + // + // @param CHK, Expected a macro that can be used to detect CPU features + // in runtime, which takes a CPU feature name without string quotes and + // returns the testing result in a shape of boolean value. + // NumPy already has macro called "NPY_CPU_HAVE", which fit this requirment. + // + // @param CB, a callback macro that expected to be called multiple times depending + // on the required optimizations, the callback should receive the following arguments: + // 1- The pending calls of @param CHK filled up with the required CPU features, + // that need to be tested first in runtime before executing call belong to + // the compiled object. + // 2- The required optimization name, same as in 'NPY__CPU_TARGET_CURRENT' + // 3- Extra arguments in the macro itself + // + // By default the callback calls are sorted depending on the highest interest + // unless the policy "$keep_sort" was in place within the configuration statements + // see "Dive into the CPU dispatcher" for more clarification. + #define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \ + NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \ + NPY__CPU_DISPATCH_EXPAND_(CB((CHK(SSE)&&CHK(SSE2)&&CHK(SSE3)&&CHK(SSSE3)&&CHK(SSE41)), SSE41, __VA_ARGS__)) + + An example of using the config header in light of the above: + + .. code:: c + + // NOTE: The following macros are only defined for demonstration purposes only. + // NumPy already has a collections of macros located at + // numpy/numpy/core/src/common/npy_cpu_dispatch.h, that covers all dispatching + // and declarations scenarios. + + #include "numpy/npy_cpu_features.h" // NPY_CPU_HAVE + #include "numpy/utils.h" // NPY_CAT, NPY_EXPAND + + // An example for setting a macro that calls all the exported symbols at once + // after checking if they're supported by the running machine. + #define DISPATCH_CALL_ALL(FN, ARGS) \ + NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_ALL_CB, FN, ARGS) \ + NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_ALL_CB, FN, ARGS) + // The preprocessor callbacks. + // The same suffixes as we define it in the dispatch-able source. + #define DISPATCH_CALL_ALL_CB(CHECK, TARGET_NAME, FN, ARGS) \ + if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; } + #define DISPATCH_CALL_BASELINE_ALL_CB(FN, ARGS) \ + FN NPY_EXPAND(ARGS); + + // An example for setting a macro that calls the exported symbols of highest + // interest optimization, after checking if they're supported by the running machine. + #define DISPATCH_CALL_HIGH(FN, ARGS) \ + if (0) {} \ + NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_HIGH_CB, FN, ARGS) \ + NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_HIGH_CB, FN, ARGS) + // The preprocessor callbacks + // The same suffixes as we define it in the dispatch-able source. + #define DISPATCH_CALL_HIGH_CB(CHECK, TARGET_NAME, FN, ARGS) \ + else if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; } + #define DISPATCH_CALL_BASELINE_HIGH_CB(FN, ARGS) \ + else { FN NPY_EXPAND(ARGS); } + + // NumPy has a macro called 'NPY_CPU_DISPATCH_DECLARE' can be used + // for forward declrations any kind of prototypes based on + // 'NPY__CPU_DISPATCH_CALL' and 'NPY__CPU_DISPATCH_BASELINE_CALL'. + // However in this example, we just handle it manually. + void simd_whoami(const char *extra_info); + void simd_whoami_AVX512F(const char *extra_info); + void simd_whoami_SSE41(const char *extra_info); + + void trigger_me(void) + { + // bring the auto-gernreated config header + // which contains config macros 'NPY__CPU_DISPATCH_CALL' and + // 'NPY__CPU_DISPATCH_BASELINE_CALL'. + // it highely recomaned to include the config header before exectuing + // the dispatching macros in case if there's another header in the scope. + #include "hello.dispatch.h" + DISPATCH_CALL_ALL(simd_whoami, ("all")) + DISPATCH_CALL_HIGH(simd_whoami, ("the highest interest")) + // An example of including multiple config headers in the same source + // #include "hello2.dispatch.h" + // DISPATCH_CALL_HIGH(another_function, ("the highest interest")) + } + + +Dive into the CPU dispatcher +============================ + +The baseline +~~~~~~~~~~~~ + +Dispatcher +~~~~~~~~~~ + +Groups and Policies +~~~~~~~~~~~~~~~~~~~ + +Examples +~~~~~~~~ + +Report and Trace +~~~~~~~~~~~~~~~~ + + +.. _`Universal Intrinsics`: https://numpy.org/neps/nep-0038-SIMD-optimizations.html |