libsimdpp
0.9.3
|
libsimdpp is a header-only zero-overhead C++ wrapper around SIMD intrinsics.It supports multiple instruction sets via single interface. The same source code may be compiled for different instruction sets and linked to the same resulting binary. The library provides a convenient dynamic dispatch mechanism to select the fastest version of a function for the target processor.
To use the library, define one or more macros that specify the instruction set (architecture) of the target processor and then include simdpp/simd.h
. The following instruction sets are supported:
NONE_NULL:
The instructions are not vectorized and use standard C++. This instruction set is used if no SIMD instruction set is selected. (no macro defined).
X86_SSE2:
The x86/x86_64 SSE and SSE2 instruction sets are used.
Macro:
SIMDPP_ARCH_X86_SSE2
.
X86_SSE3:
The x86/x86_64 SSE3 instruction set is used. The SSE and SSE2 instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_SSE3
.
X86_SSSE3:
The x86/x86_64 SSSE3 instruction set is used. The SSE, SSE2 and SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_SSSE3
.
X86_SSE4.1
:The x86/x86_64 SSE4.1 instruction set is used. The SSE, SSE2 and SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_SSE4_1
.
X86_AVX:
The x86/x86_64 AVX instruction set is used. The SSE, SSE2, SSE3 and SSSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_AVX
.
X86_AVX2:
The x86/x86_64 AVX2 instruction set is used. The SSE, SSE2, SSE3, SSSE3 and AVX instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_AVX2
.
X86_FMA3:
The Intel x86/x86_64 FMA3 instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets). This instruction set must not be combined with X86_FMA4.
Macro:
SIMDPP_ARCH_X86_FMA3
.
X86_FMA4:
The AMD x86/x86_64 FMA4 instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets). This instruction set must not be combined with X86_FMA3.
Macro:
SIMDPP_ARCH_X86_FMA4
.
X86_XOP:
The AMD x86/x86_64 XOP instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).
Macro:
SIMDPP_ARCH_X86_XOP
.
ARM_NEON:
The ARM NEON instruction set. The VFP co-processor is used for any floating-point functionality (NEON does not require the implementation to be IEEE-754 compliant, whereas VFP does).
Macro
SIMDPP_ARCH_ARM_NEON
ARM_NEON_FLT_SP:
Performs 32-bit floating-point computations on the NEON vector unit. The NEON instruction set support is required implicitly (no need to define the macro for that instruction set).
Macro
SIMDPP_ARCH_ARM_NEON_FLT_SP
.
POWER_ALTIVEC:
The POWER Altivec instruction set. 64-bit floating point operations are not supported.
Macro
SIMDPP_ARCH_POWER_ALTIVEC
.
Instruction counts
In this documentation all functions that map to more than one instruction are marked as such by listing the number of instructions that are used to implement a function. The instructions are counted as follows:
- Any register-register moves and copies that do not cross the processor domains are ignored;
- Non-vector domain instructions are ignored except when they move data to or from memory or vector domain.
- If the implementation of a function is dependent on template arguments (for example, element selector), then the instruction count is defined as a range with both lower and upper bounds
- If the function loads or computes static data, then the instruction count is defined as a range. The lower count calculated as if the loads from memory or computation didn't happen (for example, if the function was used in a small loop and there were enough registers to cache the data). The upper count is calculated the other way round.
If instruction count is not listed for specific architecture, then the function directly maps to one instruction. This rule does not apply to the following architectures:
X86_FMA3
, X86_FMA4
and X86_XOP
.
For these, if instruction count is not listed, the instruction counts should be interpreted as if the architecture is not supported.
Note, that instruction count is very, very imprecise way to measure performance. It is provided just as a quick way to estimate how well specific functionality maps to target architecture.
Dynamic dispatch
If the user wants to include several versions of the same code, compiled for different architectures sets into the same executable, then all such code must be put into SIMDPP_ARCH_NAMESPACE
namespace. This macro evaluates to an identifier which is unique for each architecture.
In addition to the above, the source file must not define any of the architecture select macros; they must be supplied via the compiler options. The code for NONE_NULL
architecture must be linked to the resulting executable.
To use dynamic dispatch mechanism, declare the function within an SIMDPP_ARCH_NAMESPACE
and then use one of SIMDPP_MAKE_DISPATCHER_***
macros.
Dynamic dispatch example
The following example demonstrates the simpliest usage of dynamic dispatch:
If compiled, the above example selects the "fastest" of SSE2, SSE3 or SSE4.1 instruction sets, whichever is available on the target processor and outputs an integer that identifiers that instruction set.
Note, that the object files must be linked directly to the executable. If static libraries are used, the linker may throw out static dispatcher registration code and break the mechanism. Do prevent this behavior, -Wl
,–whole-archive or an equivalent flag must be used.
CMake
For CMake users, cmake/SimdppMultiarch.cmake
contains several useful functions:
simdpp_get_compilable_archs:
checks what architectures are supported by the compiler.simdpp_get_runnable_archs:
checks what architectures are supported by both the compiler and the current processor.simdpp_multiversion:
given a list of architectures (possibly generated bysimdpp_get_compilable_archs
orsimdpp_get_runnable_archs
), automatically configures compilation of additional objects. The user only needs to add the returned list of source files toadd_library
oradd_executable
.
The above example may be build with CMakeLists.txt
as simple as follows:
Generated on Thu Oct 31 2013 04:08:52 for libsimdpp by 1.8.3.1