libsimdpp is a header-only zero-overhead C++ wrapper around SIMD intrinsics.It supports multiple instruction sets via single interface. The same source code may be compiled for different instruction sets and linked to the same resulting binary. The library provides a convenient dynamic dispatch mechanism to select the fastest version of a function for the target processor.

To use the library, define one or more macros that specify the instruction set (architecture) of the target processor and then include simdpp/simd.h. The following instruction sets are supported:

NONE_NULL:

The instructions are not vectorized and use standard C++. This instruction set is used if no SIMD instruction set is selected. (no macro defined).

X86_SSE2:

The x86/x86_64 SSE and SSE2 instruction sets are used.

Macro: SIMDPP_ARCH_X86_SSE2.

X86_SSE3:

The x86/x86_64 SSE3 instruction set is used. The SSE and SSE2 instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_SSE3.

X86_SSSE3:

The x86/x86_64 SSSE3 instruction set is used. The SSE, SSE2 and SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_SSSE3.

X86_SSE4.1:

The x86/x86_64 SSE4.1 instruction set is used. The SSE, SSE2 and SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_SSE4_1.

X86_AVX:

The x86/x86_64 AVX instruction set is used. The SSE, SSE2, SSE3 and SSSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_AVX.

X86_AVX2:

The x86/x86_64 AVX2 instruction set is used. The SSE, SSE2, SSE3, SSSE3 and AVX instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_AVX2.

X86_FMA3:

The Intel x86/x86_64 FMA3 instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets). This instruction set must not be combined with X86_FMA4.

Macro: SIMDPP_ARCH_X86_FMA3.

X86_FMA4:

The AMD x86/x86_64 FMA4 instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets). This instruction set must not be combined with X86_FMA3.

Macro: SIMDPP_ARCH_X86_FMA4.

X86_XOP:

The AMD x86/x86_64 XOP instruction set is used. The SSE, SSE2, SSE3 instruction set support is required implicitly (no need to define the macros for these instruction sets).

Macro: SIMDPP_ARCH_X86_XOP.

ARM_NEON:

The ARM NEON instruction set. The VFP co-processor is used for any floating-point functionality (NEON does not require the implementation to be IEEE-754 compliant, whereas VFP does).

Macro SIMDPP_ARCH_ARM_NEON

ARM_NEON_FLT_SP:

Performs 32-bit floating-point computations on the NEON vector unit. The NEON instruction set support is required implicitly (no need to define the macro for that instruction set).

Macro SIMDPP_ARCH_ARM_NEON_FLT_SP.

POWER_ALTIVEC:

The POWER Altivec instruction set. 64-bit floating point operations are not supported.

Macro SIMDPP_ARCH_POWER_ALTIVEC.

Instruction counts

In this documentation all functions that map to more than one instruction are marked as such by listing the number of instructions that are used to implement a function. The instructions are counted as follows:

Any register-register moves and copies that do not cross the processor domains are ignored;
Non-vector domain instructions are ignored except when they move data to or from memory or vector domain.
If the implementation of a function is dependent on template arguments (for example, element selector), then the instruction count is defined as a range with both lower and upper bounds
If the function loads or computes static data, then the instruction count is defined as a range. The lower count calculated as if the loads from memory or computation didn't happen (for example, if the function was used in a small loop and there were enough registers to cache the data). The upper count is calculated the other way round.

If instruction count is not listed for specific architecture, then the function directly maps to one instruction. This rule does not apply to the following architectures:

X86_FMA3, X86_FMA4 and X86_XOP.

For these, if instruction count is not listed, the instruction counts should be interpreted as if the architecture is not supported.

Note, that instruction count is very, very imprecise way to measure performance. It is provided just as a quick way to estimate how well specific functionality maps to target architecture.

Dynamic dispatch

If the user wants to include several versions of the same code, compiled for different architectures sets into the same executable, then all such code must be put into SIMDPP_ARCH_NAMESPACE namespace. This macro evaluates to an identifier which is unique for each architecture.

In addition to the above, the source file must not define any of the architecture select macros; they must be supplied via the compiler options. The code for NONE_NULL architecture must be linked to the resulting executable.

To use dynamic dispatch mechanism, declare the function within an SIMDPP_ARCH_NAMESPACE and then use one of SIMDPP_MAKE_DISPATCHER_*** macros.

Dynamic dispatch example

The following example demonstrates the simpliest usage of dynamic dispatch:

// test.h

void print_arch();

// test.cc
#include "test.h"
#include <simdpp/simd.h>
#include <iostream>
namespace SIMDPP_ARCH_NAMESPACE {
void print_arch()
{
    std::cout << static_cast<unsigned>(simdpp::this_compile_arch()) << '\n';
}
} // namespace SIMDPP_ARCH_NAMESPACE
SIMDPP_MAKE_DISPATCHER_VOID0(print_arch);

// main.cc
#include "test.h"
int main()
{
    print_arch();
}

#Makefile
CXXFLAGS="-std=c++11"
test: main.o test_sse2.o test_sse3.o test_sse4_1.o test_null.o
    g++ $^ -lpthread -o test
main.o: main.cc
    g++ main.cc $(CXXFLAGS) -c -o main.o
# inclusion of NONE_NULL is mandatory
test_null.o: test.cc
    g++ test.cc -c $(CXXFLAGS) -o test_sse2.o
test_sse2.o: test.cc
    g++ test.cc -c $(CXXFLAGS) -DSIMDPP_ARCH_X86_SSE2 -msse2 -o test_sse2.o
test_sse3.o: test.cc
    g++ test.cc -c $(CXXFLAGS) -DSIMDPP_ARCH_X86_SSE3 -msse3 -o test_sse3.o
test_sse4_1.o: test.cc
    g++ test.cc -c $(CXXFLAGS) -DSIMDPP_ARCH_X86_SSE4_1 -msse4.1 -o test_sse3.o

If compiled, the above example selects the "fastest" of SSE2, SSE3 or SSE4.1 instruction sets, whichever is available on the target processor and outputs an integer that identifiers that instruction set.

Note, that the object files must be linked directly to the executable. If static libraries are used, the linker may throw out static dispatcher registration code and break the mechanism. Do prevent this behavior, -Wl,–whole-archive or an equivalent flag must be used.

CMake

For CMake users, cmake/SimdppMultiarch.cmake contains several useful functions:

simdpp_get_compilable_archs: checks what architectures are supported by the compiler.
simdpp_get_runnable_archs: checks what architectures are supported by both the compiler and the current processor.
simdpp_multiversion: given a list of architectures (possibly generated by simdpp_get_compilable_archs or simdpp_get_runnable_archs), automatically configures compilation of additional objects. The user only needs to add the returned list of source files to add_library or add_executable.

The above example may be build with CMakeLists.txt as simple as follows:

cmake_minimum_required(VERSION 2.8.0)
project(test)
include(SimdppMultiarch)
simdpp_get_runnable_archs(RUNNABLE_ARCHS)
simdpp_multiarch(GEN_ARCH_FILES test.cc ${RUNNABLE_ARCHS})
add_executable(test main.cc ${GEN_ARCH_FILES})
target_link_libraries(test pthread)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-std=c++11")