Join us at Microsoft Ignite in Chicago, Nov 19-21. Booth 614! ->->

Blog / Engineering

CPU Dispatch in ClickHouse

author avatar
Maksim Kita
Oct 3, 2023 - 20 minutes read

Overview

In this post, I will describe how vectorization works, what CPU dispatch is, how to find places for CPU dispatch optimizations and how we use CPU dispatch in ClickHouse.

First, let’s describe our problem. Hardware vendors constantly add new instructions to the instruction set of modern CPUs. And we often want to use the latest instructions for optimizations, and the most important are SIMD instructions. But the main issue with this is the compatibility. For example, if your program is compiled with AVX2 instruction set, and your CPU supports only SSE4.2, then if it runs such a program, you will get an illegal instruction signal (SIGILL).

Also, an important thing to note is that data structures and algorithms can be specifically designed around SIMD instructions, for example, modern integer compression codecs, or ported to such instructions later on, for example, JSON parsing.

To improve performance while retaining compatibility with older hardware, parts of your code can be compiled for different instruction sets, and then at runtime the program can dispatch execution to the most performant variant.

For any examples in this post I will use the clang-15 compiler.

Vectorization basics

Vectorization is an optimization where your data is processed using vector operations instead of scalar operations. Modern CPUs have specific instructions that allow you to process your data in vectors using SIMD instructions. Such optimizations can be performed manually, or compilers can perform automatic vectorization.

Let’s consider such code example:

void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
   for (size_t i = 0; i < size; ++i) {
       c[i] = b[i] + a[i];
   }
}

We have a plus function that takes 3 pointers to a, b and c arrays and size of these arrays. This function computes the sum of a and b array elements and writes it into array c.

If we compile this code without loop unrolling by specifying the option fno-unroll-loops, and with AVX2 support specifying via the option -mavx2, the following assembly will be generated:

$ /usr/bin/clang++-15 -mavx2 -fno-unroll-loops -O3 -S vectorization_example.cpp
# %bb.0:
	testq	%rcx, %rcx
	je	.LBB0_7
# %bb.1:
	cmpq	$4, %rcx
	jae	.LBB0_3
# %bb.2:
	xorl	%r8d, %r8d
	jmp	.LBB0_6
.LBB0_3:
	movq	%rcx, %r8
	andq	$-4, %r8
	xorl	%eax, %eax
	.p2align	4, 0x90
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rdx,%rax,8)
	addq	$4, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4
# %bb.5:
	cmpq	%rcx, %r8
	je	.LBB0_7
	.p2align	4, 0x90
.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	movq	(%rdi,%r8,8), %rax
	addq	(%rsi,%r8,8), %rax
	movq	%rax, (%rdx,%r8,8)
	incq	%r8
	cmpq	%r8, %rcx
	jne	.LBB0_6
.LBB0_7:
	vzeroupper
	retq

In the final assembly, there are two loops. The vectorized loop that processes 4 elements at a time:

.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rdx,%rax,8)
	addq	$4, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4

And the Scalar loop:

.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	movq	(%rdi,%r8,8), %rax
	addq	(%rsi,%r8,8), %rax
	movq	%rax, (%rdx,%r8,8)
	incq	%r8
	cmpq	%r8, %rcx
	jne	.LBB0_6

At the beginning of the functions assembly, there is a check depending on the arrays size which determines which loop is chosen:

# %bb.1:
	cmpq	$4, %rcx
	jae	.LBB0_3
# %bb.2:
	xorl	%r8d, %r8d
	jmp	.LBB0_6

Additionally, an important thing to note is vzeroupper instruction. The compiler inserts this to avoid the penalty of mixing SSE and VEX AVX instructions. You can read more about it in Agner Fog Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 13.2 Mixing VEX and SSE code.

Another important thing to note is the __restrict keyword on the input array pointers. It tells the compiler that the function arguments do not alias each other. This means in particular that they do not point to overlapping memory regions. If __restrict is not specified, the compiler will either not vectorize the loop at all or vectorize it only after doing a costly runtime check at the beginning of the function to make sure that the arrays do indeed not overlap.

Additionally, if we compile this example without fno-unroll-loops and look at the generated loop, we will see that the compiler unrolled the vectorized loop, which now processes 16 elements at a time.

.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vmovdqu	32(%rdi,%rax,8), %ymm1
	vmovdqu	64(%rdi,%rax,8), %ymm2
	vmovdqu	96(%rdi,%rax,8), %ymm3
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vpaddq	32(%rsi,%rax,8), %ymm1, %ymm1
	vpaddq	64(%rsi,%rax,8), %ymm2, %ymm2
	vpaddq	96(%rsi,%rax,8), %ymm3, %ymm3
	vmovdqu	%ymm0, (%rdx,%rax,8)
	vmovdqu	%ymm1, 32(%rdx,%rax,8)
	vmovdqu	%ymm2, 64(%rdx,%rax,8)
	vmovdqu	%ymm3, 96(%rdx,%rax,8)
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4

There is a very useful tool that can help you identify places where the compiler does or does not perform vectorization to avoid assembly checking. You can add -Rpass=loop-vectorize, -Rpass-missed=loop-vectorize and -Rpass-analysis=loop-vectorize options to clang. There are similar options for gcc.

If we compile our example with these options, there will be the following output:

$ /usr/bin/clang++-15 -mavx2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3

vectorization_example.cpp:7:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    for (size_t i = 0; i < size; ++i) {

Now let’s consider another example:

class SumFunction
{
public:
    void sumIf(int64_t * values, int8_t * filter, size_t size);

    int64_t sum = 0;
};

void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        sum += filter[i] ? 0 : values[i];
    }
}
/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp

...

vectorization_example.cpp:28:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
        for (size_t i = 0; i < size; ++i) {

In case the compiler cannot perform vectorization. There are two possible scenarios:

  1. You can try to modify your code, so it can be vectorized. In some complex scenarios, you may need to redesign your data representation. I highly encourage you to check the LLVM documentation and gcc documentation that can help you understand in which cases auto-vectorization can or cannot be performed.
  2. You can vectorize the loop manually using intrinsics. This option is less preferred because of the additional maintenance.

To fix the issue in our example, we need to make a local sum inside the function:

class SumFunction
{
public:
    void sumIf(int64_t * values, int8_t * filter, size_t size);

    int64_t sum = 0;
};

void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size)
{
    int64_t local_sum = 0;

    for (size_t i = 0; i < size; ++i) {
        local_sum += filter[i] ? 0 : values[i];
    }

    sum += local_sum;
}

Such code example is vectorized by the compiler:

/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp

vectorization_example.cpp:31:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    for (size_t i = 0; i < size; ++i) {

In the resulting assembly vectorized loop looks like this:

.LBB0_5:                                # =>This Inner Loop Header: Depth=1
	vmovd	(%rdx,%rax), %xmm5              # xmm5 = mem[0],zero,zero,zero
	vmovd	4(%rdx,%rax), %xmm6             # xmm6 = mem[0],zero,zero,zero
	vmovd	8(%rdx,%rax), %xmm7             # xmm7 = mem[0],zero,zero,zero
	vmovd	12(%rdx,%rax), %xmm1            # xmm1 = mem[0],zero,zero,zero
	vpcmpeqb	%xmm5, %xmm8, %xmm5
	vpmovsxbq	%xmm5, %ymm5
	vpcmpeqb	%xmm6, %xmm8, %xmm6
	vpmovsxbq	%xmm6, %ymm6
	vpcmpeqb	%xmm7, %xmm8, %xmm7
	vpmovsxbq	%xmm7, %ymm7
	vpcmpeqb	%xmm1, %xmm8, %xmm1
	vpmaskmovq	-96(%r8,%rax,8), %ymm5, %ymm5
	vpmovsxbq	%xmm1, %ymm1
	vpmaskmovq	-64(%r8,%rax,8), %ymm6, %ymm6
	vpaddq	%ymm0, %ymm5, %ymm0
	vpmaskmovq	-32(%r8,%rax,8), %ymm7, %ymm5
	vpaddq	%ymm2, %ymm6, %ymm2
	vpmaskmovq	(%r8,%rax,8), %ymm1, %ymm1
	vpaddq	%ymm3, %ymm5, %ymm3
	vpaddq	%ymm4, %ymm1, %ymm4
	addq	$16, %rax
	cmpq	%rax, %r9
	jne	.LBB0_5

CPU Dispatch basics

CPU dispatch is a technique when there are multiple compiled versions of your code for different CPU features, and in runtime, your program detects which CPU features your machine has and uses the most performant version in runtime. The most important instruction sets you want to check are SSE4.2, AVX, AVX2, and AVX-512.

To implement CPU dispatch, first, we need to use the CPUID instruction to check if particular features are supported for the current CPU.

You can call the cpuid instruction using inline assembly or using a cpuid.h header where these functions are defined:

/* x86-64 uses %rbx as the base register, so preserve it. */
#define __cpuid(__leaf, __eax, __ebx, __ecx, __edx) \
   __asm("  xchgq  %%rbx,%q1\n" \
         "  cpuid\n" \
         "  xchgq  %%rbx,%q1" \
       : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \
       : "0"(__leaf))

#define __cpuid_count(__leaf, __count, __eax, __ebx, __ecx, __edx) \
   __asm("  xchgq  %%rbx,%q1\n" \
         "  cpuid\n" \
         "  xchgq  %%rbx,%q1" \
       : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \
       : "0"(__leaf), "2"(__count))
#endif

Next, to check if some CPU feature is supported, you need to check Intel Software Optimization Reference Manual Chapter 5 manual for specific instructions. For example, for SSE4.2:

bool hasSSE42()
{
    uint32_t eax = 0;
    uint32_t ebx = 0;
    uint32_t ecx = 0;
    uint32_t edx = 0;

    __cpuid(0x1, eax, ebx, ecx, edx);

    return (ecx >> 20) & 1ul;
}

Now we need to compile our function with different instructions. In clang, there is target attribute that can do exactly that. In gcc, there is the same attribute. For example:

void plusDefault(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

__attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2")))
void plusAVX2(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

__attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2,avx512f")))
void plusAVX512(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

In this example, we compile our plus function additionally for AVX2 and AVX-512. In the final assembly, we can check that the compiler uses AVX2 to vectorize the loop of the plusAVX2 function:

...

.globl	_Z8plusAVX2PlS_S_m              # -- Begin function _Z8plusAVX2PlS_S_m

...

.LBB4_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rsi,%rax,8), %ymm0
	vmovdqu	32(%rsi,%rax,8), %ymm1
	vmovdqu	64(%rsi,%rax,8), %ymm2
	vmovdqu	96(%rsi,%rax,8), %ymm3
	vpaddq	(%rdi,%rax,8), %ymm0, %ymm0
	vpaddq	32(%rdi,%rax,8), %ymm1, %ymm1
	vpaddq	64(%rdi,%rax,8), %ymm2, %ymm2
	vpaddq	96(%rdi,%rax,8), %ymm3, %ymm3
	vmovdqu	%ymm0, (%rdx,%rax,8)
	vmovdqu	%ymm1, 32(%rdx,%rax,8)
	vmovdqu	%ymm2, 64(%rdx,%rax,8)
	vmovdqu	%ymm3, 96(%rdx,%rax,8)
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	.LBB4_4

...

and AVX-512 for vectorized loop of plusAVX512:

...

.globl	_Z10plusAVX512PlS_S_m    # -- Begin function _Z10plusAVX512PlS_S_m

...

.LBB5_4:    # =>This Inner Loop Header: Depth=1
	vmovdqu64	(%rsi,%rax,8), %zmm0
	vmovdqu64	64(%rsi,%rax,8), %zmm1
	vmovdqu64	128(%rsi,%rax,8), %zmm2
	vmovdqu64	192(%rsi,%rax,8), %zmm3
	vpaddq	(%rdi,%rax,8), %zmm0, %zmm0
	vpaddq	64(%rdi,%rax,8), %zmm1, %zmm1
	vpaddq	128(%rdi,%rax,8), %zmm2, %zmm2
	vpaddq	192(%rdi,%rax,8), %zmm3, %zmm3
	vmovdqu64	%zmm0, (%rdx,%rax,8)
	vmovdqu64	%zmm1, 64(%rdx,%rax,8)
	vmovdqu64	%zmm2, 128(%rdx,%rax,8)
	vmovdqu64	%zmm3, 192(%rdx,%rax,8)
	addq	$32, %rax
	cmpq	%rax, %r8
	jne	.LBB5_4

...

Now we have everything we need to perform a CPU dispatch:

void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    if (hasAVX512()) {
        plusAVX512(a, b, c, size);
    } else if (hasAVX2()) {
        plusAVX2(a, b, c, size);
    } else {
        plusDefault(a, b, c, size);
    }
}

In this example, we created a plus function that dispatches to a concrete implementation based on the available instruction set. Such a CPU dispatch method is also called dispatch on each call. There are other methods that you can read about in Agner Fog Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 13.1 CPU dispatch strategies.

Dispatch on each call is the most flexible method because it will allow you to work with template functions and class member functions, or choose an implementation based on some statistics collected in runtime. The only downside is branching, although this overhead is negligible if your function performs a lot of work.

CPU Dispatch Optimization Places

Now how to find places where SIMD optimizations can be applied?

  1. If you know what loops are hot in your program, you can try to apply CPU dispatch for them.
  2. If you have performance tests, you can compile your program with AVX, AVX2, AVX-512 and compare performance reports to figure out which places can be optimized in your program using CPU dispatch.

This technique with performance tests can be applied not only to CPU dispatch but to many other useful optimizations. The main idea is to compile your code with different configurations (compilers, compiler options, libraries, allocators), and if there is a performance improvement in some places you can manually optimize them. For example:

  1. Try different allocators and different libraries.
  2. Try different compiler options (loop unrolling, inline threshold).
  3. Enable AVX/AVX2/AVX-512 for build.

CPU Dispatch in ClickHouse

We would like to recognize the efforts of Dmitriy Kovalkov who added the CPU dispatching framework to ClickHouse. It served as a basis for the subsequent work described in this article.

First, I want to show how we designed our dispatch framework in ClickHouse.

enum class TargetArch : UInt32
{
    Default  = 0,         /// Without any additional compiler options.
    SSE42    = (1 << 0),  /// SSE4.2
    AVX      = (1 << 1),
    AVX2     = (1 << 2),
    AVX512F  = (1 << 3),
    AVX512BW    = (1 << 4),
    AVX512VBMI  = (1 << 5),
    AVX512VBMI2 = (1 << 6),
};

/// Runtime detection.
bool isArchSupported(TargetArch arch);

We define an enum TargetArch for the target architecture, and inside the isArchSupported function we use CPUID instruction set checks that we have already discussed. Then we define a bunch of BEGIN_INSTRUCTION_SET_SPECIFIC_CODE sections that apply a target attribute to the whole code block.

For example, for clang:

#   define BEGIN_AVX512F_SPECIFIC_CODE \
_Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\
    popcnt,avx,avx2, avx512f\"))), apply_to=function)")
\
#   define BEGIN_AVX2_SPECIFIC_CODE \
_Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\
    popcnt, avx,avx2\"))), apply_to=function)") \
\
#   define END_TARGET_SPECIFIC_CODE \
_Pragma("clang attribute pop")

Then for each instruction set, we define a separate namespace TargetSpecific::INSTRUCTION_SET. Examples for AVX2 and AVX512:

#define DECLARE_AVX2_SPECIFIC_CODE(...) \
BEGIN_AVX2_SPECIFIC_CODE \
namespace TargetSpecific::AVX2 { \
    DUMMY_FUNCTION_DEFINITION \
    using namespace DB::TargetSpecific::AVX2; \
    __VA_ARGS__ \
} \
END_TARGET_SPECIFIC_CODE

#define DECLARE_AVX512F_SPECIFIC_CODE(...) \
BEGIN_AVX512F_SPECIFIC_CODE \
namespace TargetSpecific::AVX512F { \
    DUMMY_FUNCTION_DEFINITION \
    using namespace DB::TargetSpecific::AVX512F; \
    __VA_ARGS__ \
} \
END_TARGET_SPECIFIC_CODE

It can be used like this:

DECLARE_DEFAULT_CODE (
    int funcImpl() {
        return 1;
    }
) // DECLARE_DEFAULT_CODE

DECLARE_AVX2_SPECIFIC_CODE (
    int funcImpl() {
        return 2;
    }
) // DECLARE_AVX2_SPECIFIC_CODE

/// Dispatcher function
int dispatchFunc() {
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
        return TargetSpecific::AVX2::funcImpl();
#endif
    return TargetSpecific::Default::funcImpl();
}

The examples above work well with standalone functions, but when we have class member functions, they do not work because these cannot be wrapped into the namespace. For such cases, we have another bunch of macros. We need to insert a specific attribute before the class member function name and generate functions with different names, ideally with suffixes like SSE42, AVX2, AVX512. We can split a function into a header and body using the MULTITARGET_FUNCTION_HEADER and MULTITARGET_FUNCTION_BODY macros. Specific attributes are then inserted before the function name. For example, for AVX-512 (BW), AVX-512 (F), AVX2 and SSE4.2 it can look like this:

/// Function header
#define MULTITARGET_FUNCTION_HEADER(...) __VA_ARGS__

/// Function body
#define MULTITARGET_FUNCTION_BODY(...) __VA_ARGS__

#define MULTITARGET_FUNCTION_AVX512BW_AVX512F_AVX2_SSE42(FUNCTION_HEADER, name, FUNCTION_BODY) \
    FUNCTION_HEADER \
    \
    AVX512BW_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##AVX512BW \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    AVX512_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##AVX512 \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    AVX2_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##AVX2 \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    SSE42_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##SSE42 \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    name \
    FUNCTION_BODY \

We use CPU dispatch in computationally intensive places, for example, in hashing, geometry functions, string processing functions, random numbers generation functions, unary functions, and aggregate functions. For example, let’s see how we use CPU dispatch in aggregate functions. In ClickHouse, if there is GROUP BY without keys, for example SELECT sum(value), avg(value) FROM test_table, the aggregate functions process data directly in a batch. For the sum function, there is the following implementation:

template <typename Value>
void NO_INLINE addManyImpl(const Value * __restrict ptr, size_t start, size_t end)
{
    ptr += start;
    size_t count = end - start;
    const auto * end_ptr = ptr + count;

    /// Loop
    T local_sum{};
    while (ptr < end_ptr)
    {
        Impl::add(local_sum, *ptr);
        ++ptr;
    }
    Impl::add(sum, local_sum);
}

After we wrap this loop into our dispatch framework, the function code will look like this:

MULTITARGET_FUNCTION_AVX512BW_AVX512F_AVX2_SSE42(
MULTITARGET_FUNCTION_HEADER(
template <typename Value>
void NO_SANITIZE_UNDEFINED NO_INLINE
), addManyImpl,
MULTITARGET_FUNCTION_BODY((const Value * __restrict ptr, size_t start, size_t end)
{
    ptr += start;
    size_t count = end - start;
    const auto * end_ptr = ptr + count;

    /// Loop
    T local_sum{};
    while (ptr &lt end_ptr)
    {
        Impl::add(local_sum, *ptr);
        ++ptr;
    }
    Impl::add(sum, local_sum);
}))

We can now dispatch to the right implementation based on the fastest available CPU instruction set:

template <typename Value>
void NO_INLINE addMany(const Value * __restrict ptr, size_t start, size_t end)
{
#if USE_MULTITARGET_CODE
 if (isArchSupported(TargetArch::AVX512BW))
    {
        addManyImplAVX512BW(ptr, start, end);
        return;
    } 
    else if (isArchSupported(TargetArch::AVX512F))
    {
        addManyImplAVX512F(ptr, start, end);
        return;
    }
    else if (isArchSupported(TargetArch::AVX2))
    {
        addManyImplAVX2(ptr, start, end);
        return;
    }
    else if (isArchSupported(TargetArch::SSE42))
    {
        addManyImplSSE42(ptr, start, end);
        return;
    }
#endif

    addManyImpl(ptr, start, end);
}

Here is a small part of the performance report after this optimization was applied:

QueryOld (s)New (s)Ratio of speedup(-) or slowdown(+)Relative difference (new - old) / old
SELECT sum(toNullable(toUInt8(number))) FROM numbers(100000000)0.1100.077-1.428x-0.300
SELECT sum(number) FROM numbers(100000000)0.0440.035-1.228x-0.185
SELECT sumOrNull(number) FROM numbers(100000000)0.0440.036-1.226x-0.183
SELECT avg(number) FROM numbers(100000000)0.4160.341-1.219x-0.180

In general, such optimization for sum and avg aggregate functions improves performance 1.2-1.6 times. Similar optimizations can be applied to other aggregate functions as well. Now let’s take a look at a CPU dispatch optimization in unary functions:

template <typename A, typename Op>
struct UnaryOperationImpl
{
    using ResultType = typename Op::ResultType;
    using ColVecA = ColumnVectorOrDecimal<A>;
    using ColVecC = ColumnVectorOrDecimal<ResultType>
    using ArrayA = typename ColVecA::Container;
    using ArrayC = typename ColVecC::Container;

    static void vector(const ArrayA & a, ArrayC & c)
    {
        /// Loop Op::apply is template for operation
        size_t size = a.size();
        for (size_t i = 0; i < size; ++i)
            c[i] = Op::apply(a[i]);
    }

    static void constant(A a, ResultType & c)
    {
        c = Op::apply(a);
    }
};

In the example, there is a loop that applies some template operation using Op::apply on elements of array a and writes the result into array c. After we wrap this loop into our dispatch framework, the loop code will look like this:

MULTITARGET_FUNCTION_WRAPPER_AVX2_SSE42(
MULTITARGET_FH(static void NO_INLINE),
vectorImpl,
MULTITARGET_FB((const ArrayA & a, ArrayC & c) /// NOLINT
{
    /// Loop Op::apply is template for operation
    size_t size = a.size();
    for (size_t i = 0; i < size; ++i)
        c[i] = Op::apply(a[i]);
}))

Now we need to dispatch to the appropriate function based on the currently available CPU instruction set:

static void NO_INLINE vector(const ArrayA & a, ArrayC & c)
{
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
    {
        vectorImplAVX2(a, c);
        return;
    }
    else if (isArchSupported(TargetArch::SSE42))
    {
        vectorImplSSE42(a, c);
        return;
    }
#endif

    vectorImpl(a, c);
}

Here is a small part of the performance report after this optimization was applied:

QueryOld (s)New (s)Ratio of speedup(-) or slowdown(+)Relative difference (new - old) / old
SELECT roundDuration(toInt32(number))) FROM numbers(100000000)1.6320.229-7.119x-0.860
SELECT intExp2(toInt32(number)) FROM numbers(100000000)0.1480.105-1.413x-0.293
SELECT roundToExp2(toUInt8(number)) FROM numbers(100000000)0.1440.102-1.41x-0.291

In general, such optimization for unary functions improves performance 1.15-2 times. For some specific functions like roundDuration such optimizations improved performance 2-7 times.

Summary

The compiler can vectorize even complex loops using SIMD instructions. Additionally, you can vectorize your code manually or design SIMD-oriented algorithms. But the biggest problem is that if you want to use modern instruction sets, it can decrease your program or library portability. Runtime CPU dispatching can help you eliminate this problem, with the cost of compiling parts of your code multiple times for different architectures. You can find places to improve performance using performance tests and compiling your code with different configurations. For CPU dispatch optimizations, you can compile your code with AVX, AVX2, AVX512 and manually apply CPU dispatch in places where there are performance improvements. In ClickHouse, we designed a framework specifically for such optimizations and have improved performance in a lot of places as a result.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
Twitter imageSlack imageGitHub image
Telegram imageMeetup imageRss image