Accelerate C Code with AVX2 Instructions

Instruction Set Extension

Today’s modern CPU such as Intel’s Boardwell and Skylake usually has some instruction set extensions, for example, SSE2 and AVX2. These instruction sets provide complex, and usually multi-cycle instructions to make it possible for programmers to speed up their code, in the hardware level.
Among these instructions the AVX instructions may be the most useful ones. Intel AVX, the abbreviation of Intel Advanced Vector Extensions, provides CPU the ability of doing the same thing multiple times at the same time. This feature is super useful in vector operation, so it is granted the name AVX. Intel AVX is first introduced in Sandy Bridge, and it’s improved to AVX2 when Haswell Architecture is published by Intel. This extension is still improving, today’s Skylake CPU in the Xeon family has already support AVX-512, which means the operational capability is doubled then its predecessor, Haswell.

Registers

AVX2 uses 8 ymm registers, which is an extension of xmm. Each ymm register has 256 bits, and the lowest 128 bits belongs to xmm register. As for AVX-512, zmm registers are introduced. Just like what ymm is for xmm, ymm register is the lowest 256 bits of zmm register.

Instructions

There’s a lot of them, and they are all in an official guide of Intel CPU, which contains more than 3000 pages, and you can find it on 01.org.
And here’s some examples:

vpxor ymm, ymm, ymm // 256 bit xor
vpaddq ymm, ymm, ymm // add 4 64bit integer
vpaddd ymm, ymm, ymm // add 8 32bit integer
vpcmpeqq ymm, ymm, ymm // compare 4 64bit integer, and store the result into a ymm
vmovdqu ymm, m256 // load a in-memory value to register or store it back

Bad News & Good News

Few compilers support these instructions. This result is kind of reasonable, because as a compiler you have take the compatibility into consideration, for example, you have to support different CPU from different vendor, although they have the common base, x86_64 Instruction Set.
Good news is that although compilers like GCC refuses to use these instructions to speed up your program automatically, you can force them to use it. Both of GCC and Visual C++ provide a group of headers to make programming with these new instruction easier. So there’s no need to directly write Assembly or inline Assembly anymore.

Work with GCC

GCC has several headers which provide encapsulation of these instruction, for example immintrin.h, emmintrin.h, xmmintrin.h and so on, but you can only include immintrin.h because most other headers are included by it.
To use AVX instructions with these header is quiet simple. But before that, we have to know how to store the data in those special registers, in other word, how to represent these registers.

Data Type

The data types used to represent ymm should be __m256, __m256i and __m256d. There’s also types like __m128, representing xmm, and __m512, representing zmm.

Functions & Micros

Most functions and be find here.
Here’s some example:

__m256i _mm256_loadu_si256 (__m256i const * mem_addr) // to load 256bit integer data from memory.
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a) // to store data in a register to memory.
__m256i _mm256_xor_si256 (__m256i a, __m256i b) // to calculate a xor b

Note that registers begins with two _, and functions begins with one _. And while programming, although the compiler will allocate registers and spill variable atomically, programmer would better decide these operation themselves, because these functions or micros are at a very low level.

Example

#include <immintrin.h>
void xor4ll(void* addr_a, void* addr_b, void* addr_target) {
    __m256i a = _mm256_loadu_si256((__m256i*)addr_a);
    __m256i b = _mm256_loadu_si256((__m256i*)addr_b);
    __m256i c = _mm256_xor_si256(a, b);
    _mm256_storeu_si256((__m256i*)addr_target, c);
}