Introduction
Welcome to GCC* 13.1 released on Apr. 26, 2023! The GNU Compiler Collection (GCC) team at Intel has closely worked with the GCC community and contributed innovations and performance enhancements into GCC 13. This continues and reflects our long-standing close collaboration with the open source community, enabling developers across the software ecosystem.
Support for many new features of upcoming generations of the Intel® Xeon® Scalable Processor has been added in GCC 13, including
- AVX-IFMA: Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Integer Fused Multiply Add (IFMA)
- AVX-VNNI-INT8: Intel® AVX Vector Neural Network Instructions (VNNI) INT8
- AVX-NE-CONVERT: A new set of instructions, which can convert low precision floating point like BF16/FP16 to high precision floating point FP32, as well as convert FP32 elements to BF16. This instruction allows the platform to have improved AI capabilities and better compatibility.
- CMPccXADD: compare and add instruction using an implicit lock
- WRMSRNS: non-serializing write to model specific registers
- MSRLIST: R/W support for list of model specific registers
- RAO-INT: remote atomic ADD, AND, OR, XOR operations
- AMX-FP16: tile computational operations on FP16 numbers for Intel® Advanced Matrix Extensions (Intel® AMX)
- AMX-COMPLEX: matrix multiply operations on complex elements for Intel® AMX
- PREFETCHIT0/1: improved memory prefetch instruction
Note: Please refer to the latest edition of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference from March 2023 for instruction set details.
The heuristics for function inlining and loop unrolling has been updated, which leads to performance improvements measured using the SPECrate* 2017 benchmark.
Several auto-vectorization enhancements have been developed for new vector neural network instructions (AVX-VNNI-INT8) in the GCC 13 compiler based on the vectorization framework. We also contributed many patches, improving quality and performance in the backend.
Also, -mtune=alderlake architecture optimizations were fine-tuned with significant performance gains.
Generation-to-Generation Performance Improvement
The heuristic for function inlining has been updated in GCC 13. Figure 1 shows performance is improved by 2.17% for -Ofast on Intel® 64 Architecture as measured by SPECrate* 2017.
Figure 1. GCC 13 -Ofast SPECrate 2017 (64-core) estimated improvement ratio vs GCC 12.2 on 4th Gen Intel® Xeon® Scalable Processor (formerly code-named Sapphire Rapids)
GCC 13 -O2 performance has also improved by 0.5% for SPECrate 2017. There are extra performance gains for several benchmarks if the default arch compilation option is upgraded from x86_64 to x86_64-v2. For more detail info about arch level, see the x86_64 psABI. Figure 2 shows more details of the improvement.
Figure 2. GCC 13 -O2 SPECrate 2017 (64-core) estimated improvement ratio vs GCC 12.2 on 4th Gen Intel Xeon Processor.
Support for next Generation Intel Xeon Scalable Processors
To take advantage of the newly added architecture specific features and optimizations supported in GCC 13, select the following compiler options:
- AVX-IFMA intrinsics are available via the -mavxifma compiler option switch.
- AVX-VNNI-INT8 intrinsics are available via the -mavxvnniint8 compiler option switch.
- AVX-NE-CONVERT intrinsics are available via the -mavxneconvert compiler option switch.
- CMPccXADD intrinsics are available via the -mcmpccxadd compiler option switch.
- AMX-FP16 intrinsics are available via the -mamx-fp16 compiler option switch.
- PREFETCHI intrinsics are available via the -mprefetchi compiler option switch.
- RAO-INT intrinsics are available via the -mraoint compiler option switch.
- AMX-COMPLEX intrinsics are available via the -mamx-complex compiler option switch.
The -march GCC compiler instruction set options for upcoming Intel processors are as follows:
- GCC now supports the Intel® CPU code-named Raptor Lake through -march=raptorlake.
- GCC now supports the Intel® CPU code-named Meteor Lake through -march=meteorlake.
- GCC now supports the Intel® CPU code-named Sierra Forest through -march=sierraforest. The switch enables the AVX-IFMA, AVX-VNNI-INT8, AVX-NE-CONVERT, and CMPccXADD ISA extensions.
- GCC now supports the Intel® CPU code-named Grand Ridge through -march=grandridge. The switch enables the AVX-IFMA, AVX-VNNI-INT8, AVX-NE-CONVERT, CMPccXADD, and RAO-INT ISA extensions.
- GCC now supports the Intel® CPU code-named Granite Rapids through -march=graniterapids. The switch enables the AMX-FP16, AMX-COMPLEX and PREFETCHI ISA extensions.
For more details of the new intrinsics, see the Intel® Intrinsics Guide.
Auto-vectorization Enhancement
Besides the intrinsics for the new instructions, auto-vectorization has been enhanced to generate vector neural network instructions(AVX-VNNI-INT8). Several other auto-vectorization improvements have also been added to GCC 13.
Auto-vectorization for AVX-VNNI-INT8 vpdpbssd.
Auto-vectorization was enhanced to perform idiom recognition (such as the dot-production idiom) that triggers instruction generation for Intel® AVX-VNNI-INT8. Figure 3 shows the compiler generating the vpdpbssd instruction and additionally a sum reduction.
int sdot_prod_qi (char * restrict a,
char *restrict b, int c, int n) {
for (int i = 0; i < 32; i++) {
c += ((int) a[i] * (int) b[i]);
}
return c;
}
sdot_prod_qi:
vmovdqu ymm0, YMMWORD PTR [rsi]
vpxor xmm1, xmm1, xmm1
vpdpbssd ymm1, ymm0, YMMWORD PTR [rdi]
vmovdqa xmm0, xmm1
vextracti128 xmm1, ymm1, 0x1
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
add eax, edx
vzeroupper
ret
Figure 3: Intel® AVX for VNNI idiom recognition in GCC (-O2 -mavxvnniint8) auto-vectorization
Auto-vectorization for Nonlinear Induction
The original GCC auto-vectorization can only handle linear induction variables, such as
i = i + 1.
When a nonlinear induction variable is used, the whole loop was prevented from vectorization. This could potentially hurt performance a lot if the loop is a hotspot. Now with the enhancement, auto-vectorization can handle several nonlinear inductions.
Figure 4 shows that the nonlinear induction is now vectorized.
int
foo (int* p, int t)
{
for (int i = 0; i != 256; i++)
{
p[i] = t;
t = -t;
}
}
foo:
mov eax, esi
movd xmm0, esi
neg eax
pinsrd xmm0, eax, 1
lea rax, [rdi+1024]
punpcklqdq xmm0, xmm0
.L2:
movups XMMWORD PTR [rdi], xmm0
add rdi, 32
movups XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2
ret
Figure 4: GCC (-O2 -march=x86-64-v2) nonlinear induction auto-vectorization example.
Auto-Vectorization for AVX-512 vcvttps2udq.
The original auto-vectorization did not realize vcvttps2udq is available under Intel AVX-512, and generated a sequence of vcvttps2dq instructions for emulation which is not efficient. Now it’s enhanced to generate single AVX-512 vcvttps2udq.
void
foo (unsigned* p, float* q)
{
for (int i = 0; i != 256; i++)
p[i] = q[i];
}
foo:
xor eax, eax
.L2:
vcvttps2udq ymm0, YMMWORD PTR [rsi+rax]
vmovdqu YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L2
vzeroupper
ret
Figure 5: Auto-Vectorization for AVX-512 vcvttps2udq (-O2 -mavx512vl -mprefer-vector-width=256)
Codegen Optimization for the x86 Backend.
New Option -mdaz-ftz
In Intel processors, the flush-to-zero (FTZ) and denormals-are-zero (DAZ) flags in the MXCSR register are used to control floating-point calculations. Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® AVX instructions, including scalar and vector instructions, benefit from enabling the FTZ and DAZ flags.
Floating-point computations using the Intel SSE and Intel AVX instructions are accelerated when the FTZ and DAZ flags are enabled. This improves the application’s performance. The new option set FTZ and DAZ bits, which can improve SPECrate 2017 Floating Point suite results by 1.67%. Figure 6 shows more details about the improvement.
Figure 6: GCC13 -O2 -march=86-64-v2 -mdaz-ftz SPECrate 2017 Floating Point(64-core) estimated improvement ratio vs GCC13 -O2 -march=86-64-v2 on current Intel Xeon Processor.
Memcmp Is Optimized with vptest Instruction
If the string length is in range of a vector length, memcmp can be inlined and optimized using the vptest instruction.
_Bool f256(char *a)
{
char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
f256:
vmovdqa ymm1, YMMWORD PTR .LC1[rip]
vpxor ymm0, ymm1, YMMWORD PTR [rdi]
vptest ymm0, ymm0
sete al
vzeroupper
ret
.LC1:
.quad 3978425819141910832
.quad 3833745473465760056
.quad 3689065127958034230
.quad 13573712489362740
Figure 7: AVX vptest optimization(-O2 -mavx -mmove-max=256 -mstore-max=256)
Microarchitecture Tuning
As measured by SPECrate 2017 Floating Point suite results, -mtune=alderlake updatesignificantly improved performance by 4.5% on E-core, 3.13% on P-core. Figure 8-9 show more details about the improvement.
Figure 8: Update -mtune=alderlake improves SPECrate 2017 Floating Point suite results by 4.5% Geomean on E-core.
Figure 9: Update -mtune=alderlake improves SPECrate 2017 Floating Point suite results by 3.12% Geomean on P-core.
Future Work
The GCC team at Intel is continuing to enhance vectorization, improve performance, bring more new features to Intel® platforms, and fix issues to increase the code quality in coming GNU compiler releases. We are looking forward to continuing active contribution to this and other open source projects.