Build Innovation and Performance with GCC* 13

Get the Latest on All Things CODE

author-image

作者

Introduction

Welcome to GCC* 13.1 released on Apr. 26, 2023! The GNU Compiler Collection (GCC) team at Intel has closely worked with the GCC community and contributed innovations and performance enhancements into GCC 13. This continues and reflects our long-standing close collaboration with the open source community, enabling developers across the software ecosystem.

Support for many new features of upcoming generations of the Intel® Xeon® Scalable Processor has been added in GCC 13, including
 

  • AVX-IFMA: Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Integer Fused Multiply Add (IFMA)
  • AVX-VNNI-INT8: Intel® AVX Vector Neural Network Instructions (VNNI) INT8
  • AVX-NE-CONVERT: A new set of instructions, which can convert low precision floating point like BF16/FP16 to high precision floating point FP32, as well as convert FP32 elements to BF16. This instruction allows the platform to have improved AI capabilities and better compatibility.
  • CMPccXADD: compare and add instruction using an implicit lock
  • WRMSRNS: non-serializing write to model specific registers
  • MSRLIST: R/W support for list of model specific registers
  • RAO-INT: remote atomic ADD, AND, OR, XOR operations
  • AMX-FP16: tile computational operations on FP16 numbers for Intel® Advanced Matrix Extensions (Intel® AMX)
  • AMX-COMPLEX: matrix multiply operations on complex elements for Intel® AMX
  • PREFETCHIT0/1: improved memory prefetch instruction
     

Note: Please refer to the latest edition of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference from March 2023 for instruction set details.

The heuristics for function inlining and loop unrolling has been updated, which leads to performance improvements measured using the SPECrate* 2017 benchmark.

Several auto-vectorization enhancements have been developed for new vector neural network instructions (AVX-VNNI-INT8) in the GCC 13 compiler based on the vectorization framework. We also contributed many patches, improving quality and performance in the backend.

Also, -mtune=alderlake architecture optimizations were fine-tuned with significant performance gains.

Generation-to-Generation Performance Improvement

The heuristic for function inlining has been updated in GCC 13. Figure 1 shows performance is improved by 2.17% for -Ofast on Intel® 64 Architecture as measured by SPECrate* 2017.

Figure 1. GCC 13 -Ofast SPECrate 2017 (64-core) estimated improvement ratio vs GCC 12.2 on 4th Gen Intel® Xeon® Scalable Processor (formerly code-named Sapphire Rapids)

GCC 13 -O2 performance has also improved by 0.5% for SPECrate 2017. There are extra performance gains for several benchmarks if the default arch compilation option is upgraded from x86_64 to x86_64-v2. For more detail info about arch level, see the x86_64 psABI. Figure 2 shows more details of the improvement.

Figure 2. GCC 13 -O2 SPECrate 2017 (64-core) estimated improvement ratio vs GCC 12.2 on 4th Gen Intel Xeon Processor.

Support for next Generation Intel Xeon Scalable Processors

To take advantage of the newly added architecture specific features and optimizations supported in GCC 13, select the following compiler options:
 

  • AVX-IFMA intrinsics are available via the -mavxifma compiler option switch.
  • AVX-VNNI-INT8 intrinsics are available via the -mavxvnniint8 compiler option switch.
  • AVX-NE-CONVERT intrinsics are available via the -mavxneconvert compiler option switch.
  • CMPccXADD intrinsics are available via the -mcmpccxadd compiler option switch.
  • AMX-FP16 intrinsics are available via the -mamx-fp16 compiler option switch.
  • PREFETCHI intrinsics are available via the -mprefetchi compiler option switch.
  • RAO-INT intrinsics are available via the -mraoint compiler option switch.
  • AMX-COMPLEX intrinsics are available via the -mamx-complex compiler option switch.
     

The -march GCC compiler instruction set options for upcoming Intel processors are as follows:
 

  • GCC now supports the Intel® CPU code-named Raptor Lake through -march=raptorlake.
  • GCC now supports the Intel® CPU code-named Meteor Lake through -march=meteorlake.
  • GCC now supports the Intel® CPU code-named Sierra Forest through -march=sierraforest. The switch enables the AVX-IFMA, AVX-VNNI-INT8, AVX-NE-CONVERT, and CMPccXADD ISA extensions.
  • GCC now supports the Intel® CPU code-named Grand Ridge through -march=grandridge. The switch enables the AVX-IFMA, AVX-VNNI-INT8, AVX-NE-CONVERT, CMPccXADD, and RAO-INT ISA extensions.
  • GCC now supports the Intel® CPU code-named Granite Rapids through -march=graniterapids. The switch enables the AMX-FP16, AMX-COMPLEX and PREFETCHI ISA extensions.
     

For more details of the new intrinsics, see the Intel® Intrinsics Guide.

Auto-vectorization Enhancement

Besides the intrinsics for the new instructions, auto-vectorization has been enhanced to generate vector neural network instructions(AVX-VNNI-INT8). Several other auto-vectorization improvements have also been added to GCC 13.

Auto-vectorization for AVX-VNNI-INT8 vpdpbssd.

Auto-vectorization was enhanced to perform idiom recognition (such as the dot-production idiom) that triggers instruction generation for Intel® AVX-VNNI-INT8. Figure 3 shows the compiler generating the vpdpbssd instruction and additionally a sum reduction.

int sdot_prod_qi (char * restrict a,
                   char *restrict b, int c, int n) {
  for (int i = 0; i < 32; i++) {
   c += ((int) a[i] * (int) b[i]);
  }
  return c;
}
sdot_prod_qi:
        vmovdqu ymm0, YMMWORD PTR [rsi]
        vpxor   xmm1, xmm1, xmm1
        vpdpbssd        ymm1, ymm0, YMMWORD PTR [rdi]
        vmovdqa xmm0, xmm1
        vextracti128    xmm1, ymm1, 0x1
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        add     eax, edx
        vzeroupper
        ret

Figure 3: Intel® AVX for VNNI idiom recognition in GCC (-O2 -mavxvnniint8) auto-vectorization

Auto-vectorization for Nonlinear Induction

The original GCC auto-vectorization can only handle linear induction variables, such as

i = i + 1.

When a nonlinear induction variable is used, the whole loop was prevented from vectorization. This could potentially hurt performance a lot if the loop is a hotspot. Now with the enhancement, auto-vectorization can handle several nonlinear inductions.

Figure 4 shows that the nonlinear induction is now vectorized.

int
foo (int* p, int t)
{
    for (int i = 0; i != 256; i++)
    {
      p[i] = t;
      t = -t;
    }
}

foo:
        mov     eax, esi
        movd    xmm0, esi
        neg     eax
        pinsrd  xmm0, eax, 1
        lea     rax, [rdi+1024]
        punpcklqdq      xmm0, xmm0
.L2:
        movups  XMMWORD PTR [rdi], xmm0
        add     rdi, 32
        movups  XMMWORD PTR [rdi-16], xmm0
        cmp     rax, rdi
        jne     .L2
        ret

Figure 4: GCC (-O2 -march=x86-64-v2) nonlinear induction auto-vectorization example.

Auto-Vectorization for AVX-512 vcvttps2udq.

The original auto-vectorization did not realize vcvttps2udq is available under Intel AVX-512, and generated a sequence of vcvttps2dq instructions for emulation which is not efficient. Now it’s enhanced to generate single AVX-512 vcvttps2udq.

void
foo (unsigned* p, float* q)
{
    for (int i = 0; i != 256; i++)
      p[i] = q[i];
}
foo:
        xor     eax, eax
.L2:
        vcvttps2udq     ymm0, YMMWORD PTR [rsi+rax]
        vmovdqu YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 1024
        jne     .L2
        vzeroupper
        ret

Figure 5: Auto-Vectorization for AVX-512 vcvttps2udq (-O2 -mavx512vl -mprefer-vector-width=256)

Codegen Optimization for the x86 Backend.

New Option -mdaz-ftz

In Intel processors, the flush-to-zero (FTZ) and denormals-are-zero (DAZ) flags in the MXCSR register are used to control floating-point calculations. Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® AVX instructions, including scalar and vector instructions, benefit from enabling the FTZ and DAZ flags.

Floating-point computations using the Intel SSE and Intel AVX instructions are accelerated when the FTZ and DAZ flags are enabled. This improves the application’s performance. The new option set FTZ and DAZ bits, which can improve SPECrate 2017 Floating Point suite results by 1.67%. Figure 6 shows more details about the improvement.

Figure 6: GCC13 -O2 -march=86-64-v2 -mdaz-ftz SPECrate 2017 Floating Point(64-core) estimated improvement ratio vs GCC13 -O2 -march=86-64-v2 on current Intel Xeon Processor.

Memcmp Is Optimized with vptest Instruction

If the string length is in range of a vector length, memcmp can be inlined and optimized using the vptest instruction.

_Bool f256(char *a)
{
  char t[] = "0123456789012345678901234567890";
  return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
f256:
        vmovdqa ymm1, YMMWORD PTR .LC1[rip]
        vpxor   ymm0, ymm1, YMMWORD PTR [rdi]
        vptest  ymm0, ymm0
        sete    al
        vzeroupper
        ret
.LC1:
        .quad   3978425819141910832
        .quad   3833745473465760056
        .quad   3689065127958034230
        .quad   13573712489362740

Figure 7: AVX vptest optimization(-O2 -mavx -mmove-max=256 -mstore-max=256)

Microarchitecture Tuning

As measured by SPECrate 2017 Floating Point suite results, -mtune=alderlake updatesignificantly improved performance by 4.5% on E-core, 3.13% on P-core. Figure 8-9 show more details about the improvement.

Figure 8: Update -mtune=alderlake improves SPECrate 2017 Floating Point suite results by 4.5% Geomean on E-core.

Figure 9: Update -mtune=alderlake improves SPECrate 2017 Floating Point suite results by 3.12% Geomean on P-core.

Future Work

The GCC team at Intel is continuing to enhance vectorization, improve performance, bring more new features to Intel® platforms, and fix issues to increase the code quality in coming GNU compiler releases. We are looking forward to continuing active contribution to this and other open source projects.