Requirements for Vectorizing Loops with #pragma SIMD

ID 标签 782267
已更新 6/26/2023
版本 Republished 06/23/2023
公共

Enhance Performance with Loop Vectorization

  • Pragma SIMD allows a wider variety of loops to be vectorized, including loops containing multiple branches or function calls.

  • For further information about the SIMD pragma/directive, see the Intel® DPC++/C++ Compiler Developer Guide and Reference and the Intel® Fortran Compiler Developer Guide and Reference.

author-image

作者

The types of loop that can be vectorized automatically by the Intel® DPC++/C++ Compiler and the Intel® Fortran Compiler Classic and Intel® Fortran Compiler are described in the article Requirements for Vectorizable Loops.

Various coding techniques, pragmas and command line options are available to help the compiler to vectorize, as described in the Intel Compiler user guides. The SIMD pragma or directive, described in the compiler user guide, asks the compiler to relax some of the above requirements and to make every possible effort to vectorize a loop. If an ASSERT clause is present, the compilation will fail if the loop is not successfully vectorized. This has led to the nickname "vectorize or die" pragma.

#pragma simd  (!DIR$ SIMD  for Fortran) behaves somewhat like a combination of #pragma vector always  and  #pragma ivdep, but is more powerful. The compiler does not try to assess whether vectorization is likely to lead to performance gain, it does not check for aliasing or dependencies that might cause incorrect results after vectorization, and it does not protect against illegal memory references. #pragma ivdep overrides potential dependencies, but the compiler still performs a dependency analysis, and will not vectorize if it finds a proven dependency that would affect results. With #pragma simd, the compiler does no such analysis, and tries to vectorize regardless. It is the programmer's responsibility to ensure that there are no backward dependencies that might impact correctness. The semantics of #pragma simd are rather similar to those of the OpenMP* pragma,  #pragma omp parallel for. It accepts optional clauses such as REDUCTION, PRIVATE, FIRSTPRIVATE and LASTPRIVATE. SIMD specific clauses are VECTORLENGTH (implies the loop unroll factor), and LINEAR, which can specify different strides for different variables. Pragma SIMD allows a wider variety of loops to be vectorized, including loops containing multiple branches or function calls. 

Nevertheless, the technology underlying the SIMD pragma is still that of the compiler vectorizer, and some restrictions remain on what types of loop can be vectorized:

  • The loop must be countable, i.e. the number of iterations must be known before the loop starts to execute, though it need not be known at compile time. Consequently, there must be no data-dependent exit conditions, such as break (C/C++) or EXIT (Fortran) statements. This also excludes most "while" loops. Typical diagnostics:

error: invalid simd pragma

warning #8410: Directive SIMD must be followed by counted DO loop.

  • Certain special, non-mathematical operators are not supported, and also certain combinations of operators and of data types, with diagnostic messages such as

"operation not supported", "unsupported reduction", "unsupported data type".

  • Very complex array subscripts or pointer arithmetic may not be vectorized, a typical diagnostic message is "dereference too complex". 
  • Loops with very low trip counts may not be vectorized. Typical diagnostic:

remark: loop was not vectorized: low trip count. 

  • Extremely large loop bodies (very many lines and symbols) may not be vectorized. The compiler has internal limits that prevent it from vectorizing loops that would require a very large number of vector registers, with many spills and restores to and from memory. 
  • SIMD directives may not be applied to loops containing C++ exception handling code.

A number of the requirements detailed in Requirements for Vectorizable Loops are relaxed for #pragma simd, in addition to the above-mentioned ones relating to dependencies and performance estimates. Non-inner loops may be vectorized in certain cases; more mixing of different data types is allowed; function calls are possible and more complex control flow is supported. Nevertheless, the advice in the above article should be followed where possible, since it is likely to improve performance.

Side effects With #pragma simd, loops are vectorized under the "fast" floating-point model, corresponding to /fp:fast (-fp-model=fast). The command line option /fp:precise (-fp-model precise) is not respected by a loop vectorized with #pragma simd; such a loop might not give identical results to a loop without #pragma simd. For further information about the floating-point model, see Consistency of Floating-Point Results using the Intel Compiler.

For further information about the SIMD pragma/directive, see the Intel® DPC++/C++ Compiler Developer Guide and Reference and the Intel® Fortran Compiler Developer Guide and Reference.