What to Do When Auto-Vectorization Fails?

Introduction

The following article is a follow up and a detailed analysis of a problem reported on the Intel® Developer Zone forum¹ dedicated to the Intel® C++ Compiler ².

An Intel Developer Zone user implemented a simple program as part of a code modernization workshop, and a problem with an inner for-loop was detected. Here is a piece of the code related to the problem:

	...
	for (std::size_t i = 0; i < nb_cluster; ++i) {
	float x = point[k].red - centroid[i].red;
	float y = point[k].green - centroid[i].green;
	float z = point[k].blue - centroid[i].blue;
	float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
	if (distance < best_distance) {
	best_distance = distance;
	best_centroid = i;
	}
	...

Note: This is not an auto-vectorized inner for-loop from KmcTestAppV1.cpp.

The Intel DZ user suspects that the inner for-loop was not auto-vectorized because a variable 'i' was declared as a 'std::size_t' data type, that is as 'unsigned int'.

Unmodified source code⁶ is attached. See KmcTestAppV1.cpp for more details.

Note that this article is not a tutorial on vectorization or parallelization techniques. However, a brief overview of these techniques will be given in the next part of this article.

A brief overview of vectorization and parallelization techniques

Modern software is complex and in order to achieve peak performance, especially when doing data-intensive processing, the vectorization and parallelization capabilities of modern CPUs, which could have many cores with several Logical Processing Units (LPUs) and Vector Processing Units (VPUs), need to be fully used.

VPUs allow different operations to be performed on multiple values of a data set simultaneously, and this technique, called vectorization, increases the performance of the processing when compared to the same processing implemented in a scalar, or sequential, way.

Parallelization is another technique, which allows different parts of a data set to be processed at the same time by different LPUs.

When vectorization and parallelization are combined, the performance of the processing can be boosted significantly.

Generic vectorization rules

You need to take into account the following generic rules related to vectorization of source codes:

A modern C/C++ compiler needs to be used with vectorization support.
Two types of vectorization techniques can be used: auto-vectorization (AV) and explicit vectorization (EV).
Only relatively simple inner for-loops can be vectorized.
Some inner for-loops cannot be vectorized with AV or EV techniques, because complex C or C++ constructions are used, for example, Standard Template Library classes or C++ operators.
It is recommended to review and analyze all cases when a modern C/C++ compiler cannot vectorize inner for-loops.

How an inner for-loop counter variable should be declared

The AV technique is considered the most effective for simple inner for-loops, because no code modifications are required, and AV of modern C/C++ compilers is enabled by default when optimization options 'O2' or 'O3' are used.

In more complex cases EV can be used to force vectorization using intrinsic functions, or vectorization #pragma directives, but it requires some modifications of inner for-loops.

A question can be asked: How should an inner for-loop counter variable be declared?

Two possible declarations can be considered:

Case A - Variable 'i' is declared as 'int'

	...
	for( int i = 0; i < n; i += 1 )
	{
		A[i] = A[i] + B[i];
	}
	...

and

Case B - Variable 'i' is declared as 'unsigned int'

	...
	for( unsigned int i = 0; i < n; i += 1 )
	{
		A[i] = A[i] + B[i];
	}
	...

In Case A the variable 'i' is declared as a signed data type 'int'.

In Case B the variable 'i' is declared as an unsigned data type 'unsigned int'.

Cases A and B could be combined in a simple test program ³ to evaluate vectorization capabilities of a C/C++ compiler:

////////////////////////////////////////////////////////////////////////////////////////////////////
// TestApp.cpp - To generate assembly listings an option '-S' needs to be used.
// Linux:
//		icpc -O3 -xAVX -qopt-report=1 TestApp.cpp -o TestApp.out
//		g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.out
// Windows:
//		icl   -O3 /QxAVX /Qvec-report=1 TestApp.cpp TestApp.exe
//		g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.exe

#include <stdio.h>
#include <stdlib.h>
//

////////////////////////////////////////////////////////////////////////////////////////////////////

	typedef float			RTfnumber;

	typedef int				RTiterator;			// Uncomment for Test A
	typedef int				RTinumber;
//	typedef unsigned int	RTiterator;			// Uncomment for Test B
//	typedef unsigned int	RTinumber;

////////////////////////////////////////////////////////////////////////////////////////////////////

	const RTinumber iDsSize = 1024;

////////////////////////////////////////////////////////////////////////////////////////////////////

int main( void )
{
	RTfnumber fDsA[ iDsSize ];
	RTfnumber fDsB[ iDsSize ];

	RTiterator i;

	for( i = 0; i < iDsSize; i += 1 )
		fDsA[i] = ( RTfnumber )( i );
	for( i = 0; i < iDsSize; i += 1 )
		fDsB[i] = ( RTfnumber )( i );

	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsA[i] );
	printf( "\n" );
	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsB[i] );
	printf( "\n" );

	for( i = 0; i < iDsSize; i += 1 )
		fDsA[i] = fDsA[i] + fDsB[i];			// Line 49

	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsA[i] );
	printf( "\n" );

	return ( int )1;
}

It turns out that these two for-loops (see Line 49 in the code sample above) are easily vectorizible⁴ (instructions with a prefix 'v' are used, like vmovups, vaddps, and so on) and the Intel C++ Compiler generated identical vectorization reports regardless of how the variable 'i' is declared:

Vectorization report for cases A and B

...
	Begin optimization report for: main()
	Report from: Interprocedural optimizations [ipo]
	INLINE REPORT: (main())
	Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]
	LOOP BEGIN at TestApp.cpp(37,2)
		remark #25045: Fused Loops: ( 37 39 )
		remark #15301: FUSED LOOP WAS VECTORIZED
	LOOP END
	LOOP BEGIN at TestApp.cpp(39,2)
	LOOP END
	LOOP BEGIN at TestApp.cpp(42,2)
		remark #25460: No loop optimizations reported
	LOOP END
	LOOP BEGIN at TestApp.cpp(45,2)
		remark #25460: No loop optimizations reported
	LOOP END
	LOOP BEGIN at TestApp.cpp(49,2)
		remark #15300: LOOP WAS VECTORIZED
	LOOP END
	LOOP BEGIN at TestApp.cpp(52,2)
		remark #25460: No loop optimizations reported
	LOOP END
...

The vectorization reports⁴ show that a for-loop at Line 49 ³ was vectorized:

	...
	LOOP BEGIN at TestApp.cpp(49,2)
		remark #15300: LOOP WAS VECTORIZED
	LOOP END
	...

However, the Intel C++ Compiler considers these two for-loops as different C language constructions and generates different vectorized binary codes.

Here are the two core pieces of assembler listings, related to the for-loop at Line 49³, for both cases:

Case A - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:								# Preds ..B1.12 ..B1.11
	vmovups		(%rsp,%rax,4), %ymm0					#50.13
	vmovups		32(%rsp,%rax,4), %ymm2					#50.13
	vmovups		64(%rsp,%rax,4), %ymm4					#50.13
	vmovups		96(%rsp,%rax,4), %ymm6					#50.13
	vaddps		4128(%rsp,%rax,4), %ymm2, %ymm3			#50.23
	vaddps		4096(%rsp,%rax,4), %ymm0, %ymm1			#50.23
	vaddps		4160(%rsp,%rax,4), %ymm4, %ymm5			#50.23
	vaddps		4192(%rsp,%rax,4), %ymm6, %ymm7			#50.23
	vmovups		%ymm1, (%rsp,%rax,4)					#50.3
	vmovups		%ymm3, 32(%rsp,%rax,4)					#50.3
	vmovups		%ymm5, 64(%rsp,%rax,4)					#50.3
	vmovups		%ymm7, 96(%rsp,%rax,4)					#50.3
	addq		$32, %rax#49.2
	cmpq		$1024, %rax								#49.2
	jb			..B1.12						# Prob 99%	#49.2
...

Note: See TestApp.icc.itype.s^5.1 for a complete assembler listing.

Case B - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:								# Preds ..B1.12 ..B1.11
	lea			8(%rax), %edx							#50.13
	lea			16(%rax), %ecx							#50.13
	lea			24(%rax), %esi							#50.13
	vmovups		(%rsp,%rax,4), %ymm0					#50.13
	vaddps		4096(%rsp,%rax,4), %ymm0, %ymm1			#50.23
	vmovups		%ymm1, (%rsp,%rax,4)					#50.3
	addl		$32, %eax								#49.2
	vmovups		(%rsp,%rdx,4), %ymm2					#50.13
	cmpl		$1024, %eax								#49.2
	vaddps		4096(%rsp,%rdx,4), %ymm2, %ymm3			#50.23
	vmovups		%ymm3, (%rsp,%rdx,4)					#50.3
	vmovups		(%rsp,%rcx,4), %ymm4					#50.13
	vaddps		4096(%rsp,%rcx,4), %ymm4, %ymm5			#50.23
	vmovups		%ymm5, (%rsp,%rcx,4)					#50.3
	vmovups		(%rsp,%rsi,4), %ymm6					#50.13
	vaddps		4096(%rsp,%rsi,4), %ymm6, %ymm7			#50.23
	vmovups		%ymm7, (%rsp,%rsi,4)					#50.3
	jb			..B1.12						# Prob 99%	#49.2
...

Note: See TestApp.icc.utype.s^5.2 for a complete assembler listing.

It is finally clear that the problem where the inner for-loop is not auto-vectorized (see beginning of the forum posting¹) is not related to how the variable 'i' is declared, and that something else is affecting a vectorization engine of the Intel C++ Compiler.

In order to pinpoint a root cause of the vectorization problem a question needs to be asked: What compiler messages will be generated when AV or EV techniques cannot be applied?

A small list of some “loop was not vectorized” messages of the Intel C++ Compiler when AV or EV techniques can't be applied is as follows:

...loop was not vectorized: not inner loop.
...loop was not vectorized: existence of vector dependence.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: unsupported reduction.
...loop was not vectorized: unsupported loop structure.
...loop was not vectorized: vectorization possible but seems inefficient.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: nonstandard loop is not a vectorization candidate.
...loop was not vectorized: dereference too complex.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: conditional assignment to a scalar.
...warning #13379: loop was not vectorized with "simd".
...loop skipped: multiversioned.

One message deserves special attention:

...loop was not vectorized: unsupported loop structure.

It is seen in KmcTestAppV1.cpp⁶ that the inner for-loop has three parts:

Part 1 - Initialization of x, y, and z variables

...
float x = point[k].red - centroid[i].red;
float y = point[k].green - centroid[i].green;
float z = point[k].blue - centroid[i].blue;
...

Part 2 - Calculation of a distance between points x, y, and z

...
float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
...

Part 3 - Update of a 'best_distance' variable

...
if (distance < best_distance) {
best_distance = distance;
best_centroid = i;
}
...

Because all these parts are in the same inner for-loop, the Intel C++ Compiler cannot match its structure to a predefined vectorization template. However, Part 3, with a conditional if-statement, is the root cause of the vectorization problem.

A possible solution of the vectorization problem is to split the inner for-loop into three parts as follows:

...														// Calculate Distance
for( i = 0; i < nb_cluster; i += 1 )
{
	float x = point[k].red - centroid[i].red;
	float y = point[k].green - centroid[i].green;
	float z = point[k].blue - centroid[i].blue;			// Performance improvement: ( x * x ) is
	distance[i] = ( x * x ) + ( y * y ) + ( z * z );	// used instead of std::pow(x, 2), etc
}
														// Best Distance
for( i = 0; i < nb_cluster; i += 1 )
{
	best_distance = ( distance[i] < best_distance ) ? ( float )distance[i] : best_distance;
}
														// Best Centroid
for( i = 0; i < nb_cluster; i += 1 )
{
	cluster[k] = ( distance[i] < best_distance ) ? ( float )i : best_centroid;
}
...

The most important two modifications are related to the conditional if-statement in the for-loop. It was modified from a generic form:

...
if( A < B )
{
	D = val1
	C = val3
}
...

to a form that uses two conditional operators ( ? : ):

...
D = ( A < B ) ? ( val1 ) : ( val2 )
...
C = ( A < B ) ? ( val3 ) : ( val4 )
...

also known as ternary operators. Now a modern C/C++ compiler can match this C language construction to a pre-defined vectorization template.

Performance evaluation of unmodified and modified source code

A performance evaluation of both versions of the program for 1,000,000 points, 1,000 clusters, and 10 iterations was completed and the results are as follows:

...>KmcTestAppV1.exe
		Time: 111.50

Note: Original version⁶.

...>KmcTestAppV2.exe			
	Time:  20.48

Note: Optimized and vectorized version⁷.

The optimized and vectorized version⁷ is about 5.5x faster than the original version of the program (see ¹ or ⁶). Times are in seconds.

Conclusion

If a modern C/C++ compiler fails to vectorize a for-loop it is important to evaluate its complexity. In the case of the Intel C++ Compiler, 'opt-report=n' option needs to be used (n greater than 3).

In most cases, a C/C++ compiler cannot vectorize the for-loop because it cannot match its structure to a predefined vectorization template. For example, in the case of the Intel C++ Compiler, the following vectorization messages would be reported:

...loop was not vectorized: unsupported reduction.

...loop was not vectorized: unsupported loop structure.

If this is the case, you need to modify the for-loop to simplify its structure, consider EV techniques using #pragma directives, like #pragma simd, or consider reimplementation of the required functionality using intrinsic functions.

About the author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms and high-performance computing of big data sets.

Downloads

WhatToDoWhenAVFails.zip

List of all files (sources, assembly listings and vectorization reports):

KmcTestAppV1.cpp
KmcTestAppV2.cpp
TestApp.cpp
TestApp.icc.itype.rpt
TestApp.icc.utype.rpt
TestApp.icc.itype.s
TestApp.icc.utype.s

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索