What to Do When Auto Vectorization Fails?

Published Date
02 - Mar - 2017
| Last Updated
03 - Mar - 2017
 
What to Do When Auto Vectorization Fails?

Introduction

The following article is a follow up and a detailed analysis of a problem reported on the Intel® Developer Zone (Intel® DZ) forum1 dedicated to the Intel® C++ Compiler 2.

An Intel DZ user implemented a simple program as part of a code modernization workshop, and a problem with an inner for-loop was detected. Here is a piece of the code related to the problem:

...
for (std::size_t i = 0; i < nb_cluster; ++i) {
float x = point[k].red - centroid[i].red;
float y = point[k].green - centroid[i].green;
float z = point[k].blue - centroid[i].blue;
float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
if (distance < best_distance) {
best_distance = distance;
best_centroid = i;
}
...

Note: This is not an auto-vectorized inner for-loop from KmcTestAppV1.cpp.

The Intel DZ user suspects that the inner for-loop was not auto-vectorized because a variable 'i' was declared as a 'std::size_t' data type, that is as 'unsigned int'.

Unmodified source code6 is attached. See KmcTestAppV1.cpp for more details.

Note that this article is not a tutorial on vectorization or parallelization techniques. However, a brief overview of these techniques will be given in the next part of this article.

A brief overview of vectorization and parallelization techniques

Modern software is complex and in order to achieve peak performance, especially when doing data-intensive processing, the vectorization and parallelization capabilities of modern CPUs, which could have many cores with several Logical Processing Units (LPUs) and Vector Processing Units (VPUs), need to be fully used.

VPUs allow different operations to be performed on multiple values of a data set simultaneously, and this technique, called vectorization, increases the performance of the processing when compared to the same processing implemented in a scalar, or sequential, way.

Parallelization is another technique, which allows different parts of a data set to be processed at the same time by different LPUs.

When vectorization and parallelization are combined, the performance of the processing can be boosted significantly.

Generic vectorization rules

You need to take into account the following generic rules related to vectorization of source codes:

  • A modern C/C++ compiler needs to be used with vectorization support.
  • Two types of vectorization techniques can be used: auto-vectorization (AV) and explicit vectorization (EV).
  • Only relatively simple inner for-loops can be vectorized.
  • Some inner for-loops cannot be vectorized with AV or EV techniques, because complex C or C++ constructions are used, for example, Standard Template Library classes or C++ operators.
  • It is recommended to review and analyze all cases when a modern C/C++ compiler cannot vectorize inner for-loops.

How an inner for-loop counter variable should be declared

The AV technique is considered the most effective for simple inner for-loops, because no code modifications are required, and AV of modern C/C++ compilers is enabled by default when optimization options 'O2' or 'O3' are used.

In more complex cases EV can be used to force vectorization using intrinsic functions, or vectorization#pragma directives, but it requires some modifications of inner for-loops.

A question can be asked: How should an inner for-loop counter variable be declared?

Two possible declarations can be considered:

Case A - Variable 'i' is declared as 'int'

...
for( int i = 0; i < n; i += 1 )
{
    A[i] = A[i] + B[i];
}
...
and

Case B - Variable 'i' is declared as 'unsigned int'

...
for( unsigned int i = 0; i < n; i += 1 )
{
    A[i] = A[i] + B[i];
}
...

In Case A the variable 'i' is declared as a signed data type 'int'.

In Case B the variable 'i' is declared as an unsigned data type 'unsigned int'.

Cases A and B could be combined in a simple test program 3 to evaluate vectorization capabilities of a C/C++ compiler:

////////////////////////////////////////////////////////////////////////////////////////////////////
// TestApp.cpp - To generate assembly listings an option '-S' needs to be used.
// Linux:
//      icpc -O3 -xAVX -qopt-report=1 TestApp.cpp -o TestApp.out
//      g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.out
// Windows:
//      icl   -O3 /QxAVX /Qvec-report=1 TestApp.cpp TestApp.exe
//      g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.exe
#include <stdio.h>
#include <stdlib.h>
//
////////////////////////////////////////////////////////////////////////////////////////////////////
    typedef float           RTfnumber;
    typedef int             RTiterator;         // Uncomment for Test A
    typedef int             RTinumber;
//  typedef unsigned int    RTiterator;         // Uncomment for Test B
//  typedef unsigned int    RTinumber;
////////////////////////////////////////////////////////////////////////////////////////////////////
    const RTinumber iDsSize = 1024;
////////////////////////////////////////////////////////////////////////////////////////////////////
int main( void )
{
    RTfnumber fDsA[ iDsSize ];
    RTfnumber fDsB[ iDsSize ];
    RTiterator i;
    for( i = 0; i < iDsSize; i += 1 )
        fDsA[i] = ( RTfnumber )( i );
    for( i = 0; i < iDsSize; i += 1 )
        fDsB[i] = ( RTfnumber )( i );
    for( i = 0; i < 16; i += 1 )
        printf( "%4.1f ", fDsA[i] );
    printf( "\n" );
    for( i = 0; i < 16; i += 1 )
        printf( "%4.1f ", fDsB[i] );
    printf( "\n" );
    for( i = 0; i < iDsSize; i += 1 )
        fDsA[i] = fDsA[i] + fDsB[i];            // Line 49
    for( i = 0; i < 16; i += 1 )
        printf( "%4.1f ", fDsA[i] );
    printf( "\n" );
    return ( int )1;
}

It turns out that these two for-loops (see Line 49 in the code sample above) are easily vectorizible4(instructions with a prefix 'v' are used, like vmovups, vaddps, and so on) and the Intel C++ Compiler generated identical vectorization reports regardless of how the variable 'i' is declared:

Vectorization report for cases A and B

...
    Begin optimization report for: main()
    Report from: Interprocedural optimizations [ipo]
    INLINE REPORT: (main())
    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]
    LOOP BEGIN at TestApp.cpp(37,2)
        remark #25045: Fused Loops: ( 37 39 )
        remark #15301: FUSED LOOP WAS VECTORIZED
    LOOP END
    LOOP BEGIN at TestApp.cpp(39,2)
    LOOP END
    LOOP BEGIN at TestApp.cpp(42,2)
        remark #25460: No loop optimizations reported
    LOOP END
    LOOP BEGIN at TestApp.cpp(45,2)
        remark #25460: No loop optimizations reported
    LOOP END
    LOOP BEGIN at TestApp.cpp(49,2)
        remark #15300: LOOP WAS VECTORIZED
    LOOP END
    LOOP BEGIN at TestApp.cpp(52,2)
        remark #25460: No loop optimizations reported
    LOOP END
...

The vectorization reports4 show that a for-loop at Line 49 3 was vectorized:

...
LOOP BEGIN at TestApp.cpp(49,2)
    remark #15300: LOOP WAS VECTORIZED
LOOP END
...

However, the Intel C++ Compiler considers these two for-loops as different C language constructions and generates different vectorized binary codes.

Here are the two core pieces of assembler listings, related to the for-loop at Line 493, for both cases:

Case A - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:                                # Preds ..B1.12 ..B1.11
    vmovups     (%rsp,%rax,4), %ymm0                    #50.13
    vmovups     32(%rsp,%rax,4), %ymm2                  #50.13
    vmovups     64(%rsp,%rax,4), %ymm4                  #50.13
    vmovups     96(%rsp,%rax,4), %ymm6                  #50.13
    vaddps      4128(%rsp,%rax,4), %ymm2, %ymm3         #50.23
    vaddps      4096(%rsp,%rax,4), %ymm0, %ymm1         #50.23
    vaddps      4160(%rsp,%rax,4), %ymm4, %ymm5         #50.23
    vaddps      4192(%rsp,%rax,4), %ymm6, %ymm7         #50.23
    vmovups     %ymm1, (%rsp,%rax,4)                    #50.3
    vmovups     %ymm3, 32(%rsp,%rax,4)                  #50.3
    vmovups     %ymm5, 64(%rsp,%rax,4)                  #50.3
    vmovups     %ymm7, 96(%rsp,%rax,4)                  #50.3
    addq        $32, %rax#49.2
    cmpq        $1024, %rax                             #49.2
    jb          ..B1.12                     # Prob 99%  #49.2
...

Note: See TestApp.icc.itype.s5.1 for a complete assembler listing.

Case B - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:                                # Preds ..B1.12 ..B1.11
    lea         8(%rax), %edx                           #50.13
    lea         16(%rax), %ecx                          #50.13
    lea         24(%rax), %esi                          #50.13
    vmovups     (%rsp,%rax,4), %ymm0                    #50.13
    vaddps      4096(%rsp,%rax,4), %ymm0, %ymm1         #50.23
    vmovups     %ymm1, (%rsp,%rax,4)                    #50.3
    addl        $32, %eax                               #49.2
    vmovups     (%rsp,%rdx,4), %ymm2                    #50.13
    cmpl        $1024, %eax                             #49.2
    vaddps      4096(%rsp,%rdx,4), %ymm2, %ymm3         #50.23
    vmovups     %ymm3, (%rsp,%rdx,4)                    #50.3
    vmovups     (%rsp,%rcx,4), %ymm4                    #50.13
    vaddps      4096(%rsp,%rcx,4), %ymm4, %ymm5         #50.23
    vmovups     %ymm5, (%rsp,%rcx,4)                    #50.3
    vmovups     (%rsp,%rsi,4), %ymm6                    #50.13
    vaddps      4096(%rsp,%rsi,4), %ymm6, %ymm7         #50.23
    vmovups     %ymm7, (%rsp,%rsi,4)                    #50.3
    jb          ..B1.12                     # Prob 99%  #49.2
...

Note: See TestApp.icc.utype.s5.2 for a complete assembler listing.

It is finally clear that the problem where the inner for-loop is not auto-vectorized (see beginning of the forum posting1) is not related to how the variable 'i' is declared, and that something else is  affecting a vectorization engine of the Intel C++ Compiler.

In order to pinpoint a root cause of the vectorization problem a question needs to be asked: What compiler messages will be generated when AV or EV techniques cannot be applied?

A small list of some “loop was not vectorized” messages of the Intel C++ Compiler when AV or EV techniques can't be applied is as follows:

...loop was not vectorized: not inner loop.
...loop was not vectorized: existence of vector dependence.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: unsupported reduction.
...loop was not vectorized: unsupported loop structure.
...loop was not vectorized: vectorization possible but seems inefficient.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: nonstandard loop is not a vectorization candidate.
...loop was not vectorized: dereference too complex.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: conditional assignment to a scalar.
...warning #13379: loop was not vectorized with "simd".
...loop skipped: multiversioned.

One message deserves special attention:

...loop was not vectorized: unsupported loop structure.

It is seen in KmcTestAppV1.cpp6 that the inner for-loop has three parts:

Part 1 - Initialization of x, y, and z variables

...
float x = point[k].red - centroid[i].red;
float y = point[k].green - centroid[i].green;
float z = point[k].blue - centroid[i].blue;
...

Part 2 - Calculation of a distance between points x, y, and z

...
float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
...

Part 3 - Update of a 'best_distance' variable

...
if (distance < best_distance) {
best_distance = distance;
best_centroid = i;
}
...

Because all these parts are in the same inner for-loop, the Intel C++ Compiler cannot match its structure to a predefined vectorization template. However, Part 3, with a conditional if-statement, is the root cause of the vectorization problem.

A possible solution of the vectorization problem is to split the inner for-loop into three parts as follows:

...                                                     // Calculate Distance
for( i = 0; i < nb_cluster; i += 1 )
{
    float x = point[k].red - centroid[i].red;
    float y = point[k].green - centroid[i].green;
    float z = point[k].blue - centroid[i].blue;         // Performance improvement: ( x * x ) is
    distance[i] = ( x * x ) + ( y * y ) + ( z * z );    // used instead of std::pow(x, 2), etc
}
                                                        // Best Distance
for( i = 0; i < nb_cluster; i += 1 )
{
    best_distance = ( distance[i] < best_distance ) ? ( float )distance[i] : best_distance;
}
                                                        // Best Centroid
for( i = 0; i < nb_cluster; i += 1 )
{
    cluster[k] = ( distance[i] < best_distance ) ? ( float )i : best_centroid;
}
...

The most important two modifications are related to the conditional if-statement in the for-loop. It was modified from a generic form:

...
if( A < B )
{
    D = val1
    C = val3
}
...

to a form that uses two conditional operators ( ? : ):

...
D = ( A < B ) ? ( val1 ) : ( val2 )
...
C = ( A < B ) ? ( val3 ) : ( val4 )
...

also known as ternary operators. Now a modern C/C++ compiler can match this C language construction to a pre-defined vectorization template.

Performance evaluation of unmodified and modified source code

A performance evaluation of both versions of the program for 1,000,000 points, 1,000 clusters, and 10 iterations was completed and the results are as follows:

...>KmcTestAppV1.exe
        Time: 111.50

Note: Original version6.

...>KmcTestAppV2.exe        
    Time:  20.48

Note: Optimized and vectorized version7.

The optimized and vectorized version7 is about 5.5x faster than the original version of the program (see 1or 6). Times are in seconds.

Conclusion

If a modern C/C++ compiler fails to vectorize a for-loop it is important to evaluate its complexity. In the case of the Intel C++ Compiler, 'opt-report=n' option needs to be used (n greater than 3).

In most cases, a C/C++ compiler cannot vectorize the for-loop because it cannot match its structure to a predefined vectorization template. For example, in the case of the Intel C++ Compiler, the following vectorization messages would be reported:

...loop was not vectorized: unsupported reduction.
or
...loop was not vectorized: unsupported loop structure.

If this is the case, you need to modify the for-loop to simplify its structure, consider EV techniques using#pragma directives, like #pragma simd, or consider reimplementation of the required functionality using intrinsic functions.

For more such intel Modern Code and tools from Intel, please visit the Intel® Modern Code

Source:https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails