Fortran Array Data and Arguments and Vectorization

ID 标签 688534
已更新 3/4/2019
版本 Latest
公共

author-image

作者

Vectorization Essentials, Fortran Array Data and Arguments and Vectorization

Overview

This document provides examples to various array types in Fortran and their usage as local variables, function/subroutine parameters, as well as, examples to Fortran pointers and their usage. This document shows how various array data types and arguments are vectorized by the compiler. This includes examples of high level source code and explains the code generated by the compiler for these cases.

Topics

Fortran Array Data and Arguments, Vectorization Examples

The modern Fortran language has various types of arrays and an array sectioning feature that enables subsections of arrays to be passed as parameters to functions or be pointed by Fortran pointers. For loops that operate on arrays, the compiler can generate unit stride vector code, non-unit stride code, or multi-version code that defers the identification of which code version to execute to runtime. The compiler creates code for a loop based on the type of array chosen by the programmer, whether array sectioning is used for that array, and whether the arrays are passed as function parameters.

To use the SIMD (single instruction, multiple data) instructions, code with non-unit stride memory accesses uses gather/scatter instructions (vgather/vscatter). Code with unit stride memory accesses can use much more efficient loads and stores. Code with unit stride memory accesses can use aligned load/store instructions (vmovaps or equivalent) or unaligned load/store instructions (vloadunpack/vpackstore) based on the address alignment of the target memory location.

In the case where compiler generates unit stride vectorized code, it is also very important to have the arrays aligned to achieve good performance (i.e., use aligned load/store instructions). The programmer can instruct the compiler to allocate aligned memory for arrays using compiler options (-align array64byte) or directives (!dir$ attribute align). However, this is not sufficient for generating aligned vector code for loops. The programmer should also tell the compiler for each loop which arrays will be accessed in an aligned fashion by using directives (!dir$ vector aligned, !dir$ assume_aligned). This guide contains detailed information on alignment in the article titled Data Alignment to Assist Vectorization.

For loops with unaligned arrays, the compiler typically generates a "peel loop". A peel loop precedes the main loop and operates on data elements of the arrays until one of the arrays is aligned (a vectorized peel loop uses gather/scatter instructions). After the peel loop, a "kernel loop" that uses aligned memory accesses for this array operates on the aligned data. Following the peel loop and kernel loop, a remainder loop operates on any remaining unaligned data at the end of the array(s) from where the kernel loop left off until all iterations are finished (a vectorized remainder loop uses gather/scatter instructions). For the kernel loop, the compiler can further generate a dynamic check for the alignment of a second array and a multi-version loop (one loop uses aligned accesses for this array and the other unaligned accesses).

1: Explicit Shape Arrays

For subroutines with arguments that are explicit shape arrays, the compiler generates code in the subroutine/function that assumes that the data for these arguments is contiguous.

Calling procedure: The code generated automatically by the compiler for the call to this subroutine will insure that array data passed is contiguous. If the actual argument is indeed contiguous or is a contiguous array section, a pointer to the actual array or contiguous array section is passed as the argument. If the data is non-contiguous, such as an array section of non-contiguous data, the caller will generate an array temporary, gather the non-contiguous source data into this contiguous array temporary and pass a pointer to this contiguous temporary data. The actual subroutine/function call uses the temporary arrays as the arguments. When the call returns, the calling code then unpacks the temporary data into the original non-contiguous source array section and destroys the temporary array. This incurs the overhead of the array temporary plus the overhead of the memory movement needed for the gather and scatter operations.

Example 1.1: Explicit shape arrays passed as parameters.

subroutine explicit1(A, B, C)
  real, intent(in),    dimension(400,500) :: A
  real, intent(out),   dimension(500)     :: B
  real, intent(inout), dimension (400)    :: C
  integer i
 
!..loop 1 
  do i=1,500
    B(i) = A(3,i)
  end do

!...loop 2
  do i=1,400
   C(i) = C(i) + A(i, 400)
  end do
end 

Loop 1:

  • Source code has unit stride access on B, non-unit stride access on A.
  • Loop is peeled until B is aligned.
  • Peel loop is vectorized, but uses vgather for load A (unaligned), vscatter for store B (unaligned).
  • Kernel loop uses vmovaps for store B (aligned), vgather for load A (unaligned).
  • Remainder loop is vectorized: uses vmovaps for store B (aligned), vgather for load A (unaligned).

Loop 2:

  • Source code has unit-stride access on both load A and load/store C.
  • Loop is peeled until store C is aligned. Peeled loop vectorized.
  • Load A can be aligned or unaligned. Compiler generates multi-version code for the kernel loop:
    • Version1: loadunpack for load A (unaligned), vaddps for load C (aligned), vmovaps for store C (aligned)
    • Version2: vmovaps for load A (aligned), vaddps for load C (aligned), vmovaps for store C (aligned).
  • Remainder loop is vectorized.

Example 1.2: Aligned explicit shape arrays passed as parameters.

subroutine explicit2(A, B, C)

  real, intent(in), dimension(400,500) :: A
  real, intent(out), dimension(500)    :: B
  real, intent(inout), dimension(400)  :: C   
  !dir$ assume_aligned A(1,1):64
  !dir$ assume_aligned B(1):64
  !dir$ assume_aligned C(1):64
 
!...loop 1
  do i=1,500
    B(i) = A(3,i)
  end do
 
!...loop 2
  do i=1,400
    C(i) = C(i) + A(i, 400)
  end do
end

The assume_aligned directive is used to declare that the three arrays are aligned. The information provided by this directive is effective on both loops in the subroutine.

Loop 1:

  • No Peel loop needed as arrays are already aligned.
  • Kernel loop is vectorized using vmovaps for store B, vgatherd for load A.
  • Remainder loop has only 4 iterations (known at compile time). Not vectorized as it is not efficient to vectorize it.

Loop 2:

  • No peel loop needed.
  • Kernel loop vectorized using vmovaps all three memory-accesses - two loads and one store (A and C).
  • No remainder loop needed (400*4 is a multiple of 64).

Example 1.3: Aligned explicit shape arrays passed as parameters.

subroutine explicit3(A, B, C)

  real, intent(in), dimension(400,500) :: A
  real, intent(out), dimension(500)    :: B
  real, intent(inout), dimension(400)  :: C
  
  !dir$ vector aligned
  do i=1,500
    B(i) = A(3,i)
  end do

  !dir$ vector aligned
  do i=1,400
   C(i) = C(i) + A(i, 400)
  end do
end

This method is identical to the method in the above example. The !dir$ vector aligned directive indicates that all memory accesses in the given loop are aligned. This directive must be repeated for both loops.

2: Adjustable Arrays

Similar to what we just described for explicit shape arrays, the generated code for subroutines/functions with adjustable arrays as parameters assume that these parameters are contiguous (have unit stride). At the call sites in the calling routines, a section of a larger array could have been passed to the called procedure as an argument. Therefore, at the call sites, the compiler generates code to pack strided array arguments into contiguous temporary arrays before the call and unpacks them back to original stride after the call (gather/scatter of actual data to/from temporary arrays). The actual subroutine/function call uses the temporary arrays as the arguments. This incurs the overhead of the array temporary plus the overhead of the memory movement needed for the gather and scatter operations.

Example 2.1: 1D adjustable arrays passed as parameters.

subroutine adjustable1(Y, Z, N)
 real, intent(inout), dimension(N) :: Y
 real, intent(in), dimension(N)    :: Z
 integer, intent(in)               :: N

 Y = Y + Z
 
 return
end
  • Loop is peeled until store Y is aligned.
  • Kernel loop may have load Z aligned or unaligned. Compiler generates multi-version code:
    • Version 1: loadunpack for load Z (unaligned), vaddps for load Y (aligned), vmovaps for store Y (aligned)
    • Version 2: vmovaps for load Z (aligned), vaddps for load Y (aligned), vmovaps for store Y (aligned)
  • Remainder loop is vectorized.

Example 2.2: 2D adjustable arrays passed as parameters.

subroutine adjustable2(Y, Z, M, N)
  real, intent(inout), dimension(M, N) :: Y
  real, intent(in), dimension(M, N)    :: Z
  integer, intent(in)                  :: M
  integer, intent(in)                  :: N
  
  Y = Y + Z
  
  return
end
  • Two loops corresponding to the two array dimensions are collapsed by the compiler into a single loop.
  • Loop is peeled until store Y is aligned.
  • Kernel loop may have load Z aligned or unaligned. Compiler generates multi-version code:
    • Version 1: vloadunpack for load Z (aligned), vaddps for load Y (aligned), vmovaps for store Y (aligned).
    • Version 2: vmovaps for load Z (aligned) vaddps for load Y (aligned), vmovaps for store Y (aligned)
  • Remainder loop is vectorized.

Example 2.3: Aligned 1D adjustable arrays passed as parameters.

subroutine adjustable3(YY, ZZ, NN)

  real, intent(inout), dimension(NN) :: YY
  real, intent(in), dimension(NN)    :: ZZ
  integer, intent(in)                :: NN
  !dir$ assume_aligned YY:64,ZZ:64

  YY = YY + ZZ
  
  return
end
  • The assume_aligned directive is used to tell the compiler that in this function, YY and ZZ refer to locations aligned to 64 bytes.
  • No peel loop as YY and ZZ are aligned.
  • Kernel loop vectorized, uses vmovaps for load ZZ (aligned), vaddps for load YY (aligned) and vmovaps for store YY (aligned).
  • Remainder loop is vectorized (unaligned).

Example 2.4: Aligned 2D adjustable array example.

subroutine adjustable4(YY, ZZ, MM, NN)
  real, intent(inout), dimension(MM, NN) :: YY
  real, intent(in), dimension(MM, NN)    :: ZZ
  integer, intent(in)                    :: MM
  integer, intent(in)                    :: NN
  !dir$ assume_aligned YY:64, ZZ:64
  
   YY = YY + ZZ

   return
end
  • The assume aligned directive is used to tell the compiler that in this function, YY and ZZ refer to locations aligned to 64 bytes.
  • Two loops collapsed by the compiler into one loop.
  • No peel loop as YY and ZZ are aligned.
  • Kernel loop vectorized, uses vmovaps for load ZZ (aligned), vaddps for load YY (aligned) and vmovaps for store YY (aligned).
  • Remainder loop is vectorized (unaligned).

Example 2.5: Aligned 1D adjustable arrays passed as parameters.

subroutine adjustable5(YY, ZZ, NN)

  real, intent(inout), dimension(NN) :: YY
  real, intent(in), dimension(NN)    :: ZZ
  integer, intent(in)                :: NN
  
  !dir$ vector aligned
  YY = YY + ZZ
  return
end
  • The vector aligned directive is used to tell the compiler that in this function, YY and ZZ are aligned.
  • No peel loop as YY and ZZ are aligned.
  • Kernel loop vectorized, uses vmovaps for load ZZ (aligned), vaddps for load YY (aligned) and vmovaps for store YY (aligned).
  • Remainder loop is vectorized (unaligned).

3: Assumed Shape Arrays

When assumed shape arrays are used as procedure parameters, unlike the adjustable array examples above, the size of the arrays are not passed by the programmer as explicit parameters. Furthermore, the code generated for the function/subroutine does not assume that the passed parameters are contiguous (unit stride). The parameters can be non-contiguous (strided, i.e., sections of other arrays). In this case, the compiler does not generate code for packing/unpacking strided arrays into/from temporary contiguous arrays before/after the call sites. Instead, it generates code such that, for each dimension of each array, the lower/upper bounds and the stride information is passed from the caller to the callee together with the array base addresses. Then, the code generated on the callee uses these bounds and strides, thereby eliminating the need for any packing/unpacking at the caller. This is more efficient for the call and return.

When the caller and callee are to be compiled separately, in order for the compiler to know that a function to be called has an assumed shape array (and hence does not need packing/unpacking but needs transfer of bounds and strides), there needs to be an interface declaration on the caller side (unless the callee is an internal procedure in which case the interface will be implicit). This interface declares the corresponding function arguments to be assumed shape arrays. Note that the performance of unit-stride vectorized code (applicable when the arrays are actually contiguous at runtime) is better than non-unit stride vectorized code.

You can use the Fortran 2008 CONTIGUOUS attribute with an assumed shape array to tell the compiler that the data occupies a contiguous block. This allows compiler to make optimizations for assumed-shaped arrays.This attribute is described in more detail in a later section in this article.

Note that contiguous and non-contiguous arrays can both be passed to these subroutines/functions without packing/unpacking. So the compiler cannot blindly generate unit-stride code for the assumed-shape arrays while compiling the callee subroutine. Instead, in most such cases, the compiler generates multi-version code:
1) a version that is vectorized such that all assumed shape arrays are accessed assuming they are unit-stride, and
2) a version that is vectorized such that all assumed shape arrays are accessed assuming they are non-unit-stride.
At runtime within the procedure the strides of all arrays are checked and the correct version of the vectorized region is selected.

Note that the second case is a conservative fall-back case. If only one (out of possibly many) actual assumed shape array argument is non-unit-stride, it does not matter whether the rest of the arguments are unit stride and the second version with non-unit-stride vectorization for all arrays is executed.

Example 3.1: 1D assumed shape array as a parameter.

subroutine assumed_shape1(Y)

 real, intent(inout), dimension(:) :: Y
 
 Y = Y + 1
 
 return
end
  • Multi-version code for the array stride is generated:
    • Version 1: Vector code assuming Y is unit stride.
      • Peel loop until Y is aligned (gather/scatter).
      • Kernel loop uses vaddps for load Y (aligned), vmovaps for store Y (aligned).
      • Remainder loop is vectorized (Gather/scatter).
    • Version 2: Vector code assuming Y is not unit stride.
      • No peel loop
      • Kernel loop uses vgatherd for load Y, vscatterd for store Y.
      • Remainder loop is scalar.

Example 3.2: 2D assumed shape array as a parameter.

subroutine assumed_shape2(Z)

 real, intent(inout), dimension(:,:) :: Z
 
 Z = Z + 1
 
 return
end
  • Two-deep loop nest is not collapsed into 1-D loop. The outer loop is already non-unit stride and is not vectorized.
  • Two versions are generated for the inner loop for the array strides: :
    • Version 1: Vector code assuming Z is unit stride in the first dimension (inner-loop dimension).
      • Peel loop until Z is aligned (gather/scatter).
      • Kernel loop uses vaddps for load Z (aligned), vmovaps for store Z (aligned).
      • Remainder loop is vectorized (gather/scatter).
    • Version 2: Vector code assuming Z is non-unit stride in the first dimension (inner-loop dimension).
      • No peel loop.
      • Kernel loop uses vgather for load Z, vscatter for store Z.
      • Two remainder loops: gather/scatter with half vector size and a scalar loop.

Example 3.3: Two 1D assumed shape arrays as parameters.

subroutine assumed_shape3(A, B)

 real, intent(out), dimension(:) :: A
 real, intent(in), dimension(:)  :: B
 
 A = B + 1
 
 return
end
  • Multi-version code is generated for the array strides:
    • Version 1: Vectorized for A and B both with unit stride.
      • Peel loop until A is aligned.
      • Kernel loop has another multi-version code:
        • Version 1a: Assumes B is aligned.
        • Version 1b: Assumes B is unaligned.
      • Remainder loop vectorized.
    • Version 2: Vector code with neither A nor B unit stride. This case is entered if at least one of the arrays is non-unit stride.
      • Uses gather and scatter for all arrays.
      • Remainder loop scalar.

Example 3.4: Three 1D assumed shape arrays as parameters.

subroutine assumed_shape4(A, B, C)

 real, intent(out), dimension(:) :: A
 real, intent(in), dimension(:) :: B
 real, intent(in), dimension(:) :: C
 A = B + C
 
 return
end
  • Multi-version code is generated for the array strides:
    • Version 1: Vectorized for A, B, and C all unit stride
      • Peel loop until A is aligned.
      • Kernel loop has multi-version:
        • Version 1a: Assume B is aligned. Assume C is unaligned.
        • Version 1b: Assume B is unaligned. Assume C is unaligned.
        • Note that a 3rd level of multi-versioning is not created for alignment of C to prevent too deep multi-versioning.
    • Version 2: Vector code with all arrays non-unit stride. This case is entered if at least one of the arrays is non-unit stride.
      • Uses gather/scatter for all arrays.
      • Remainder loop scalar.

Example 3.5: Three aligned 1D assumed shape arrays as parameters.

subroutine assumed_shape5(A, B, C)

 real, intent(out), dimension(:) :: A
 real, intent(in), dimension(:)  :: B
 real, intent(in), dimension(:)  :: C
 !dir$ assume_aligned A:64,B:64,C:64
   
  A = B + C
  
  return
end
  • The assume_aligned directive is used to indicate that all thre arrays are aligned.
  • Multi-version code is generated for the stride issue:
    • Version 1: Vector code with A,B,C all unit-stride.
      • No peel loop as arrays are aligned.
      • Kernel loop does not need multi-versioning because all arrays are already asserted to be aligned.
      • Remainder loop uses gather/scatter.
    • Version 2: Vector code with none of the arrays unit stride.
      • Uses gather/scatter. (Arrays being aligned does not matter)
      • Remainder loop scalar.

4. Assumed Size Arrays

The code generated for subroutines/functions with assumed size array parameters assume that these parameters are unit-stride. Similar to passing adjustable arrays as arguments explained above, at each call site, the compiler generates code to pack any non-unit stride array parameters into unit-stride temporary arrays before the call and unpack them back to their original strided locations after the call.

Example 4.1: 1D assumed size array passed as a parameter.

subroutine assumed_size1(Y)

 real, intent(inout), dimension(*) :: Y
 !The upper bound is known and hardcoded by the programmer
 !Y(:) = Y(:) + 1 => Illegal because size of Y is not known by the compiler
 
 Y(1:500) = Y(1:500) + 1
 
 return
end subroutine
  • Loop peeled until Y is aligned using gather/scatter.
  • Kernel loop vectorized with vaddps for load Y (aligned) and vmovaps for store Y (aligned).
  • Remainder loop is vectorized using gather/scatter.

Example 4.2: 2D assumed size array passed as a parameter.

subroutine assumed_size2(Y)

 real, intent(inout), dimension(20,*) :: Y
 !Inner dimension size (i.e., 20) is provided to the compiler, so it can generate the
 !appropriate code for it. Outer dimension is unknown to the compiler, so it must be a
 !constant or a range of constants given by the programmer.
 !Y(:,:) => Illegal because 2nd dimension size of Y is unknown.
 
 Y(:,1:10) = Y(:,1:10) + 1
 
 return
end subroutine
  • Loop peeled until Y is aligned using gather/scatter.
  • Kernel loop vectorized using vaddps for load Y (aligned) and vmovaps for store Y (aligned).
  • Remainder loop vectorized using gather/scatter.

Example 4.3: Aligned 1D assumed size array passed as a parameter.

subroutine assumed_size3(Y)

 real, intent(inout), dimension(*) :: Y
 !dir$ assume_aligned Y:64
 
 Y(1:500) = Y(1:500) + 1
 
 return
end subroutine
  • The assume aligned directive is used to tell the compiler that in this subroutine, Y is aligned to 64B.
  • No peel loop, Y is already aligned.
  • Kernel loop vectorized using vaddps for load Y (aligned) and vmovaps for store Y (aligned).
  • Remainder loop has only 4 iterations, so it is not vectorized (not profitable). It is completely unrolled and is scalar.

Example 4.4: Aligned 1D assumed size array passed as a parameter.

subroutine assumed_size4(Y)

 real, intent(inout), dimension(*) :: Y
 !dir$ vector aligned
 
 Y(1:500) = Y(1:500) + 1
 
 return
end subroutine
  • The vector aligned directive is used to tell the compiler that in this subroutine, Y is aligned.
  • Result is the same as the above example.

5. Allocatable Arrays

The generated code for subroutines/functions with allocatable array parameters assumes that these parameters are unit-stride. Similar to passing adjustable arrays as arguments explained above, at each call site, the compiler generates code to pack any non-unit stride array parameters into unit-stride temporary arrays before the call and unpack them back to their original strided locations after the call.

Example 5.1: 1D allocatable array as a local variable.

subroutine allocatable_array1(N)

 integer N
 real, allocatable :: Y(:)

  allocate (Y(N))
!..loop 1 
  Y = 1.2
  call dummy_call() 

!..loop 2
  Y = Y + 1

  call dummy_call()
  deallocate(Y)
  return
end

Loop 1:

  • Loop peeled until Y is aligned. Peel loop vectorized using vscatter.
  • Kernel loop vectorized using vmovaps (aligned).
  • Remainder loop vectorized using vscatter.

Loop 2:

  • Loop peeled until Y is aligned. Peel loop vectorized using vgather/vscatter.
  • Kernel loop vectorized with vaddps for load Y (aligned) and vmovaps for store Y (aligned).
  • Remainder loop vectorized using vgather/vscatter.

Example 5.2: 2D allocatable array as a local variable.

subroutine allocatable_array2(N)

  integer N
  real, allocatable :: Z(:,:)
  
  allocate(Z(N,N))
!..loop 1
  Z = 2.3
  call dummy_call()

!..loop 2
  Z = Z + 1
  call dummy_call()
  
  deallocate(Z)
  return
end

Loop 1:

  • Loop peeled until Z is aligned. Peel loop vectorized using vscatter.
  • Kernel loop vectorized using vmovaps for store Z (aligned).
  • Remainder loop vectorized using vscatter.

Loop 2:

  • Loop peeled until Z is aligned. Peel loop vectorized using vgather/vscatter.
  • Kernel loop vectorized with vaddps for load Z (aligned) and vmovaps for store Z (aligned).
  • Remainder loop vectorized using vgather/vscatter.

Example 5.3: 1D allocatable array as a parameter.

subroutine allocatable_array3(Y)

  real, intent(inout), allocatable, dimension(:) :: Y
!..loop 1
  Y = 1.2
  call dummy_call()
 
!..loop 2 
  Y = Y + 1
  call dummy_call()
  
  return
end

Loop 1:

  • Loop peeled until Y is aligned. Peel loop vectorized using scatter.
  • Kernel loop vectorized using vmovaps for store Y (aligned).
  • Remainder loop vectorized using vscatter.

Loop 2:

  • Loop peeled until Y is aligned. Peel loop vectorized using gather/scatter.
  • Kernel loop vectorized using vaddps for load Y (aligned), vmovaps for store Y (aligned).
  • Remainder loop vectorized using vgather/vscatter.

Example 5.4: 2D allocatable array as a parameter.

subroutine allocatable_array5(Z)

  real, intent(inout), allocatable, dimension(:,:) :: Z
!..loop 1 
  Z = 2.3
  call dummy_call()
  
!..loop 2
  Z = Z + 1
  call dummy_call()
  
  return
end
  • Same as above example. Peel loop has gather/scatter. Kernel loop has aligned accesses. Remainder loop has gather/scatter.

Example 5.5: Aligned 1D allocatable array as a parameter.

subroutine allocatable_array6(Y)

  real, intent(inout), allocatable, dimension(:) :: Y
  !dir$ assume_aligned Y:64
  Y = 1.2
  call dummy_call()
  Y = Y + 1
  call dummy_call()
  
  return
end

Example 5.6: Aligned 1D allocatable array as a parameter.

subroutine allocatable_array6(Y)

  real, intent(inout), allocatable, dimension(:) :: Y
  !dir$ vector aligned
  
  Y = 1.2
  call dummy_call()
  !dir$ vector aligned
  Y = Y + 1
  call dummy_call()
  
  return
end
  • The vector aligned directive is used to generate aligned memory accesses for the two loops.
  • No peel loop as Y is told by the programmer to be aligned. (The programmer needs to ensure Y is actually aligned)
  • Kernel loop vectorized with aligned loads/stores.
  • Remainder loop vectorized.

6. Pointers

A Fortran pointer can be pointing to a section of an array with a possible non-unit stride in one or more dimensions. Therefore, the vector code generated for routines that have pointer arguments are multi-versioned:
1) unit stride vectorized version
2) non-unit stride vectorized version.
Furthermore, multi-versioning can be generated for the alignments of the arrays.

In case of a pointer assignment, the stride information (unit stride or non-unit stride) is used by the compiler to eliminate one of the two versions as it would never be executed. However, this stride information is lost when there is a function call as the function can modify the pointer and make the current stride information invalid.

You can use the Fortran 2008 CONTIGUOUS attribute with a pointer array to tell the compiler that the data occupies a contiguous block. This attribute is described in more detail in a later section in this article.

Example 6.1: 1D pointer passed as a parameter.

subroutine pointer1(Y)
  real, intent(inout), pointer, dimension(:) :: Y
  
  Y = Y + 1
  
  return
end
  • Two versions are created:
    • Version 1: Vectorization with unit stride accesses.
      • Loop is peeled until Y is aligned using vgather/vscatter.
      • Kernel loop vectorized with vmovaps (aligned).
      • Remainder loop vectorized with vscatter.
    • Version 2: Vectorization with non-unit-stride accesses. Uses vgather/vscatter.

Example 6.2: Three 1D pointers passed as parameters.

subroutine pointer2(A,B,C)

  real, intent(inout), pointer, dimension(:) :: A,B,C
  
  A = A + B + C + 1
  
  return
end
  • Loop is split into two loops:
    • First loop writes into a temporary array A_tmp that is aligned to 64B.
    • Second loop copies from A_tmp (aligned) into A (alignment unknown).
  • We also have multi version code for unit stride vs non-unit-stride for both of these two loops.
    • Loop 2a: (A_tmp = A + B + C + 1)
      • Version 1: Vectorized assuming A, B, and C are all unit-stride.
        • No peel loop as A_tmp is already aligned.
        • Kernel loop vectorized assuming A,B,C are unaligned.
        • Remainder loop unaligned using vgather/vscatter.
      • Version 2: Vectorized assuming A,B, and C are all non-unit-stride.
        • Peel loop is scalar.
        • Kernel loop uses vgather/vscatter.
        • Remainder loop is scalar.
    • Loop 2b: (A = A_tmp)
      • Version 1: Vectorized assuming A is unit stride. A_tmp is already known to be unit stride and aligned.
        • Peel loop is scalar. Peels until A is aligned.
        • For kernel loop:
          • Version 1a: Vectorized assuming A is aligned, A_tmp is aligned. (i.e., peel loop did not execute any iterations)
          • Version 1b: Vectorized assuming A is aligned, A_tmp is unaligned.
        • Remainder loop is scalar.
      • Version 2: Vectorized assuming A is not unit stride.
        • Peel loop is scalar.
        • Kernel loop vectorized using vmovaps for load A_tmp (aligned) and vscatter for store A (unaligned and strided).
        • Remainder loop is scalar.

Example 6.3: Pointer assignment with stride information propagated by the compiler.

module mod1

  implicit none
  real, target, allocatable :: A1(:,:)
  real, pointer :: Z1(:,:)
contains
  subroutine pointer_array3(N)
    integer N
    Z1 => A1(:, 1:N:2)
    Z1 = Z1 + 1
    return
  end subroutine
end module
  • Z1 is a section of A1. Its inner dimension is contiguous but outer dimension is non-contiguous.
  • This information is used in the loop and only a vector version of the inner loop that assumes contiguous layout is generated.
  • Loop peeled until A1 is aligned. Peel loop vectorized using vgather/vscatter,
  • Kernel loop vectorized using aligned loads and stores.
  • Remainder loop vectorized using vgather/vscatter.

Example 6.4: Pointer example with stride information lost due to function call.

module mod2

  implicit none
  real, target, allocatable :: A2(:,:)
  real, pointer :: Z2(:,:)
contains
  subroutine pointer_array4(N)
    integer N
    Z2 => A2(:, 1:N:2)
    !..loop 1
    Z2 = 2.3
    
    dummy_call()
    
    !..loop 2
    Z2 = Z2 + 1
    
    return

  end subroutine
end module
  • Z2 is a section of A2. Its inner dimension is contiguous but outer dimension is non-contiguous.
  • Loop 1:
    • A2 stride information is used in the loop and only a vector version of the inner loop that assumes contiguous layout is generated.
    • Loop peeled until A2 is aligned. Peel loop vectorized using vgather/vscatter,
    • Kernel loop vectorized using aligned loads and stores.
    • Remainder loop vectorized using vgather/vscatter.
  • Loop 2:
    • The function call to dummy_call() can potentially modify the location pointed by A2. Therefore, the stride information on A2 is lost.
    • Multi-version code for the stride of A2 is generated:
      • Version 1: Vector code with unit-stride for A2 (for inner loop)
        • Loop peeled until A2 is aligned. Peel loop vectorized using gather/scatter.
        • Kernel loop vectorized using aligned loads and stores.
        • Remainder loop vectorized using gather/scatter.
      • Version 2: Vector code with non-unit-stride for A2 (for inner loop)
        • Kernel loop vectorized using gather/scatter.
        • Remainder loop has two versions: a vectorized version with gather/scatter and a scalar version.

7. Indirect Array Access

Indirect accesses to arrays is very common in applications with sparse arrays, in adaptive mesh based applications, and in many N-body applications. In this case, an index array is sequentially accessed and used as the offset to a base array/matrix. The memory accesses are non-contiguous and require gather/scatter instructions to be used.

Example 7.1: Indirect array access using Fortran array notation.

subroutine indirect1(A, B, ind, N)

  real A(N),B(N)
  integer N
  integer ind(N)
  
  A(ind(:)) = A(ind(:)) + B(ind(:)) + 1
  
  return
end
  • Arrays A and B are accessed indirectly through the index array 'ind'.
  • The Fortran array notation semantics requires that the result of executing this assignment statement be equivalent to evaluating the right hand side for all array elements and assigning the result to the target. As there is a loop carried flow dependence between the right hand side and the left hand side of this assignment over A, this assignment is vectorized by using a temporary array to store the entire right hand side result and then a second loop to copy it to the A() array. As a result, the compiler generates two loops:
    • T(:) = A(ind(:)) + B(ind(:)) + 1
    • A(ind(:)) = T(:).
  • Loop 1:
    • No peel loop. T is already allocated as aligned.
    • Kernel loop is vectorized using gather for A and B, unaligned load for ind, aligned store for T.
    • Remainder loop is vectorized using gather for A and B, scatter for T.
  • Loop 2:
    • No peel loop. T is already aligned.
    • Kernel loop is vectorized using aligned load for T, scatter for A, unaligned load for ind.
    • Remainder loop is scalar.

Example 7.2: Indirect array access using Fortran do loop.

subroutine indirect2(A, B, ind, N)
  real A(N),B(N)
  integer N
  integer ind(N)
  integer i
  !dir$ ivdep
  
  do i=1,N
    A(ind(i)) = A(ind(i)) + B(ind(i)) + 1
  end do
  
  return
end
  • Arrays A and B are accessed indirectly through the index array ind.
  • The semantics of the do-loop is different from the above Fortran array notation. The do-loop implies sequential execution of the loop iterations, which would prevent vectorization of this loop due to loop carried flow dependence between the right hand side and the left hand side of this assignment over A. However, the ivdep directive provided by the programmer indicates that it is safe to ignore the loop carried flow dependences and vectorize this code.
  • Due to the semantics of this loop and the directive, the compiler does not need to generate a temporary array and two loops.
    • Loop is peeled until ind is aligned. Peel loop is vectorized using vgather for load A, B, ind, and vscatter for store A.
    • Kernel loop is vectorized using vgather for A,B, aligned load for ind, and vscatter for store A.
    • Remainder loop is vectorized using vgather for load A,B, aligned masked load for ind, and vscatter for store A.

8. Global Arrays in Fortran90 modules
 
Modules provide an effective method for defining global data. You can specify the align clause inside the module to align arrays globally:
 
Example 8.1: Global arrays declared in modules with known size.

module mymod
  !dir$ attributes align:64 :: a
  !dir$ attributes align:64 :: b
  real (kind=8) :: a(1000), b(1000)
end module mymod
 
subroutine add_them()
use mymod
implicit none
 
!  array syntax shown.
!...No explicit directive needed to tell the compiler that A and B
!  are aligned, the USE brings that information
 
  a = a + b
end subroutine add_them

The attributes align directive inside the module is used to tell the compiler that A and B are aligned to 64B.

No peel loop is generated, A and B are already aligned. Note that there is no separate vector-aligned directive required before the loop here.

Kernel loop vectorized using vmovapd (aligned).

Example 8.2: Global allocatable arrays declared in modules, but allocated elsewhere.

module mymod
  real, allocatable :: a(:), b(:)
end module mymod
 
subroutine add_them()
use mymod
implicit none
 
!dir$ vector aligned
  a = a + b
end subroutine add_them

The attributes align directive (inside the module) is not enough here - since the actual allocation for A and B happens later when ALLOCATE is called.

If the vector aligned directive is used as shown above, no peel loop is generated, compiler assumes A and B are already aligned.

If the vector aligned directive is removed, compiler does not have the right alignment information, peel loop would be generated for aligning the arrays.

Kernel loop vectorized using vmovaps, vaddps (aligned).

Note that changing the module to add the attributes align clause will not help here:

module mymod
  real, allocatable :: a(:), b(:)
  !dir$ attributes align:64 :: a                                                             
  !dir$ attributes align:64 :: b                                                             
end module mymod

This module definition doesn't say whether the allocated data pointer is 64byte aligned. [Think about pointer variable is located at 64byte boundary and the pointer value stored in there may or may not be 64byte aligned]

9. Use of -align array64byte option to align all arrays

The compiler option -align arraynbyte works on all Fortran arrays {except arrays in COMMON}. It aligns the start of arrays on an n-byte boundary. n can be 8, 16, 32, 64, 128, or 256. The default value for n is 8. Arrays do not have padding between their elements.  This has to do with fixed bounds, assume-shape and –size, automatic, allocatable or pointer attribute {however, this does NOT apply to Cray pointer arrays.  These have no alignment control}.
 
Fortran arrays come in three memory styles:
a)      Static memory – start of array in memory is aligned on n-byte boundary
b)      Stack memory – local, automatic, temporary – start of array in stack is aligned on n-byte boundary
c)      Heap memory – allocatable or pointer attribute {not Cray pointer array} – memory allocated by ALLOCATE statement that calls aligned_malloc so start of array in heap is aligned on n-byte boundary.
 
-align arraynbyte is one way to align all arrays in a compilation.  If you want to align individual arrays instead, use the ALIGN attribute for each individual array. In addition, the programmer should also explicitly inform the compiler in front of each loop which arrays will be accessed in an aligned fashion using directives (!dir$ vector aligned, !dir$ assume_aligned). 

10. Fortran 2008 CONTIGUOUS Attribute

This is an array attribute that tells the compiler that the data occupies a contiguous block. This allows compiler to make optimizations for pointers and assumed-shaped arrays. By default, compiler has to assume they may be non-contiguous. But adding this attribute will give extra information and compiler is free to assume that the data access will be contiguous. Note that the results are indeterminate (wrong answers, segmentation faults) if the user assertion is wrong and the data is NOT contiguous at runtime.

real, pointer, contiguous :: ptr(:)
real, contiguous :: arrayarg(:,:)

The POINTER target must be contiguous. The actual argument corresponding to the assumed-shape array must be contiguous.

There is also a Fortran 2008 intrinsic of the form:

is_contiguous() that returns a logical return value
IF ( is_contiguous(thisarray) ) THEN
   ptr => thisarray

Here is an example usage of contiguous attribute:

subroutine foo1(X, Y, Z, low1, up1)
real, contiguous, intent(inout), dimension(:) :: X
real, contiguous, intent(in), dimension(:) :: Y, Z
integer i, low1, up1

do i = low1, up1
   X(i) = X(i) + Y(i)*Z(i)
enddo

return
end

User has added the CONTIGUOUS attribute that asserts that all call-sites of foo1 use contiguous sections for the three array parameters. Compiler takes advantage of this attribute and generates unit-strided vectorization for the loop without any loop-multiversioning for strides.

11. Use of Fortran Array Notation for Unit-strided Vectorization

Fortran language semantics allow unit-stride vectorization for lots of array types such as allocatable arrays, adjustable arrays, explicit arrays, assumed-size arrays, etc. This still requires vector-loop index to be in the last dimension. When this condition is not true, it may be possible to refactor the code to use F90 array notation for higher performance. Here is an example.

Original source:

do I=1,m
    do index=1,n
         A(I,j,k,index) = B(I,j,k,index) +  C(I,j,k,index) * D
    enddo
enddo
  • The inner-loop gets vectorized with non-unit stride vectorization, since index is not in the innermost dimension.
  • Results in (less efficient) gathers/scatters.

Modified source after refactoring:

do I=1,m,VLEN
    do index=1,n
         A(I:I+VLEN,j,k,index) = B(I:I+VLEN,j,k,index) + C(I:I+VLEN,j,k,index) * D
    enddo
enddo
​​
  • Use of F90 array notation helps here to vectorize the inner-loop with unit-stride.

Take Aways

This document provided examples to various array types in Fortran and their usage as local variables, function/subroutine parameters, as well as examples to Fortran pointers and their usage. This document shows how various array data types and arguments are vectorized by the compiler. This includes examples of high level source code and explains the code generated by the compiler for these cases.

There were quite a few examples - what are the take aways?

  1. Indirect array accesses are costly and lead to inefficient code. Yes, the compiler can still vectorize loops with indirect memory reference with vgather/vscatter THIS DOES NOT MEAN IT'S EFFICIENT CODE!!! Efficient code is unit stride accesses. If your code is chasing pointers or using indirect references, it is less than optimal no matter whether vector code is created or not. Avoid indirect memory referencing if at all possible OR understand that the compiler cannot miraculously make this code efficient and fast.

  2. Avoid array temporaries for argument passing: this can be done by passing contiguous array data, using explicit interfaces, and assumed shape arrays. Better yet, put data in modules and USE the module rather than passing arguments explicitly.

  3. Although Fortran pointers are far more restrictive than C pointers, they are still aliases. Potential overlap of pointer based variables on the left-hand side (LHS) expression and the right-hand side (RHS) will cause array temporaries to be created to hold the RHS expression as it's evaluated prior to storing to the LHS. Avoid pointers and use allocatable arrays if at all possible

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on the Intel® Xeon architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back the main chapter Vectorization Essentials.

"