Intel® FPGA SDK for OpenCL™ Pro Edition: 最佳实践实践指南

ID 683521
日期 9/26/2022
Public
文档目录

3.6.2. Load-Store Unit修改程序

编译器会根据内核中的存储器访问模式修改某些LSU。

高速缓存(Cache)

突发合并LSU有时可能包含cache。当存储器访问模式依赖于数据或者看似重复时,就会创建cache。即使需要负载需要相同的数据,cache也不能与其他负载共享。当内核启动时,cache被刷写(flushed),并且比没有cache的等效LSU消耗更多的硬件资源。仅针对非易失性全局指针推断Cache。

kernel void cached (global int * restrict in, 
                    global int * restrict out, 
                    int N) {
  int gid = get_global_id(0);
  for (int i = 0; i < N; i++) {
    out[N*gid + i] = in[i];
  }
}

写确认(write-ack)

存在数据依赖项时,突发合并(Burst-coalesced store)LSU有时需要write-acknowledgment(写确认)信号。带有写确认(write-acknowledge)信号的LSU需要额外的硬件资源。如果多个写确认同一存储器,则吞吐量可能会减少。

kernel void write_ack (global int * restrict in, 
                       global int * restrict out, 
                       int N) {
  for (int i = 0; i < N; i++) {
    if (i < 2)
      out[i] = 0;            // Burst-coalesced write-ack LSU
    out[i] = in[i];
  }
}       

Nonaligned(未对齐)

当突发合并的LSU可以访问未与外部存储器字长对齐的存储器时,将创建一个未对齐的LSU。需要额外的硬件资源来实现未对齐的LSU。如果未对齐的LSU接受太多未对齐的请求,则其吞吐量可能会降低。

kernel void non_aligned (global int * restrict in, 
                         global int * restrict out) {
  int i = get_global_id(0);
  
  // Three loads are statically coalesced into one, 
  // creating a burst-coalesced non-aligned LSU.
  int a1 = in[3*i+0];
  int a2 = in[3*i+1];
  int a3 = in[3*i+2];
  
  // Three stores statically coalesced into one, 
  // creating a burst-coalesced non-aligned LSU.
  out[3*i+0] = a3;
  out[3*i+1] = a2;
  out[3*i+2] = a1;
}

Never-stall(永无停顿)

如果流水线LSU在没有仲裁的情况下被连接到局部存储器,则会创建永无停顿(never-stall)的LSU,因为对存储器的所有访问都以编译器已知的固定周期数。

__attribute((reqd_work_group_size(1024,1,1)))
kernel void never_stall (global int* restrict in, 
                         global int* restrict out, 
                         int N) {
  local int lmem[1024];
  int gi = get_global_id(0);
  int li = get_local_id(0);

  lmem[li] = in[gi];                   // Pipelined never-stall LSU
  barrier(CLK_GLOBAL_MEM_FENCE);
  out[gi] = lmem[li] ^ lmem[li + 1];
}