Challenges, tips, and known issues when debugging heterogenous programs using DPC++ or OpenMP offload

ID 标签 689824
已更新 11/17/2020
版本 Latest
公共

author-image

作者

Introduction to cross-architecture programming

There are many methods to enable your program for cross-architecture support with Intel® oneAPI Toolkits. One method to enable your program for cross-architecture support is to make use of the provided libraries.  This is also called API-based programming model.  Another method is a direct programming model using OpenMP* offload directives or Data Parallel C++ (DPC++).  This second method, using OpenMP* offload directives or Data Parallel C++ (DPC++), is the focus of this article.  
 
When creating your heterogenous program with oneAPI, debugging and testing are part of getting your program working correctly on all the targets supported.  It is more complicated to debug such a program comparing to the traditional CPU program. My suggestion is to test and debug the code on a CPU device to first to make sure your program is algorithmically correct, and then to switch to a different target like Intel Gen 9 GPU or newer. 
 
Intel® Distribution for GDB* is part of the Intel oneAPI Base Toolkit. You can learn on how to use it from Getting Started document for Linux or for Windows. But debugging offload code with big kernels, there're always challenges. Please read more below. 

Challenges on Debugging Applications with Large Kernels

In normal cases after building an application with DPC++ or OpenMP offload directives, the kernel code is compiled into byte-code by the Intel DPC++/C++ compiler. When running the application on the target device, the code will be then jitted to the device code. The jit compilation happens every time an application runs because the application could run on a different device target. This is the same when debugging. Every time you rerun the program inside a debugger, the JIT compilation will happen again. So for programs with large kernels you may see a big slow down when inside a debugger. 
 
One alternative to speed up this is to use Ahead Of Time compilation (AOT). You can find more information here: Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference - Ahead of Time Compilation.  AOT support for OpenMP offload will be available in the future update. 

Tips for Debugging Cross-architecture Programs

  1. Start with CPU Target for Program Correctness
    There are several ways to set the target device to CPU. 
    • If you use "default_selector", use "SYCL_DEVICE_TYPE=CPU
    • Or use “cpu_selector” in the code directly 
       
  2. "printf" can be Your Friend
    There're times "printf" can be easy way for debugging. "printf" is supported in the kernel code as well under name space "sycl::ONEAPI::experimental". 
     
  3. Environment Variables to Control Verbose Log Information
    For DPC++ programs
    Env Var Name Value Note
    SYCL_DEVICE_TYPE    CPU|GPU|HOST|ACC  
    SYCL_BE PI_OPENCL|PI_LEVEL0 Change the backend between Level 0 and OpenCL
    ZE_DEBUG 1 Print out trace information of Level 0
    SYCL_PI_TRACE [1|2|-1]
    • 1 - print out the basic trace log of DPC++ runtime plugin
    • 2 - print out all API traces of DPC++ runtime plugin
    • -1 - all of "2" above plus more debug messages


    For OpenMP offload programs

    Env Var Name Value Note
    LIBOMPTARGET_DEVICETYPE CPU|GPU|HOST  
    LIBOMPTARGET_DEBUG 1 Print out verbose debug information

  4. Use Ahead Of Time compilation (AOT) to move JIT compilation issues to AOT compilation issues
    The detailed AOT documentation can be found online at Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference - Ahead of Time Compilation. By using the AOT, you are actually moving any JIT compilation issues to compile-time issues so it can speed up the development time. 

Known Issues Related to Debugging and Work Arounds

  1. DPC++ extension "subgroup" is not supported at "-O0" (Linux) or "/Od" (Windows) on CPU only

    This feature is being worked on still, will be supported in the future update. 
    As of now if your program contains "subgroup", when building with "-O0", you will get segmentation fault or core dump at runtime on CPU. But the same program will run fine on GPU. 
    This is only an issue with "-O0" and running on CPU, it works fine with "-O2" on CPU. 

    Workaround: There is no workaround for this issue.  

  2. Barrier support with "-g" or "/Zi" is not fully implemented

    There is a known-issue with barrier support (work-group barrier, cl::sycl::nd item::barrier) that the code for barrier is not generated. So when you debug the code, you won't be able to set break-point at the barrier line. 
    This issue will be fixed in the future update.  

    Workaround: You can set break-point at the line before or after, or adding a dummy line after the barrier; or utilizing the debugging tips above. 

  3.  Very limited support on debug information generation for kernel code at "-O2 -g" or "/O2 /Zi" 

    For those who wants to debug optimized code on GPU, this might pose a big challenge to you.
    Currently the debug information for kernel code at "-O2 -g" is not fully implemented, so debugging in a debugger like oneapi-gdb is not recommended.

    Workaround: those issues do not exist on CPU, so you can debug your program on CPU for functional correctness; or utilizing the debugging tips above.

  4. Slow JIT compilation time on GPU at "-O0 -g" (Linux) or "/Od /Zi" (Windows)

    The JIT compilation under "-O0 -g" for GPU is significantly slower which may give the impress that the program maybe hang.

    Workaround: This issue do not exist on CPU, so you can debug your program on CPU for functional correctness; or use other debugging tips above without using "-g" or "/Zi".  

"