How to Debug the GPU Device Code

author-image

作者

Overview

This article demonstrates how to debug the device code in a Data Parallel C++ (DPC++) program.

"

Introduction

DPC++ is a combination of the C++ language and SYCL* parallel programming model to distribute computation across processing elements in a device.

DPC++ is single source; device and host code can be included in the same source file. A DPC++ compiler generates code for both the host and device.

DPC++ enables developers to port their code across multiple back end platforms like CPUs, GPUs, and FPGAs, and tune performance for a specific accelerator.

"

What Is Debugging?

Debugging is the process of finding and resolving bugs (defects or problems that prevent correct operation) within computer programs, software, or systems.

You can debug a C++ program using a GNU* Debugger (GDB*)  or by printing the values to the terminal.

For kernel and device code, use GPU debugging. You have three options:

  • Output stream method
  • Experimental printf method
  • Using the Intel® Distribution for GDB*
"

Prerequisites

Before you start the debugging process:

"

Different Methods to Debug the Kernel Code

Output Stream Method

Use std::cout to print and debug the host code. But, std::cout cannot be used inside the kernel nor with device code. You cannot print the kernel values directly to the terminal using the std::cout and printf functions. Instead of the std::cout function, you can use the sycl output stream inside the kernel. 

The output stream is a part of the cl::sycl::stream class. The output stream object is similar to the std::cout function in C++. It can be accessible inside the kernel by each thread that parallel_for launches.

How to Use Output Stream inside the Kernel

The following is an example program to demonstrate how to use output stream inside a kernel:

          #include <CL/sycl.hpp>

          using namespace sycl;



          int main() {

          int N=8;

          queue q(default_selector{});



          q.submit([&] (handler &h){

          sycl::stream out(1024, 256, h);

          h.parallel_for(N ,[=](id<1> i){

          out << "Hello stream!" << sycl::endl;

          } );

          q.wait();

          } );

           return 0;

}

The previous program prints the “Hello stream!” message from the kernel to the standard output. It created a queue with a default selector. After creating the queue, submit the lambda functions and command group handlers to it. Create output stream object using command group handler(h):

Stream out(1024, 256,h);  

This creates an out object with a 1024 buffer size and 128 message size for the command group handler.

Now, write the kernel function inside the parallel_for. Use the out object inside the device code to print the values:

out << "Hello stream!" << sycl::endl;

To get synchronized results, the host program must wait until the kernel finishes the execution. When the SYCL scope ends, the data from the kernel is passed to std::cout.

The following image shows the output of the "Hello stream!" program:

Figure 1.

Synchronize Output Stream Variables

When data streams inside the kernel, the data is stored inside a cl_mem object until the kernel finishes its execution. After the device code executes, as the kernel is destroyed, and then the data from the kernel is passed to std::cout.

There is no guarantee about when the output displays on std::cout, only that it prints when the command group executes.

Limitations

You must call std::cout at least once in the host program after executing the kernel in Windows* (Microsoft Visual Studio*). It works as expected in a Windows command prompt and on Linux*.

Experimental Printf Method

This method is like the OpenCL™ standard C-style printf function. The format string is interpreted according to OpenCL standard specifications and all the data types have a fixed size. Sometimes it might give incorrect results, except for the char datatype. (Some data types have different sizes in the OpenCL standard and C++. ) So, you need to type_cast the variable manually.

For example: The OpenCL standard has a 64-bit-long data type. But, C++ is 32 bit. So, you need to type_cast every time as explained in the following section: How to Use Experimental Printf inside the Kernel.

The OpenCL standard printf is not directly available for the device code. So, you need to add the __SYCL_DEVICE_ONLY__ preprocessor macro. The printf function is available in the cl::sycl::ext::oneapi::experimental namespace.

How to Use Experimental Printf inside the Kernel

The following code sample is an example program that demonstrates how to use experimental printf inside a kernel.

           #include<CL/sycl.hpp>

           #include<iostream>

           #include<chrono>

           #ifdef __SYCL_DEVICE_ONLY__

             #define CONSTANT __attribute__((opencl_constant))

           #else

             #define CONSTANT

           #endif

           using namespace sycl;

           int main()

           {

           constexpr int N=8;

           int i;

           std::array<int,N> a;

           std::array<int,N> b;

           std::array<int, N> c;

           for(i=0;i<N;i++)

           {

           a[i]=i;

           b[i]=i;

           }

           buffer a_buffer(a);

           buffer b_buffer(b);

           buffer c_buffer(c);

           queue q (default_selector{});

           q.submit([&] (handler &h){

           accessor acc_a(a_buffer,h);

           accessor acc_b(b_buffer,h);

           accessor acc_c(c_buffer,h);

           h.parallel_for(N ,[=](id<1> i){

                    acc_c[i]=acc_a[i]+acc_b[i];

           static const CONSTANT char FMT[]="inside kernel: %d :%d \n" ;

                     sycl::ext::oneapi::experimental:: printf ( FMT, int (i), acc_c[i] );

           });

           });

           q.wait();



           return 0;

           }

In the previous code sample, the vector-add is a simple program that adds two large vectors of integers. You are debugging the vector addition program using a C-style printf format for the kernel code. To use the printf function inside the kernel, add the macros in the host code:

#ifdef __SYCL_DEVICE_ONLY__

 #define CONSTANT __attribute__((opencl_constant))

#else

 #define CONSTANT

#endif

Create the vectors of size N and initialize them. Then make a queue with the required selector(default/cpu/gpu). Create the buffers for the vectors, and then get the access buffers. You need to create the accessor for the buffers. Next, submit the lambda functions and command group handlers to the queue.

Write the kernel function inside the parallel_for using a command group handler. Create a static char constant, which must be used to debug inside the kernel. Use the printf function and if needed, type cast the variables.

static const CONSTANT char FMT[] = " format specifier";

        sycl::ext::oneapi::experimental::printf (FMT, <variable>);

To get synchronized results, host programs need to wait until the kernel finishes the execution. When the SYCL scope ends, the debug data is copied back from device to host.

The output of the program gives garbage values when you do not typecast the variable i in the parallel_for (as shown in Figure 2). 

static const CONSTANT char FMT[]="inside kernel: %d :%d \n";

       sycl::ext::oneapi::experimental::printf(FMT,i,acc_c[i]);

Figure 2 shows the code inside the kernel without typecasting the variable(thread_id).

Figure 2.

 

The output of the program after typecasting the variable i in the parallel_for is shown in Figure 3. 

static const CONSTANT char FMT[] = "inside kernel: %d :%d \n" ;

       sycl::ext::oneapi::experimental::printf(FMT,int(i),acc_c[i]);

Figure 3 shows the code inside the kernel after typecasting the variable(thread_id).

Figure 3.

Limitations

  • You must call std::cout at least once in the host program after executing the kernel in Windows.
  • It's not supported in Visual Studio, but it works correctly in the Intel oneAPI command prompt for Windows.

Use the Intel® Distribution for GDB*

GDB is only used to debug the C/C++ code (host code), not the kernel code. The Intel Distribution for GDB (a part of the Base Kit) debugs the host program and device program inside the kernel. 

For more information, see Get Started on Linux or Get Started on Windows.

How to Use Output Stream and Experimental Printf

Learn how to run the programs mentioned in How to Use Output Stream inside the Kernel or How to Use Experimental Printf inside the Kernel.

Run the Programs on Linux*

  1. The source code is located in /opt/intel/oneAPI/setvars.sh
  2. Save the previous code as <filename>.cpp
  3. Compile the program using <filename.cpp> -o test
  4. Run the executable binary.

Run the Methods on Windows* Using Visual Studio*

  1. Open Visual Studio, and then select Create a new project.

     
  2. Select DPC++ Console Application.

     
  3. Copy the code from How to Use Output Stream inside the Kernel or How to Use Experimental Printf inside the Kernel.
  4. To select the compiler in the platform toolset:
    1. Go to Project > Properties.
    2. Under Configuration Properties, select General.
    3. For Platform Toolset, select Intel® oneAPI DPC++ Complier 2022.

       
  5. To build the program, select Build.
  6. Select Local Windows Debugger.

     

Figure 4 shows the output:

Figure 4.

Run the Methods on Windows Using a Command Prompt

  1. Copy the code and save the program as <filename>.cpp
  2. Compile the code using the command <filename>.cpp
  3. Run the executable <filename>.exe
    The following image shows output of the vector add code using experimental printf.

"

Code Samples

For the source code used in this page, download Debug_Articles. It has two files:

  • Stream_out.cpp has source code for the output stream method.
  • Experimental_printf.cpp has source code for the experimental printf method.
"