Find Bugs Quickly Using Sanitizers with the Intel® oneAPI DPC++/C++ Compiler

Get the Latest on All Things CODE

author-image

作者

In this tutorial, you will learn how to use sanitizers with Intel oneAPI DPC++/C++ Compiler to efficiently and accurately detect common issues in your C/C++ code.

Sanitizers help identify and pinpoint undesirable or undefined behavior in your code. They are enabled with a compiler option switch that instruments your program, adding additional safety checks to the binary.

As a side effect, there will be performance and file size overhead. The impact on performance and executable size depends on the specific sanitizer tool used and the program's characteristics being analyzed. Factors such as the program's size, the amount of memory allocated, and the number of threads used play a role. Thus, sanitizers should only be used for debugging and code verification, not in production code.

However, the benefits of using these separate sanitizer-enabled build steps are tremendous. They help to detect and prevent bugs and security vulnerabilities. Their use can be vital for regular software testing in a CI/CD DevOps environment.

They also provide a convenient way for software developers to verify code changes before submitting them to a repository branch.  

In fact, sanitizers, as used with LLVM-based compilers like Clang* or the Intel® oneAPI DPC++/C++ Compiler, are fairly lightweight. This is especially true if you compare them with other open-source software testing solutions like Valgrind* or commercial code analyzer solutions for functional testing and coding standards compliance like Parasoft’s Insure++*, PVS Studio*, AbsInt Astrée*, or QA Systems Cantata*. Usually, sanitizers increase execution time by a factor of 2-3, while Valgrind can introduce overheads of up to 100x.

This makes sanitizers quite useful for testing or debugging a program as part of your regular software development flow or for identifying runtime issues that occur late in the execution of a larger application.

If, instead, you compare with more traditional interactive debug approaches like the use of GDB*, there does, of course, remain one drawback. The use of sanitizers requires recompilation of the program. Ideally, if your program depends on other shared libraries, these, too, should be recompiled with sanitizers enabled (except for the standard libc/libc++ or course). The benefit is that the code instrumentation will do the bug-hunting for you.

New with the Intel® oneAPI DPC++/C++ Compiler 2025.0 and Intel® Fortran Compiler 2025.0: 
Find and resolve coding issues while you are writing your application. Now even more comprehensive feedback for   
  • Data Race Detection with ThreadSanitizer  ​​​​
  • Memory Access Safety Issue Detection on SYCL accelerator offload code using DeviceSanitizer  

In this tutorial, we will take a closer look at the following sanitizers:

  1. AddressSanitizer - detect memory safety bugs
  2. UndefinedBehaviourSanitizer - detect undefined behavior bugs
  3. MemorySanitizer - detect use of uninitialized memory bugs
  4. ThreadSanitizer – detect data races
  5. Device-Side AddressSanitizer – detect memory safety bugs in SYCL device code

The example source code used throughout this tutorial can be found in the archive file sanitizers-tutorial.tgz

1. Detecting Memory Safety Bugs With the AddressSanitizer

To demonstrate the different capabilities of sanitizers, we will use a small program that prints the Fibonacci sequence, a sequence in which each number is the sum of the two preceding ones starting with 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …

Code Sample

Let us start with the following code example, which can be found as fibonacci_v1.c inside the tutorial source archive.

#include <stdlib.h>
#include <stdio.h>

/**
 * Fill array arr of length n
 * with the first n fibonacci numbers
 */
void set_fibonacci_list(int *arr, int n) {
  arr[0] = 0;
  arr[1] = 1;

  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

/**
 * Print the first n fibonacci numbers
 * */
void print_fibonacci(int n) {
  int fibos[n];

  set_fibonacci_list(fibos, n);

  printf("Fibonacci Sequence\n");
  printf("==================\n");
  for (int i = 0; i < n; i++) {
    printf("%d\n", fibos[i]);
  }
  printf("==================\n");
  if (n > 1 && fibos[n-2] != 0) {
    printf("Golden ratio approximation: %g\n", ((double)fibos[n-1])/fibos[n-2]);
  }
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
    printf("Usage: %s NUM\n", argv[0]);
    return 1;
  }

  print_fibonacci(atoi(argv[1]));

  return 0;
}

Figure 1. Initial Fibonacci sequence example source code

This program takes the amount of Fibonacci numbers to print as a command line parameter. It then computes the Fibonacci sequence in the function set_fibonacci_list and prints it to the screen in the function print_fibonacci.

Running the Sanitizer

We will now use the AddressSantizer to detect potential memory-related bugs in this program. The AddressSanitizer can detect multiple memory safety bugs, including out-of-bounds accesses on the stack and heap and use-after-free bugs.

To compile the program using the AddressSantizer, use the following command:

$ icx src/fibonacci_v1.c -O0 -g -fsanitize=address -fno-omit-frame-pointer -o fibonacci_v1_with_asan

The compiler option -fsanitize=address activates the sanitizer.

Flags -O0 -g -fno-omit-frame-pointer are added to get the best diagnostic output in case we indeed find a coding issue, but these options are not mandatory.

Note that -g implicitly set -O0 and -fno-omit-frame-pointer. So, these options are only listed to provide you with the complete set of parameters.

There are additional sanitizer-related flags that you can pass to the command line. Please refer to the Clang Compiler User’s Manual for a complete list.

For comparison, we can also compile a version without the sanitizer:

$ icx src/fibonacci_v1.c -O0 -g -o fibonacci_v1

Now, you can run both executables with some value for N. They both should print the same output:

$ ./fibonacci_v1 10
Fibonacci Sequence
=====================================
0
1
1
2
3
5
8
13
21
34
=====================================
Golden ratio approximation: 1.61905
$ ./fibonacci_v1_with_asan 10
Fibonacci Sequence
=====================================
0
1
1
2
3
5
8
13
21
34
=====================================
Golden ratio approximation: 1.61905

However, the program contains a bug:

When n < 2, in the set_fibonacci_list function, we assign the initial Fibonacci values to indices that are out of bounds!

Let us try to run the programs with 0 as the argument and see what happens:

$ ./fibonacci_v1 0
Fibonacci Sequence
==================
$ ./fibonacci_v1_with_asan 0
====================================================
==9006==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffd21da5a20 at pc 0x000000506601 bp 0x7ffd21da5990 sp 0x7ffd21da5988                                 
...

This illustrates the power of the AddressSanitizer. Normal program execution did not fail in this example. So, we might have easily missed the bug. In other configurations, the program might have crashed. But the crash could have also happened at a later point. In the worst case, a program does not crash but produces wrong results!

AddressSantitizer, on the other hand, immediately detects the error and aborts the execution, showing a verbose diagnostic report. This report includes:

  • The type, location and register values of the bug:
    dynamic-stack-buffer-overflow on address 0x7ffd21da5a20 at pc 0x000000506601 bp 0x7ffd21da5990 sp 0x7ffd21da5988
  • A traceback of where the bug occurred (here the debug compiler flags help):
    WRITE of size 8 at 0x7ffd21da5a20 thread T0
        #0 0x506600 in set_fibonacci_list /home/user/sanitizer_tutorial/src/Fibonacci_v1.c:10:10
        #1 0x5067c1 in print_fibonacci /home/user/sanitizer_tutorial/src/fibonacci_v1.c:24:3
        #2 0x50691e in main /home/user/sanitizer_tutorial/src/fibonacci_v1.c:42:3
        #3 0x7feff05077b2 in __libc_start_main (/lib64/libc.so.6+0x237b2) (BuildId: ade58d86662aceee2210a9ef12018705e978965d)
        #4 0x41eb2d in _start (/home/user/sanitizer_tutorial/fibonacci_v1_with_asan+0x41eb2d) 

     

Valgrind, on the other hand, did not detect the bug in this case. This is because Valgrind does not detect stack-based buffer overflows:

valgrind ./fibonacci_v1 0
==166581== Memcheck, a memory error detector
==166581== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==166581== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==166581== Command: ./Fibonacci_v1 0
==166581==
Fibonacci Sequence
==================
==================
==166581==
==166581== HEAP SUMMARY:
==166581==     in use at exit: 0 bytes in 0 blocks
==166581==   total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated
==166581==
==166581== All heap blocks were freed -- no leaks are possible
==166581==
==166581== For lists of detected and suppressed errors, rerun with: -s
==166581== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Fixing the Issue

Let us fix the memory safety bug we detected by adjusting the set_fibonacci_list function:

void set_fibonacci_list(int *arr, int n) {
  if (n > 1) {
    arr[0] = 0;
  }
  if (n > 2) {
    arr[1] = 1;
  }
  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

Figure 2. Fix in set_fibonacci_list function for n<2

You can find the new program in src/fibonacci_v2.c inside the tutorial source archive.

After recompilation, we can now check that the bug is gone by rerunning the sanitized version:

$ ./fibonacci_v2_with_asan 0
Fibonacci Sequence
=====================================

Perfect! We fixed the program.

2. Detecting Undefined Behavior With the UndefinedBehaviorSanitizer (UBSan)

After fixing memory-related bugs in the Fibonacci program, we can now do some more basic manual functional testing with the program.

Observing an Issue

For example, we can try to use larger values for N:

$ ./fibonacci_v2_with_asan 100
Fibonacci Sequence
==================
0
1
...
-889489150
==================
Golden ratio approximation: 9.81579

We can see that the output is wrong: The golden ratio seems very off, and Fibonacci numbers should never be negative!

Running the Sanitizer

To find out what is going wrong, let us now use the UndefinedBehaviorSanitizer (UBSan), a sanitizer that can detect types of undefined behavior in your program:

$ icx src/fibonacci_v2.c -O0 -g -fsanitize=undefined -fno-omit-frame-pointer -o fibonacci_v2_with_ubsan

Use the -fsanitize=undefined to enable UBSan. UBSan will catch a set of common undefined behavior types. Please refer to the UndefinedBehaviorSanitizer documentation to learn how to enable checks on different or additional undefined behavior types.

Let us run our sanitizer-enabled binary:

$ ./fibonacci_v2_with_ubsan 100
src/Fibonacci_v2.c:17:23: runtime error: signed integer overflow: 1836311903 + 1134903170 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/Fibonacci_v2.c:17:23 in
Fibonacci Sequence
==================
0
...

Identifying the Cause

UBSan has successfully identified the problem: The Fibonacci sequence is growing quickly and leads to a signed integer overflow, which is an undefined behavior according to the C standard.

Similar to the AddressSanitizer, we are getting verbose diagnostic output:

  • The type of undefined behavior (signed integer overflow)
  • Additional information about the problem (1836311903 + 1134903170 cannot be represented in type 'int')
  • The location of the bug (undefined-behavior src/Fibonacci_v2.c:17:23)

Note that in contrast to the AddressSantizer, the program is not aborted on detecting undefined behavior.

3. Detecting Uninitialized Memory Usage with the MemorySanitizer

The MemorySanitizer allows you to catch bugs caused by uninitialized memory usage. You can enable the sanitizer via the -fsanitize=memory flag.

Important Notes on MemorySanitizer Usage:
• The sanitizer does not fail immediately on uninitialized memory reads. It only fails once a branch, syscall, or dynamic call depends directly or indirectly on uninitialized memory.
• All project dependencies should be recompiled with MemorySanitizer. Otherwise, there might be high amounts of false positives

Let us look closer at how the MemorySanitizer can detect coding issues in our program.

One alternative way to fix the integer overflow bug from the previous section is to limit the amount of Fibonacci numbers computed (maximum of 47).

Let's say that naively, we add the limitation to our set_fibonacci_list function:

void set_fibonacci_list(int *arr, int n) {
  if (n > 1) {
    arr[0] = 0;
  }
  if (n > 2) {
    arr[1] = 1;
  }
  if (n > 47) {
    n = 47;
  }
  for (int i = 2; i < n; i++) {
    arr[i] = arr[i-1] + arr[i-2];
  }
}

Figure 3. Limit the number of items in set_fibonacci_list function to n=47

You can find the newly updated program in src/fibonacci_v3.c inside the tutorial source archive.

Observing an Issue

Now we can re-compile and re-run the program:

$ icx src/fibonacci_v3.c -O0 -g -fsanitize=undefined -fno-omit-frame-pointer -o fibonacci_v3_with_ubsan

$ ./fibonacci_v3_with_ubsan 100
Fibonacci Sequence
==================
0
1
...
==================
Golden ratio approximation: -0.000100335
...

The good news is that UBSan does not complain anymore, meaning the program no longer contains a signed integer overflow. The bad news is that we still have negative Fibonacci numbers in our list, and the golden ratio approximation is still off.

Another observation is that the output of the tool changes non-deterministically.

This is a hint that there might be some uninitialized memory usage.

Running the Sanitizer

We can use the MemorySanitizer to double-check that. Use the following command to compile the program with the MemorySanitizer:

$ icx src/fibonacci_v3.c -O0 -g -fsanitize=memory -fsanitize-memory-track-origins=2 -fno-omit-frame-pointer -o fibonacci_v3_with_msan

The -fsanitize=memory flag enables the MemorySanitizer. To additionally track from which variable the uninitialized memory was derived, you can optionally pass the -fsanitize-memory-tracks-origins=2 flag.

Running the memory-sanitizer enabled program yields:

$ ./fibonacci_v3_with_msan 100
Fibonacci Sequence
==================
0
1
1
...
==177412==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4afd9e in print_fibonacci /home/user/sanitizer_tutorial/src/fibonacci_v3.c:36:5
    #1 0x4b03ec in main /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:53:3
    #2 0x7f1eac55a7b2 in __libc_start_main (/lib64/libc.so.6+0x237b2) (BuildId: ade58d86662aceee2210a9ef12018705e978965d)
    #3 0x41f2dd in _start (/home/user/sanitizer_tutorial/fibonacci_v3_with_msan+0x41f2dd)

  An uninitialized value was created by an allocation of 'vla' in the stack frame of function 'print_fibonacci'
    #0 0x4af6c0 in print_fibonacci /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:29:3

SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/user/sanitizer_tutorial/src/Fibonacci_v3.c:36:5 in print_fibonacci
Exiting

Identifying the Cause

We can see that MemorySanitizer reports a use-of-uninitialized-value bug. The reason is that while we only fill the first 47 entries of the Fibonacci array, we still print and use the unassigned (and, thus, uninitialized) array values.

To fix the problem, we should add an extra check to the main function:

if (atoi(argv[1]) > 47) {
    printf("Please provide a number of 47 or less.\n");
    return 1;
  }

Figure 4. Add check for input parameter for Fibonacci sequence exceeding n=47

You can find the fixed resulting program in src/fibonacci_v4.c inside the tutorial source archive.

Congratulations, you have fixed all bugs in this program!

4. Detecting Data Races with the ThreadSanitizer

The ThreadSanitizer allows you to catch data races in OpenMP and threaded applications. You can enable the sanitizer via the -fsanitize=thread flag.

Important Notes on ThreadSanitizer Usage:
    • ThreadSanitizer can detect data races only in runtime. So, it is recommended to increase code coverage with a wide variety of tests checking different inputs to execute each code path.
    • Just like for Memory Sanitizer, all project dependencies should be recompiled with ThreadSanitizer. Otherwise, there might be high amounts of false positives
    • Since ThreadSanitizer allocates a significant amount of virtual memory (without reserving it), it may change the behavior of tools like ulimit
    • Static linking with the standard C or C++ libraries (libc/libstdc++) is not supported
    • Non-position-independent executables are not supported

Code Sample

Let us start with the following code example, which can be found as MergesortOMP inside the tutorial source archive. The MergeSortOMP sample is a top-down implementation of a comparison-based sorting algorithm known as a merge sort.

void MergeSortOpenMP(int a[], int tmp_a[], int first, int last) {
  if (first < last) {
    int middle = (first + last + 1) / 2;  // = first + (last - first + 1) / 2;
    if (last - first < task_threshold) {
      MergeSort(a, tmp_a, first, middle - 1);
      MergeSort(a, tmp_a, middle, last);
    } else {
#pragma omp task
      MergeSortOpenMP(a, tmp_a, first, middle - 1);
#pragma omp task
      MergeSortOpenMP(a, tmp_a, middle, last);
#pragma omp taskwait
    }
    Merge(a, tmp_a, first, middle, last);
    my_counter++;
  }
}

Figure 5. Initial MergesortOMP example source code

This program has 3 test modes: it can run serial implementation, openMP task-based implementation, and both. It takes the test number as a user prompt. Serial implementation can be used as a reference, while OpenMP implementation contains a race condition.

Observing an Issue

Let us first compile a code without the sanitizer:

$ icpx -std=c++11 -g -qopenmp -o merge_sort_v1 src/merge_sort_v1.cpp

Now, we can run it both in serial and in OpenMP modes. They both should print the same output:

$ ./merge_sort
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
1

Serial version:
Shuffling the array
Sorting
Sort succeeded in 0.294302 seconds. 

$ ./merge_sort
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
Sort succeeded in 0.0802705 seconds.

Running the Sanitizer

We will now use the ThreadSanitizer to detect potential data race conditions in this program. To compile the program using the ThreadSantizer, use the following command:

$ icpx -std=c++11 -g -qopenmp -fsanitize=thread -fPIE -pie -o merge_sort_v1_tsan src/merge_sort_v1.cpp

The compiler option -fsanitize=thread activates the sanitizer, and -fPIE -pie flags generate a position-independent executable. Non-position-independent executables have fixed addresses for code and data. ThreadSanitizer’s instrumentation may need to insert extra checks and maintain metadata about memory accesses. This is more easily done when the code is position-independent, allowing it to be loaded at any address.

When the -fsanitize=thread flag is used, and the code is not already position-independent, the compiler will implicitly add the -fPIE flag for compilation and the -pie flag for linking.  So, these options are only listed to provide the complete set of parameters.

However, the code contains a data race in the openMP implementation, which can be detected using ThreadSanitizer. Let us run the version compiled with the sanitizer support in the openMP mode:

$ TSAN_OPTIONS='ignore_noninstrumented_modules=1:halt_on_error=1:abort_on_error=1'  ./merge_sort_v1_tsan

N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
==================
WARNING: ThreadSanitizer: data race (pid=2193569)
  Write of size 4 at 0x555556a6dd28 by thread T27:
    #0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_v1_tsan+0xe9859)
    #1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_v1_tsan+0xea350)
    #2 __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1658:9 (libiomp5.so+0xf9623)
    #3 main.extracted /home/MergesortOMP/src/merge_sort_v1.cpp (merge_sort_v1_tsan+0xea4f6)
    #4 __kmp_invoke_microtask <null> (libiomp5.so+0x145602)
    #5 main /home/MergesortOMP/src/merge_sort_v1.cpp:231:1 (merge_sort_v1_tsan+0xea09e)

  Previous write of size 4 at 0x555556a6dd28 by thread T19:
    #0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_v1_tsan+0xe9859)
    #1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_v1_tsan+0xea350)
    #2 __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1658:9 (libiomp5.so+0xf9623)
    #3 main.extracted /home/MergesortOMP/src/merge_sort_v1.cpp (merge_sort_v1_tsan+0xea4f6)
    #4 __kmp_invoke_microtask <null> (libiomp5.so+0x145602)
    #5 main /home/MergesortOMP/src/merge_sort_v1.cpp:231:1 (merge_sort_v1_tsan+0xea09e)

  Location is global 'my_counter' of size 4 at 0x555556a6dd28 (merge_sort_v1_tsan+0x1519d28)

  Thread T27 (tid=2193609, running) created by main thread at:
    #0 pthread_create <null> (merge_sort_v1_tsan+0x38585)
    #1 __kmp_create_worker /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/z_Linux_util.cpp:974:7 (libiomp5.so+0x146126)

  Thread T19 (tid=2193601, running) created by main thread at:
    #0 pthread_create <null> (merge_sort_v1_tsan+0x38585)
    #1 __kmp_create_worker /tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm1.t1..h1.w1-anompbdwlin05/../../src/z_Linux_util.cpp:974:7 (libiomp5.so+0x146126)

SUMMARY: ThreadSanitizer: data race /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 in MergeSortOpenMP(int*, int*, int, int)
==================

Aborted

First, let us closer look at the TSAN_OPTIONS environment variable. It consists of three values:

  • ignore_noninstrumented_modules=1 helps ThreadSanitizer to work with non-instrumented libraries. It will, for example, suppress reports from openMP runtime code
  • halt_on_error=1 forces the sanitizer to exit after the first reported error. It is particularly useful for debugging multithreaded applications, like openMP, because otherwise, the sanitizer will report the error for each thread, making the output very hard to read.
  • abort_on_error=1 with the option enabled, abort() will be called instead of _exit(). After printing the error report, only the first process encountering the error condition will be stopped, not the entire application.

Identifying the Cause

Now let us take a closer look at the ThreadSanitizer report:

  • The sanitizer was able to catch the data race:

     WARNING: ThreadSanitizer: data race (pid=2193569)

  • And the location of the potentially incorrect code is:
#0 MergeSortOpenMP(int*, int*, int, int) /home/MergesortOMP/src/merge_sort_v1.cpp:112:15 (merge_sort_tsan+0xe9859)
#1 MergeSortOpenMP(int*, int*, int, int) (.extracted) /home/MergesortOMP/src/merge_sort_v1.cpp:106:7 (merge_sort_tsan+0xea350)

So, we should look at lines 112 and 106 of the source code:

…
(105) #pragma omp task
(106) MergeSortOpenMP(a, tmp_a, first, middle - 1);
… 
(112) my_counter++;

 

The sanitizer is pointing to line 112 at the top of the stack. So, we may need to learn more about   my_counter:

(10) int my_counter = 0;

It is a global variable. Since this is a MergeSortOpenMP function, we see the issue happens inside a recursive call under the openMP task region. It means that multiple threads increment the same global variable. If you try running TSAN_OPTIONS='ignore_noninstrumented_modules=1’, you will see that the same issue is reported for many threads.

The #pragma omp atomic directive prevents simultaneous updates. The final program in merge_sort_v1.cpp inside the tutorial source archive.

Let us look at the MergeSortOpenMP function:

void MergeSortOpenMP(int a[], int tmp_a[], int first, int last) {
  if (first < last) {
    int middle = (first + last + 1) / 2;  // = first + (last - first + 1) / 2;
    if (last - first < task_threshold) {
      MergeSort(a, tmp_a, first, middle - 1);
      MergeSort(a, tmp_a, middle, last);
    } else {
#pragma omp task
      MergeSortOpenMP(a, tmp_a, first, middle - 1);
#pragma omp task
      MergeSortOpenMP(a, tmp_a, middle, last);
#pragma omp taskwait
    }
    Merge(a, tmp_a, first, middle, last);
    #pragma omp atomic
    my_counter++;
  }
}

Figure 6. Fixed MergesortOMP example source code (comments removed)

Now we are ready to compile the fixed version and run it once again under the sanitizer:

$ icpx -std=c++11 -g -qopenmp -fsanitize=thread -fPIE -pie -o merge_sort_v2_tsan src/merge_sort_v2.cpp 
$   TSAN_OPTIONS='ignore_noninstrumented_modules=1:suppress_equal_pcs=1'  ./merge_sort_v2_tsan
N = 1000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
2

OpenMP version:
Shuffling the array
Sorting
Sort succeeded in 0.419602 seconds.

The initial version of this code sample can be found at the oneAPI-samples GitHub repository: oneAPI-samples/DirectProgramming/C++/GraphTraversal/MergesortOMP.

5. Detecting Memory Safety Bugs in SYCL* Code with the Device-Side AddressSanitizer


The AddressSanitizer, a tool for detecting memory errors in C/C++ code, now includes support for SYCL device code. To activate this feature for the device code, use the flag -Xarch_device -fsanitize=address. The flag -Xarch_host -fsanitize=address should be used to identify memory access problems in the host code. This new SYCL accelerator extension thus provides a Device-Side AddressSanitizer. It is introduced with Intel® oneAPI DPC++/C++ Compiler in version 2025.0.

Note on  Device-Side Address Sanitizer Usage: 
    • Device-side AddressSanitizer can increase private memory usage and trigger a UR_RESULT_ERROR_INVALID_WORK_GROUP_SIZE error. To resolve this, reduce the SYCL local workgroup size 
    • Device code AddressSanitizer forces in-order execution of kernels

Code Sample

Within the tutorial source archive, in the vector-add directory, there is a simple SYCL program designed to perform vector addition on a selected accelerator.

This example is presented in two distinct versions, each illustrating one of the two memory models available in SYCL. The initial version employs buffers, which are abstractions resembling containers for memory, and the SYCL runtime fully handles their management.

void VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector,
               IntVector &sum_parallel) {
  range<1> num_items{a_vector.size()};
  buffer a_buf(a_vector);
  buffer b_buf(b_vector);
  buffer sum_buf(sum_parallel.data(), num_items);

  for (size_t i = 0; i < num_repetitions; i++ ) {
    q.submit([&](handler &h) {
      accessor a(a_buf, h, read_only);
      accessor b(b_buf, h, read_only);
      accessor sum(sum_buf, h, write_only, no_init);
      h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
    });
  };
  // Wait until compute tasks on GPU are done
  q.wait();
}

Figure 7. vector-add-buffers example source code

The second model employs Unified Shared Memory (USM), which relies on malloc functions and offers developers greater control.

void VectorAdd(queue &q, const int *a, const int *b, int *sum, size_t size) {
  range<1> num_items{size+12};
  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  e.wait();
}

Figure 8. vector-add-usm example source code

Exploring the USM-Based Vector Addition Example

Let's begin by examining the USM code. As usual, we will first compile the code without the sanitizer to observe any potential issues up front and get an idea of what we may be looking for.

$ icpx -fsycl -g -O0 vector-add-usm_v1.cpp -o vector-add-usm_v1

Now, we can run it on GPU using Intel® oneAPI Level Zero backend:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
An exception is caught while adding two vectors.
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 11 (UR_RESULT_ERROR_INVALID_MEM_OBJECT)

Even though the output appears correct, it immediately indicates a runtime crash. To uncover and diagnose this issue, we will proceed to the next step.

Running the Sanitizer

We will now use the DeviceSanitizer to detect potential memory issues in this program. To compile the program with the DeviceSanitizer support, use the following command:

$ icpx -fsycl -g -O0 -Xarch_device -fsanitize=address vector-add-usm_v1.cpp -o vector-add-usm_v1_asan

The compiler option -Xarch_device -fsanitize=address activates the sanitizer for device code analysis. With the compilation complete, we can execute the binary to check for any memory issues:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000

====ERROR: DeviceSanitizer: out-of-bounds-access on Shared USM
READ of size 4 at kernel <typeinfo name for VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::'lambda'(auto)> LID(0, 0, 0) GID(10000, 0, 0)
  #0 auto VectorAdd(sycl::_V1::queue&, int const*, int const*, int*, unsigned long)::'lambda'(auto)::operator()<sycl::_V1::item<1, true>>(auto) const  /home/vector-add-usm_v1.cpp:62

The output confirms that the program is running on an Intel(R) Data Center GPU Max 1550 with a vector size of 10,000. However, the DeviceSanitizer has detected an out-of-bounds access error on Shared USM at vector-add-usm_v1.cpp:62.

With this information, we are now ready to investigate further and diagnose the root cause of the memory error reported by the DeviceSanitizer.

Identifying the Cause

To identify the cause of the memory error reported by the DeviceSanitizer, we must closely examine the error message's details and review the code's relevant sections. The error message provides us with specific clues: it indicates an out-of-bounds access during a read operation, and it points to the exact location in the code where the issue occurred:

(62)  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });

The code includes both kernel and host segments. An out-of-bounds error in this scenario might arise due to various factors, but two primary reasons are most common:

1. Incorrect Memory Allocation

If the arrays a, b, or sum were not allocated with the correct size before calling this kernel, any access beyond the allocated size would result in out-of-bounds access. Let us check this hypothesis and look at the allocations in the main function:

    int *a = malloc_shared<int>(array_size, q);
    int *b = malloc_shared<int>(array_size, q);
    int *sum_sequential = malloc_shared<int>(array_size, q);
    int *sum_parallel = malloc_shared<int>(array_size, q);

However, all 4 allocations are allocated with the same size, which has been passed to:

VectorAdd(q, a, b, sum_parallel, array_size);

2. Incorrect Range Specification:

If num_items is larger than the actual size of the array_size of a, b, or sum, the kernel will attempt to access elements beyond the allocated memory, leading to an out-of-bounds error.  In this code, the range is num_items range specification is behind num_items:

range<1> num_items{size+12};

And this is exactly the bug. Let’s fix it and run with the sanitizer one more time:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan 
====ERROR: DeviceSanitizer: bad-free on address 0x387ae30
 #0 in main /home/vector-add-usm_v1.cpp:153:5
 #1 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xe5f3622a1ca)
 #2 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xe5f3622a28b)
 #3 in _start (./vector-add-usm_v1_asan+0x4034e5)

0x387ae30 may be allocated on Host Memory

A new bug is detected in vector-add-usm_v1.cpp:153:

(153) free(c, q);

To address this issue, we must examine how the array c was allocated:

(110) int *c = new int[array_size];

Since c is allocated using the C++ new operator rather than a SYCL memory allocation function, it should be deallocated using the corresponding C++ delete operator, not the SYCL free function.

After fixing the issue, we recompile and see another bad-free:

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998

====ERROR: DeviceSanitizer: bad-free on address 0xc433b840400
 #0 in main /home/vector-add-usm_v1.cpp:155:5
 #1 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a1ca)
 #2 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a28b)
 #3 in _start (./vector-add-usm_v1_asan+0x4034e5)

0xc433b840400 is located inside of Shared USM region [0xc433b820400, 0xc433b82a040)
allocated here:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt /intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt /intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:112:25
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0xc434c02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)

An exception is caught while adding two vectors.
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 45 (UR_RESULT_ERROR_INVALID_ARGUMENT)

Upon reviewing the code in vector-add-usm_v1.cpp:112, we observe the following lines:

(111)     int *sum_sequential = malloc_shared<int>(array_size, q);
(112)     int *sum_parallel = malloc_shared<int>(array_size, q);
...
(155)     free(sum_sequential, q2);

It is evident that the sum_sequential array is allocated using the malloc_shared function with the queue q, but an attempt is made to deallocate it using a different queue, q2. This discrepancy can lead to issues because the context within SYCL is crucial. Memory allocated in one context should be freed in the same context to ensure proper resource management and avoid potential errors.

While addressing the incorrect deallocation of sum_sequential using a different queue is important, it is not the final issue that needs to be resolved.

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-usm_v1_asan 
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

====ERROR: DeviceSanitizer: detected memory leaks of Shared USM
Direct leak of 40000 byte(s) at 0x12f842280400 allocated from:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler /../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:108:14
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)


====ERROR: DeviceSanitizer: detected memory leaks of Shared USM
Direct leak of 40000 byte(s) at 0x12f8422a0400 allocated from:
 #0 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:229:3
 #1 in int* sycl::_V1::malloc_shared<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/latest/bin/compiler/../../include/sycl/usm.hpp:237:10
 #2 in main /home/vector-add-usm_v1.cpp:107:14
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a1ca)
 #4 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x12f84e02a28b)
 #5 in _start (./vector-add-usm_v1_asan+0x4034e5)

The output now indicates ‘Vector add successfully completed on device’, so the kernel completed successfully, but there seem to be two more bugs in the code.  The sanitizer detected two memory leaks related to vector-add-usm_v1.cpp:107 and vector-add-usm_v1.cpp:108.

Lines are:

int *a = malloc_shared<int>(array_size, q);
int *b = malloc_shared<int>(array_size, q);

These are memory allocations causing the memory leak. The most common cause of memory leaks is simply forgetting to free the memory allocated with malloc_shared, malloc_device, or malloc_host. In SYCL, memory allocated with these functions must be explicitly deallocated using the corresponding sycl::free function. This is the end of the program:

free(sum_sequential, q);
free(sum_parallel, q);

The code currently lacks the deallocation for a and b. By including the appropriate free calls, the code executes as intended. The corrected code version can be found in the source file vector-add-usm_v2.cpp.

Congratulations, all the issues in vector-add based on USM are fixed.

Exploring the Buffer-Based Vector Addition Example

In addition, we can examine the version of this example that utilizes buffers. The kernel invocation from vector-add-buffers_v1.cpp has been previously illustrated in Figure 8. We will now proceed to compile and execute the code on an Intel GPU, utilizing the Level Zero backend.

$ icpx -fsycl -g -O0 vector-add-buffers_v1.cpp -o vector-add-buffers_v1
   $  ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-buffers_v1
Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10000
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

While the code ran without any runtime failures and provided correct results, hidden memory issues might have gone undetected with the present compiler and driver version, and these could cause crashes with future versions of the compiler or runtime environment.

AddressSanitizer has uncovered an out-of-bounds memory access in the buffer:

$ icpx -fsycl -g -O0 -Xarch_device -fsanitize=address vector-add-buffers_v1.cpp -o vector-add-buffers_v1_asan
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./vector-add-buffers_v1_asan 

Running on device: Intel(R) Data Center GPU Max 1550
Vector size: 10030

====ERROR: DeviceSanitizer: out-of-bounds-access on Memory Buffer
READ of size 4 at kernel <typeinfo name for VectorAdd(sycl::_V1::queue&, std::vector<int, std::allocator<int>> const&, std::vector<int,er&)::operator()(sycl::_V1::handler&) const::'lambda'(auto)> LID(4, 0, 0) GID(10000, 0, 0)
#0 auto VectorAdd(sycl::_V1::queue&, std::vector<int, std::allocator<int>> const&, std::vector<int, std::allocator<int>> const&, std:r&) const::'lambda'(auto)::ope /home/vector-add-buffers_v1.cpp:87

The line in question is 87:

(87) h.parallel_for(task_size, [=](auto i) { sum[i] = a[i] + b[i]; });

The task size is defined in line 72 as:

(72) auto task_size = num_items + 2*(i + 1);

The root cause of the issue is that the task size is incorrectly calculated; it should be exactly num_items, which is determined by the size of the array:

(59) range<1> num_items{a_vector.size()};

After correcting the task size to match num_items, the vector addition code based on buffers will function correctly. The correct code version can be found in vector-add-buffers_v1.cpp in the source code archive.
We have explored the role sanitizers play in identifying and resolving memory issues within both USM and buffer-based SYCL code, ensuring error-free parallel computing applications.

Summary and Next Steps

In this tutorial, we introduced you to the fundamentals of using sanitizers with the Intel oneAPI DPC++/C++ Compiler. Sanitizers help you catch multiple bugs in a simple program.

Using sanitizers can effectively catch issues early in the development process, saving time and reducing the likelihood of costly errors in production code.

Accelerate your software development and speed up your application's code correctness and functional safety testing by catching issues early. The Intel oneAPI DPC++/C++ Compiler enables you to do this for CPU and GPU accelerator offload code, giving you the confidence you desire to go into validation and production.

Useful Resources 

Here are some detailed resources for you to explore the oneAPI DPC++/C++ Compiler: 

Get The Software  

You can install the Intel® oneAPI DPC++/C++ Compiler as a part of the Intel® oneAPI Base Toolkit or the Intel® oneAPI HPC Toolkit. You can also download a standalone version of the compiler or test it across Intel® CPUs and GPUs on the Intel® Developer Cloud platform.