Intel® oneAPI DPC++/C++ Compiler Release Notes

This document summarizes new and changed product features and includes notes about features and problems not described in the product documentation.

Where to Find the Release

Please follow the steps to download the Intel® oneAPI toolkit from the Intel® oneAPI Base Toolkit Download page and follow the installation instructions to install.

oneAPI 2024.1, Compiler Release 2024.1

New Features and Improvements

Added -fpreview-breaking-changes option.
Enabled early AOT abilities when creating objects instead of linking with fno-sycl-rdc option.
Added support for -faltmathlib.
Added support for -ftarget-register-alloc-mode.
Added -fsycl-add-default-spec-consts-image command line option to Clang Driver.
Added diagnostics if SYCL-specific command line options are used without -fsycl.
Enabled INTEL_SYCL_OPAQUEPOINTER_READY.
Supported multiple SYCL targets with native_cpu in the same compiler invocation.
Made -fintelfpga flag imply -fsycl..
Ensured debug format for SYCL device on Windows is the expected format.
Disabled parallel for range rounding at -O0.
Fixed missing optnone attribute in -O0 mode on non-FPGA target to improve debuggability.
Added support for local accessors to copy_from/copy_to for ESIMD.
Added support for local accessors to atomic update for ESIMD.
Added support for local accessors to lsc for ESIMD.
Added new raw_send APIs moving compile time params to template params for ESIMD.
Implemented sycl_ext_oneapi_auto_local_range.
Implemented sycl_ext_oneapi_sub_group_mask version 2.
Implemented sycl_ext_oneapi_queue_priority.
Implemented optional kernel feature for joint_matrix and joint_matrix_mad.
Implemented sycl_ext_oneapi_prefetch.
Improved error message when running ESIMD code on unsupported devices.
Added ext_oneapi_non_uniform_groups aspect.
Stopped throwing exception when passing empty accessor to handler::require().
Moved addc() and subb() functions out of experimental namespace.
Moved BFN function from experimental namespace.
Improved error message when queue::memcpy receives a null pointer by reporting source code info(file/line number).
Improved accuracy of double tanpi host implementation.
Updated annotated_ptr API to match spec changes.
Implemented info::event_profiling::command_submit support for platforms that use OpenCL version less than 2.1.
Fixed required alignment in buffers.
Improved error message when using sycl_ext_oneapi_graph features that are not yet implemented.
Supported get_multi_ptr in ESIMD context.
Accept integral type in parallel_for with offsets.
Added support for assert() in ESIMD kernels.
Added missing Intel archs to sycl_ext_oneapi_device_architecture.
Enabled range rounding for unnamed lambdas.
Fixed loading of plugins when path to them contains non-ANSI symbols.
Improved the error message when the command group function object passed to queue::submit has the wrong signature.
Improved handler::fill performance on 2D and 3D buffers.
Added defaults for streaming and register_map fpga_kernel_properties.
Moved GRF property conflict check to compile time.
Added error when using invoke_simd with callables with struct arguments.
Made sycl_ext_oneapi_graph functions thread-safe.
Improved diagnostic for nd_item passed to parallel_for with range.
Implemented missing exceptions defined by sycl_ext_oneapi_graph specification.
Implemented runtime query for matrix combinations.
Supported basic kernel launches over 64-bit global ranges.
Clarified runtime error for nested kernel submissions.
Fix compiler crash when spec constants are structs with padding in the middle.
Added support for specialization constants in sycl_ext_oneapi_graph nodes.
Improved vec::convert implementation to use a single instruction instead of per-element.
Made annotated_arg<T> implicitly device copyable if T is device copyable.
Default behavior on Intel® GPUs utilizing the Unified Runtime Level Zero Adapter has now been set to gracefully fail if the requested memory allocation exceeds the maximum size of a single allocation allowed on a given device. In some Intel® GPUs (not including Intel® Data Center GPU Max), this size limit would be >=4GB for a single allocation size. To override this limit, users can enable "relaxed allocation limits" by setting the environment variable UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1. By setting this environment variable, users assume responsibility of avoiding using more memory than the Global Memory Size during their workload due to the relaxed allocation limit checking.
New OpenMP Features:
The reduction clause now supports the task modifier.
The scope construct now supports the allocate clause.
The linear clause now supports the step modifier.
The proc_bind clause now supports the primary policy. The old proc_bind(master) form is deprecated.
The syntax of the linear clause now deprecates the linear-modifier(list) form.
The syntax of the map clause now deprecates the form of modifiers without comma separators.
New device and host memory allocators were added to the runtime.
New FPGA Features:
The fpga_crossgen command option previously called --emulation_model is now --cpp_model. The old option name is deprecated and will be removed in a future release.
For FPGA, added the annotated_ptr class to help you constrain memory accesses to build more efficient FPGA hardware.
When compiling for FPGA, if you do not specify the -Xstarget compiler command option, the default compiler target is now Agilex™ 7 devices (agilex7). In previous versions, the default compiler target was the Intel® PAC with Intel® Arria® 10 GX FPGA board (intel_a10gx_pac:pac_a10).
For FPGA, the mmhost macro is removed. Use the annotated_arg class instead.
For FPGA, the register_map_interface macro is removed. A memory-mapped interface is now the default, so an explicit declaration is no longer required.
For FPGA, the streaming_interface macro is removed. Use the streaming_interface kernel property instead.
For FPGA memory-mapped agent kernel invocation interfaces, the register map header files have changed. You must update or re-add your oneAPI designs in Platform Designer and update any code that refers to the macros included in the register map header file. The changes are as follows:
- The register map header file name has changed. It is now register_map_offsets.h (instead of register_map_offsets.hpp).
- Some macro names have changed.
The ‑Xstarget=agilex compiler command flag is no longer supported. Use ‑Xstarget=agilex7 instead.

Bug Fixes

Fixed backend option passing issues with multiple targets.
Fixed triple in offload mismatch warning.
Fixed stack corruption in sycl::modf for sycl::half type.
Fixed nextafter(half) precision on host.
Corrected exception thrown when kernel is submitted with a large (unsupported) reqd_work_group_size attribute.
Adjusted range definition so it has a default constructor in accordance with the SYCL spec.
Added missing return statement in the multi_ptr<void..> conversion operator.
Optimized kernel fusion implementation so that some resources can be released earlier.
Fixed an integer overflow that happened when sub-buffer size exceeds INT_MAX.
Fixed std::enable_if typo for marray and multi_ptr.
Fixed an issue with reqd_work_group_size attribute values bigger than INT_MAX being incorrectly handled.
Fixed UB in group sort scratch memory access.
Fixed a crash when default-constructed accessor is passed to a kernel.
Stopped sycl_ext_oneapi_graph subgraph nodes from creating incorrect dependencies on subsequent usages of the subgraph.
Fixed ambiguous log2 call in host code.
OpenMP-related fixes:
Fixed an unsafe access in the offload runtime when unloading the library.
Fixed a target-simd issue with masked load/store intrinsics that caused JIT to crash.
Fixed a problem where firstprivate of an array of pointers was only copying the first element.
Fixed an offload performance issue related to dynamic reduction buffer support in atomic-free reduction.
Fixed a hang in offload reduction code caused by the reduction variable incorrectly mapped to host memory.
Fixed stability issues related to loop-rotation optimization aimed at reducing compile time of OpenMP code.
Improved OMPT support to distinguish the type of OpenMP work-sharing construct being traced.
Fixed incorrect alignment of some reduction variables that led to segfault.
Fixed a problem with tree-like user-defined reduction of non-POD variables that caused a segfault.
Fixed an issue where target data was using different device numbers for entry and exit, causing offload runtime to crash.
Fixed a correctness issue with offload where num_teams clause was not honored under atomic-free reduction.
Fixed memory leak in offload compilation during metadata generation for offload entries.
FPGA-related fixes:
For FPGA, fixed an issue where DSP utilization increased disproportionately as variable size increased for ac_fixed variables. DSP utilization is now proportional to ac_fixed size across all sizes.
In the FPGA Optimization report, nodes that represent RTL libraries are now rendered correctly. Specifically, inputs to the RTL library now show edges going into the RTL library node.
For FPGA, fixed an issue where taking the modulo of an ac_int variable by a constant whose width is larger than the ac_int variable resulted in zero. Taking the modulo in this situation now returns correct results.

Special Flag -fpreview-breaking-changes

In 2024.1 release, Intel introduces a special compiler flag -fpreview-breaking-changes. When -fpreview-breaking-changes is specified, the compiler driver and compilation phases break backward compatibility, but users can access and evaluate features that are not yet default to the current release. When this option is used in conjunction with -fsycl, the driver will link against an alternate form of libsycl known as "libsycl-preview." This table shows what are enabled behind -fpreview-breaking-changes:

Provide new sycl math builtins

Fixed sycl::vec ABI inconsistencies and made it trivially copyable.

Change sycl vec::operator! return type

Updated sycl.hpp to not include <complex> and <cmath>

Deprecations

Deprecated experimental functions for ESIMD: quot, div, mod, dp*, line, lrp.
Removed ESIMD Emulator plugin.
Deprecated sycl::abs(genfloat) .
Online compiler has been deprecated and replaced by kernel compiler with several improvements.

Known Issues and Limitations

Having MESA OpenCL implementation which provides no devices on a system may cause incorrect device discovery. As a workaround such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd.
-fsycl-dead-args-optimization can't help eliminate offset of accessor even though it's created with no offset specified
SYCL 2020 barriers show worse performance than SYCL 1.2.1 do..
When using fallback assert in separate compilation flow it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv
Limit alignment of allocation requests at 64KB which is the only alignment supported by Level Zero. .
User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e. an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior.
Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies -std=c++14 or /std:c++14.
Printing internal defines isn't supported on Windows.
The support of accessor and local_accessor for ESIMD is still limited comparing to SYCL.
sycl_ext_oneapi_root_group implementation has the following limitations: max_num_work_group_sync query always returns 1 and the implemented barrier has work group scope since all work items in a root group are currently in the same work group.
Feeding ONEAPI_DEVICE_SELECTOR bad input, such as ONEAPI_DEVICE_SELECTOR="", causes SYCL runtime to not select or initialize any device at all.
When passing Level Zero host/shared memory to the SYCL Level Zero backend as native buffers, if this memory is to be used on a SYCL device which is a Discrete GPU, the original memory allocated will not be updated after the SYCL buffer goes out of scope. One must setup a host accessor for the new SYCL buffer to copy back the memory to the host before the SYCL buffer is out of scope in order to retrieve the memory changed on the device.
If users attempt to do AOT compilation on Windows and compile with -fno-sycl-rdc option, they might see binary larger than expected
OpenMP known issues:
OpenMP offload optimization reports are missing some “constructs transformed” messages.
The collapse clause is not yet supported for triangular loops, but the compiler does not emit an error message.
Prefetching in every loop iteration in offload code may sometimes cause incorrect results due to inconsistent address spaces.
An unsafe optimization across barriers may occur in some corner cases, resulting in incorrect results.
Offloading MKL or SYCL kernels using interop objects created with OpenMP (pragma omp dispatch) could result in segment fault when the program is exiting. This only occurs on Intel® Data Center GPU Flex Series and not on Intel® Data Center GPU Max Series.
FPGA known issues
For FPGA host pipes that map to streaming interfaces with sideband signals, you must configure your StreamingBeat type with the use_empty template parameter set to true whenever the width of the _data_t template parameter does not match the bits_per_symbol property of the associated host pipe. While this condition is enforced when you compile for simulation or hardware, it is not enforced when you compile for emulation.
For FPGA devices, you might run into performance issues when using switch statements instead of if statements. If the cases of the switch statement access external memories at different buffer_locations then the compiler might not be able to resolve the address space to the loads and stores, which results in the creation of extra loads and stores to dynamically resolve the address space at run time.
If you encounter these issues, use if statements instead of switch statements.
The atomic_ref class is not supported for FPGA devices.
In the FPGA SYCL* HLS flow, the compiler might generate a wider than requested address bus for the Avalon MM Host interfaces when the ring interconnect is used to connect the LSUs. You can ignore the extra MSBs on the bus by leaving them unconnected.
For FPGA devices, using the annotated_ptr class on a struct data type can result in the compiler not capturing the property information on the annotated_ptr class.
For example, when using annotated_ptr to constrain a linked list node and its next to different buffer locations. In the following code example, the compiler may not capture buffer_location<2> for p2, and therefore it creates a store unit to access each of the global memories for the last assignment:
```
struct ListNode {
  int data;
  ListNode *next; 
};
struct MyIP {
  ListNode *head;
  MyIP(ListNode *_head) : head(_head) {}
  void operator()() const {
    annotated_ptr<ListNode, decltype(properties{buffer_location<1>})> p1(head);
    const ListNode &n1 = *p1;
    
    annotated_ptr<ListNode, decltype(properties{buffer_location<2>}> p2(n1.next);
    const ListNode &n2 = *p2;
    n2.data = 5;
  }
};
```
Designs with host pipe reads and writes in an unrolled loop cause a compiler error message that contains text similar to the following text:
…pipe 'acl_c_MyID_pipe_channel' must be accessed from both endpoints…
If you receive this error message, unroll the loop manually to resolve this error. When compiling for emulation, you might not receive an error message for this issue.
When you use the -fsycl-device-code-split=per_kernel compiler command option for a design that launches and collects multiple kernels, the first kernel that is returned provides correct results. However, subsequent kernels may intermittently return incorrect results.
When compiling for FPGA simulation, the <project_name>.prj folder is missing the following files and folders. To generate the missing files and folders, compile for FPGA hardware.
- The ip folder is missing.
- The top_<project_name>_di.ip file is missing.
In the FPGA SYCL* HLS flow, when you compile to hardware with Quartus® Prime Version 22.2 or earlier and your design has more than 32 kernels per device image, you will hit a failure in qsys-generate.
In later versions of Quartus® Prime software, this limit is increased to 2048 kernels per device image.
The atomic_fence function has the following constraints on FPGA devices:
- The memory_scope::system constraint is not supported, and its use is not diagnosed. The compiler treats the memory_scope::system constraint as a memory_scope::device constraint because that is the broadest scope supported for FPGA.
- Queries to atomic_fence_scope_capabilities do not reflect the FPGA support for the atomic_fence function. When you query the SYCL* atomic fence capabilities with <device>.get_info<sycl::info::device::atomic_fence_scope_capabilities>() calls, the FPGA runtime returns the following support:
  - sycl::memory_scope::work_item
  - sycl::memory_scope::sub_group
  - sycl::memory_scope::work_group
    This list is missing the supported sycl::memory_scope::device constraint.
For FPGA pipelined kernels in simulation, the reported II may not reflect the lowest II achievable by the hardware because the runtime cannot feed data to the simulator fast enough. One possible workaround which allows lower II to be achieved is to use pipelined kernels with streaming arguments only. If wall clock time is not a restriction, using the ‑Xsghdl=0 compiler command option should slow down the simulator sufficiently for the runtime to feed it data at the lowest achievable II.
When applying memory attributes such as the [[intel::fpga_register]] attribute to member variables of structs, you might get an error message similar to the following example error message:
Unable to implement variable "var" in registers as requested since the compiler was not able to break up the struct in which this member variable with the register attribute resides.
You can avoid this error message in the following ways:
- Break up the struct or move the member variables that you want to apply the attribute to outside of the struct definition.
- If you want all member variables of the struct implemented in registers, apply the [[intel::fpga_register]] attribute to the instantiation of the struct instead of the struct definition.
- Avoid default initializing struct or class member variable arrays (for example, int arr[10] = {}). Instead, manually initialize the array in a struct or class member variable (in internal memory) that is initialized., and always access the array with square bracket notation (that is, use arr[i] instead of *(arr + i)).
You cannot use both accessors and USM pointers in the same SYCL HLS design. If you do, you receive the following error message:
Compiler Error: Redeclared MM Host interface ID 0 conflicts with another.
The depth that is reported in the system viewer for pipes that link two task_sequences may be incorrect. To check the actual depth of the pipe, grep the RTL for a module with the name *iowr*<pipe_name>*fifo_inst (where * is a wildcard), then check the FIFO_DEPTH parameter of the corresponding acl_stream_fifo module that is being instantiated.
Converting an ap_float number to an ac_fixed data type in SYCL device code in the form of ApFloatT x = (AcFixedT) y; may produce incorrect results in the FPGA emulation flow. This type of conversion works correctly in FPGA simulation and hardware compilation flows.
A DPC++ system that has FPGAs installed does not support multiprocess execution.
A DPC++ program that runs kernels on one or more FPGA devices does not support multithreaded execution.
When doing a multistep compilation for FPGA where the object file name and the exectuable file name differ, the second compilation fails.
For example, the following compilation sequence fails:
icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o A.a
icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 A.a -o B.exe
To prevent this, use the same file name for both compilations:
icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o B.a
icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 B.a -o B.exe
In the FPGA acceleration flow, pipes that use protocol<protocol_name::avalon_mm_uses_ready> produce a compiler error. To prevent this error, pass data between host and kernel (while it is running) through a different method, such a host USM.
For Intel Stratix 10 FPGA reference boards, a rare failure can occur when initializing internal memory where the memory is initialized into an unknown state that can cause unexpected behavior.
As a workaround, compile your design with the -Xsbsp-flow=flat compiler option to avoid this issue.
For large FPGA simulations, such as those that target Agilex™ 7 boards, you might receive a linker error that contains a PC-relative offset overflow message.
If you receive this message, compile your simulation with the -fsycl-link-huge-device-code compiler command option.
For FPGA kernels, when you specify the no-interleaving=default property, the buffers are not burst-interleaved. Their addresses are still assigned according to the mem_channel properties.
The expected behavior is that the mem_channel property should be ignored, and buffers should be burst-interleaved when no-interleaving flag is not provided.
There is currently no workaround.
For FPGA devices, channel widths are limited to 4096 bits in Quartus® Prime Version 21.4 and earlier. If you exceed the channel width limitation, you receive an error like the following message:
<name>_pipe_channel_read: dataBitsPerSymbol 5120 is out of range: 1-4096
To avoid this error, ensure that you use Quartus® Prime Version 22.1 or later.
In the FPGA Optimization Report, designs with multiple lambda kernels will report inaccurate results unless the lambda kernels are all given unique names. For information about how to name lambda kernels, refer to "Suggested Coding Styles" in the Intel oneAPI FPGA Handbook (https://www.intel.com/content/www/cn/zh/docs/oneapi-fpga-add-on/developer-guide/current/suggested-kernel-coding-styles.html).
For lambda kernels generated in a loop, use templated classes to give the kernels procedurally generated names.
For ap_float data types, the ‑fp-model=fast compiler command option does not enable dot product inference. There is currently no workaround for this issue.
For simulation in the FPGA Acceleration flow, a design that uses USM pointers to be allocated to <global_mem_name> but has no annotation or buffer location property that informs the compiler of where it is to be allocated fails at run time with an error when the board_spec.xml file for the target board meets the following conditions:
- It defines more than one device global memory
- The difference between the minimum and the maximum addresses across all the global memories is larger than the size of some global memory named <global_mem_name>
- .The minimum address of <global_mem_name> is larger than the size of itself
  The error is similar to the following error message:
  Error: Out of bounds memory write attempted at address <some_address>, for <size> bytes, max_size = <global_mem_name_size>
  To workaround this error, add annotation or buffer location property to the USM pointer specifying where it is to be allocated.
For FPGA, counting the leading zeros of an unsigned native integer type using a loop like in the following example can lead to a compiler error such as,
```
Compiler Error: undefined reference to 'llvm.ctlz.iN'"

unsigned int leading_zeros = 0;
while (number) {
  leading_zeros += 1;
  number >>= 1;
}
```
You can workaround this issue by using the built-in function to count the leading zeros: __builtin_clz(unsigned) or __builtin_clzll(unsigned long long). When counting the leading zeros of unsigned char or unsigned short using the built-in functions, deduct the number of bits extended during type conversion from the return.
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
```
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
```
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
When compiling for FPGA, the compiler might pack structs differently on Windows than on Linux. This difference can result in structs with members that might not be well-aligned for optimal memory accesses. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows.
As a workaround, force an alignment on the misaligned structs, as shown in the following example:
```
//Code with misaligned struct
struct Item {
  bool valid;
  int value1;
  unsigned char value2;
};

//Forced alignment of struct
struct Item {
  bool valid;
  bool __empty__[3];
  int value1;
  unsigned char value2;
  unsigned char __empty2__[3];
}
```
On Windows, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display the GPU process launch failed error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool from the mapped network drive to your local computer and run it locally.
Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, either use the Intel oneAPI FPGA Reports Tool or copy the FPGA optimization reports directory to a local file system and launch it using a supported browser.
When compiling for FPGA, the runtime ignores the sycl::property::buffer::mem_channel buffer property. Irrespective of whether you specify the property or not, all buffer allocations are allocated to the first memory channel. Currently, there is no known workaround for this issue.
The FPGA SYCL* HLS encryption flow is not fully supported on Windows systems.
You might encounter functional failures in the FPGA emulation flow when resetting a device_global and a new device_image is loaded without the device_image scope property. Currently, there is no known workaround for this issue.

oneAPI 2024.0, Compiler Release 2024.0.3

Bug fixes for installation using the Conda package manager
Includes the fixes that are part of 2024.0.2

oneAPI 2024.0, Compiler Release 2024.0, 2024.0.1, 2024.0.2

New in this release

Compiler Release 2024.0.2

Bug fixes
Includes the fixes that are part of 2024.0.1.

Compiler Release 2024.0.1

SYCL Bindless textures have been fixed and will now work correctly on NVidia® hardware via the Codeplay® NVidia® plugin.
OpenMP* runtime has been updated to support Intel® Core™ Ultra devices.

oneAPI 2024.0, Compiler Release 2024.0

New Features and Improvements

The Level Zero driver switched the default from Intel® Data Center GPU Max Series "cards-as-devices" to "tiles-as-devices" in 2024.0.
Starting with this release, the Intel Level Zero and OpenCL GPU driver exposes multi-tile devices like Intel® Data Center GPU Max Series differently, which also affects the way these devices are exposed in SYCL and OpenMP. Prior to this change, each card was exposed as a root device and tiles were exposed as sub-devices. Now, each tile is exposed as a root device by default. This also affects how root devices can be partitioned into sub-devices. The old behavior can be enabled via the ZE_FLAT_DEVICE_HIERARCHY environment variable.
As a result, use of the environment variables ONEAPI_DEVICE_SELECTOR and ZE_AFFINITY_MASK may need to change because the number of root devices and the availability of sub-devices is different than in prior releases.
Enabled Pre-Compiled header (PCH) inclusion while performing host compilation with -fsycl.
Added support for -ftarget-export-symbols option which allows to export symbols for AOT.
Added support for reqd_sub_group_size to sycl::is_compatible and implemented device code split based on reqd-sub-group-size optional kernel feature.
Added support for -ftarget-compile-fast for spir64_gen and JIT mode.
Implemented adding C++ libraries to the linker with -fsycl.
Added support for multiple call operators in a kernel functor.
Added support to propagate compile flags to device backend compiler.
Added new FPGA loop attribute enable_loop_pipelining.
Enabled sycl_ext_oneapi_annotated_arg and sycl_ext_oneapi_annotated_ptr experimental extensions.
Implemented sycl_ext_intel_queue_immediate_command_list extension.
Implemented sycl_ext_oneapi_copy_optimize experimental extension.
Added initial implementation of the experimental SYCL Graph extension sycl_ext_oneapi_graph.
Added dimensions member to item/rangelike types.
Added support of queue::priority_* properties for OpenCL backend.
Implemented initial version of the SYCL Native CPU Plugin designed in SYCLNativeCPU.
Added __imf_max/min/hadd/fast_* functions to imf device libraries.
Introduced and implemented new sycl::ext::oneapi::experimental::info::device::architecture device descriptor and device::ext_oneapi_architecture_is(ext::oneapi::experimental::architecture) host API as part of sycl_ext_oneapi_device_architecture extension (for Level Zero and OpenCL).
Added experimental implementation of sycl_ext_intel_grf_size.
Experimental sycl_ext_oneapi_device_global extension is supported now.
Enabled media API that works with image accessors for ESIMD. Those accessors do not depend on stateful/stateless mode.
Allowed implicit conversion from std::experimental::simd_mask to ESIMD::simd_mask.
Added host_task image accessor support.
Started to support 64bit offsets with accessors in stateless mode for ESIMD APIs.
Added __imf_llabs to the imf device libraries.
Added explicit conversion of multi_ptr<T> to multi_ptr<const T>.
Added support for scalar logical operators with group algorithms.
Implemented device query for 64 bit atomic support in ESIMD emulator.
Added support of 16 bit data for lsc_atomic_update and lsc_slm_atomic_update ESIMD API.
Added initial implementation of sycl_ext_oneapi_root_group. There are severe limitations: max_num_work_group_sync query always returns 1 and the implemented barrier has work group scope since all work items in a root group are currently in the same work group.
Added support for tf32 type using the unified interface for SYCL Matrix extension.
Implemented Host Pipes described in sycl_ext_intel_dataflow_pipes extension.
Added support for more math builtins for bfloat16 together with the extension specification sycl_ext_oneapi_bfloat16_math_functions update.
Added decorated async_work_group_copy overloads.
Added initial implementation of the Unified Runtime plugin and routed to it with SYCL_PREFER_UR.
Added support for accessors to atomic_update ESIMD API.
Enabled prettyprinting of reference objects by GDB.
Added Xmethods that help GDB to support reading local_accessors on GPU from SLM.
Enabled passing local accessors to ESIMD kernel and via invoke_simd() API, enabled usage of get_pointer() and operator[] for accessors in ESIMD kernel. The support of accessor and local_accessor is still limited comparing to SYCL.
Implement unpadding for 2d block load/store ESIMD API.
Added SYCL 2020 image classes.
Added interfaces for SYCL 2020 image accessors, only the host accessors interfaces are actually functional.
Added XPTI notifications for SYCL 2020 images as well as for the associated accessor classes.
Started to print device aspects in syclls verbose.
Enabled group algorithms to be used with tangle_group and opportunistic_group arguments.
Implemented info::device::backend_version query.
Added fixed_size_group support to algorithms.
Added simple abs(int) to imf libdevice.
Added ballot_group support to algorithms.
Implemented interoperability support for images for the Level Zero backend.
Added marray support for math builtin functions.
Enabled inline assembly support in ESIMD kernels.
Enhanced interop queue interface to choose standard or immediate commandlists.
Enabled double type for atomic_update() ESIMD API.
Added support for addc and subb operations for ESIMD.
Allowed zerosized 3D accessors.
Deprecated fsycllinkhugedevicecode in favor of a new option, flinkhugedevicecode. The new option is identical in functionality but allowed with fopenmptargets.
Optimized size of produced binaries when device_global is used.
Started to emit an error when PCH is triggered in SYCL mode.
Improved FPGA archive device unbundling with AOCO.
Moved imf abs to a separate device library for Deep Learning.
Fixed the bug report URL for DPC++.
Started to link with bfloat16 related device libraries only when they are used.
Started to properly pass fsyclesimdforcestatelessmem to the host code compilation.
Combined ADLS and RPLS device architectures.
Implemented multi_ptr default to be legacy to avoid code break with SYCL 1.2.1.
Started to use aggregate initialization for group_local_memory arrays according to updated sycl_ext_oneapi_local_memory.
Deprecated ESIMD Emulator.
Deprecated ext::oneapi::sub_group.
Improved error message related to ext_intel_free_memory aspect.
Removed nonnecessary barrier after region copy in the Level Zero backend.
Fixed get_info<device::free_memory> to check its aspect.
Removed the workaround for release of auxiliary buffers.
Enabled optimization for readonly buffer accesses from multiple devices in a context.
Removed old specialization constants extension and implementation.
Improved is_compatible to check if specific target is defined with fsycltargets and change the result.
Improved quotient approximation for host sycl::remquo.
Allowed accessor constructed with zerosize buffers.
Improved resources recycling for the Level Zero backend by cleaning up events on inorder queue wait.
Added code_location parameter to the rest of sycl::queue methods which allows to improve error reporting.
Enabled xpti::node_create signal emit for parallel_for that bypasses graph.
Improved the accuracy of host sycl::cospi.
Replaced usage of outdated memset OpenCL API clEnqueueMemsetINTEL with clEnqueueMemFillINTEL.
Added memory pointer to XPTI memory allocation event metadata to allow more informative XPTI memory tracing.
Implemented recycling of immediate command lists for queues in a context for the Level Zero backend.
Optimized ext_oneapi_submit_barrier() for inorder queue for the Level Zero backend.
Change the SYCL_PI_LEVEL_ZERO_USM_RESIDENT default to force device allocations only.
Started to report false for aspect::image on all devices.
Removed lambda parameter from "if_architecture_is" according to sycl_ext_oneapi_device_architecture.
Improved error reporting when reqd_work_group_size is not supported by a device.
Adjusted static restriction checks for block_2d APIs.
Disallowed local accessor misuse according to SYCL 2020.
Implemented passing address space information to SPIRV Joint Matrix Load/Store intrinsics.
Enabled immediate command lists by default for the Level Zero backend.
Improved SYCL RT performance by removing extra map lookup for eliminated kernel arguments during enqueue of cached kernels.
Changed the default to SYCL_PI_LEVEL_ZERO_USM_RESIDENT=2.
Added heuristics to reduce overhead from immediate commandlist cleanup in the Level Zero backend.
Renamed cluster_group to fixed_size_group.
Added error for invalid uniform arguments in InvokeSIMD.
Reduced overhead from queue profiling by using steady_clock::now() to retrieve host time as piGetDeviceAndHostTimer has large overhead.
Implemented graceful handling of the case that Level Zero runtime was already unloaded when we do cleanup.
Improved error messages for InvokeSIMD.
Updated native_specialization_constant() for the case when no specialization constants are present.
Optimized reuse of command lists in the Level Zero backend.
Changed the behavior of sycl::maximum and sycl::minimum to be consistent with std::max and std::min.
Switched to sycl::exception instead of sycl::runtime_error in the majority of SYCL device headers. sycl::runtime_error is deprecated in SYCL 2020.
Added USM calls parameter verification layer to sycl-trace which provides improved diagnostic on accessing invalid device memory under USM.
Started to print supported SG sizes in syclls --verbose.
Added support for the following FPGA variable-precision data type conversions:
- ac_int to ap_float
- ap_float to ac_int
- ac_fixed to ap_float
- ap_float to ac_fixed
Added command to extract FPGA hardware configuration file (.aocx) from your multiarchitecture binary.
Removed the -Xsfp-relaxed FPGA option. This option is no longer supported.
If you want an accumulator to be inferred even when using -fp-model=precise, rewrite your code to expose the accumulation.
Removed restriction on simulating FPGA host pipes with protocol_name::avalon_mm.
Replaced the ‑⁠Xsoptimize=throughput option with the ‑Xsoptimize=throughput-area-balanced option.
Added the annotated_arg class. Use this class to create a memory-mapped host interface to your FPGA kernel.
Deprecated the [[intel::max_global_work_dim(0)]] kernel attribute. The compiler adds this attribute automatically for any single-task kernel, so adding this attribute explicitly is no longer required.
Deprecated the register_map_interface macro for FPGA kernels. A memory-mapped interface is now the default, so an explicit declaration is no longer required.
Added the streaming_interface FPGA kernel property to replace the streaming_interface macro. The macro is now deprecated.
Added the pipelined FPGA kernel property to replace the streaming_pipelined_interface macro. The macro is now deprecated.

Bug Fixes

Fixed uses_aspects to be applied to function declarations and not only function definitions.
Fixed ivdep attribute in template contexts.
Fixed option restriction for device with Windows.
Fixed handling unsupported attributes for ESIMD.
Fixed optimization option processing for device options.
Enabled proper behavior of optional kernel features with SYCL_EXTERNAL.
Fixed "SLM init call is supported only in kernels" message for slm_init() inside a ESIMD kernel when compiling with -O0 option.
Fixed memory leak in Base64::decode.
Fixed crash when using libraries with -fsycl.
Added predefines for /MDd to fix certain build issues when using reductions on Windows.
Fixed compilation issue when kernel name is a class defined with "final" keyword.
Fixed an issue with setting VCSLMSize attribute which was causing JIT compilation failure when using slm_init() in ESIMD kernel.
Fixed option for undefined symbols for hostdep link to work with gold linker.
Fixed directory field of DIFile.
Fixed the bug in the Level zero backend where zeModuleDestroy is called although ext::oneapi::level_zero::ownership::keep is passed to make_kernel_bundle leading to a double free corruption.
Addressed specification mismatches for multi_ptr.
Fixed corner case when using short or char with exclusive scan.
Updated the legacy multi_ptr construction from local_accessor's to conform with the construction method used by other accessors.
Made get_pointer, noexcept to comply with SYCL 2020.
Started to return the correct sycl::errc in case of invalid sycl::reqd_work_group_size.
Fixed sycl::remquo truncation error.
Fixed linkage errors when trying to get mem_channel property or check it using has_property().
Fixed dynamic loading of ocloc tool for online_compiler.
Fixed global memory reporting for Arc.
Fixed sycl::sub_group to follow byvalue semantics according to SYCL 2020.
Fixed invalid value returned by event.get_info<sycl::info::event::command_execution_status>() for OpenCL backend.
Fixed assignment operator for ESIMD globals.
Fixed the Level Zero backend to not destroy build log on program build failure.
Fixed error appearing when validation layer is enabled for the Level Zero with ZE_DEBUG=6.
Allowed group algorithms to accept a function object with an explicit type.
Fixed sycl::is_compatible() for the case when vector of kernels ids is empty.
Fixed multi_ptr ctor for extended address spaces.
Fixed trivially_copyable to device_copyable for fill and copy.
Fixed range deductions in reduction parallel_for.
Adjusted multi_ptr deduction guides to comply with SYCL 2020.
Fixed gather/scatter with accessors when passing scalar for ESIMD backend.
Fixed lost data during implicit conversion in local and host accessors.
Fixed memory leak because of unnecessary kernel retain.
Fixed the pointer type contained by a readonly accessor to const.
Fixed max_sub_group_size query for devices without sub groups.
Removed undefined behavior in hostside abs_diff.
Fixed vec::as<vec<bool, N>>().
Fixed segfault when using ZE_DEBUG and an exception is thrown on device selection phase.
Removed an unnecessary multi_ptr conversion which had conflict with an existing one resulting in ambiguity on implicit conversion.
Fixed lost data in accessor implicit conversion.
Fixed stack smashing which happened in some device info queries on OpenCL backend.
Stopped shipping the Level Zero loader and headers with the DPCPP toolchain installation.
Corrected the Intel® Data Center GPU Max Series device id check in the Level Zero backend.
Fixed ambiguity for bit_cast.
Fixed nan/inf handling in sycl::stream when in fastmath mode.
Allowed for different types in group algorithms reduce, exclusive_scan, and inclusive_scan.
Fixed bug in get methods in the config_2d_mem_access class for 2D block stateless load/store API.
Fixed compilation error for sycl::fabs builtin in fastmath mode.
Allowed host_task deduction tags in placeholder accessors.
Added iterator operations for zerodimension accessors to comply with SYCL 2020.
Fixed rounding issue in __imf_vavgs.
Fixed bug in fill operation for zerodimensional accessor.
Fixed arguments passed to ocloc via ftargetcompilefast.
Implemented implicit conversion for local and host accessors to comply with SYCL 2020.
Throw exception when empty accessor calls require() to comply with SYCL 2020.
Fixed operator~ for sycl::vec<bool, N>.
Fixed sub_group shuffle for vectors of long long and half.
Fixed static destruction order issue in OpenCL extension fptr cache.
Fixed leak of active barriers' events in the Level Zero backend.
Implemented proper queries for aspect::ext_oneapi_srgb.
Added vec assignment from scalar and more vec modulus overloads to comply with SYCL 2020.
Fixed handling of mem_channel buffer property.
Fixed operator& and operator[] in local_accessor<const T>.
Fixed the third argugment type for select builtin.
Fixed native_specialization_constant() API implementation.
Moved Level Zero specific collectors to dynamic libraries loaded by request in sycltrace tool to support the case when Level Zero driver is not available.
Added missing marray relational functions: any, all, bitselect; fixed scalar select relational function; aligned scalar abs integer function with SYCL 2020 and fixed math functions which take multi_ptr argument.
Added operator[] and element_type to swizzle vec to align with SYCL 2020.
Fixed buffer range in atomic_memory_order_acq_rel.
Started to throw for invalid global_work_size query.
Fixed the Level Zero backend to report events as submitted, not running, until they are completed.
Allowed raw send ESIMD API to use nonstandard types like sycl::half.
Started to throw for invalid info::kernel::num_args query to comply with SYCL 2020.
Updated group_broadcast to support vec types.
Fixed reductions to avoid implicit atomic64 requirements.
Added a partial profiling workaround for acc devices because queue profiling is no longer supported for OpenCL version < 2.1 after recent changes regarding command submit profiling info.
Fixed program build API for the Level Zero backend to retain buildlog when program build failed.
Implemented missing std::hash specializations for local_accessor and host_accessor to comply with SYCL 2020.
Fixed global_work_size kernel query.
Fixed incorrect sycl::vec<bool, N> constructor behavior.
Fixed weak_object for host_accessor and stream.
Fixed incorrect behaviors in some operations using sycl::vec<bool, N>.
Fixed integration footer for specialization_id.
Fixed compilation break occurring when bfloat16 constructor is used in a kernel.
Fixed crash in subgroup info queries when running on OpenCL backend which doesn't support subgroups.
Added missing support for target::host_tas specialised accessor constructor using mode_target Tag.
Fixed identityless reductions with unwritten reducers.
Fixed dangling pointer issue in xpti tracing.
Fixed PI event leak in memcpy2d devicehost fallback.
Fixed weak_object and owner_less for device objects.
Added noexcept specifier to vec::byte_size to comply with SYCL 2020.
Fixed undefined behaviour in vector printf specifier.
Fixed handler::fill so it works even if pattern is not supported by a backend natively.
Fixed mechanism to throw exception when placeholder accessor passed to a command.
Fixed empty zerodimensional accessor access range.
Fixed incorrect write back in a case when a sycl::buffer is constructed with a const T* as host data.
Fixed empty accessor default constructor to not create a placeholder to comply with SYCL 2020.
Removed deprecated piclCreateProgramWithSource.
Removed deprecated barrier API.
Removed deprecated interop_task
Removed deprecated sycl::group_local_memory.
Removed deprecated sycl::detail::bitcast.
Removed deprecated piEnqueueNativeKernel.
Removed deprecated backend enum values: level_zero, esimd_cpu.
Removed the workaround for release of auxiliary buffers.
Removed old specialization constants extension and implementation.
Removed nonstandard RT namespace from sycl namespace.
Removed deprecated ESIMD APIs.
Removed DISABLE_SYCL_INSTRUMENTATION_METADATA macro and _CODELOC* macro usage from API.
Removed nonstandard RT namespace from sycl namespace.
Removed getOSModuleHandle usage.
Removed deprecated sycldevice triple support.
Dropped support for sycl_ext_oneapi_extended_atomics extension and sycl_ext_oneapi_group_algorithms extension.
Removed unneeded backwards compatibility of make_queue and get_native.
Removed support for binaries generated for triples with "sycldevice" environment component.
Removed lambda parameter from "if_architecture_is" according to sycl_ext_oneapi_device_architecture.
Updated sycl::exception to be SYCL2020 compliant.
Replaced deprecated sycl::runtime_error with SYCL 2020 compliant sycl::exception in SYCL RT.
Renamed win_proxy_loader to pi_win_proxy_loader.
Promoted the return type changes of SYCL relational builtins that changed between SYCL 1.2.1 and SYCL 2020 out from the guard of SYCL2020_CONFORMANT_APIS.
Fixed get_pointer to return T* for target::device specialized accessor according to specification.
Fixed max_work_item_sizes return type from id to range according to SYCL 2020.
Deprecated experimental set_kernel_properties API and use_double_grf/use_large_grf properties were removed. New API provided in the extension sycl_ext_intel_grf_size has to be used.
Fixed an issue preventing the support of FPGA Task Sequence functions with struct return values.
Fixed an FPGA issue causing an assert message for the loop with local memory LSUs on the loop’s II critical path due to loop-carried memory dependency when compiling a hyper-optimized loop using the [[intel::max_reinvocation_delay]] FPGA loop attribute.
Fixed an issue where FPGA RTL libraries cause an Intel® Quartus® Prime compilation of a SYCL* HLS IP core to fail late in the compilation process.
Fixed an FPGA issue where the hw.tcl file that is generated as part of the SYCL* HLS flow did not map signals correctly.
Fixed an error that occurred when you use FPGA mmhost macros on kernel arguments that are used inside a lambda within the kernel function. In addition to fixing this error, the mmhost macros are now deprecated.
Fixed an FPGA issue where the modulefiles-setup.sh script was not supported for FPGA.

Known Issues and Limitations

There is a known issues with bindless textures in SYCL currently. The team is working to resolve this and a patch release (2024.0.1) will be posted to address this.
Copy operations inside a graph node of Intel® Arc™ Graphics can fail because of a possible synchronization issue that also affects accessors.
Having MESA OpenCL implementation which provides no devices on a system may cause incorrect device discovery. As a workaround such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd.
-fsycl-dead-args-optimization can't help eliminate offset of accessor even though it's created with no offset specified SYCL 2020 barriers show worse performance than SYCL 1.2.1 do.
When using fallback assert in separate compilation flow it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv.
Limit alignment of allocation requests at 64KB which is the only alignment supported by Level Zero.
The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior.
Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies "-std=c++14" or "/std:c++14".
Printing internal defines isn't supported on Windows.
The support of accessor and local_accessor for ESIMD is still limited comparing to SYCL.
sycl_ext_oneapi_root_group implementation has the following limitations: max_num_work_group_sync query always returns "1" and the implemented barrier has work group scope since all work items in a root group are currently in the same work group.
Unnecessary handling of devices may cause overhead and may lead to unintended interactions with buffers. SYCL Graph finalizes for each device associated with a context, rather than just finalizes for the specific user defined device.
Applications running on "pre-Intel® Data Center GPU Max Series" GPUs using larger-than-4GB allocations need to set the following environment variable so Intel Graphics Compiler (IGC) can create the correct compilation unit: SYCL_PROGRAM_COMPILE_OPTIONS=-ze-intel-greater-than-4GB-buffer-required
Not using this environment variable on applications using larger-than-4GB allocations may produce unexpected behavior. Using larger-than-4GB allocations on "pre-Intel® Data Center GPU Max Series" GPUs disables certain pointer arithmetic optimizations added by the graphics compiler, so a difference in performance in the application may be seen.
If a user builds a SYCL program with '-g' option and runs it with a gcc version greater than 13.1.0 on CPU device, the program may crash during the shutdown stage. Although this issue is not a Intel® product defect, users should not run a SYCL program built with '-g' option on CPU device if the gcc version used is greater than 13.1.0.
In 2024.0 release, we have fine-tuned the Address, Leak, and Thread sanitizers (triggered with the compiler flag -fsanitize) to focus on detecting issues in a user’s program, as opposed to those, whether legit or not, in the various runtime libraries that the DPC++ runtime depends on. Note that for the Thread Sanitizer, it may occasionally detect issues in SYCL header files while we continue working on fixing or suppressing those issues in future releases. The Memory sanitizer, on the other hand, will still report issues in the runtime libraries before reaching the user code. Note that these sanitizers only sanitize the host-side code in your DPC++ program; the kernels are not sanitized.
The DPC++ 2024.0 compiler has been updated to conform to a change in the SYCL 2020 specification, but this update can break some SYCL programs that work fine with DPC++ 2023.0. Specifically, programs that use sycl::minimum or sycl::maximum can be affected. The reason is that before, sycl::maximum was defined using operator>. Now, it's defined using operator< to be consistent with std::max.
Using sycl::image L0 interop when compiling at -O0 might result some issues. To work around this, please use a different optimization level.
Starting from CMake 3.27, CMake support for Intel® oneAPI DPC+/C+ Compiler and Intel® Fortran Compiler has been updated to use the compiler driver for linking instead of the linker. This change enables use cases for building SYCL applications and libraries and enables Interprocedural Optimization (IPO) on Windows*. Also starting from CMake 3.25, due to limitations in the way CMake tracks linker flags, a CMake project with mixed C/C++ and Fortran code where one of the compilers is LLVM-based (e.g. icx, ifx) and another is the classic version (e.g. icc, ifort) will result in invalid linking flags. As an example, CMake cannot build a project using "icx" as the C/C++ compiler and "ifort" as the Fortran compiler. A workaround is to use only LLVM-based compilers when building mixed language applications.
A DPC++ system that has FPGAs installed does not support multiprocess execution.
A DPC++ program that runs kernels on one or more FPGA devices does not support multithreaded execution.
When compiling for FPGA emulation, debug support on Windows is unavailable when using device-side libraries.
On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
```
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
```
As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported, and its use is not diagnosed. The compiler treats the memory_scope::system constraint as a memory_scope::device constraint because that is the broadest scope supported for FPGA.
When compiling for FPGA, the compiler might pack structs differently on Windows than on Linux. This difference can result in structs with members that might not be well-aligned for optimal memory accesses. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows.
As a workaround, force an alignment on the misaligned structs, as shown in the following example:
```
//Code with misaligned struct
struct Item {
  bool valid;
  int value1;
  unsigned char value2;
};

//Forced alignment of struct
struct Item {
  bool valid;
  bool __empty__[3];
  int value1;
  unsigned char value2;
  unsigned char __empty2__[3];
```
The FPGA optimization report reports incorrect area utilization data from Quartus compiles for Intel® Quartus® Prime Pro Edition software versions 23.1 and later. Currently, there is no known workaround for this issue.
On Windows, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display the GPU process launch failed error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool from the mapped network drive to your local computer and run it locally.
Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, either use the Intel oneAPI FPGA Reports Tool or copy the FPGA optimization reports directory to a local file system and launch it using a supported browser.
When compiler for the FPGA optimization report flow, the list of optimization flags used for a compile may be incomplete or unavailable in the FPGA Reports Summary Page under certain circumstances, such as when using the -g0 flag. As a workaround for this issue, avoid using the -g0 flag in your compilation command. Also, if you use the -ghdl option, ensure that it is the last argument in your command.
When generating FPGA optimization reports, the compiler might crash for any design with pipes having a capacity 0. If only a few pipes (but not all pipes) have a capacity of 0 in the design, then only those with a capacity will appear in the Area report. As a workaround for the compiler crash, assign a capacity (for example, 1) to one of the pipes with capacity 0.
When compiling for FPGA, if you specify output target names that are pure numbers or that start with a number, the compiler errors out and might display an error message, as shown in the following example:
```
icpx -fsycl -fintelfpga -Xssimulation basic.cpp -o 2
aoc: Compiling for Simulator.
Error: Simulation system generation FAILED.
Refer to 2.prj/2.log for details.

llvm-foreach:
cpx: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
```
The error message is not representative. As a workaround for this issue, specify target names that are not pure numbers or that start with a number.
When compiling for FPGA, the compiler ignores the sycl::property::buffer::mem_channel buffer property. Irrespective of whether you specify the property or not, all buffer allocations are allocated to the first memory channel. Currently, there is no known workaround for this issue.
When running pipelined kernels in the FPGA simulation flow, the simulation runtime may not launch kernel invocations fast enough to achieve the lowest possible kernel initiation interval that the generated RTL can achieve. Currently, there is no known workaround for this issue.
When compiling for FPGA, the compiler might ignore non-RTL source library functions when the library archive file also contains RTL source objects and report the following error message:
```
Compiler Error: undefined reference to <non-RTL source library function>
```
As a workaround for this issue, avoid placing RTL and non-RTL source library objects in the same archive file.
The compiler is not constrained to the specified LSU style when requesting a particular LSU style using the FPGA LSU controls for a struct data type. Instead, it chooses the best LSU style for the access pattern. As a workaround for this issue, avoid using LSU controls with the struct data type and use simple data types instead.
The FPGA SYCL HLS encryption flow is not fully supported on Windows systems.
When compiling for FPGA, the compiler might crash if any global memory in the board_spec.xml file does not have a name field. Ensure that all global memories in the board_spec.xml file have a name field. For example, <global_mem name="DDR" ... >

When compiling for FPGA and linking multiple fat static libraries containing the device code (produced using the -fsycl-link=image flag), only the device code from the first library is included in the fat executable, and the following error message is returned:


> what(): native api failed. native api returns: -46 (pi_error_invalid_kernel_name)
> terminate called after throwing an instance of 'sycl::_v1::exception

As a workaround for this issue, dynamically link the host code instead of linking statically.
Example compilation commands:


icpx -fsycl main.cpp -c -o main.o
icpx -fsycl -fintelfpga -fpic -shared add_kernel.cpp -o libadd_kernel.so
icpx -fsycl -fintelfpga -fpic -shared sub_kernel.cpp -o libsub_kernel.so
icpx -fsycl -fintelfpga main.o -L. -ladd_kernel -lsub_kernel -o hot_swapper
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:. ./hot_swapper

If your design includes a device_global memory greater than 1024 bits in size and you have not initialized it in the kernel, then you might see incorrect behavior when compiling for the simulator. Memory size greater than 1024 bits can happen due to the following reasons:
- device_global is an array with a size greater than 1024 bits.
- device_global is a scalar that uses ac_int (or other large types) with a size greater than 1024 bits.
  
  This is caused by a bug in the Intel Quartus Prime Pro Edition software that occurs when initializing memory using a MIF file. As a workaround for this issue, zero-initialize the contents of the device_global memory before accessing the memory.
You might encounter functional failures in the FPGA emulation flow when resetting a device_global and a new device_image is loaded without the device_image scope property. Currently, there is no known workaround for this issue.
Using atomic_ref in FPGA flows to access memory data requires much more width on the interface than the width of the datatype.
The FPGA mmhost macro does not work in a kernel lambdas. The mmhost macro is deprecated and will be removed in a future relase.
When doing a multistep compilation for FPGA where you create an FPGA early image file (.a) that you later compile to a multiarchiteture binary and you change the file name between the two compilations, the second compilation will fail.
For example, the following compilation sequence fails:
```
icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o A.a
icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 A.a -o B.exe
```
To prevent this, use the same file name for both compilations:
```
icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o B.a
icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 B.a -o B.exe
```
For FPGA compilation, when you use the annotated_arg class in SYCL* HLS kernels, the address width of the lowest number buffer location must be at least 11 bits wide. Otherwise, the compiler returns an error.
For FPGA compilation, the amount of memory that can be allocated through a single call of a SYCL* allocation API to a buffer_location is limited by the total size of the lowest numbered buffer location.

To avoid a null pointer being returned by any of the SYCL* allocation APIs due to this limitation, set the address width of the lowest numbered buffer_location to at least ceil(log2(the largest allocation size made + 1024)) bits. In hardware, tie the unused address bits to 0 if on input, and leave them dangling if on output.

The following allocation APIs are affected:
- - sycl::malloc_shared
  - sycl::malloc_host
  - sycl::malloc_device
  - sycl::aligned_alloc_shared
  - sycl::aligned_alloc_host
  - sycl::aligned_alloc_device
When running FPGA pipelined kernels in simulation flow, you might not achieve the lowest II in the waveforms.
In the FPGA acceleration flow, pipes that use protocol<protocol_name::avalon_mm_uses_ready> produce a compiler error. To prevent this error, pass data between host and kernel (while it is running) through a different method, such a host USM.
For Intel® FPGA PAC D5005, previously known as Intel® PAC with Intel Stratix® 10 SX FPGA, there is a known issuse where a sequence of store operations in your FPGA kernel followed by a load operation produces incorrect results. The PAC BSP has been discontinued. Contact your Intel representative for the avaialability of any replacement BSPs for the Intel FPGA PAC D5005.
For example, in the following code, the data[1] = data[0]; operation occurs before the store operations on earlier lines. This incorrect order of operations results in an incorrect value for data[1].
```
int main() {
  sycl::queue q{sycl::ext::intel::fpga_selector_v};
  volatile int *data = sycl::malloc_host<int>(2, q); 
  data[0] = 0;
  q.parallel_for(sycl::range<1>(128), [=] (sycl::id<1> index) {
    if (index == 0) 
      #pragma unroll
      for (int i = 0; i < 100; ++i)
        *data = i;
      data[1] = data[0];
    }   
  }).wait();
  std::cout << "Result: " << data[1] << "\n";
}
```
For Intel Stratix 10 FPGA reference boards, a rare failure can occur when initializing internal memory where the memory is initialized into an unknown state that can cause unexpected behavior.
As a workaround, compile your design with the -Xsbsp-flow=flat compiler option to avoid this issue.
For simulations of large FPGA designs, such as those that target Intel Agilex 7 boards, you might receive a linker error that contains a PC-relative offset overflow message.
If you receive this message, compile your simulation with the -fsycl-link-huge-device-code compiler command option.
For FPGA kernels, when you specify the no-interleaving=default property, the buffers are not burst-interleaved. Their addresses are still assigned according to the mem_channel properties.
The expected behavior is that the mem_channel property should be ignored, and buffers should be burst-interleaved when no-interleaving flag is not provided.
There is currently no workaround.
For FPGA devices, channel widths are limited to 4096 bits in Intel Quartus Prime Version 21.4 and earlier. If you exceed the channel width limitation, you receive an error like the following message:
<name>_pipe_channel_read: dataBitsPerSymbol 5120 is out of range: 1-4096
To avoid this error, ensure that you use Intel Quartus Prime software Version 22.1 or later.
When using the FPGA aocl command on a Microsoft* Windows* system, you might get the following error:
```
aocl.exe: Unable to determine the execution environment of the Intel(R) FPGA SDK for OpenCL(TM).
aocl.exe:     Detailed error: Could not determine the path to SDK internal Perl executable
```
If you get this error, complete the following steps:
1. Copy aocl.exe from <oneAPI-install-location>\oneAPI\2024.0\bin to <oneAPI-install-location>\oneAPI\compiler\2024.0\opt\oclfpga\bin
2. Add <oneAPI-install-location>\oneAPI\compiler\2024.0\opt\oclfpga\bin to your PATH environment variable before <oneAPI-install-location>\oneAPI\2024.0\bin
In the FPGA Optimization Report, designs with multiple lambda kernels report inaccurate results unless the lambda kernels are all given unique names. For information about how to name lambda kernels, refer to "Suggested Coding Styles" in the Intel oneAPI FPGA Handbook (https://www.intel.com/content/www/cn/zh/docs/oneapi-fpga-add-on/developer-guide/current/suggested-kernel-coding-styles.html).
For lambda kernels generated in a loop, use templated classes to give the kernels procedurally generated names.
For FPGA ap_float data types, the ‑fp-model=fast compiler command option does not enable dot product inference.
To enable dot product inference for ap_float data types, use the ‑Xsffp‑reassociate compiler command option.
When you use Intel® Quartus® Prime Version 23.2 or later as part of the SYCL* HLS flow, a simulation compilation fails when your design contains a memory interface that satisfies the following conditions:
- The memory interface uses a global memory ring interconnect.
- The memory interface port direction is read/write or read-only.
- The memory interface has a fixed latency (that is, the latency is not explicitly set to 0)
  
  The error is similar to the following error message:
```
Error: mm_agent_ks_mem0_rw.mm_agent_ks_mem0_rw.s0: Agent with readdatavalid must use waitrequest.
```
For simulation in the FPGA Acceleration flow, a design that uses USM pointers to be allocated to <global_mem_name> but has no annotation or buffer location property that informs the compiler of where it is to be allocated fails at run time with an error when the board_spec.xml file for the target board meets the following conditions:
- It defines more than one device global memory.
- The difference between the minimum and the maximum addresses across all the global memories is larger than the size of some global memory named <global_mem_name>.
- The minimum address of <global_mem_name> is larger than the size of itself.
  
  The error is similar to the following error message:
```
Error: Out of bounds memory write attempted at address <some_address>, for <size> bytes, max_size = <global_mem_name_size>
```
  To workaround this error, add annotation or buffer location property to the USM pointer specifying where it is to be allocated.
For FPGA, counting the number of bits in an unsigned integer without leading zeros using a loop like in the following example can lead to a compiler error such as " Compiler Error: undefined reference to 'llvm.ctlz.iN'"
```
unsigned int leading_zeros = 0;
while (number) {
  leading_zeros += 1;
  number >>= 1;
}
```
You can workaround this issue by first getting the number of leading zeros using the built-in functions: __builtin_clz(unsigned) or __builtin_clzll(unsigned long long). When counting the leading zeros of unsigned char or unsigned short using the built-in functions, deduct the number of bits extended during type conversion from the return.
In the FPGA SYCL* HLS flow, if your design contains a "write-only" pointer-type kernel argument defined by an annotated_arg<> with buffer location 'X' in any kernel, compiling for simulation compile fails if all the following conditions are true:
- If none of the pointers with buffer location "X" are accessed in any kernel body.
- There are no unannotated pointer-type kernel arguments defined or used in any of the kernels in the design.
  
  For example, compiling the following design for simulation causes the compilation to fail:
```
struct kernelA {
  annotated_arg<int*, properties{buffer_location<0>, readwrite_mode_write}> a;
  void operator()() {}  // a is never used, this operation fails
}
```

System Requirements

Intel® oneAPI DPC++/C++ Compiler System Requirements

Additional Documentation

Previous oneAPI Releases

Notices and Disclaimers

Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Intel® oneAPI DPC++/C++ Compiler Release Notes

Where to Find the Release

oneAPI 2024.1, Compiler Release 2024.1

New Features and Improvements

Bug Fixes

Special Flag -fpreview-breaking-changes

Deprecations

Known Issues and Limitations

oneAPI 2024.0, Compiler Release 2024.0.3

oneAPI 2024.0, Compiler Release 2024.0, 2024.0.1, 2024.0.2

New in this release

System Requirements

Additional Documentation

Previous oneAPI Releases

Notices and Disclaimers

产品和性能信息