Intel® oneAPI DPC++/C++ Compiler Release Notes

ID 标签 768207
已更新 12/21/2023
版本 2024.0.2
公共

author-image

作者

This document summarizes new and changed product features and includes notes about features and problems not described in the product documentation.

Where to Find the Release

Please follow the steps to download the Intel® oneAPI toolkit from the Base Toolkit Download page and follow the installation instructions to install.

Compiler Release 2024.0.2

  • Bug fixes
  • Includes the fixes that are part of 2024.0.1.

Compiler Release 2024.0.1

  • SYCL Bindless textures have been fixed and will now work correctly on NVidia® hardware via the Codeplay® NVidia® plugin.
  • OpenMP* runtime has been updated to support Intel® Core™ Ultra devices.

oneAPI 2024.0, Compiler Release 2024.0

New Features and Improvements

  • The Level Zero driver switched the default from Intel® Data Center GPU Max Series "cards-as-devices" to "tiles-as-devices" in 2024.0.
  • Starting with this release, the Intel Level Zero and OpenCL GPU driver exposes multi-tile devices like Intel® Data Center GPU Max Series differently, which also affects the way these devices are exposed in SYCL and OpenMP. Prior to this change, each card was exposed as a root device and tiles were exposed as sub-devices. Now, each tile is exposed as a root device by default. This also affects how root devices can be partitioned into sub-devices. The old behavior can be enabled via the ZE_FLAT_DEVICE_HIERARCHY environment variable.
    As a result, use of the environment variables ONEAPI_DEVICE_SELECTOR and ZE_AFFINITY_MASK may need to change because the number of root devices and the availability of sub-devices is different than in prior releases.  

  • Enabled Pre-Compiled header (PCH) inclusion while performing host compilation with -fsycl.
  • Added support for -ftarget-export-symbols option which allows to export symbols for AOT.
  • Added support for reqd_sub_group_size to sycl::is_compatible and implemented device code split based on reqd-sub-group-size optional kernel feature.
  • Added support for -ftarget-compile-fast for spir64_gen and JIT mode.
  • Implemented adding C++ libraries to the linker with -fsycl.
  • Added support for multiple call operators in a kernel functor.
  • Added support to propagate compile flags to device backend compiler.
  • Added new FPGA loop attribute enable_loop_pipelining.
  • Enabled sycl_ext_oneapi_annotated_arg and sycl_ext_oneapi_annotated_ptr experimental extensions. 
  • Implemented sycl_ext_intel_queue_immediate_command_list extension.
  • Implemented sycl_ext_oneapi_copy_optimize experimental extension.
  • Added initial implementation of the experimental SYCL Graph extension sycl_ext_oneapi_graph.
  • Added dimensions member to item/rangelike types. 
  • Added support of queue::priority_* properties for OpenCL backend.
  • Implemented initial version of the SYCL Native CPU Plugin designed in SYCLNativeCPU.
  • Added __imf_max/min/hadd/fast_* functions to imf device libraries.
  • Introduced and implemented new sycl::ext::oneapi::experimental::info::device::architecture device descriptor and device::ext_oneapi_architecture_is(ext::oneapi::experimental::architecture) host API as part of sycl_ext_oneapi_device_architecture extension (for Level Zero and OpenCL). 
  • Added experimental implementation of sycl_ext_intel_grf_size.
  • Experimental sycl_ext_oneapi_device_global extension is supported now. 
  • Enabled media API that works with image accessors for ESIMD. Those accessors do not depend on stateful/stateless mode.
  • Allowed implicit conversion from std::experimental::simd_mask to ESIMD::simd_mask.
  • Added host_task image accessor support.
  • Started to support 64bit offsets with accessors in stateless mode for ESIMD APIs.
  • Added __imf_llabs to the imf device libraries. 
  • Added explicit conversion of multi_ptr<T> to multi_ptr<const T>
  • Added support for scalar logical operators with group algorithms.
  • Implemented device query for 64 bit atomic support in ESIMD emulator.
  • Added support of 16 bit data for lsc_atomic_update and lsc_slm_atomic_update ESIMD API.
  • Added initial implementation of sycl_ext_oneapi_root_group. There are severe limitations: max_num_work_group_sync query always returns 1 and the implemented barrier has work group scope since all work items in a root group are currently in the same work group.
  • Added support for tf32 type using the unified interface for SYCL Matrix extension.
  • Implemented Host Pipes described in sycl_ext_intel_dataflow_pipes extension.
  • Added support for more math builtins for bfloat16 together with the extension specification sycl_ext_oneapi_bfloat16_math_functions update.
  • Added decorated async_work_group_copy overloads.
  • Added initial implementation of the Unified Runtime plugin and routed to it with SYCL_PREFER_UR.
  • Added support for accessors to atomic_update ESIMD API.
  • Enabled prettyprinting of reference objects by GDB.
  • Added Xmethods that help GDB to support reading local_accessors on GPU from SLM.
  • Enabled passing local accessors to ESIMD kernel and via invoke_simd() API, enabled usage of get_pointer() and operator[] for accessors in ESIMD kernel. The support of accessor and local_accessor is still limited comparing to SYCL.
  • Implement unpadding for 2d block load/store ESIMD API.
  • Added SYCL 2020 image classes.
  • Added interfaces for SYCL 2020 image accessors, only the host accessors interfaces are actually functional.
  • Added XPTI notifications for SYCL 2020 images as well as for the associated accessor classes.
  • Started to print device aspects in syclls verbose.
  • Enabled group algorithms to be used with tangle_group and opportunistic_group arguments.
  • Implemented info::device::backend_version query.
  • Added fixed_size_group support to algorithms.
  • Added simple abs(int) to imf libdevice.
  • Added ballot_group support to algorithms.
  • Implemented interoperability support for images for the Level Zero backend.
  • Added marray support for math builtin functions.
  • Enabled inline assembly support in ESIMD kernels.
  • Enhanced interop queue interface to choose standard or immediate commandlists.
  • Enabled double type for atomic_update() ESIMD API.
  • Added support for addc and subb operations for ESIMD.
  • Allowed zerosized 3D accessors.
  • Deprecated fsycllinkhugedevicecode in favor of a new option, flinkhugedevicecode. The new option is identical in functionality but allowed with fopenmptargets.
  • Optimized size of produced binaries when device_global is used. 
  • Started to emit an error when PCH is triggered in SYCL mode.
  • Improved FPGA archive device unbundling with AOCO.
  • Moved imf abs to a separate device library for Deep Learning. 
  • Fixed the bug report URL for DPC++. 
  • Started to link with bfloat16 related device libraries only when they are used.
  • Started to properly pass fsyclesimdforcestatelessmem to the host code compilation. 
  • Combined ADLS and RPLS device architectures. 
  • Implemented multi_ptr default to be legacy to avoid code break with SYCL 1.2.1. 
  • Started to use aggregate initialization for group_local_memory arrays according to updated sycl_ext_oneapi_local_memory.
  • Deprecated ESIMD Emulator.
  • Deprecated ext::oneapi::sub_group.
  • Improved error message related to ext_intel_free_memory aspect. 
  • Removed nonnecessary barrier after region copy in the Level Zero backend.
  • Fixed get_info<device::free_memory> to check its aspect.
  • Removed the workaround for release of auxiliary buffers.
  • Enabled optimization for readonly buffer accesses from multiple devices in a context.
  • Removed old specialization constants extension and implementation.
  • Improved is_compatible to check if specific target is defined with fsycltargets and change the result. 
  • Improved quotient approximation for host sycl::remquo.
  • Allowed accessor constructed with zerosize buffers.
  • Improved resources recycling for the Level Zero backend by cleaning up events on inorder queue wait.
  • Added code_location parameter to the rest of sycl::queue methods which allows to improve error reporting. 
  • Enabled xpti::node_create signal emit for parallel_for that bypasses graph.
  • Improved the accuracy of host sycl::cospi.
  • Replaced usage of outdated memset OpenCL API clEnqueueMemsetINTEL with clEnqueueMemFillINTEL.
  • Added memory pointer to XPTI memory allocation event metadata to allow more informative XPTI memory tracing.
  • Implemented recycling of immediate command lists for queues in a context for the Level Zero backend.
  • Optimized ext_oneapi_submit_barrier() for inorder queue for the Level Zero backend.
  • Change the SYCL_PI_LEVEL_ZERO_USM_RESIDENT default to force device allocations only.
  • Started to report false for aspect::image on all devices.
  • Removed lambda parameter from "if_architecture_is" according to sycl_ext_oneapi_device_architecture.
  • Improved error reporting when reqd_work_group_size is not supported by a device. 
  • Adjusted static restriction checks for block_2d APIs.
  • Disallowed local accessor misuse according to SYCL 2020.
  • Implemented passing address space information to SPIRV Joint Matrix Load/Store intrinsics.
  • Enabled immediate command lists by default for the Level Zero backend.
  • Improved SYCL RT performance by removing extra map lookup for eliminated kernel arguments during enqueue of cached kernels.
  • Changed the default to SYCL_PI_LEVEL_ZERO_USM_RESIDENT=2.
  • Added heuristics to reduce overhead from immediate commandlist cleanup in the Level Zero backend.
  • Renamed cluster_group to fixed_size_group.
  • Added error for invalid uniform arguments in InvokeSIMD.
  • Reduced overhead from queue profiling by using steady_clock::now() to retrieve host time as piGetDeviceAndHostTimer has large overhead.
  • Implemented graceful handling of the case that Level Zero runtime was already unloaded when we do cleanup.
  • Improved error messages for InvokeSIMD.
  • Updated native_specialization_constant() for the case when no specialization constants are present.
  • Optimized reuse of command lists in the Level Zero backend.
  • Changed the behavior of sycl::maximum and sycl::minimum to be consistent with std::max and std::min.
  • Switched to sycl::exception instead of sycl::runtime_error in the majority of SYCL device headers. sycl::runtime_error is deprecated in SYCL 2020.
  • Added USM calls parameter verification layer to sycl-trace which provides improved diagnostic on accessing invalid device memory under USM.
  • Started to print supported SG sizes in syclls --verbose.
  • Added support for the following FPGA variable-precision data type conversions:
    • ac_int to ap_float
    • ap_float to ac_int
    • ac_fixed to ap_float
    • ap_float to ac_fixed
  • Added command to extract FPGA hardware configuration file (.aocx) from your multiarchitecture binary.
  • Removed the -Xsfp-relaxed FPGA option. This option is no longer supported.
    If you want an accumulator to be inferred even when using -fp-model=precise, rewrite your code to expose the accumulation.
  • Removed restriction on simulating FPGA host pipes with protocol_name::avalon_mm.
  • Replaced the ‑⁠Xsoptimize=throughput option with the ‑Xsoptimize=throughput-area-balanced option.
  • Added the annotated_arg class. Use this class to create a memory-mapped host interface to your FPGA kernel.
  • Deprecated the [[intel::max_global_work_dim(0)]] kernel attribute. The compiler adds this attribute automatically for any single-task kernel, so adding this attribute explicitly is no longer required.
  • Deprecated the register_map_interface macro for FPGA kernels. A memory-mapped interface is now the default, so an explicit declaration is no longer required.
  • Added the streaming_interface FPGA kernel property to replace the streaming_interface macro. The macro is now deprecated.
  • Added the pipelined FPGA kernel property to replace the streaming_pipelined_interface macro. The macro is now deprecated.

Bug Fixes

  • Fixed uses_aspects to be applied to function declarations and not only function definitions.
  • Fixed ivdep attribute in template contexts.
  • Fixed option restriction for device with Windows.
  • Fixed handling unsupported attributes for ESIMD.
  • Fixed optimization option processing for device options.
  • Enabled proper behavior of optional kernel features with SYCL_EXTERNAL.
  • Fixed "SLM init call is supported only in kernels" message for slm_init() inside a ESIMD kernel when compiling with -O0 option.
  • Fixed memory leak in Base64::decode.
  • Fixed crash when using libraries with -fsycl.
  • Added predefines for /MDd to fix certain build issues when using reductions on Windows.
  • Fixed compilation issue when kernel name is a class defined with "final" keyword.
  • Fixed an issue with setting VCSLMSize attribute which was causing JIT compilation failure when using slm_init() in ESIMD kernel.
  • Fixed option for undefined symbols for hostdep link to work with gold linker.
  • Fixed directory field of DIFile
  • Fixed the bug in the Level zero backend where zeModuleDestroy is called although ext::oneapi::level_zero::ownership::keep is passed to make_kernel_bundle leading to a double free corruption.
  • Addressed specification mismatches for multi_ptr.
  • Fixed corner case when using short or char with exclusive scan.
  • Updated the legacy multi_ptr construction from local_accessor's to conform with the construction method used by other accessors.
  • Made get_pointer, noexcept to comply with SYCL 2020.
  • Started to return the correct sycl::errc in case of invalid sycl::reqd_work_group_size.
  • Fixed sycl::remquo truncation error.
  • Fixed linkage errors when trying to get mem_channel property or check it using has_property().
  • Fixed dynamic loading of ocloc tool for online_compiler.
  • Fixed global memory reporting for Arc.
  • Fixed sycl::sub_group to follow byvalue semantics according to SYCL 2020.
  • Fixed invalid value returned by event.get_info<sycl::info::event::command_execution_status>() for OpenCL backend.
  • Fixed assignment operator for ESIMD globals.
  • Fixed the Level Zero backend to not destroy build log on program build failure.
  • Fixed error appearing when validation layer is enabled for the Level Zero with ZE_DEBUG=6.
  • Allowed group algorithms to accept a function object with an explicit type.
  • Fixed sycl::is_compatible() for the case when vector of kernels ids is empty.
  • Fixed multi_ptr ctor for extended address spaces.
  • Fixed trivially_copyable to device_copyable for fill and copy.
  • Fixed range deductions in reduction parallel_for.
  • Adjusted multi_ptr deduction guides to comply with SYCL 2020.
  • Fixed gather/scatter with accessors when passing scalar for ESIMD backend.
  • Fixed lost data during implicit conversion in local and host accessors.
  • Fixed memory leak because of unnecessary kernel retain.
  • Fixed the pointer type contained by a readonly accessor to const.
  • Fixed max_sub_group_size query for devices without sub groups.
  • Removed undefined behavior in hostside abs_diff.
  • Fixed vec::as<vec<bool, N>>().
  • Fixed segfault when using ZE_DEBUG and an exception is thrown on device selection phase.
  • Removed an unnecessary multi_ptr conversion which had conflict with an existing one resulting in ambiguity on implicit conversion.
  • Fixed lost data in accessor implicit conversion.
  • Fixed stack smashing which happened in some device info queries on OpenCL backend.
  • Stopped shipping the Level Zero loader and headers with the DPCPP toolchain installation.
  • Corrected the Intel® Data Center GPU Max Series device id check in the Level Zero backend.
  • Fixed ambiguity for bit_cast.
  • Fixed nan/inf handling in sycl::stream when in fastmath mode.
  • Allowed for different types in group algorithms reduce, exclusive_scan, and inclusive_scan.
  • Fixed bug in get methods in the config_2d_mem_access class for 2D block stateless load/store API.
  • Fixed compilation error for sycl::fabs builtin in fastmath mode.
  • Allowed host_task deduction tags in placeholder accessors.
  • Added iterator operations for zerodimension accessors to comply with SYCL 2020.
  • Fixed rounding issue in __imf_vavgs.
  • Fixed bug in fill operation for zerodimensional accessor.
  • Fixed arguments passed to ocloc via ftargetcompilefast.
  • Implemented implicit conversion for local and host accessors to comply with SYCL 2020.
  • Throw exception when empty accessor calls require() to comply with SYCL 2020.
  • Fixed operator~ for sycl::vec<bool, N>.
  • Fixed sub_group shuffle for vectors of long long and half.
  • Fixed static destruction order issue in OpenCL extension fptr cache.
  • Fixed leak of active barriers' events in the Level Zero backend.
  • Implemented proper queries for aspect::ext_oneapi_srgb.
  • Added vec assignment from scalar and more vec modulus overloads to comply with SYCL 2020.
  • Fixed handling of mem_channel buffer property.
  • Fixed operator& and operator[] in local_accessor<const T>.
  • Fixed the third argugment type for select builtin.
  • Fixed native_specialization_constant() API implementation.
  • Moved Level Zero specific collectors to dynamic libraries loaded by request in sycltrace tool to support the case when Level Zero driver is not available.
  •  Added missing marray relational functions: any, all, bitselect; fixed scalar select relational function; aligned scalar abs integer function with SYCL 2020 and fixed math functions which take multi_ptr argument.
  • Added operator[] and element_type to swizzle vec to align with SYCL 2020.
  • Fixed buffer range in atomic_memory_order_acq_rel.
  • Started to throw for invalid global_work_size query.
  • Fixed the Level Zero backend to report events as submitted, not running, until they are completed. 
  • Allowed raw send ESIMD API to use nonstandard types like sycl::half.
  • Started to throw for invalid info::kernel::num_args query to comply with SYCL 2020.
  • Updated group_broadcast to support vec types.
  • Fixed reductions to avoid implicit atomic64 requirements. 
  • Added a partial profiling workaround for acc devices because queue profiling is no longer supported for OpenCL version < 2.1 after recent changes regarding command submit profiling info.
  • Fixed program build API for the Level Zero backend to retain buildlog when program build failed.
  • Implemented missing std::hash specializations for local_accessor and host_accessor to comply with SYCL 2020.
  • Fixed global_work_size kernel query.
  • Fixed incorrect sycl::vec<bool, N> constructor behavior. 
  • Fixed weak_object for host_accessor and stream.
  • Fixed incorrect behaviors in some operations using sycl::vec<bool, N>.
  • Fixed integration footer for specialization_id.
  • Fixed compilation break occurring when bfloat16 constructor is used in a kernel.
  • Fixed crash in subgroup info queries when running on OpenCL backend which doesn't support subgroups.
  • Added missing support for target::host_tas specialised accessor constructor using mode_target Tag.
  • Fixed identityless reductions with unwritten reducers.
  • Fixed dangling pointer issue in xpti tracing.
  • Fixed PI event leak in memcpy2d devicehost fallback.
  • Fixed weak_object and owner_less for device objects.
  • Added noexcept specifier to vec::byte_size to comply with SYCL 2020.
  • Fixed undefined behaviour in vector printf specifier.
  • Fixed handler::fill so it works even if pattern is not supported by a backend natively.
  • Fixed mechanism to throw exception when placeholder accessor passed to a command. 
  • Fixed empty zerodimensional accessor access range. 
  • Fixed incorrect write back in a case when a sycl::buffer is constructed with a const T* as host data.
  • Fixed empty accessor default constructor to not create a placeholder to comply with SYCL 2020.
  • Removed deprecated piclCreateProgramWithSource.
  • Removed deprecated barrier API.
  • Removed deprecated interop_task 
  • Removed deprecated sycl::group_local_memory.
  • Removed deprecated sycl::detail::bitcast. 
  • Removed deprecated piEnqueueNativeKernel.
  • Removed deprecated backend enum values: level_zero, cuda, esimd_cpu, hip.
  • Removed the workaround for release of auxiliary buffers.
  • Removed old specialization constants extension and implementation.
  • Removed nonstandard RT namespace from sycl namespace.
  • Removed deprecated ESIMD APIs.
  • Removed DISABLE_SYCL_INSTRUMENTATION_METADATA macro and _CODELOC* macro usage from API.
  • Removed nonstandard RT namespace from sycl namespace.
  • Removed getOSModuleHandle usage.
  • Removed deprecated sycldevice triple support.
  • Dropped support for sycl_ext_oneapi_extended_atomics extension and sycl_ext_oneapi_group_algorithms extension.
  • Removed unneeded backwards compatibility of make_queue and get_native.
  • Removed support for binaries generated for triples with "sycldevice" environment component.
  • Removed lambda parameter from "if_architecture_is" according to sycl_ext_oneapi_device_architecture.
  • Updated sycl::exception to be SYCL2020 compliant.
  • Replaced deprecated sycl::runtime_error with SYCL 2020 compliant sycl::exception in SYCL RT.
  • Renamed win_proxy_loader to pi_win_proxy_loader.
  • Promoted the return type changes of SYCL relational builtins that changed between SYCL 1.2.1 and SYCL 2020 out from the guard of SYCL2020_CONFORMANT_APIS.
  • Fixed get_pointer to return T* for target::device specialized accessor according to specification.
  • Fixed max_work_item_sizes return type from id to range according to SYCL 2020.
  • Deprecated experimental set_kernel_properties API and use_double_grf/use_large_grf properties were removed. New API provided in the extension sycl_ext_intel_grf_size has to be used.
  • Fixed an issue preventing the support of FPGA Task Sequence functions with struct return values.
  • Fixed an FPGA issue causing an assert message for the loop with local memory LSUs on the loop’s II critical path due to loop-carried memory dependency when compiling a hyper-optimized loop using the [[intel::max_reinvocation_delay]] FPGA loop attribute.
  • Fixed an issue where FPGA RTL libraries cause an Intel® Quartus® Prime compilation of a SYCL* HLS IP core to fail late in the compilation process.
  • Fixed an FPGA issue where the hw.tcl file that is generated as part of the SYCL* HLS flow did not map signals correctly.
  • Fixed an error that occurred when you use FPGA mmhost macros on kernel arguments that are used inside a lambda within the kernel function. In addition to fixing this error, the mmhost macros are now deprecated.
  • Fixed an FPGA issue where the modulefiles-setup.sh script was not supported for FPGA.

Known Issues and Limitations

  • There is a known issues with bindless textures in SYCL currently. The team is working to resolve this and a patch release (2024.0.1) will be posted to address this.

  • Copy operations inside a graph node of Intel® Arc™ Graphics can fail because of a possible synchronization issue that also affects accessors.

  • Having MESA OpenCL implementation which provides no devices on a system may cause incorrect device discovery. As a workaround such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd.

  • -fsycl-dead-args-optimization can't help eliminate offset of accessor even though it's created with no offset specified SYCL 2020 barriers show worse performance than SYCL 1.2.1 do. 

  • When using fallback assert in separate compilation flow it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv.

  • Limit alignment of allocation requests at 64KB which is the only alignment supported by Level Zero.

  • The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.

  • Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior.

  • Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??$_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies "-std=c++14" or "/std:c++14".

  • Printing internal defines isn't supported on Windows.

  • The support of accessor and local_accessor for ESIMD is still limited comparing to SYCL.

  • sycl_ext_oneapi_root_group implementation has the following limitations: max_num_work_group_sync query always returns "1" and the implemented barrier has work group scope since all work items in a root group are currently in the same work group.

  • Unnecessary handling of devices may cause overhead and may lead to unintended interactions with buffers. SYCL Graph finalizes for each device associated with a context, rather than just finalizes for the specific user defined device.
  • Applications running on "pre-Intel® Data Center GPU Max Series" GPUs using larger-than-4GB allocations need to set the following environment variable so Intel Graphics Compiler (IGC) can create the correct compilation unit: SYCL_PROGRAM_COMPILE_OPTIONS=-ze-intel-greater-than-4GB-buffer-required 
    Not using this environment variable on applications using larger-than-4GB allocations may produce unexpected behavior. Using larger-than-4GB allocations on "pre-Intel® Data Center GPU Max Series" GPUs disables certain pointer arithmetic optimizations added by the graphics compiler, so a difference in performance in the application may be seen.
  • If a user builds a SYCL program with '-g' option and runs it with a gcc version greater than 13.1.0 on CPU device, the program may crash during the shutdown stage. Although this issue is not a Intel® product defect, users should not run a SYCL program built with '-g' option on CPU device if the gcc version used is greater than 13.1.0.

  • In 2024.0 release, we have fine-tuned the Address, Leak, and Thread sanitizers (triggered with the compiler flag -fsanitize) to focus on detecting issues in a user’s program, as opposed to those, whether legit or not, in the various runtime libraries that the DPC++ runtime depends on. Note that for the Thread Sanitizer, it may occasionally detect issues in SYCL header files while we continue working on fixing or suppressing those issues in future releases. The Memory sanitizer, on the other hand, will still report issues in the runtime libraries before reaching the user code. Note that these sanitizers only sanitize the host-side code in your DPC++ program; the kernels are not sanitized.

  • The DPC++ 2024.0 compiler has been updated to conform to a change in the SYCL 2020 specification, but this update can break some SYCL programs that work fine with DPC++ 2023.0. Specifically, programs that use sycl::minimum or sycl::maximum can be affected. The reason is that before, sycl::maximum was defined using operator>. Now, it's defined using operator< to be consistent with std::max.

  • Using sycl::image L0 interop when compiling at -O0 might result some issues. To work around this, please use a different optimization level. 

  • Starting from CMake 3.27, CMake support for Intel® oneAPI DPC+/C+ Compiler and Intel® Fortran Compiler has been updated to use the compiler driver for linking instead of the linker. This change enables use cases for building SYCL applications and libraries and enables Interprocedural Optimization (IPO) on Windows*. Also starting from CMake 3.25, due to limitations in the way CMake tracks linker flags, a CMake project with mixed C/C++ and Fortran code where one of the compilers is LLVM-based (e.g. icx, ifx) and another is the classic version (e.g. icc, ifort) will result in invalid linking flags. As an example, CMake cannot build a project using "icx" as the C/C++ compiler and "ifort" as the Fortran compiler. A workaround is to use only LLVM-based compilers when building mixed language applications. 

  • A DPC++ system that has FPGAs installed does not support multiprocess execution.

  • A DPC++ program that runs kernels on one or more FPGA devices does not support multithreaded execution.
  • When compiling for FPGA emulation, debug support on Windows is unavailable when using device-side libraries.
  • On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error: 
    dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
    NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'
    As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).
  • When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported, and its use is not diagnosed. The compiler treats the memory_scope::system constraint as a memory_scope::device constraint because that is the broadest scope supported for FPGA.
  • When compiling for FPGA, the compiler might pack structs differently on Windows than on Linux. This difference can result in structs with members that might not be well-aligned for optimal memory accesses. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows.
    As a workaround, force an alignment on the misaligned structs, as shown in the following example:
    //Code with misaligned struct
    struct Item {
      bool valid;
      int value1;
      unsigned char value2;
    };
    
    //Forced alignment of struct
    struct Item {
      bool valid;
      bool __empty__[3];
      int value1;
      unsigned char value2;
      unsigned char __empty2__[3];
  • The FPGA optimization report reports incorrect area utilization data from Quartus compiles for Intel® Quartus® Prime Pro Edition software versions 23.1 and later. Currently, there is no known workaround for this issue. 
  • On Windows, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display the GPU process launch failed error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool from the mapped network drive to your local computer and run it locally.
  • Due to a known issue pertaining to HTML files within the Jupyter Notebook, you cannot launch the FPGA Optimization Report in a Jupyter Notebook. As a workaround for this issue, either use the Intel oneAPI FPGA Reports Tool or copy the FPGA optimization reports directory to a local file system and launch it using a supported browser. 
  • When compiler for the FPGA optimization report flow, the list of optimization flags used for a compile may be incomplete or unavailable in the FPGA Reports Summary Page under certain circumstances, such as when using the -g0 flag. As a workaround for this issue, avoid using the -g0 flag in your compilation command. Also, if you use the -ghdl option, ensure that it is the last argument in your command. 
  • When generating FPGA optimization reports, the compiler might crash for any design with pipes having a capacity 0. If only a few pipes (but not all pipes) have a capacity of 0 in the design, then only those with a capacity will appear in the Area report. As a workaround for the compiler crash, assign a capacity (for example, 1) to one of the pipes with capacity 0.
  • When compiling for FPGA, if you specify output target names that are pure numbers or that start with a number, the compiler errors out and might display an error message, as shown in the following example:
    icpx -fsycl -fintelfpga -Xssimulation basic.cpp -o 2
    aoc: Compiling for Simulator.
    Error: Simulation system generation FAILED.
    Refer to 2.prj/2.log for details.
    
    llvm-foreach:
    cpx: error: fpga compiler command failed with exit code 1 (use -v to see invocation)

    The error message is not representative. As a workaround for this issue, specify target names that are not pure numbers or that start with a number.

  • When compiling for FPGA, the compiler ignores the sycl::property::buffer::mem_channel buffer property. Irrespective of whether you specify the property or not, all buffer allocations are allocated to the first memory channel. Currently, there is no known workaround for this issue.  
  • When running pipelined kernels in the FPGA simulation flow, the simulation runtime may not launch kernel invocations fast enough to achieve the lowest possible kernel initiation interval that the generated RTL can achieve. Currently, there is no known workaround for this issue.
  • When compiling for FPGA, the compiler might ignore non-RTL source library functions when the library archive file also contains RTL source objects and report the following error message: 
    Compiler Error: undefined reference to <non-RTL source library function>
    As a workaround for this issue, avoid placing RTL and non-RTL source library objects in the same archive file.
  • The compiler is not constrained to the specified LSU style when requesting a particular LSU style using the FPGA LSU controls for a struct data type. Instead, it chooses the best LSU style for the access pattern. As a workaround for this issue, avoid using LSU controls with the struct data type and use simple data types instead. 
  • The FPGA SYCL HLS encryption flow is not fully supported on Windows systems. 
  • When compiling for FPGA, the compiler might crash if any global memory in the board_spec.xml file does not have a name field. Ensure that all global memories in the board_spec.xml file have a name field. For example, <global_mem name="DDR" ... >
  •  When compiling for FPGA and linking multiple fat static libraries containing the device code (produced using the -fsycl-link=image flag), only the device code from the first library is included in the fat executable, and the following error message is returned:
    > what(): native api failed. native api returns: -46 (pi_error_invalid_kernel_name)
    > terminate called after throwing an instance of 'sycl::_v1::exception
    As a workaround for this issue, dynamically link the host code instead of linking statically.
    Example compilation commands:
    icpx -fsycl main.cpp -c -o main.o
    icpx -fsycl -fintelfpga -fpic -shared add_kernel.cpp -o libadd_kernel.so
    icpx -fsycl -fintelfpga -fpic -shared sub_kernel.cpp -o libsub_kernel.so
    icpx -fsycl -fintelfpga main.o -L. -ladd_kernel -lsub_kernel -o hot_swapper
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:. ./hot_swapper
  • If your design includes a device_global memory greater than 1024 bits in size and you have not initialized it in the kernel, then you might see incorrect behavior when compiling for the simulator. Memory size greater than 1024 bits can happen due to the following reasons: 
    • device_global is an array with a size greater than 1024 bits.
    • device_global is a scalar that uses ac_int (or other large types) with a size greater than 1024 bits.

      This is caused by a bug in the Intel Quartus Prime Pro Edition software that occurs when initializing memory using a MIF file. As a workaround for this issue, zero-initialize the contents of the device_global memory before accessing the memory.
  • You might encounter functional failures in the FPGA emulation flow when resetting a device_global and a new device_image is loaded without the device_image scope property. Currently, there is no known workaround for this issue.
  • Using atomic_ref in FPGA flows to access memory data requires much more width on the interface than the width of the datatype.
  • The FPGA mmhost macro does not work in a kernel lambdas. The mmhost macro is deprecated and will be removed in a future relase.
  • When doing a multistep compilation for FPGA where you create an FPGA early image file (.a) that you later compile to a multiarchiteture binary and you change the file name between the two compilations, the second compilation will fail.
    For example, the following compilation sequence fails:
    icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o A.a
    icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 A.a -o B.exe
    To prevent this, use the same file name for both compilations:
    icpx -fsycl -fintelfpga -fsycl-link=early -Xshardware -Xsboard=pac_a10 vector_add.cpp *.cpp -o B.a
    icpx -fsycl -fintelfpga -Xshardware -Xsboard=pac_a10 B.a -o B.exe
  • For FPGA compilation, when you use the annotated_arg class in SYCL* HLS kernels, the address width of the lowest number buffer location must be at least 11 bits wide. Otherwise, the compiler returns an error.
  • For FPGA compilation, the amount of memory that can be allocated through a single call of a SYCL* allocation API to a buffer_location is limited by the total size of the lowest numbered buffer location.

    To avoid a null pointer being returned by any of the SYCL* allocation APIs due to this limitation, set the address width of the lowest numbered buffer_location to at least ceil(log2(the largest allocation size made + 1024)) bits. In hardware, tie the unused address bits to 0 if on input, and leave them dangling if on output.

    The following allocation APIs are affected:
      • sycl::malloc_shared
      • sycl::malloc_host
      • sycl::malloc_device
      • sycl::aligned_alloc_shared
      • sycl::aligned_alloc_host
      • sycl::aligned_alloc_device
  • When running FPGA pipelined kernels in simulation flow, you might not achieve the lowest II in the waveforms.
  • In the FPGA acceleration flow, pipes that use protocol<protocol_name::avalon_mm_uses_ready> produce a compiler error. To prevent this error, pass data between host and kernel (while it is running) through a different method, such a host USM.
  • For Intel® FPGA PAC D5005, previously known as Intel® PAC with Intel Stratix® 10 SX FPGA, there is a known issuse where a sequence of store operations in your FPGA kernel followed by a load operation produces incorrect results. The PAC BSP has been discontinued. Contact your Intel representative for the avaialability of any replacement BSPs for the Intel FPGA PAC D5005.
    For example, in the following code, the data[1] = data[0]; operation occurs before the store operations on earlier lines. This incorrect order of operations results in an incorrect value for data[1].
    int main() {
      sycl::queue q{sycl::ext::intel::fpga_selector_v};
      volatile int *data = sycl::malloc_host<int>(2, q); 
      data[0] = 0;
      q.parallel_for(sycl::range<1>(128), [=] (sycl::id<1> index) {
        if (index == 0) 
          #pragma unroll
          for (int i = 0; i < 100; ++i)
            *data = i;
          data[1] = data[0];
        }   
      }).wait();
      std::cout << "Result: " << data[1] << "\n";
    }

     

  • For Intel Stratix 10 FPGA reference boards, a rare failure can occur when initializing internal memory where the memory is initialized into an unknown state that can cause unexpected behavior.
    As a workaround, compile your design with the -Xsbsp-flow=flat compiler option to avoid this issue.
  • For simulations of large FPGA designs, such as those that target Intel Agilex 7 boards, you might receive a linker error that contains a PC-relative offset overflow message.
    If you receive this message, compile your simulation with the -fsycl-link-huge-device-code compiler command option.
  • For FPGA kernels, when you specify the no-interleaving=default property, the buffers are not burst-interleaved. Their addresses are still assigned according to the mem_channel properties.
    The expected behavior is that the mem_channel property should be ignored, and buffers should be burst-interleaved when no-interleaving flag is not provided.
    There is currently no workaround.
  • For FPGA devices, channel widths are limited to 4096 bits in Intel Quartus Prime Version 21.4 and earlier. If you exceed the channel width limitation, you receive an error like the following message:
    <name>_pipe_channel_read: dataBitsPerSymbol 5120 is out of range: 1-4096
    To avoid this error, ensure that you use Intel Quartus Prime software Version 22.1 or later.
  • When using the FPGA aocl command on a Microsoft* Windows* system, you might get the following error:
    aocl.exe: Unable to determine the execution environment of the Intel(R) FPGA SDK for OpenCL(TM).
    aocl.exe:     Detailed error: Could not determine the path to SDK internal Perl executable
    If you get this error, complete the following steps:
    1. Copy aocl.exe from <oneAPI-install-location>\oneAPI\2024.0\bin to <oneAPI-install-location>\oneAPI\compiler\2024.0\opt\oclfpga\bin
    2. Add <oneAPI-install-location>\oneAPI\compiler\2024.0\opt\oclfpga\bin to your PATH environment variable before <oneAPI-install-location>\oneAPI\2024.0\bin
  • In the FPGA Optimization Report, designs with multiple lambda kernels report inaccurate results unless the lambda kernels are all given unique names. For information about how to name lambda kernels, refer to "Suggested Coding Styles" in the Intel oneAPI FPGA Handbook (https://www.intel.com/content/www/cn/zh/docs/oneapi-fpga-add-on/developer-guide/current/suggested-kernel-coding-styles.html).
    For lambda kernels generated in a loop, use templated classes to give the kernels procedurally generated names.
  • For FPGA ap_float data types, the ‑fp-model=fast compiler command option does not enable dot product inference.
    To enable dot product inference for ap_float data types, use the ‑Xsffp‑reassociate compiler command option.
  • When you use Intel® Quartus® Prime Version 23.2 or later as part of the SYCL* HLS flow, a simulation compilation fails when your design contains a memory interface that satisfies the following conditions:
    • The memory interface uses a global memory ring interconnect.
    • The memory interface port direction is read/write or read-only.
    • The memory interface has a fixed latency (that is, the latency is not explicitly set to 0)

      The error is similar to the following error message:
      Error: mm_agent_ks_mem0_rw.mm_agent_ks_mem0_rw.s0: Agent with readdatavalid must use waitrequest.
  • For simulation in the FPGA Acceleration flow, a design that uses USM pointers to be allocated to <global_mem_name> but has no annotation or buffer location property that informs the compiler of where it is to be allocated fails at run time with an error when the board_spec.xml file for the target board meets the following conditions:
    •  It defines more than one device global memory.
    • The difference between the minimum and the maximum addresses across all the global memories is larger than the size of some global memory named <global_mem_name>.
    • The minimum address of <global_mem_name> is larger than the size of itself.

      The error is similar to the following error message:
      Error: Out of bounds memory write attempted at address <some_address>, for <size> bytes, max_size = <global_mem_name_size>
      To workaround this error, add annotation or buffer location property to the USM pointer specifying where it is to be allocated.
  • For FPGA, counting the number of bits in an unsigned integer without leading zeros using a loop like in the following example can lead to a compiler error such as " Compiler Error: undefined reference to 'llvm.ctlz.iN'"
    unsigned int leading_zeros = 0;
    while (number) {
      leading_zeros += 1;
      number >>= 1;
    }

    You can workaround this issue by first getting the number of leading zeros using  the built-in functions: __builtin_clz(unsigned) or __builtin_clzll(unsigned long long). When counting the leading zeros of unsigned char or unsigned short using the built-in functions, deduct the number of bits extended during type conversion from the return.

  • In the FPGA SYCL* HLS flow, if your design contains a "write-only" pointer-type kernel argument defined by an annotated_arg<> with buffer location 'X' in any kernel, compiling for simulation compile fails if all the following conditions are true:
    • If none of the pointers with buffer location "X" are accessed in any kernel body.
    • There are no unannotated pointer-type kernel arguments defined or used in any of the kernels in the design.

      For example, compiling the following design for simulation causes the compilation to fail:
      struct kernelA {
        annotated_arg<int*, properties{buffer_location<0>, readwrite_mode_write}> a;
        void operator()() {}  // a is never used, this operation fails
      }

System Requirements

Additional Documentation

Previous oneAPI Releases

Notices and Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.