Intel® oneAPI Math Kernel Library (oneMKL) 2023 Release Notes

2023.2.2

System Requirements

This is a bugfix release for macOS*.

Fixed Issues

Removed the deprecation warning for SSE4.2 on macOS*.

2023.2.1

System Requirements

This is a bugfix release for macOS*.

Fixed Issues

Fixed inaccurate macOS SDK version in oneMKL binary.

2023.2

System Requirements Bug Fix Log

NOTE for Intel® Distribution for HPCG Benchmark: Currently HPCG is available for CPU only.

Features

General
- Features
  - Extended SYCL backend support for OpenMP offload functionality. Now available for BLAS, Sparse BLAS, LAPACK, DFT and RNG.
Intel® Distribution for LINPACK* Benchmark
- Features
  - Introduced Intel® Distribution for LINPACK* Benchmark for Intel GPUs, Intel® Optimized HPL-AI* Benchmark for CPU and Intel GPUs.
BLAS
- Features
  - Introduced alternate mode functionality for complex GEMM that can provide improved performance, potentially in exchange for reduced accuracy.
- Optimizations
  - Improved general performance of BLAS level-3 routines on Intel® Data Center GPU Max and Flex Series.
  - Improved performance of oneapi::mkl::blas::{row,column}::{symm,hemm} on Intel® Data Center GPU Max Series.
Sparse BLAS
- Features
  - Introduced SYCL API for modifying the diagonal values of a sparse matrix handle.
  - Introduced support for transpose sparse matrix-dense vector product through oneapi::mkl::sparse::gemv on GPU.
- Optimizations
  - Improved performance of oneapi::mkl::sparse::gemv on Intel® Data Center GPU Max Series and Intel® Iris® Xe MAX Graphics.
LAPACK
- Features
  - Introduced SYCL USM APIs to compute the solution to a system of linear equations (oneapi::mkl::lapack::gesv).
  - Introduced C and Fortran OpenMP* offload APIs for triangular matrix inverse (?trtri).
- Optimizations
  - Improved performance of pivoting and non-pivoting versions of batched LU factorization (oneapi::mkl::lapack::{getrf_batch, getrfnp_batch}) for both strided and group SYCL APIs on Intel® Data Center GPU Max Series.
  - Improved performance of Cholesky factorization (oneapi::mkl::lapack::potrf) and triangular matrix inverse (oneapi::mkl::lapack:: trtri) on Intel GPUs.
DFT
- Features
  - Enabled dynamic linking for FFTW OpenMP* offload 32-bit integer APIs (lp64 interface) on Windows*.
- Optimizations
  - Improved performance of 1D complex FFT of power of two sizes from 2^19 to 2^26 on Intel® Data Center GPU Max Series.
  - Improved performance of 1D complex FFT of size less than 4096 on Intel® Data Center GPU Max Series.
  - Improved performance of 2D and 3D complex FFT of size less than 300 in each dimension on Intel® Data Center GPU Max Series.
  - Improved performance of 2D and 3D real double precision FFT that can fit into the last level of cache on Intel® Data Center GPU Max Series.
Vector Math
- Features
  - Introduced FP16 (sycl::half) support for the SYCL API.
  - Introduced accuracy selection at runtime in the SYCL Device API.
  - Enabled explicit 64-bit (ILP64) interface with suffix “_64”, allowing 64-bit and 32-bit indices to be used from a single LP64 interface library.
- Optimizations
  - Performance optimizations for single precision sinpi, cospi, sincospi, sincos, remainder, fmod on Intel GPUs by up to 15%, and on Intel® Xeon® CPUs by up to 30%.

Known Issues and Limitations

oneMKL DFT SYCL APIs may fail to compute correct results for 2D and 3D real FFT when using user allocated SYCL buffer workspace and the OpenCL™ runtime.
oneMKL DFT SYCL APIs using SYCL buffer for data input do not support SYCL sub-buffer inputs for a range of large power of two sizes (2^21 – 2^26) 1D complex FFT.
The getri_batch_usm and getri_oop_batch_usm LAPACK examples that are located at ${MKLROOT}/examples/dpcpp/lapack may fail on Intel® Iris® Xe MAX Graphics on Windows in debug_mode.
On Intel® Iris® Xe MAX Graphics, {c,s}getrfnp_batch functions may hang or have a segmentation fault. As a workaround, use the {c,s}getrfnp_batch_strided functions instead.
OpenMP* offload of Fortran BLAS functions ctrsv, ctpsv, and ctbsv to GPU under Windows in static linking mode may crash. As a workaround, use dynamic linking mode.
OpenMP* offload of Fortran LAPACK functions cpotrf, cpotri, cpotrs, ctrtri, spotrf, spotri, spotrs, strtri to GPU under Windows in static linking mode may crash. As a workaround, use dynamic linking mode.
SYCL BLAS and BLAS-like group batch APIs may crash intermittently on Intel GPU under Windows with oneAPI Level Zero backend.
Sparse BLAS use of C OpenMP* offload mkl_sparse_optimize with mkl_sparse_mv_hint with asynchronous execution sometimes hangs on GPU. As a workaround, use non asynchronous execution for the optimize step.
oneapi::mkl::sparse::trsv with sycl::buffer inputs may crash with a segmentation fault when any of the CSR matrix data arrays and the solution vectors are themselves sub-buffer(s) of a sycl::buffer.
Mrg32k3a random number engine may fail on Intel® Arc™ A-Series Graphics GPU in case of Windows OS and /Od enabled option.
Random number generators Device APIs with enabled Vector Math Device APIs underneath do not work on Intel GPUs that don’t have native double precision support due to Vector Math restrictions.
oneMKL SYCL DLL could leak memory after unloading on Windows. The problem can be avoided by adding mkl_free_buffer before unloading the DLL.
Using the Clang compiler 14.0.0 on macOS* to link against the oneMKL Intel threading library may emit warnings similar to: 
  id: warning: alignment (4) of atom '.2.16_2_kmpc_loc_struct_pack.60' is too small and may result in unaligned pointers 
Workaround: Set the environment variable MACOSX_DEPLOYMENT_TARGET to 10.14. 
Intel® VML functions may raise spurious FP exceptions even if the (default) ML_ERRMODE_EXCEPT is not set. Recommendation: do not unmask FP exceptions before calling VML functions.

Deprecations

As the industry has largely shifted to 64-bit architecture over the last decade, Intel® oneMKL 32-bit binaries will be deprecated in the upcoming Intel® oneMKL 2024.0 release and targeted to be removed after one year deprecation notice period. If you are currently using the 32-bit binaries, please consider upgrading to our 64-bit options. You can share your feedback or any concerns on the oneMKL Community Forum or through Priority Support.
PGI* support is deprecated and will be removed in the oneMKL 2025.0 release.

2023.1

System Requirements Bug Fix Log

Features

BLAS and Sparse BLAS
- Features
  - Improved parameter checking and error handling in BLAS Level-2 Routines.
- Optimizations
  - Improved general performance for BLAS Level-3 Routines on Intel® Data Center GPU Max Series.
  - Improved performance of solving a triangular matrix equation (oneapi::mkl::blas::{row,column}_major::trsm) on Intel GPU.
  - Improved performance of oneapi::mkl::sparse::{matmat, gemm, gemv} for a wider range of matrix sizes on Intel® Data Center GPU Max Series.
LAPACK
- Features
  - Introduced batch strided least squares solver (?gels_batch_strided) functionality with C and Fortran APIs on CPU.
- Optimizations
  - Improved performance of Cholesky inverse (oneapi::mkl::lapack::potri), triangular matrix inverse (oneapi::mkl::lapack::trtri), and batch group LU inverse (oneapi::mkl::lapack::getri_batch) on Intel® Data Center GPU Max Series.
  - Improved performance of real precision generalized eigensolvers {d,s}gges and {d,s}ggev on CPU.
  - Improved performance of tridiagonal eigensolver (?steqr) for small sizes on CPU.
DFT
- Features
  - Added Fortran OpenMP offload support for real FFT with FFTW3 APIs.
  - Added support for the FWD_DISTANCE and BWD_DISTANCE parameters in the SYCL DFT get_value API.
  - Added OpenMP 5.1 dispatch construct support for out of place FFT with the DFTI C OpenMP offload APIs.
- Optimizations
  - Improved performance of FFT commit/planning stage for selected 1D and 2D complex FFT of power of two sizes on Intel Data Center GPU Max Series.
  - Improved performance of 1D, 2D and 3D single precision real FFT that do not exceed 192 MB and whose first or second dimension can be expressed as two non-prime factors that are no greater than 64 on Intel® Data Center GPU Max Series.
  - Improved performance of 1D double precision complex FFT of power of two sizes, from 512 to 4096, on Intel® Data Center GPU Max Series.
  - Improved performance of 2D and 3D complex FFT of power of two sizes, from 64 to 512, in each dimension on Intel® Data Center GPU Max Series.
  - Improved performance of 1D complex FFT of sizes between 20000 and 40000 with a large power of 7 factor on Intel® Data Center GPU Max Series.
  - Improved performance of 1D and 2D complex FFT of sizes 4096, 8192 and 16384 in each dimension on Intel® Data Center GPU Max Series.
Vector Statistics
- Features
  - Introduced new Gaussian ICDF-based method in Random Number Generators (RNG) Device APIs.
  - Introduced uniform_bits distribution in RNG Device APIs.
  - Enabled _64 C-APIs for the set of RNG and Summary Statistics functions in order to use LP64 and ILP64 interfaces in the same application.
  - Enabled SYCL-based interoperability object support for OpenMP Offload.
  - Enabled RNG Device APIs with Vector Math Device APIs for Gaussian, Lognormal and Exponential distributions that may improve performance on Intel® Data Center GPU Max Series.
- Optimizations
  - Improved Device API performance of philox4x32x10 and mrg32k3a RNG engines on Intel® Data Center GPU Max Series.

Known Issues and Limitations

MKLConfig.cmake doesn't support oneMKL Conda package structure.
Using the Clang compiler 14.0.0 on macOS to link against the oneMKL Intel threading library may emit warnings similar to: 
  id: warning: alignment (4) of atom '.2.16_2_kmpc_loc_struct_pack.60' is too small and may result in unaligned pointers 
Workaround: Set the environment variable MACOSX_DEPLOYMENT_TARGET to 10.14. 
oneMKL SYCL DLL for Debug CRT "mkl_sycld.3.dll" could fail with error code 0xc0000409 during DLL unloading if BLAS Level-3, LAPACK, or DFT functionality was used on Intel GPU with oneAPI Level Zero backend. The problem can be avoided by adding mkl_free_buffer before the end of an application or before unloading the DLL. 
oneMKL caches for GPU functionality might not cleanup completely by mkl_free_buffer.
Some LAPACK functions may experience segmentation faults when using the OpenCL backend. As a workaround, use the oneAPI Level Zero backend instead.
On Intel® Data Center GPU Max Series, the OpenMP offload of dgetrf_batch_strided function may produce incorrect results for size m=n=64.
On Windows Intel® Iris® Xe MAX Graphics, single complex precision versions of certain batched LAPACK functions (getri, getrs), may throw an exception. As a workaround, call the non-batched version of the function.
On Intel® Iris® Xe MAX Graphics, {c,s}getrfnp_batch functions may hang or have a segmentation fault. As a workaround, use the {c,s}getrfnp_batch_strided functions instead.
FFT on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) CPUs and Linux 32-bit OS might not give correct result when the system is heavily oversubscribed.
RNG Device APIs with enabled Vector Math Device APIs underneath do not work on Intel GPUs that don’t have native double precision support due to Vector Math restrictions. 
The example located at ${MKLROOT}/examples/dpcpp_device/rng/uniform may fail on Intel® Arc™ A-Series Graphics GPU in case of Windows OS and /Od enabled option. 
Sparse BLAS SYCL functionality may result in crashes when used on Windows with debug mode.
Sporadic segmentation fault failures like PI_ERROR_INVALID_EVENT_WAIT_LIST or PI_ERROR_INVALID_VALUE are currently known to occur in some Sparse BLAS functionality like oneapi::mkl::sparse::{optimize_trsv, matmat, omatcopy}.
Some BLAS functions may experience accuracy issues or segment faults when using the OpenCL backend. As a workaround, use the oneAPI Level Zero backend instead. 
oneMKL Vector Math functions for OpenMP Offload for C and Fortran cannot be used with the single dynamic library (mkl_rt).
Intel® VML functions may raise spurious FP exceptions even if the (default) ML_ERRMODE_EXCEPT is not set. Recommendation: do not unmask FP exceptions before calling VML functions.

2023.0

System Requirements Bug Fix Log

Features

BLAS and BLAS-like Extensions
- Features
  - New alternate computation mode functionality for Level-3 routines that can provide improved performance, potentially in exchange for reduced accuracy.
  - More flexible implementation of omatadd that now supports addition operations that update (rather than replace) the output matrix.
  - Improved argument checking and error messages for the BLAS-like extensions omatcopy, imatcopy, omatadd, and their batched variants.
- Optimizations
  - Improved performance for many Level-2 and Level 3 routines on Intel GPUs.
LAPACK
- Features
  - Introduced TRTRI (triangular matrix inverse) for SYCL USM APIs.
  - Extended SYCL interface for GELS_BATCH strided functionality and C/Fortran OpenMP offload APIs to support single and double precision complex cases on GPU. Note that transposed case is not supported.
- Optimizations
  - Improved performance of getrs_batch group and in-place getri_batch strided SYCL APIs on Intel GPUs.
  - Improved performance of ?GETRI, ?POTRI, and ?TRTRI inverse routines on CPU with OpenMP threading.
Sparse
- Features
  - Added a new in-place sparse matrix sorting C++ SYCL API, oneapi::mkl::sparse::sort_matrix(). The output sparse matrix from oneapi::mkl::sparse::matmat() and mkl_sparse_sp2m() C OpenMP is no longer guaranteed to be sorted, giving more performance and flexibility to users to sort only as needed.
  - Added a new out-of-place sparse matrix copy/transpose C++ SYCL API, oneapi::mkl::sparse::omatcopy() with support for operations oneapi::mkl::transpose::nontrans and oneapi::mkl::transpose::trans
  - Added new C/Fortran Inspector-Executor Sparse BLAS APIs: mkl_sparse_set_sorv_hint() and mkl_sparse_?_sorv() with an optimized sequential algorithm for applying the SOR preconditioner for a matrix system, Ax = b, where x and b are vectors and A is a sparse matrix in CSR format.
- Optimizations
  - Improved performance of C++ SYCL API oneapi::mkl::sparse::gemv() on Intel GPUs
  - Improved performance for C++ SYCL API oneapi::mkl::sparse::gemm() on Intel GPUs
  - Improved performance for C++ SYCL API oneapi::mkl::sparse::matmat() on Intel GPUs.
DFT
- Features
  - Extended DFT SYCL APIs (oneapi::mkl::dft::descriptor::set_workspace) to provide the possibility to users to manage the temporary memory used for the FFT computation on GPU.
  - Enabled C OpenMP offload functionality for 1/2/3D real FFT using FFTW3 APIs
  - Enabled support for OpenMP 5.1 dispatch construct for C and Fortran DFTi and FFTW3 OpenMP offload.
  - Removed support for deprecated real FFT packed storage formats
- Optimizations
  - Improved performance for 1/2D complex scaled FFTs on CPU.
Vector Math
- Features
  - Vector Math supports calls from user kernels through the new Device API. Please see oneapi/mkl/vm/device/vm.hpp for details. Only static linking is supported, due to toolchain limitations.
  - Vector Math supports the new sycl::span API. Please see oneapi/mkl/vm/span.hpp for details. This interface accepts spans from all types of pointers, including stack and heap.
- Optimizations
  - The double precision complementary error function (erfc) has improved performance on Intel GPUs.
  - We added the double precision gamma function (tgamma) for Intel GPUs.
  - The OpenMP offload interoperability layer was reworked to remove rare instabilities.
Vector Statistics
- Features
  - sycl::span APIs for multinomial, poisson_v and gaussian_mv random number distributions were introduced
Library Engineering
- Features
  - Updated the version of mkl_sycl dynamic library to 3 (libmkl_sycl.so.3 on Linux and mkl_sycl.3.dll on Windows).

Known Issues and Limitations

Using space as a delimiter for TARGET_DOMAINS and TARGET_FUNCTIONS could not work correctly on some systems during oneMKL examples cmake config step. Please replace space with ";" as a delimiter in order to fix the config step.
Fortran vslstream2file example may fail with ifx compiler in case of single dynamic library (mkl_rt). As a workaround static or dynamic linking may be used.
C DFTi APIs do not support OpenMP 5.1 dispatch construct for out of place FFTs.
FFT with 4GiB of total data or greater are not supported on GPU devices.
FFT on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) CPUs and Linux 32-bit OS might raise floating point exception
The oneapi::mkl::sparse:MATMAT() C++ SYCL API with sycl::buffer inputs sometimes shows incorrect output with sparse matrix products that result in a very large (>500 million non-zeros) output matrix C = A * B. A work-around is to use USM data.
The oneapi::mkl::sparse::TRSV() C++ SYCL API with sycl::buffer inputs sometimes shows incorrect output. The work-around is to use USM inputs and outputs.
Use Release mode as a workaround for non-functional Debug mode MKL_SPARSE_?_TRSV/MKL_SPARSE_SP2M C OpenMP offload APIs and sparse::TRSV/sparse::MATMAT on Win32e.
OpenMP Offload for Vector Math functions for C and Fortran cannot be used in "dynamic" linking mode.
The Vector Math Device API can be used only with static linkage of the MKL SYCL library (libmkl_sycl.a, mkl_sycl.lib).
The Vector Math Device API incorrectly assumes that single-precision versions of ASINH, ATAN2PI, CDFNORM, CDFNORMINV, COSD, DIV, ERFC, ERFCINV, FRAC, HYPOT, INVCBRT, INVSQRT, LGAMMA, LOG10, LOG2, SIND, SINH, TAND, TGAMMA for certain accuracies are present in the library under a certain name, while the actual functions in the libmkl_sycl.a have a different name. This is known and will be fixed for the next release.
Due to a known issue in the Intel OpenMP runtime, certain LAPACK routines may crash when run on the CPU when the application is compiled with the -fopenmp-targets=spir64 flag. Group batched LU-related routines GETR{F,FNP,I_OOP}_BATCH and QR-related auxiliary routines {OR,UN}M{QR,RQ,QL,LQ} are known to be impacted.
The LAPACK GETRI (inverse after LU factorization) routine may produce incorrect results for larger sizes on Intel® Gen9.
Calling SkipAhead service routine for mt19937 random number engine with the number of skipped elements more than 2^23 may cause an illegal instruction when executing on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures lower than the 3rd generation of Intel® Xeon® Scalable server processors.

Deprecation

The undocumented LAPACK routines {S,D}COMBSSQ, which were removed from Netlib LAPACK 3.10.1, are deprecated in Intel oneMKL 2023 and will be removed in a future release.
Intel® oneAPI Math Kernel Library (oneMKL) graph routines that have been in preview mode are now deprecated.

Previous oneAPI Releases

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Intel® oneAPI Math Kernel Library (oneMKL) 2023 Release Notes

2023.2.2

Fixed Issues

2023.2.1

Fixed Issues

2023.2

Features

General

Intel® Distribution for LINPACK* Benchmark

BLAS

Sparse BLAS

LAPACK

DFT

Vector Math

Known Issues and Limitations

Deprecations

2023.1

Features

BLAS and Sparse BLAS

LAPACK

DFT

Vector Statistics

Known Issues and Limitations

2023.0

Features

BLAS and BLAS-like Extensions

LAPACK

Sparse

DFT

Vector Math

Vector Statistics

Library Engineering

Known Issues and Limitations

Deprecation

Previous oneAPI Releases

2022

2021

2017-2020

产品和性能信息