Intel® oneAPI Math Kernel Library (oneMKL) provides ARS-5 and Philox4x32-10 counter-based basic random number generators, introduced by John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw in “Parallel random numbers: as easy as 1, 2, 3” [1].

ARS-5 is Advanced Encryption Standard (AES)-based basic random number generator with five algorithm iterations (further referenced as rounds) over the generator’s state. Philox4x32-10 relies on a substitution-permutation network (SPnetwork) responsible for producing highly diffusive bijection and permutations with 10 rounds over four 32-bit inputs (see more details in “Parallel random numbers: as easy as 1, 2, 3” [1]). Both generators have a period of random number sequence 2^130 ≈ 1.4 * 10^39.

While providing good statistical properties (counter-based random number generators passed “TestU01 BigCrush” test battery as was claimed by authors in [1] and “Diehard” tests battery independently verified by oneMKL [2]) and exhibited good performance of random number sequence generation (details can be found in oneMKL Vector Statistics Performance Data [3]), ARS-5 and Philox4x32-10 have small state size that leads to easy vectorization and parallelization on different hardware. This point was also discussed in KB article “New counter-based Random Number Generators in Intel® Math Kernel Library” [4].

Recently, the oneMKL ARS-5 engine was additionally optimized with the Vector Advanced Encryption Standard (VAES) instruction set, introduced in Ice Lake (VAES - vector AES encryption / decryption instructions, more information can be found in Intel® Intrinsic Guide [5]). The performance comparison for the ARS-5 and Philox4x32-10 engines on Intel® Xeon® Platinum 8280L (Cascade Lake Server) and Intel® Xeon® Platinum 8380 (Ice Lake Server) is presented below:

Assumptions: sequential (single thread) generation mode; measured region – generation of single precision random numbers uniformly distributed with a = 0, b = 1.

By utilizing the VAES instruction set, ARS-5 shows an impressive speed-up on Ice Lake Server hardware (up to 3.9 times). The Philox4x32-10 engine also shows about a 1.12X speed-up due to other hardware characteristics (cache size and number of execution ports).

Starting from the oneMKL 2021.1 release, the ARS-5 and Philox4x32-10 generators are also available with Data Parallel C++ (DPC++) APIs, where both engines support a CPU device and Philox4x32-10 engine also supports Intel’s GPU devices:

#include <vector>
#include <CL/sycl.hpp>
#include "oneapi/mkl.hpp"
int main() {
sycl::queue queue;
const size_t n = 10000;
// create USM allocator
sycl::usm_allocator<double, sycl::usm::alloc::shared> allocator(queue);
// create vector with USM allocator
std::vector<double, decltype(allocator)> r(n, allocator);
// create basic random number generator objects
// In case of ARS-5 engine call be as follows: // oneapi::mkl::rng::ars5 engine(queue);
oneapi::mkl::rng::philox4x32x10 engine(queue);
// create distribution object
oneapi::mkl::rng::uniform distr;
// perform generation
auto event = oneapi::mkl::rng::generate(distr, engine, n,;
// sycl::event object is returned by generate function for synchronization
event.wait(); // synchronization can be also done by queue.wait()
return 0;

You can also execute the Philox4x32-10 engine on GPUs through OpenMP offload APIs and DPC++ device APIs (which can be called from DPC++ kernels [6]). Code examples are presented below.

OpenMP offload APIs usage example:

#include "mkl.h"
#include "mkl_omp_offload.h"
int main() {
int dnum = 0;
const MKL_INT n = 10000;
float* r_dev = (float*)mkl_malloc((n) * sizeof(float), 64);
VSLStreamStatePtr stream_dev;
int i;
float a = 0.0f, b = 1.0f;
// initialize Basic Random Number Generator
vslNewStream(&stream_dev, VSL_BRNG_PHILOX4X32X10, 1);
#pragma omp target data map(tofrom:r_dev[0:N]) device(dnum)
// run RNG on gpu, use standard oneMKL interface within a variant dispatch construct
#pragma omp target variant dispatch device(dnum) use_device_ptr(r_dev)
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream_dev, n, r_dev, 0.0f, 1.0f);
// deinitialize
return 0;

Device DPC++ APIs usage example:

#include <vector>
#include <CL/sycl.hpp>
#include "oneapi/mkl/rng/device.hpp"
int main() {
const size_t n = 10000;
std::vector<float> r_dev(n);
// submit a kernel to generate on device
sycl::buffer<float, 1> r_buf(, r_dev.size());
try {
queue.submit([&](sycl::handler& cgh) {
auto r_acc = r_buf.template get_access<sycl::access::mode::write>(cgh);
cgh.parallel_for(sycl::range<1>(n), [=](sycl::item<1> item) {
oneapi::mkl::rng::device::philox4x32x10 engine(1, item.get_id(0));
oneapi::mkl::rng::device::uniform distr;
float res = oneapi::mkl::rng::device::generate(distr, engine);
r_acc[item.get_id(0)] = res;
catch (sycl::exception const& e) {
std::cout << "\t\tSYCL exception\n" << e.what() << std::endl;
} // buffer life-time ends
return 0;


