Extending R* with Intel® oneAPI Math Kernel Library (oneMKL)

ID 标签 659424
已更新 8/6/2024
版本 Latest
公共

author-image

作者

Introduction

The Using Intel oneMKL with R article discusses building R with the Intel® oneAPI Math Kernel Library (oneMKL) BLAS and LAPACK to improve the performance of those parts of R that rely on matrix computations. Since the BLAS and LAPACK routines are a long-standing set of functions with a standard API, the R build scripts have a built-in facility to use these. Setting R to use oneMKL can provide significant performance improvement on parts of R, but there are other functions in oneMKL that can provide great performance as well. This article will give you a quick tutorial on how to extend R by writing a wrapper for a function in oneMKL and call it from your R script. 

The wrapper

Let's begin by selecting a simple function in oneMKL that is not a part of the BLAS or LAPACK. For this exercise we'll choose the vdPow function which takes two arrays, x and y, and creates a new array z whose nth element is x[n] to the y[n] power.

#include "R.h" 
#include "Rdefines.h" 
#include "Rinternals.h" 
#include "mkl.h" 
SEXP mkl_vdpow(SEXP N, SEXP X, SEXP Y) {   
  SEXP Z;   
  int *n;   
  int nn;   
  double *x, *y, *z;     
/* Convert input arguments to C types */   
  PROTECT(N = AS_INTEGER(N));   
  PROTECT(X = AS_NUMERIC(X));   
  PROTECT(Y = AS_NUMERIC(Y));     
  n = INTEGER_POINTER(N);   
  x = NUMERIC_POINTER(X);   
  y = NUMERIC_POINTER(Y);     
/* Allocate memory to store results */   
  nn = *n;   
  PROTECT(Z = NEW_NUMERIC(nn));   
  z = NUMERIC_POINTER(Z);     
  vdPow(nn, x, y, z);     
  UNPROTECT(4);     
  return Z; 
}

Building the wrapper

On Windows this is easily done by obtaining the Windows package and Rtools from CRAN. I downloaded R 4.4.1 and Rtools version 4.4 and then created a file to set my environment as follows:

	SET "MINGW_ROOT=C:\rtools44\x86_64-w64-mingw32.static.posix"

	SET "PATH=%MINGW_ROOT%\bin;%MINGW_ROOT%\libexec\gcc\x86_64-w64-mingw32.static.posix\13.2.0;%PATH%"

    "C:\Program Files (x86)\Intel\oneAPI\compiler\2024.2\env\vars.bat"

    "C:\Program Files (x86)\Intel\oneAPI\mkl\2024.2\env\vars.bat"

In the above instructions, the compiler environment variables are sourced for the OpenMP* runtime.

I then build using the following commands:

    SET "R_HOME=C:\Program Files\R\R-4.4.1"

    SET "PATH=%R_HOME%\bin\x64;%PATH%"

	gcc -c -O2 -Wall -I"%R_HOME%\include" -I"%MKLROOT%\include" pow_wrp.c -o pow_wrp.o

	gcc -shared -Wl,-soname,pow_wrp.so -o pow_wrp.so pow_wrp.o "%R_HOME%\bin\x64\R.dll" "%MKLROOT%\lib\mkl_rt.lib"

 

Calling the function from R

Here then is a simple script that calls the function we've just built

 

dyn.load("c:/full_path_or_directory_on_path/pow_wrp.so")

mkl_pow <- function(n, x, y) .External("mkl_vdpow", n, x, y)

n <- 1000

x <- runif(n, min=2, max=10)

y <- runif(n, min=-2, max=-1)

z <- mkl_pow(n, x, y)

 

Performance

Let's see what sort of performance difference the use of Intel® oneMKL can make by comparing this new power function with the simple means available within R. To do the similar operation in R we could use a loop to perform each scalar operation. The following is the script running with Rscript on my laptop (Intel® oneMKL 2024.2.0, R-4.4.1, 4-core Intel® Core™ i7-1187G7 processor (3.00GHz) 16GB RAM, Windows* 11):

 

dyn.load("c:/test/pow_R/pow_wrp.so")

mkl_pow <- function(n, x, y) .Call("mkl_vdpow", n, x, y)

n <- 1000000
x <- runif(n, min=2, max=10)
y <- runif(n, min=-2, max=-1)

start <- proc.time()
z <- mkl_pow(n, x, y)
end1 <- proc.time() - start

end1

i <- n

start <- proc.time()
repeat{ 
    z[i] <- x[i]^y[i] 
    i <- i - 1 
    if (i==0) break() }
end2 <- proc.time() - start

end2

The output of the above script in the Command Prompt is as following,

C:\test\pow_R>Rscript test.R
   user  system elapsed
   0.00    0.00    0.01
   user  system elapsed
   0.42    0.00    0.22

For this very large problem size the oneMKL function makes a large difference. The loop approach takes 0.22 seconds while the oneMKL function takes 0.01 seconds. For smaller problems the performance will be less, but for systems with higher core-counts, compute intensive functions can show even greater speed-up. 

Given the potential for large performance improvements, it may be worthwhile finding out the functions available in oneMKL