Accelerating Python* Today

Getting Ready for the Cambrian Explosion in Accelerator Architectures

Get the Latest on All Things CODE

author-image

作者

Python* continually surprises many with how versatile and performant it can be. I’m a dyed-in-the-wool C and Fortran programmer, with substantial C++ smarts too, because they allow me to get high performance. Python offers this too, with a convenience that separates it from the previously mentioned languages. So, I’m a Python fan, too.

Python can deliver performance because it has key libraries that are well optimized, and there is support for just-in-time compilation (at run time) for key code that was not precompiled. However, my Python code tends to slow when I reach for larger data sets or more complex algorithms. In this article, we’ll review:

  1. Why the “Heterogeneous Future” is so important to think about
  2. Two key challenges we need to solve in an open solution
  3. Parallel execution to use available CPU performance better
  4. Using an accelerator to push performance even higher

Step 3 alone offered a 12x speedup, and step 4 offers even more when an accelerator is available. These easy-to-use techniques can be invaluable for a Python programmer when a performance boost is needed. The techniques shared here let us forge ahead without waiting too long for results.

Thinking about the “Heterogeneous Future”

While understanding heterogeneity is not critical if all we want to do is make Python code run faster, it is worth highlighting a substantial shift that is happening in computing now. Computers have become faster every year. At first, clever and more complex architectures drove these performance gains. Then, from about 1989–2006, the increasing clock frequency was the key driver. Suddenly in 2006, raising clock rates no longer made sense and architectural changes were once again needed to increase performance.

Multicore processors offered more performance by increasing the number of (homogeneous) cores in a processor. Unlike rising clock rates, getting additional performance from multicore required software to change to harness the new performance. Herb Sutter’s classic article, “The Free Lunch Is Over,” highlighted the need for concurrency. While necessary, this shift complicated our lives as software developers.

Next, accelerators emerged to augment CPU computations with computations on a specialized device. The most successful of these so far has been the GPU. GPUs were originally introduced to offload graphics processing on its way to a computer’s display. Several programming models emerged to harness this additional computing power, but instead of sending the results to the display, they were sent back to the program running on the CPU. The most successful model so far has been CUDA* for NVIDIA GPUs. Today, the (heterogeneous) processors in a single system are no longer equivalent. Yet, all popular programming languages generally assume a single compute device, so the term “offload” is used when we select part of our code to run on a different compute device.

A few years ago, two industry legends, John Hennessey and David Patterson, announced that we were entering “A New Golden Age for Computer Architecture.” Heterogeneous computing is exploding thanks to the emergence of many ideas for domain-specific processors. Some will succeed and many will fail, but computing is forever changed because it is no longer about doing all computations on a single device.

Two Key Challenges Solved by One Good Solution

While CUDA is a popular option today, it is limited to NVIDIA GPUs. However, we need open solutions to handle the wave of new accelerator architectures that are arriving from multiple vendors. Programs running on heterogeneous platforms need a way to discover what devices are available at runtime. They also need a way to offload computation to those devices.

CUDA ignores device discovery by assuming that only NVIDIA GPUs are available. Python users can choose CuPy to leverage GPUs using CUDA (NVIDIA) or ROCm* (AMD); but, while CuPy is a solid option, it doesn’t improve CPU performance or generalize to other vendors or architectures. We would do better with a programming solution that is portable to multiple vendors and can support new hardware innovations. However, before we get too excited about accelerator offload, let’s be sure we are getting the most out of the host CPU, because once we understand how to get parallelism and compiled code, we will also be better positioned to use the parallelism in an accelerator as well.

Numba is an open-source, NumPy-aware optimizing (just-in-time) compiler for Python, developed by Anaconda. Under the covers, it uses the LLVM compiler to generate machine code from Python bytecode. Numba can compile a large subset of numerically focused Python, including many NumPy functions. Numba also has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of universal functions (ufuncs) and C callbacks.

The Numba auto-parallelizer was contributed by Intel. It can be enabled by setting the parallel=True option in the @numba.jit. The auto-parallelizer analyzes data-parallel code regions in the compiled function and schedules them for parallel execution. There are two types of operations that Numba can automatically parallelize:

  1. Implicitly data-parallel regions, such as NumPy array expressions, NumPy ufuncs, NumPy reduction functions
  2. Explicitly data-parallel loops that are specified using the numba.prange expression

For example, consider the following simple Python loop:

def f1(a,b,c,N):
    for i in range(N):
        c[i] = a[i] + b[i]

We can make it explicitly parallel by changing the serial range (range) to a parallel range (prange) and adding a njit directive (njit = Numba JIT = compile a parallel version):

@njit(parallel=True)
def add(a,b,c,N):
    for i in prange(N):
        c[i] = a[i] + b[i]

Run time improved from 24.3 seconds to 1.9 seconds when I ran it, but results can vary depending on the system. To try it, clone the oneAPI-samples repository (git clone) and open the AI-and-Analytics/Jupyter/Numba_DPPY_Essentials_training/Welcome.ipynb notebook. An easy way to do this is by getting a free account on the Intel® DevCloud for oneAPI.

Using an Accelerator to Push Performance Even Higher

An accelerator can be highly effective when an application has sufficient work to merit the overhead of offloading. The first step is to compile select computations (a kernel) so it can be offloaded. Extending the prior example, we use Numba data-parallel extensions (numba-dpex) to designate an offload kernel. (For more details, see Jupyter notebook training.)

@dppy.kernel
def add(a, b, c):
    i = dppy.get_global_id(0)
    c[i] = a[i] + b[i]

The kernel code is compiled and parallelized, like it was previously using @njit to get ready for running on the CPU, but this time it is ready for offload to a device. It is compiled into an intermediate language (SPIR-V) that the runtime maps to a device when it is submitted for execution. This gives us a vendor-agnostic solution for accelerator offload.

The array arguments to the kernel can be NumPy arrays or Unified Shared Memory (USM) arrays (an array type explicitly placed in Unified Shared Memory) depending on what we feel fits our programming needs best. Our choice will affect how we set up the data and invoke the kernels.

Next, we’ll take advantage of a C++ solution for open multivendor, multiarchitecture programming called SYCL*, using the open source data-parallel control (dpctl: C and Python bindings for SYCL). (See GitHub docs and Interfacing SYCL and Python for XPU Programming for more information.) These enable Python programs to access SYCL devices, queues, and memory resources and execute Python array/tensor operations. This avoids reinventing solutions, reduces how much we must learn, and allows a high level of compatibility as well.

Connecting to a device is as simple as:

device = dpctl.select_default_device()
print("Using device ...")
device.print_device_info()

The default device can be set with an environment variable SYCL_DEVICE_FILTER if we want to control device selection without changing this simple program. The dpctl library also supports programmatic controls to review and select an available device based on hardware properties.

The kernel can be invoked (offloaded and run) on the device with a couple lines of Python code:

with dpctl.device_context(device):
    dpar_add[global_size,dppy.DEFAULT_LOCAL_SIZE](a,b,c)

Our use of device_context has the runtime do all the necessary data copies (our data was still in standard NumPy arrays) to make it all work. The dpctl library also supports the ability to allocate and manage USM memory for devices explicitly. That could be valuable when we get deep into optimization, but the simplicity of letting the runtime handle it for standard NumPy arrays is hard to beat.

Asynchronous vs. Synchronous

Python coding style is easily supported by the synchronous mechanisms shown above. Asynchronous capabilities, and their advantages (reducing or hiding latencies in data movement and kernel invocations), are also available if we’re willing to change our Python code a little. See the example code at dpctl gemv example to learn more about asynchronous execution.

What about CuPy?

CuPy is a reimplementation of a large subset of NumPy. The CuPy array library acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms. However, the massive programming effort required to reimplement CuPy for new platforms is a considerable barrier to multivendor support from the same Python program, so it does not address the two key challenges mentioned earlier. For device selection, CuPy requires a CUDA-enabled GPU device. For memory, it offers little direct control over memory, though it does automatically perform memory pooling to reduce the number of calls to cudaMalloc. When offloading kernels, it offers no control over device selection and will fail if no CUDA-enabled GPU is available. We can get better portability for our application when we use a better solution that addresses the challenges of heterogeneity.

What about scikit-learn?

Python programming in general is well suited for compute-follows-data, and using enabled routines is beautifully simple. The dpctl library supports a tensor array type that we connect with a specific device. In our program, if we cast our data to a device tensor [e.g., dpctl.tensor.asarray(data, device="gpu:0")] it will be associated with and placed on the device. Using a patched version of scikit-learn that recognizes these device tensors, the patched scikit-learn methods that involve such a tensor are automatically computed on the device.

It is an excellent use of dynamic typing in Python to sense where the data is located and direct the computation to be done where the data resides. Our Python code changes very little, the only changes are where we recast our tensors to a device tensor. Based on feedback from users thus far, we expect compute-follows-data methods to be the most popular models for Python users.

Open, Multivendor, Multiarchitecture – Learning Together

Python can be an instrument to embrace the power of hardware diversity and harness the impending Cambrian Explosion in accelerators. Numba data-parallel Python combined with dpctl and compute-follows-data patched scikit-learn are worth considering because they are vendor and architecture agnostic.

While Numba offers great support for NumPy, we can consider what more can be done for SciPy and other Python needs in the future. The fragmentation of array APIs in Python has generated interest in array-API standardization for Python (read a nice summary) because of the desire to share workloads with devices other than the CPU. A standard array API goes a long way in helping efforts like Numba and dpctl increase their scope and impact. NumPy and CuPy have embraced array-API, and both dpctl and PyTorch* are working to adopt it. As more libraries go in this direction, the task of supporting heterogeneous computing (accelerators of all types) becomes more tractable.

Simply using dpctl.device_context is not sufficient in more sophisticated Python codes with multiple threads or asynchronous tasks. (See the GitHub issue.) It is likely better to pursue a compute-follows-data policy, at least in more complex threaded Python code. It may become the preferred option over the device_context style of programming.

There is a lot of opportunity for us to all contribute and refine ways to accelerate Python together. It’s all open source and works quite well today.

Learn More

For learning, there is nothing better than jumping in and trying it out yourself. Here are some suggestions for online resources to help.

For Numba and dpctl, there is a 90-minute video talk covering these concepts: Data Parallel Essentials for Python.

Losing your Loops Fast Numerical Computing with NumPy by Jake VanderPlas (author of the Python Data Science Handbook) is a delightfully useful video on how to use NumPy effectively.

The heterogeneous Python capabilities described in this article are all open source, and come prebuilt in the Intel® oneAPI Base and Intel® AI Analytics toolkits. A SYCL-enabled NumPy is hosted on GitHub. Numba compiler extensions for kernel programming and automatic offload capabilities are also hosted on GitHub. The open source data-parallel controls (dpctl: C and Python bindings for SYCL) has GitHub docs and a paper, Interfacing SYCL and Python for XPU Programming. These enable Python programs to access SYCL devices, queues, memory and execute Python array/tensor operations using SYCL resources.

Exceptions are indeed supported, including asynchronous errors from device code. Async errors will be intercepted once they are rethrown as synchronous exceptions by async error handler functions. This behavior is courtesy of Python extensions generators and the community documentation explains it well in Cython and Pybind11.