An Introduction to Accelerator and Parallel Programming

Accelerators are continuing to grow in importance and usage, what does this mean for developers?

Get the Latest on All Things CODE

author-image

作者

Computers are a means to an end. They allow us to have faster solutions to complex problems, provide the ability to store and retrieve information across the globe, provide the backbone for remarkable technologies like robotics and self-driving cars (sort of) and AI, and hopefully uplift the lives of everyone on the planet.

As the problems to solve have become more complex, computer architecture, programming languages, and programming models have continued to evolve. This has led to the growth of hardware accelerators and domain-specific programming models.

Professor David Patterson from UC Berkeley (the author of all the computer architecture books I had in college) has talked extensively about domain-specific architectures and accelerators. Today, when we talk about a hardware accelerator, we are often talking about a GPU. However, there are myriad different types of accelerators that have arisen to solve various problems—including deep learning and AI—which utilize hardware specifically designed to perform large-scale matrix operations, the heart of DL workloads.

In addition, there are hardware-acceleration technologies built into traditional CPUs like Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Matrix Extensions (Intel® AMX).

With the rise of new accelerators, there is always the challenge of how to program for them. Most accelerators currently available are based on parallel execution and, hence, some form of parallel programming.

This is the first in a series of articles where I will talk about parallel programming and various ways to enable it on modern accelerators.

Parallelism Overview

Parallel programming is how we write code to express parallelism in any code/algorithm to get it to run on an accelerator or multiple CPUs. But what is parallelism?

Parallelism is when parts of a program can run at the same time as another part of the program. Typically, we break this down into two categories: task parallelism and data parallelism.

Task Parallelism

Task parallelism is when multiple functions can be independently performed at the same time. An example would be preparing for a party. One person may get balloons, another person may get ice cream, and a third person may get the cake. While the party won’t be complete without all three things, three different people can do what they need to do independently of everyone else as long as they are done before the party and meet in the right place.

If we map this to a computer program, each person is analogous to some compute hardware (CPU/GPU/FPGA), and picking up the various items are the tasks to be run.

Data Parallelism

Data parallelism is when the same function can be performed independently on several pieces of identical data. Imagine the person in our example above goes to get ice cream, but four other people also want to get ice cream. Potentially, all five people can get ice cream from the same store freezer at the same time.

In this example, we again have the analogy of people to compute hardware, each one tasked with getting ice cream. The ice cream is our (delicious) analogy for data.

These examples are simplistic but hopefully instructive. There are more subtle types of parallelism, especially in the area of computer architecture, but these simple analogies may help you understand how to think about parallelism when doing parallel programming.

Available Resources, Available Parallelism, and Contention

Some important things to consider with parallelism are available resources and the available parallelism in a given solution space.

  • We may be limited by the number of resources to complete a task – If I only have three people, I can only do three tasks at a time.
  • We may be limited by the available tasks – If I only need balloons and cake for my party, a third person to perform a task doesn’t help me.
  • We may be limited by the available data – If I have five people wanting to get ice cream, but only three containers of ice cream are left in the store freezer, two people will have nothing to do.

Another separate issue to consider is contention. Resources are generally limited and often our attempts to parallelize a problem run into issues with resource access.

Let’s imagine we go to the store to get our ice cream and there are 100 containers of ice cream in the freezer, but only three people can stand in front of the freezer at once. That means that even if 100 people are there to get ice cream, getting ice cream still has, at most, a parallelism of three because of the limited access to the freezer.

This analogy applies to various pieces of computer hardware. One example: Think of the freezer access to be similar to a memory bus. If it is not wide enough to allow all the data to flow to the accelerator fast enough, your accelerator is stuck waiting for access to the data to do what it needs to do.

A Simple Code Example

To make this slightly more concrete, let’s show some code for the data parallel shopping problem above. We start by creating a shopper class. The job of this class is to perform the work of shopping. In this case, I’m using a naive matrix-multiply to be the cost of shopping.

#ifdef _WIN32
#include <Windows.h>
#else
#include <unistd.h>
#endif

#define DIMENSION 64

class Shopper {
public:
    Shopper(){}

    #pragma clang optimize off
    void Shop()
    {
        float a[DIMENSION][DIMENSION];
        float b[DIMENSION][DIMENSION];
        float mul[DIMENSION][DIMENSION];
        for (auto i = 0; i < DIMENSION; i++)
        {
            for (auto j = 0; j < DIMENSION; j++)
            {
                a[i][j] = 123.0f;
                b[i][j] = 456.0f;
                mul[i][j] = 0.0f;
            }
        }

        for (auto i = 0; i < DIMENSION; ++i)
        {
            for (auto j = 0; j < DIMENSION; ++j)
            {
                mul[i][j] = 0;
                for (auto k = 0; k < DIMENSION; ++k)
                {
                    mul[i][j] += a[i][k] * b[k][j];
                }
            }
        }
    }
};

shopper.hp

Parallelism with OpenMP®

Now that I have defined the act of shopping, let’s look at how to make this code run in parallel using OpenMP®, a portable solution that allows programmers to add parallelism to their programs. Pragma directives are added to existing code to tell the compiler how to run parts of the code in parallel.

#include <iostream>
#include <array>
#include <chrono>
#include "shopper.hpp"

#define SHOPPER_COUNT 65536

void Shop(std::array<Shopper, SHOPPER_COUNT> &shoppers) {
    #pragma omp parallel for
    for(int i = 0; i < SHOPPER_COUNT; ++i)
    {
        shoppers[i].Shop();
    }
}

int main()
{
    std::array<Shopper, SHOPPER_COUNT> shoppers;

    auto start = std::chrono::steady_clock::now();
    Shop(shoppers);    
    auto end = std::chrono::steady_clock::now();

    std::cout << "Elapsed time in milliseconds: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
        << " ms" << std::endl;

    return 0;
}

grocery-omp.cpp

This simple code consists of four simple pieces:

  1. Line 18: Create 65536 shoppers
  2. Line 21: Call the Shop function
  3. Line 10–12: Ask each shopper to go shopping
  4. Line 9: a pragma to tell the compiler when using OpenMP to run iterations of this loop in parallel

The rest of it is just timers so we can see the benefits of running the code in parallel.

The nice thing about the OpenMP code is you can tell the compiler to use/not use OpenMP, which then results in a serial program.

For this article I’m using an HP Envy* laptop, which has an Intel® Core™ i7–12700H processor with 32GB of memory and an integrated Intel® Iris® Xe GPU. The system is running Ubuntu* 22.04, a 6.0 kernel, and the Intel® oneAPI DPC++/C++ Compiler.

First, I enabled the Intel compiler environment with this command:

> /opt/intel/oneapi/setvars.sh

The following commands compile the code:

> icx -lstdc++ grocery-omp.cpp -o serial-test
> icx -lstdc++ -fiopenmp grocery-omp.cpp -o omp-test

The last command includes -fiopenmp flag to tell the compiler to enable parallelism using OpenMP. Running the executables, I get this output on my system:

> ./serial-test
Elapsed time in milliseconds: 27361 ms
> ./omp-test
Elapsed time in milliseconds: 4002 ms

The OpenMP run utilizes several of my CPU cores to do the work simultaneously, which results in a 6x–7x speed up in the OpenMP run.

Parallelism with oneAPI/SYCL

OpenMP is a fantastic way to enable parallelism through a directive approach. C++ with SYCL, part of the oneAPI specification, allows us to express parallelism using an explicit approach.

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
#include <chrono>
#include "shopper.hpp"

#define SHOPPER_COUNT 65536

void Shop(sycl::queue &q, std::array<Shopper, SHOPPER_COUNT> &shoppers) {
    sycl::range<1> num_items{SHOPPER_COUNT};
    sycl::buffer buf(shoppers.data(), num_items);

    // Submit a command group to the queue by a lambda function that contains the
    // data access permission and device computation (kernel).
    q.submit([&](sycl::handler &h) {
        // The accessor is used to store (with write permission) the data.
        sycl::accessor item(buf, h, sycl::write_only, sycl::no_init);

        // Use parallel_for to run vector addition in parallel on device. This
        // executes the kernel.
        //    1st parameter is the number of work items.
        //    2nd parameter is the kernel, a lambda that specifies what to do per
        //    work item. The parameter of the lambda is the work item id.
        // DPC++ supports unnamed lambda kernel by default.
        h.parallel_for(num_items, [=](auto i) { item[i].Shop(); });
    });
}

int main()
{
    std::array<Shopper, SHOPPER_COUNT> shoppers;

    // The default device selector will select the most performant device.
    sycl::default_selector d_selector;

    auto start = std::chrono::steady_clock::now();

    try {
        sycl::queue q(d_selector);

        // Print out the device information used for the kernel code.
        std::cout << "Running on device: "
                << q.get_device().get_info<sycl::info::device::name>() << "\n";

        // Go shopping using SYCL
        Shop(q, shoppers);
    } catch (std::exception const &e) {
        std::cout << "An exception was caught when shopping.\n";
        std::terminate();
    }
    
    auto end = std::chrono::steady_clock::now();

    std::cout << "Elapsed time in milliseconds: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
        << " ms" << std::endl;

    return 0;
}

grocery-sycl.cpp

The SYCL code is larger than the previous code base for this example. However, we have the same core code:

  1. Line 31: Create 65536 shoppers
  2. Line 46: Call the Shop function
  3. Line 25: Ask each shopper to go shopping

Unlike the OpenMP example where we use directives, SYCL allows users to explicitly define the parallel behavior of their program via code and C++ constructs. This provides runtime flexibility, which is unavailable in OpenMP.

Depending on your use case, one paradigm may make more sense than another. In this case, we have a way to take a single binary and run it in multiple ways using Intel’s oneAPI SYCL runtime and the SYCL_DEVICE_FILTER environment variable:

> SYCL_DEVICE_FILTER=host:host:0 ./sycl-test
Running on device: SYCL host device
Elapsed time in milliseconds: 27201 ms
> SYCL_DEVICE_FILTER=opencl:cpu:1 ./sycl-test
Running on device: 12th Gen Intel(R) Core(TM) i7-12700H
Elapsed time in milliseconds: 4197 ms
> SYCL_DEVICE_FILTER=opencl:gpu:3 ./sycl-test
Running on device: Intel(R) Graphics [0x46a6]
Elapsed time in milliseconds: 3988 ms

The values for SYCL_DEVICE_FILTER come from running the sycl-ls command, which comes as part of the Intel® oneAPI Base Toolkit.

The cool thing about the C++ with SYCL implementation is that the binary can direct the work to multiple devices.

Conclusion

Understanding parallelism and writing parallel code is a complicated problem. This primer describes a simple example of how parallelism can be enabled using OpenMP and C++ with SYCL. There are lots of other considerations when creating a parallel program, such as how multiple compute resources share memory and variables, and how to balance work properly across multiple resources.

I will address those in future articles. But for now, I encourage you to try grabbing the simple example above and see how it works on your systems.

Thanks for reading!