Why SYCL*? The Easy Path to Multiarchitecture Freedom

Get the Latest on All Things CODE

author-image

作者

Development in a Heterogeneous-Compute World

At Intel® Innovation 2022, I had the pleasure to talk about SYCL* and the need for an open, flexible ecosystem centered around heterogeneous compute.

As processor technology evolved from sequential execution to threading, instruction caching, and reordering, and then to multicore CPUs for high-performance, latency-sensitive, general-purpose compute … something else happened. GPUs evolved to the point where it became obvious that their highly parallel compute-unit design—which was ideal for rendering and media processing—was also ideal for something else: asynchronous parallel execution of compute kernels.

The upshot? The combination of CPU and GPU can accelerate almost any workload that benefits from serious data parallelism.

Within the blink of an eye, the world of high-performance computing achieved new heights of performance and architectural diversity.

Figure 1.  The Evolution of Parallel Compute

This, however, also created a new problem: Software source code became (and continues to become) ever more specific to a unique combination of hardware components and hardware-specific language constructs. Moving an application from one setup to another one becomes ever more challenging, impacting development times, quality management, and time-to-market negatively.

To fix this problem, the goal is to implement your code once while being able to leverage the strengths of different types of architectures, whether they are CPUs, GPUs, custom ASICs, or FPGAs.

Easy Reuse on Many Architectures

Figure 2. The Importance of being Vendor Agnostic

As codebases and available hardware platforms evolve, you may want the ability to simply recompile your code or, even better, have it automatically dispatched correctly to a varying set of available resources.

This requires a commonly accepted language baseline. It also requires an approach that can reuse existing libraries and standards while isolating hardware architecture dependencies in a modular fashion.

Effectively, we will want to abstract all hardware dependencies.

Which brings us to C++.

Why C++?

In confronting the need for wide acceptance and reuse of a large ecosystem with existing codebases and libraries, we decided not to pursue the dream of every software developer; that is, we decided it was not feasible to create our own programming language.

We needed an approach that was open, standards-based, and ready to be adopted and embraced by a vast ecosystem of developers. Only then could the dream of write once, run anywhere come true.

So we looked at the language that forms the backbone of most modern software: C++.

C++ is inherently high-performance. Modern C++ has developed 3 fundamental concepts that make it ideal for taking on the challenge of enabling innovations in software and hardware.

  1. Zero cost abstractions. This means we can abstract away some of the details of different hardware without paying a performance penalty.
  2. Separation of concerns. This means we can separate the performance details for each processor from the work of making the software do what users want it to do.
  3. Composability. This means we can take different components from different experts and “compose” them together to make a complete solution.

All these concepts allow us to have different experts working together as a team to solve big challenges.

To use these modern C++ concepts with the latest processors requires a C++ compiler that supports all of these recent C++ features while also enabling code generation for all the processors in the system.

This was a challenge because C++ was designed to only generate code for one processor at a time and provides no way to link two different processors together. This would require lots of people working together: C++ compiler developers, processor vendors, and software developers. That was quite an organizational challenge!

We became convinced that the answer was to define an industry standard that expands on C++ to support the latest processors. An open industry standard that would allow people to write their code and then distribute it across the whole platform. A royalty-free standard that could be brought to many different processors.

What Will SYCL Do for Me?

If you want to create high-performance software for the future, SYCL will let you write software that performs very well on today’s accelerator hardware (such as GPUs or FPGAs) but will also work on the accelerators of tomorrow.

Please check out my conversation with James Reinders (Principle Engineer at Intel’s Software and Advanced Technology Group) below:

The Queue Manages your Offload

At its core, SYCL has the concept of a queue, and you use a C++ constructor to create it. This queue is the mechanism by which you are able to asynchronously talk to the device. You tell the queue that you want to share data, or you tell the queue that you want to have computation offloaded.

// Standard SYCL header
#include <CL/sycl.hpp>
int main() {
  sycl::device d;
// Exception checking for GPU availability
  try {
    d = sycl::device(sycl::gpu_selector());
  } catch (sycl::exception const &e) {
    std::cout << "Cannot select a GPU\n" << e.what() << "\n";
    std::cout << "Using a CPU device\n";
    d = sycl::device(sycl::cpu_selector());
  }

  std::cout << "Using " << d.get_info<sycl::info::device::name>();
}

Figure 3. The SYCL Queue Concept

For data input and output parameter sharing, one common approach is the use of buffers (see Figure 4). Buffers give you a control over data management as part of the program-execution flow.

#include <CL/sycl.hpp>
#include <iostream>
 
int main() {
 
    sycl::queue Q;
    std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << std::endl;
 
    int sum;
    std::vector<int> data{1, 1, 1, 1, 1, 1, 1, 1};
 
    sycl::buffer<int> sum_buf(&sum, 1);
    sycl::buffer<int> data_buf(data);
 
    Q.submit([&](sycl::handler& h)
    {
        sycl::accessor buf_acc{data_buf, h, read_only};
 
        h.parallel_for(sycl::range<1>{8},
                       sycl::reduction(sum_buf, h, std::plus<>()),
                       [=](sycl::id<1> idx, auto& sum)
        {
            sum += buf_acc[idx];
        });
    });
        
    sycl::host_accessor result{sum_buf, read_only};
    std::cout << "Sum equals " << result[0] << std::endl;
 
    return 0;
}

Figure 4. The Use of Buffers for Data Sharing

Level Zero to the Rescue

The promise of SYCL and oneAPI is that we support the processors of the future. To deliver on this promise, we’re doing 3 things to make it easy to add more hardware support.

  1. Almost everything we provide is open source so it can be brought to new hardware.
  2. We have an open-governance model that allows hardware vendors and software developers to work together to define the specs in a way that meets the needs of multiple processor architectures, as well as the software people want to accelerate.
  3. We have been working on defining low-level interfaces to the hardware that are easy to bring to new processors.

Expect to see us develop this further in 2023. And if you’re a hardware designer or high-performance software developer:

Join Us!

SYCL gives you the basis for a complete open, standards-based ecosystem and an open oneAPI framework for the future of heterogeneous software development on the basis of C++.

It gives you all of this fully in sync with C++ programming philosophy.

Join us in making this future happen and help to build out the open ecosystem for accelerated computing.

  • Check out how SYCL and oneAPI fit into your next heterogeneous software development project.
  • Join the SYCL group (in Khronos) and/or the oneAPI Community Forum. It’s a very exciting initiative to be involved in and a group of very friendly motivated people making it happen.

Additional Innovation 2022 Resources:

Related Content

Get the Software

Download Data Parallel C++, the oneAPI implementation of SYCL, as part of the Intel® oneAPI Base Toolkit.