C for Metal - The Precious Metal for Computing on Intel® Graphics Technology

ID 标签 659633
已更新 1/22/2020
版本 Latest
公共

author-image

作者

C for Metal is a general graphics processing unit (GPU) programming language with an explicit single instruction, multiple data (SIMD) programming model that lets users achieve maximal performance on Intel® Processor Graphics. It is not limited to media processing. This article provides an introduction to the Intel® C for Metal Development Package, a software development package that is available publicly.

How many Intel processor cores do you have on your computer? If you use an Intel® CPU-based system, in the vast majority of cases you will need to add one to your answer. From the majority of entry level devices with Intel Atom® processors, to some servers with Intel® Xeon® E3 processors, and the whole range of Intel® Core™ processors, have the integrated graphics core onboard. Intel® Graphics Technology is essentially a full-fledged processor and, accordingly, capable of not only displaying an image on the screen and accelerating video, but also is capable of performing ordinary general-purpose calculations. How can it be used effectively? This paper helps to explain.

First, let’s briefly see why it is worth using an Intel® GPU. Of course, the CPU performance of a typical system usually significantly exceeds the GPU performance in terms of gigaflops, and that is reasonable and expected. That is what the central processor is about.

But it is worth noting that the performance of Intel integrated GPUs over the past decade has grown much more in percentage terms than corresponding CPU performance, and this trend will certainly continue with the advent of new, discrete graphics cards from Intel. In addition, the GPU, by virtue of its architecture (many vector execution units), is much better suited to the execution of a certain type of tasks—image processing; that is, in fact, any operations of the same type on data arrays. The GPU does this with full internal parallelization, consumes less energy than the CPU does, and in some cases even surpasses it in absolute throughput. Finally, the GPU and CPU can work in parallel, each on its own tasks, ensuring maximum performance and minimum power consumption of the entire system.

The simplest way for usage of the GPU general purpose calculation capabilities mentioned, that does not require any special knowledge in graphics (say, of Direct3D* or OpenGL* shaders), is via OpenCL™ software or Intel® oneAPI DPC++ Library (oneDPL).

OpenCL kernels are platform independent and will automatically execute on all computing devices available in the system, such as CPU, GPU, FPGA, and so on. But the fee for such versatility is the performance seen far from the maximum possible on each type of device, and especially on the integrated Intel GPU. For example: When executing code that transposes a 16x16 byte matrix on any Intel GPU, the number of instructions generated by direct GPU programming will be around eight times less than with the OpenCL code.

All statements made about OpenCL software are true about oneDPL as well, due to the same underlying technology.

In addition, some of the functionality required to implement common algorithms, such as wide filters that use data from a large group of pixels in a single transformation, is not supported by the OpenCL software.

Therefore, if you need maximum speed on the GPU or something more complicated than working independently with each element of the array and its closest neighbors, then Intel® C for Metal Development Package, a tool for developing applications running on Intel Graphics Technology, will help you.

Intel® C for Metal Development Package - Welcome to the Forge

For many years, the Intel C for Metal Development Package has been used internally by Intel in developing media processing products on the Intel GPU. But in 2018, this package was released to the public as an open source package.

In terms of performance and functionality, C for Metal can be considered an assembler language for graphics cards from Intel, and in terms of usage pattern and convenience, an analogue of OpenCL platform or Intel® oneAPI Data Parallel C++ for graphics cards from Intel.

The Intel® C for Metal Runtime application code, just like OpenCL Runtime, contains two parts: the administrative part, executed on the CPU, and the real work kernel, executed on the GPU. Not surprisingly, the first part is called the host, and the second is called the kernel.

Kernels are the functions of processing a given block of pixels (or just data). They are written in the C for Metal language and compiled into the Intel GPU instruction set using the Intel® C for Metal Compiler.

The host is a kind of kernel team manager. It administers the data transfer process between the CPU and the GPU and performs other managerial work through the Intel C for Metal Runtime library and the Intel GPU media driver.

The following is the detailed workflow:

  • The host code is compiled by any x86 C/C++ compiler along with the entire application.
  • The kernel code is compiled by the Intel C for Metal Compiler into a binary file featuring some common instruction set.
  • At run time, this general set of just-in-time (JIT) instructions is being translated to a specific Intel GPU code.
  • The host calls the Intel C for Metal Runtime library to communicate with the GPU and the operating system.

A couple more important and useful points:

Surfaces used for data representation can be shared with Microsoft DirectX* 11 and DirectX 9 on Windows*, and DirectX Video Acceleration (DXVA) on Linux*.

The GPU can load and store data from or to video memory and system memory shared with the CPU. The Intel C for Metal Development Package includes special functions for GPU – CPU data transfer in both directions. At the same time, the system memory is exactly shared, and real copying inside it is not required. For this, the so-called zero copy is provided.

Intel® C for Metal Development Package - In the Vent of the Volcano

Even from the name C for Metal, one could conclude that the internal language implementation corresponds to the internal structure of Intel Graphics Technology. That is, it takes into account that the code will be executed on several dozen GPU execution units, while each of them is a vector processor capable of executing several threads simultaneously.

The C for Metal language itself could be considered as C++ with some limitations and extensions. Compared to C++, C for Metal lacks pointers, memory allocation, and static variables. And no recursive functions are supported. But instead, the key language feature is an explicit SIMD programming model: Vector data types—vector, matrix, and surface; vector operations on these data types; vector conditions if / else, independently performed for each element of the vector; and built-in features for accessing Intel GPU hardware fixed functionality.

The work with vectors, matrices, and surfaces in real tasks is facilitated by subsets. From the corresponding basic objects mentioned previously, you can choose only the reference blocks of interest or, as a special case, individual elements by mask.

For example, let's look at C for Metal code that implements a linear filter, replacing RGB colors of each pixel by the RGB colors calculated as an averaged value of this pixel and its eight neighbors on the picture:

Intel C M code that implements a linear filter

If the colors (data) in the matrix are located as R8G8B8, the corresponding calculation with splitting the input image into blocks of 6x8 pixels (6x24 byte data elements) will be as follows:

    // declare 8x32 input matrix 
        matrix<uchar, 8, 32> in;
    // 6x24 results matrix 
        matrix<uchar, 6, 24> out;
        matrix<float, 6, 24> m;
    // load an input matrix
        read(inBuf h_pos*24, v_pos*6, in); 
    // calculate the sum of each pixel neighbors RGB 
        m  = in.select<6,1,24,1>(1,3);
        m += in.select<6,1,24,1>(0,0); 
    m += in.select<6,1,24,1>(0,3);
     m += in.select<6,1,24,1>(0,6);
        m += in.select<6,1,24,1>(1,0);
     m += in.select<6,1,24,1>(1,6);
        m += in.select<6,1,24,1>(2,0); 
    m += in.select<6,1,24,1>(2,3); 
    m += in.select<6,1,24,1>(2,6);
    // calculate an average, here division by 9 is approximated by * 0.111f;
        out = m * 0.111f;
    // save the result
        write(outBuf, h_pos*24, v_pos*6, out);
    

Here, matrix size is defined in the form <data type, height, width>;.

The operator selection <v_size, v_stride, h_size, h_stride> (i, j) returns the submatrix starting with the element (i, j), where v_size shows the number of selected rows, v_stride shows the distance between selected rows, h_size shows the number of selected columns, and h_stride shows the distance between selected columns.

Note that the 8x32 size of the input matrix is chosen because, although the 8x30 block is sufficient for calculating the values of all the pixels in the 6x24 block from the algorithmic point of view, the data block is read in not by bytes, but by 32-bit dword elements.

The code above is, in fact, a full-fledged C for Metal kernel. As mentioned, it will be compiled by the Intel C for Metal Compiler in two stages (precompilation and subsequent JIT translation). The Intel C for Metal Compiler is built on the basis of LLVM and, if desired, you can study the source and build one yourself.

Let’s see what a host does. It invokes Intel C for Metal Runtime library functions that:

  • Create, initialize, and delete when necessary the GPU device (CmDevice), as well as surfaces containing user data used in kernels (CmSurface).
  • Work with kernels. Load them from precompiled .isa files and prepare their arguments, indicating the part of the data that each kernel to work with.
  • Create and manage the kernels execution queue.
  • Manage the operation scheme of threads executing each kernel on the GPU.
  • Manage events (CmEvent) that represent GPU and CPU synchronization objects.
  • Transfer data between the GPU and the CPU, or rather, between system and video memory.
  • Report errors and measure the operating time of the kernels.

Following is an example of the simplest host code:

// CmDevice creation 
    cm_result_check(::CreateCmDevice(p_cm_device, version));
    // Load  hello_world_genx.isa 
    std::string isa_code = isa::loadFile("hello_world_genx.isa");
  
    // Create CmProgram from isa code 
    CmProgram *p_program = nullptr;
    cm_result_check(p_cm_device->LoadProgram(const_cast<char* >(isa_code.data()),isa_code.size(), p_program));
    // Create hello_world kernel.
    CmKernel *p_kernel = nullptr;
    cm_result_check(p_cm_device->CreateKernel(p_program,
                                             "hello_world",
                                              p_kernel));
  
    // Create the threading space for CmKernels execution 
    CmThreadSpace *p_thread_space = nullptr;
    cm_result_check(p_cm_device->CreateThreadSpace(thread_width,
                                                   thread_height,
                                                   p_thread_space));
    // Set kernel input.
    cm_result_check(p_kernel->SetKernelArg(0,
                                           sizeof(thread_width),
                                           &thread_width));
  
    // Create  CmTask – kernels pointers container ad add kernels there 
    // We need CmTask to add kernels to the execution queue
    CmTask *p_task = nullptr;
    cm_result_check(p_cm_device->CreateTask(p_task));
    cm_result_check(p_task->AddKernel(p_kernel));
  
    // Create the queue 
    CmQueue *p_queue = nullptr;
    cm_result_check(p_cm_device->CreateQueue(p_queue));
  
    // Enqueue means the task execution on GPU starts 
    CmEvent *p_event = nullptr;
    cm_result_check(p_queue->Enqueue(p_task, p_event, p_thread_space));
  
    // Wait till the execution result is ready.
      cm_result_check(p_event->WaitForTaskFinished());
  

As you can see, there is nothing complicated in creating and using kernels and a host. Everything is straightforward.

The only difficulty to warn about in the real world is currently, in the publicly available Intel C for Metal Development Package version, the only available way to debug kernels is via printf messages. The correct usage is shown in the example, Hello, World.

Intel® C for Metal Development Package — Not a Heavy Metal

Now let's see how it works in practice. The Intel C for Metal Development Package is available for Windows and Linux, for both operating systems it contains the Intel C for Metal Compiler, documentation, tutorials, and usage examples. A detailed description of these examples can be downloaded separately. Download the latest Intel C for Metal Development Package.

For Linux, the kit also includes a user-mode media driver for VAAPI with an integrated Intel® C for Metal Runtime library.

For Windows, the regular Intel® Graphics Driver for Windows should be used with Intel C for Metal Runtime. The Intel C for Metal Runtime library is included as the corresponding dll in this driver package. The Intel C for Metal Development Package contains the linking .lib file for the Intel Graphics Driver only. If the driver is missing from your system for some reason, it can be downloaded from the Intel website. The correct operation of C for Metal is guaranteed in the drivers, beginning with version 15.60, released in 2017.

The source code of the components mentioned can be found in the following GitHub* repositories:

The rest of this section is Windows specific, but the generic principles for working with C for Metal described below are also applicable to Linux.

For regular work with the Intel C for Metal Development Package, you need Microsoft Visual Studio* starting from 2015 and CMake* beginning with version 3.2. Take into account that the configuration and script files of the C for Metal samples and tutorials are designed for Visual Studio 2015; therefore, to use newer versions of Visual Studio, you will have to investigate and edit the paths to the corresponding Visual Studio components yourself.

So, getting to know the Intel C for Metal Development Package for Windows:

  1. Download the Intel C for Metal Development Package (zip file)
  2. Unpack it
  3. Run (preferably in the Visual Studio command line) the setupenv.bat environment configuration script with the following three arguments:
  • The target Intel GPU generation (corresponding to the processor in which the GPU is integrated: the default value is gen9).
  • Compilation platform: x86 or x64.
  • The DirectX version for surfaces sharing with Intel C for Metal Runtime: dx9 or dx11.

Now you can build all the samples in the examples folder directly—the build_all.bat script will do it. Also, you can generate the example projects for Microsoft Visual Studio by the create_vs.bat script with the name of a specific example as its argument.

As you can see, the C for Metal application is shaped as an executable file, with the host part inside and an .isa file with the corresponding precompiled GPU part inside.

The examples included in the Intel C for Metal Development Package are very diversified—from the simplest Hello, World, which shows the basic principles of C for Metal operation, to the rather complicated one—the implementation of the algorithm for finding the maximum flow, minimum cut graph (max-flow min-cut problem) used in image segmentation and stitching. All C for Metal examples are well documented right in the code and in the separate description, C for Metal Programming reference. It is reasonable to dive into C for Metal this way, sequentially studying and launching examples, and then modifying them to fit your needs.

For a general understanding of all the existing C for Metal features, it is strongly recommended that you study the C for Metal informal specification, the C for Metal description cmlangspec.html in the \documents\compiler\html\cmlangspec folder.

In addition to the C for Metal language itself the fixed function C for Metal API is included, and the functionality implemented in the GPU hardware, with access to the Texture Samplers, Motion Estimation, and some video analytics capabilities that can be used for general purpose calculations.

C for Metal - Strike while the Iron is Hot

Speaking about the performance of C for Metal applications, note that examples provided in the Intel C for Metal Development Package include measuring the time of their operation, so that by running them on the target system and comparing their functionality with your own tasks to be implemented, you can evaluate the appropriateness of using C for Metal.

The general considerations on C for Metal performance are quite simple:

  • When uploading calculations to the GPU, remember the overhead CPU <-> GPU data of transferring and these devices synchronizing. Therefore, an example such as Hello, World may be not a good candidate for a C for Metal implementation. But the algorithms of computer vision, AI, and any non-trivial processing of data arrays, especially with a data reordering in the process or at the output, is what C for Metal needs.
  • In addition, when designing C for Metal code, it is necessary to take into account the internal structure of the GPU; that is, it is advisable to create a sufficient number (> 1000) of GPU threads and load them all with work. At the same time, it is a good idea to divide the images into small blocks for processing. But the specific way of partitioning, as well as the choice of a specific processing algorithm to achieve maximum performance, is not a trivial task. However, this applies to any way of working with any GPU.

Do you have OpenCL code, but its performance does not please you? Or CUDA* code, but you want to work on a much larger number of platforms? Then it is worth a look at Intel C for Metal Development package.

Intel C for Metal Development Package is a living and evolving product. You could become not just a C for Metal user but a developer as well—the corresponding repositories on GitHub* are waiting for your commits. All the information necessary for both processes is in this article and the readme files on GitHub. And if something is missing, it will appear after your requests.

We would like to acknowledge Zvi Danovich, PhD, application engineer, Intel, for his valuable contribution to creation and improvement of this article.