Prerequisites
C++ knowledge | familiarity with C++20 and its Ranges library is recommended, but one may learn them also using this library first |
SYCL compiler | Intel's C++ compiler is recommended, and it is required by this tutorial. g++ (>=10) is also supported, but GPU usage is not possible. |
MPI | Intel MPI or MPICH should work |
Linux | Distributed Ranges is not yet tested on other platforms |
The easiest way to have environment ready is either
- install and enable Intel's oneAPI HPC Toolkit, or
- use Intel Developer Cloud
Introduction
The Distributed Ranges Library is a C++20 library for multi-CPU and multi-GPU computing environments. It provides STL-like algorithms, data structures and views tailored for use in multi-node HPC systems and servers with many CPUs and/or GPUs. It takes advantage of parallel processing and MPI communication in a distributed memory model as well as parallel processing in a shared memory model with many GPUs. Familiarity with C++ templates and the std::ranges or ranges-v3 libraries helps in using distributed ranges, but this guide will help you get started even if you have never used them before.
Disclaimer: The library is still in an experimental phase, some code may not work as stable as in e.g. oneAPI. On the other hand, the code covered in this tutorial and examples works well.
Building and running Hello World
The easiest way to start writing your own application that uses distributed-ranges is to clone (or fork and clone) the distributed-ranges-tutorial repository and modify examples provided. The cmake files provided in this repo will download the distributed-ranges headers, all dependencies as source code and build examples. There is no need to install anything else.
Prepare environment: SYCL C++ Compiler and MPI
There are many ways to do this. If you choose to install oneAPI HPC Toolkit system-wide then prepare the environment by running
. /opt/intel/oneapi/setvars.sh
When you choose to use Intel Developer Cloud, then the environment is already set up for you.
Get code and build it
git clone https://github.com/oneapi-src/distributed-ranges-tutorial.git
cd distributed-ranges-tutorial
CXX=icpx cmake -B build
cmake --build build
Running
Once the build is finished, let's run the first example using 3 processes by
mpirun -n 3 build/src/example1
The program should display a Hello World message. If you see it, it means everything is working and you are ready to learn and experiment with the library.
Hello World example
We'll explain the code of the first example you just ran in the previous section.
Here is the code:
#include <dr/mhp.hpp>
#include <fmt/core.h>
namespace mhp = dr::mhp;
int main(int argc, char **argv) {
mhp::init(sycl::default_selector_v);
mhp::distributed_vector<char> dv(81);
std::string decoded_string(80, 0);
mhp::copy(
0,
std::string("Mjqqt%|twqi&%Ymnx%nx%ywfsxrnxnts%kwtr%ymj%tsj%fsi%tsq~%"
"Inxywngzyji%Wfsljx%wjfqr&"),
dv.begin());
mhp::for_each(dv, [](char &val) { val -= 5; });
mhp::copy(0, dv, decoded_string.begin());
if (mhp::rank() == 0)
fmt::print("{}\n", decoded_string);
mhp::finalize();
return 0;
}
It runs in SPMD mode, so that is the same code is executed by all processes on all MPI nodes involved in the algorithm. The algorithm is as follows:
- Create a distributed structure - distributed_vector, consisting of parts (called segments) that are allocated on different devices (e.g. GPUs).
mhp::distributed_vector<char> dv(81);
The size of this structure is 81, i.e. if for example there are e.g. 3 MPI processes, then each of them will allocate 81/3 = 27 elements.
- Split and copy parts of an encoded string from host memory on process number 0 (rank 0) to all devices (e.g. GPUs).
mhp::copy( 0, std::string("Mjqqt%|twqi&%Ymnx%nx%ywfsxrnxnts%kwtr%ymj%tsj%fsi%tsq~%" "Inxywngzyji%Wfsljx%wjfqr&"), dv.begin());
- Perfom decoding in parallel on each device - each device decodes its own part.
mhp::for_each(dv, [](char &val) { val -= 5; });
Notice how nice it looks. Just one simple line expresses the algorithm running on each node with different data. It also contains the SYCL parallelism that happens on each node itself.
- Copy back the decoded all parts to a host memory on process number 0.
mhp::copy(0, dv, decoded_string.begin());
- Display the decoded message on a selected host process.
if (mhp::rank() == 0) fmt::print("{}\n", decoded_string);
Segments example
The second example (see its code here) shows the distributed nature of data structures. The distributed_vector has segments located in each of the nodes that execute the example. The nodes introduce themselves at the beginning. You can try different numbers on MPI processes when calling mpirun. iota() function knows what distributed_vector is, and fills the segments accordingly. Then node 0 prints out the general information about the vector, and each node presents the size and content of its local part.
You can run it with
mpirun -n 3 build/src/example2
... and see the output:
Hello, World! Distributed ranges proces is running on rank 0 / 3 on host xxx
Hello, World! Distributed ranges proces is running on rank 1 / 3 on host xxx
Hello, World! Distributed ranges proces is running on rank 2 / 3 on host xxx
Created distributed vector of size 100 with 3 segments.
Rank 0 owns segment of size 34 and content [1, 2, 3, ..., 34]
Rank 1 owns segment of size 34 and content [35, 36, 37, ..., 68]
Rank 2 owns segment of size 32 and content [69, 70, 71, ..., 100]
Cellular Automaton Example
The example simulates the Elementary 1D Cellular Automaton (ECA). Description of what the automaton is and how it works can be found in wikipedia. A visulisation of how the automaton works is available on the ASU team website.
The ECA calculates the new value of a cell using the old value of the cell and the old values the neighbours of the cell. Therefore a halo of 1-cell width is used, to get access to the values of neighbouring cells when the loop reaches the end of each local segment of a vector. Additionally, a use of a subrange and a transform() function is presented, which places transformed values of the input structure into the output structure, element by element. The transform function is given as lambda newwvalue.
Note: after each loop the vector content is printed with fmt::print(). The formatter function for distributed_vector is rather slow, as it gets the vector element by element, both from the local node and from remote nodes. Use it only for debugging.
More Examples
Many examples can be found
- in the examples directory of the distributed-ranges repository ...
- ... and in its unit tests.
The distributed-ranges-tutorial repository, on which this guide is based, only provides a few initial examples to help you get started with the library. Please explore the two places above to find many ways to use the various algorithms, views and data structures that exist in the library.
How to contact us
If you have a question, feedback, bug report, feature request, etc., please open an issue in the distributed-ranges repository.