supercomputer installation

Get Started with Distributed Ranges

This guide will show you how to

  • easily create a data structure that is distributed across a few devices, such as GPUs or computers

  • run some algorithms on this data structure

author-image

作者

Prerequisites 

 
C++ knowledge familiarity with C++20 and its Ranges library is recommended, but one may learn them also using this library first
SYCL compiler Intel's C++ compiler is recommended, and it is required by this tutorial. g++ (>=10) is also supported, but GPU usage is not possible.
MPI Intel MPI or MPICH should work
Linux Distributed Ranges is not yet tested on other platforms

 

The easiest way to have environment ready is either

Introduction

The Distributed Ranges Library is a C++20 library for multi-CPU and multi-GPU computing environments. It provides STL-like algorithms, data structures and views tailored for use in multi-node HPC systems and servers with many CPUs and/or GPUs. It takes advantage of parallel processing and MPI communication in a distributed memory model as well as parallel processing in a shared memory model with many GPUs. Familiarity with C++ templates and the std::ranges or ranges-v3 libraries helps in using distributed ranges, but this guide will help you get started even if you have never used them before.

Disclaimer: The library is still in an experimental phase, some code may not work as stable as in e.g. oneAPI. On the other hand, the code covered in this tutorial and examples works well.

Building and running Hello World

The easiest way to start writing your own application that uses distributed-ranges is to clone (or fork and clone) the distributed-ranges-tutorial repository and modify examples provided. The cmake files provided in this repo will download the distributed-ranges headers, all dependencies as source code and build examples. There is no need to install anything else.

Prepare environment: SYCL C++ Compiler and MPI

There are many ways to do this. If you choose to install oneAPI HPC Toolkit system-wide then prepare the environment by running

. /opt/intel/oneapi/setvars.sh

 When you choose to use Intel Developer Cloud, then the environment is already set up for you.

Get code and build it

git clone https://github.com/oneapi-src/distributed-ranges-tutorial.git
cd distributed-ranges-tutorial
CXX=icpx cmake -B build
cmake --build build

Running

Once the build is finished, let's run the first example using 3 processes by

mpirun -n 3 build/src/example1

The program should display a Hello World message. If you see it, it means everything is working and you are ready to learn and experiment with the library.

Hello World example

We'll explain the code of the first example you just ran in the previous section.

Here is the code:

#include <dr/mhp.hpp>
#include <fmt/core.h>

namespace mhp = dr::mhp;

int main(int argc, char **argv) {

  mhp::init(sycl::default_selector_v);

  mhp::distributed_vector<char> dv(81);
  std::string decoded_string(80, 0);

  mhp::copy(
      0,
      std::string("Mjqqt%|twqi&%Ymnx%nx%ywfsxrnxnts%kwtr%ymj%tsj%fsi%tsq~%"
                  "Inxywngzyji%Wfsljx%wjfqr&"),
      dv.begin());

  mhp::for_each(dv, [](char &val) { val -= 5; });
  mhp::copy(0, dv, decoded_string.begin());

  if (mhp::rank() == 0)
    fmt::print("{}\n", decoded_string);

  mhp::finalize();

  return 0;
}

It runs in SPMD mode, so that is the same code is executed by all processes on all MPI nodes involved in the algorithm. The algorithm is as follows:

  1. Create a distributed structure - distributed_vector, consisting of parts (called segments) that are allocated on different devices (e.g. GPUs).
    mhp::distributed_vector<char> dv(81);

    The size of this structure is 81, i.e. if for example there are e.g. 3 MPI processes, then each of them will allocate 81/3 = 27 elements.

  2. Split and copy parts of an encoded string from host memory on process number 0 (rank 0) to all devices (e.g. GPUs).
    mhp::copy(
       0,
       std::string("Mjqqt%|twqi&%Ymnx%nx%ywfsxrnxnts%kwtr%ymj%tsj%fsi%tsq~%"
                   "Inxywngzyji%Wfsljx%wjfqr&"),
       dv.begin());
  3. Perfom decoding in parallel on each device - each device decodes its own part.
    mhp::for_each(dv, [](char &val) { val -= 5; });

    Notice how nice it looks. Just one simple line expresses the algorithm running on each node with different data. It also contains the SYCL parallelism that happens on each node itself.

  4. Copy back the decoded all parts to a host memory on process number 0.
    mhp::copy(0, dv, decoded_string.begin());
  5. Display the decoded message on a selected host process.
    if (mhp::rank() == 0)
       fmt::print("{}\n", decoded_string);
    

Segments example

The second example (see its code here) shows the distributed nature of data structures. The distributed_vector has segments located in each of the nodes that execute the example. The nodes introduce themselves at the beginning. You can try different numbers on MPI processes when calling mpirun. iota() function knows what distributed_vector is, and fills the segments accordingly. Then node 0 prints out the general information about the vector, and each node presents the size and content of its local part.

You can run it with

mpirun -n 3 build/src/example2

... and see the output:
Hello, World! Distributed ranges proces is running on rank 0 / 3 on host xxx
Hello, World! Distributed ranges proces is running on rank 1 / 3 on host xxx
Hello, World! Distributed ranges proces is running on rank 2 / 3 on host xxx
Created distributed vector of size 100 with 3 segments.
Rank 0 owns segment of size 34 and content [1, 2, 3, ..., 34]
Rank 1 owns segment of size 34 and content [35, 36, 37, ..., 68]
Rank 2 owns segment of size 32 and content [69, 70, 71, ..., 100]

Cellular Automaton Example

The example simulates the Elementary 1D Cellular Automaton (ECA). Description of what the automaton is and how it works can be found in wikipedia. A visulisation of how the automaton works is available on the ASU team website.

The ECA calculates the new value of a cell using the old value of the cell and the old values the neighbours of the cell. Therefore a halo of 1-cell width is used, to get access to the values of neighbouring cells when the loop reaches the end of each local segment of a vector. Additionally, a use of a subrange and a transform() function is presented, which places transformed values of the input structure into the output structure, element by element. The transform function is given as lambda newwvalue.

Note: after each loop the vector content is printed with fmt::print(). The formatter function for distributed_vector is rather slow, as it gets the vector element by element, both from the local node and from remote nodes. Use it only for debugging.

More Examples

Many examples can be found

The distributed-ranges-tutorial repository, on which this guide is based, only provides a few initial examples to help you get started with the library. Please explore the two places above to find many ways to use the various algorithms, views and data structures that exist in the library.

How to contact us

If you have a question, feedback, bug report, feature request, etc., please open an issue in the distributed-ranges repository.