Unlock Composable Parallelism in Python

Hi, I'm Anton Malakhov. In this video, I will be talking about composable parallelism for the numeric libraries in Python. Please don't forget to check out the links below to learn more. 

When you want your program to run faster, what should you do? For starters, you want to make sure due to [INAUDIBLE] course and their vector processing units. For Intel distribution for Python, you make sure that the [INAUDIBLE] packages like NumPy use all the available resources. 

However, there are still a lot of performance left on the table. Let's imagine we have a serial program that runs on the time line from here to here as a single line. When we accelerate some NumPy calculations by running them in parallel, we see parallel lines on the time line with the number of used cores on the y-axis. 

We cannot make everything parallel with non-Pylon. There are always memory copying, bookkeeping, and serial logic inside and outside of NumPy. Therefore, we have serial regions in the application which limit scalability. 

If you have an infinite number of parallel processes, and no parallel ever has, what would the application time line look like? All the parallel regions would shrink, taking no time while leaving all the serial regions as is. This is your limit. 

However, if you find parallelism on the application level, you effectively eliminate some serial reasons, and that improves the scalability of your program. But in Python, it may be a little tricky because of the global [INAUDIBLE] or overheads due to communication between multiple processors. There are multiple libraries which can help you on the way start to use Python multi-processing model and joblib to [INAUDIBLE], which makes application level parallelism almost implicit. 

Let's return to our application. And this time, we will assume we have made it parallel. Now we have multiple application lines which execute calls to NumPy that are already parallel. 

Running too many threads can actually slow your program down or even crash it because of resource exertion. An approach to this issue would be to limit the number of open MP and application threads so the total number of threads running in parallel does not exceed what makes sense. This is not trivial to implement. You need to make calculations and take care of Affinity and does threading like Intel Threading Building Blocks library. 

Let me introduce here a new SMP model for Python. It implements all of the above and makes static multi-processing and multi-threading easier for parallel applications. All we have to do is run Python with dash M SMP and your script name. 

It might be still not enough and the application can end up undersubscribed if it consists of different parallel patterns [INAUDIBLE] processes and balanced workload. Static partitioning might be inefficient here. Then you need a dynamic approach. 

The TBB module for Python is already available, which implements dynamic approach for multi-threading. Now we introduced support for multiprocessing as well. This helps multiple processors talk to each other to coordinate total number of threads. 

There is more. Open MP might still be the best choice for many applications. We released experimental open MP runtime, which is able to coordinate threads, [INAUDIBLE] open MP issue with quadratic oversubscription. To learn more, check out the links in the description below. Thanks for watching.