Intel® Distribution of Modin Usage and Performance Tuning Guide

Rachel Oberman

This article serves to provide users a guide for best practices and advice when using Intel® Distribution of Modin.

To learn more about Intel® Distribution of Modin and how to get started, please visit the Intel® Distribution of Modin Getting Started Guide.

When To Use Intel® Distribution of Modin

The Intel® Distribution of Modin* is a performant, parallel, and distributed dataframe system that is designed around enabling data scientists to be more productive with the tools that they love in a single line code change with exclusive optimizations for Intel hardware. This library is fully compatible with the pandas API.

Typically, when a developer is using default pandas function calls to process and analyze their data, they experience performance and memory bottlenecks after the data scales to a larger size, which pandas cannot typically handle.

Intel® Distribution of Modin aims to solve this problem. When developers hit this bottleneck, we recommend Intel® Distribution of Modin to continue to use pandas API calls and also be able to speedily and infinitely scale their pandas data frames – all with just a few lines of code.

For more information on Intel® Distribution of Modin, please visit the Intel® Distribution of Modin homepage.

Using Modin

Intel® Distribution of Modin is compatible with three different backend compute engines that are used by Modin to distribute and optimize pandas API calls and computations:

Ray Backend (most recommended) – The Ray* backend is the recommended backend engine for Intel® Distribution of Modin. It has the most pandas API functionality enabled as well as the most stable implementation with Intel® Distribution of Modin.
OmniSci Backend – In partnership OmniSci* (now Heavy.AI)*, Intel® Distribution of Modin supports the OmniSci as a backend, a very performant framework for end-to-end analytics that has been optimized to harness the computing power of existing and emerging Intel® hardware. Please note that this backend is currently experimental.
Dask Backend – The Dask* backend is recommended for workloads running on Windows operating systems, Intel® DevCloud for oneAPI.

To learn more about how to get started with using Modin, please reference the Intel® Distribution of Modin Getting Started Guide.

Performance Tuning with Intel® Distribution of Modin

How to Tune Modin Function Calls with pandas

As we are providing performance benefits out-of-box on pandas via Modin, there may be some limitations or yet-to-be covered support from certain optimizations on Modin. Modin and pandas provides the ability to switch between either framework at the run-time if you were to experience performance setbacks. Some of the common reasons can be found here. Here is also some sample code that can be used as a workaround:

It is recommended to convert Modin object to pandas object to efficiently use both frameworks. Sample code to move between pandas and Modin object is given below.

import ray
ray.init()
import modin.pandas as pd
df_log=pd.concat([self.df_log])
df_log.to_csv(os.path.join(savePath, logName + '_structured.csv'), index=False)

Now to processing the df_log on pandas. Simply convert the df to pandas object using the object "_to_pandas()"

import pandas as pd
occ_dict = dict(df_log['EventTemplate']._to_pandas().value_counts())
df_event = pd.DataFrame()
df_event['EventTemplate'] = df_log['EventTemplate'].unique()

Controlling the Number of Cores

If you would like to control the number of cores that Intel® Distribution of Modin will utilize, versus the default of using all available cores on a device, then please visit the relevant documentation.

More Performance Tuning Information

For more information on performance tuning with Intel® Distribution of Modin, please visit the relevant open-source documentation.

Reasons Why Modin May Default Back to pandas

If Intel® Distribution of Modin falls back to default pandas functionality it is likely for one of the following reasons:

The function is already optimized by pandas and using Intel® Distribution of Modin will not provide any more performance improvements at this time.
The method is not currently implemented by Intel® Distribution of Modin in the backend currently being used.

For more information, please visit the Modin documentation section: Defaulting to pandas.

If the function is supposed to be implemented by Intel® Distribution of Modin backend engine that you are using according to the Supported APIs documentation, then please raise an issue on the Modin GitHub accordingly.

Using Default pandas Implementation

Intel® Distribution of Modin is meant to effortlessly speed up pandas workloads by distributing pandas data and computation.

When the dataset size is very small, we recommend developers to use default pandas import and calls first, instead of Intel® Distribution of Modin. This is because at this size, Intel® Distribution of Modin performance benefits are negligible since the pandas package is targeted towards small dataset sizes.

At this data size, users may also see a slight slow-down when using Intel® Distribution of Modin on these smaller datasets when compared to default pandas. This is because there is a required additional overhead for Modin® to distribute the data before calling pandas functions. pandas does not require this overhead, causing this discrepancy, which disappears as the dataset size continues to scale up.

About Modin Warnings

Please note, if you see a series of non-critical warnings from using Intel® Distribution of Modin, it does not mean that you are using the package incorrectly. This is the “verbose” log that is automatically generated when using Intel® Distribution of Modin and can be helpful with debugging problems. There will be an option to turn on and off most these warnings in a future release using “verbose” mode.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索