Modin* Usage and Performance Tuning Guide

author-image

作者

This article serves to provide users with a guide for best practices and advice when using Modin*.

When To Use Modin*

Modin is a performant, parallel, and distributed dataframe system designed to enable data scientists to be more productive with the tools they love with a single-line code change. This library is fully compatible with the pandas API.

Typically, when a developer is using default pandas function calls to process and analyze their data, they experience performance and memory bottlenecks after the data scales to a larger size, which pandas cannot typically handle.

Modin aims to solve this problem. When developers hit this bottleneck, we recommend Modin to continue to use pandas API calls and also be able to speedily and infinitely scale their pandas data frames – all with just a few lines of code.

For more information on Modin, please visit the documentation.

Using Modin*

Modin is compatible with three different backend compute engines, which it uses to distribute and optimize pandas API calls and computations. To check out the latest backends available, please visit the documentation.

Performance Tuning with Modin*

How to Tune Modin Function Calls with pandas

As we are providing performance benefits out-of-the-box on pandas via Modin, there may be some limitations or yet-to-be-covered support from certain optimizations on Modin. Modin and pandas provide the ability to switch between either framework at the run-time if you were to experience performance setbacks. Some of the common reasons can be found here. Here is also some sample code that can be used as a workaround: 

It is recommended to convert Modin object to pandas object to efficiently use both frameworks. Sample code to move between pandas and Modin object is given below. 

import ray
ray.init()
import modin.pandas as pd      
df_log=pd.concat([self.df_log])       
df_log.to_csv(os.path.join(savePath, logName + '_structured.csv'), index=False)

To process the df_log on pandas, simply convert the df to pandas object using the object "_to_pandas()":

import pandas as pd
occ_dict = dict(df_log['EventTemplate']._to_pandas().value_counts())
df_event = pd.DataFrame()
df_event['EventTemplate'] = df_log['EventTemplate'].unique()

Controlling the Number of Cores

If you would like to control the number of cores Modin will utilize, versus the default of using all available cores on a device, then please visit the relevant documentation.

More Performance Tuning Information

For more information on performance tuning with Modin, please visit the relevant open-source documentation.

Reasons Why Modin May Default Back to pandas

If Modin falls back to default pandas functionality it is likely for one of the following reasons:

  • The function is already optimized by pandas and using Modin will not provide any more performance improvements at this time.
  • The method is not currently implemented by Modin in the backend currently being used.

For more information, please visit the Modin documentation section: Defaulting to pandas.

If the function is supposed to be implemented by Modin backend engine that you are using according to the Supported APIs documentation, then please raise an issue on the Modin GitHub accordingly.

Using Default pandas Implementation

Modin is meant to effortlessly speed up pandas workloads by distributing pandas data and computation.

When the dataset size is very small, we recommend developers use default pandas import and calls first instead of Modin. At this size, Modin's performance benefits are negligible since the pandas package is targeted towards small dataset sizes.

At this data size, users may also see a slight slow-down when using Modin on these smaller datasets compared to default pandas. This is because Modin requires additional overhead to distribute the data before calling pandas functions. Pandas does not require this overhead, causing this discrepancy, which disappears as the dataset size continues to scale up.

About Modin Warnings

Please note that if you see a series of non-critical warnings when using Modin, it does not mean that you are using the package incorrectly. This is the “verbose” log that is automatically generated when using Modin and can be helpful with debugging problems. 

For More Information