An Easy Introduction to Modin*: A Step-by-Step Guide to Accelerating Pandas

Get the Latest on All Things CODE

author-image

作者

Modin* is an open source project which enables speeding up of data preparation and manipulation – a crucial initial phase in every data science workflow. Developed by Devin Petersohn during his work in the RISELab at UC Berkeley, it is a drop-in replacement for the extensively used Python* library, Pandas. Being fully compatible with Pandas, Modin has  wide coverage of the Pandas API. It accelerates Pandas’ data handling operations by adopting a DataFrame partitioning method which enables splits along both rows and columns. This facilitates scalability and flexibility for Modin to perform parallel and distributed computations efficiently.

Intel, being one of the largest maintainers and contributors to the Modin project, has created its own build of the performant dataframe system viz. the Intel® Distribution of Modin. Available both as a stand-alone component and through the state-of-the-art Intel® AI Analytics Toolkit, powered by oneAPI, the Intel® Distribution of Modin aims to provide the best end-to-experience of expedited analytics on Intel platforms. This article aims to shine some light on both the classical Modin framework and its Intel® distribution.

Before proceeding, get yourself acquainted with the Pandas library to better understand the power of Modin.

Comparing Modin Vs Pandas

Pandas is a very popular open-source Python library for data manipulation and analysis. It has several benefits including speed, flexibility, and ease of use. There are however certain limitations with Pandas that Modin tries to address:

  • Pandas does not support multi-threading; you can only use one single core at a time for any operation whereas Modin automatically distributes the computation across all the cores on your machine for parallel computations. As modern processors become increasingly multi-core, this provides significant advantages.

Figure1. Pandas using a single core

Image source: Official Modin documentation

Figure 2. Modin using all the available cores

 

  • Pandas’ execution speed may deteriorate, or the application may run out of memory while handling voluminous data. Modin, on the other hand, can efficiently handle large datasets (~1TB+).
  • Pandas usually copies the data irrespective of the changes made to it being in place or not. While Modin uses immutable data structures, unlike Pandas, and maintains a mutable pointer to it. So, for in place operations, Modin only adjusts the pointer among the DataFrames sharing common memory blocks. This enables memory layouts to remain unmodified, resulting in their better memory management than in Pandas.
  • Instead of having duplicate methods of performing the same operation as in Pandas, Modin has a single internal implementation of each behavior while still covering the whole Pandas API.
  • Pandas uses over 200 algebraic operators while Modin can perform the same operations with its internal algebra comprising of merely 15 operators. This makes Modin easier to use as a developer needs to learn and remember fewer operators.

Key Advantages of Modin

Modin provides several performance and productivity benefits to developers who are currently using Pandas.

  • Ease of use: Modify a single line of code to accelerate your Pandas application.
  • Productivity: Accelerate Pandas with minor code changes with no effort required to learn a whole new API. This also allows easy integration with the Python ecosystem.
  • Robustness, light weight, and scalability: Modin makes it capable of processing MBs and even TBs of data on a single machine. It also allows seamless scaling across multiple cores with Dask and Ray distributed execution engines.
  • Performance: Modin delivers orders of magnitude performance gains over Pandas.

Intel® Distribution of Modin

The Intel® Distribution of Modin is a performant, parallel, and distributed dataframe system that accelerates Pandas while using a fully compatible API. It provides all the advantages of stock Modin while also leveraging the OmniSci* framework in the back end to provides accelerated analytics on Intel® platforms.

Modin has a hierarchical architecture resembling that of a database management system. Such a structure enables optimization and swapping of individual components without disturbing the rest of the system.

Figure 3. Layers and components of Modin (Intel® distribution) architecture

Image source: Official Intel® Distribution of Modin documentation

The user sends interactive or batch commands to Modin through API. Modin then executes them using one of its supported execution engines. The following top-down model shows the working of Modin.

Figure 4. System-level view of Modin architecture

Image source: Official documentation

To understand the architecture in detail, visit the documentation of Modin.

Supported specifications for Intel® Distribution of Modin

  • Programming language: Python
  • Operating System (OS): Windows* and Linux*
  • Processors: Intel® Core™ and Intel® Xeon® processors

Installation methods

Modin can be installed with ease using one of the following ways:

Code implementation

Here’s a practical demonstration of how some Pandas operations can be expedited using Modin. A customized dataset of dimensions \(2^{15}\) * \(2^{10}\) having random integers between 10 and 1000 has been used to test Pandas vs. Modin performance on it.

A stepwise explanation of the code is as follows:

  1. Install Modin
    !pip install modin[all]

With [all] or nothing specified, Modin gets installed with all the supported execution engines. Instead, you can also work with a specific engine by specifying its name in the installation step e.g., modin[dask] or modin[ray].

  1. Import Modin and other required libraries.
    import modin as md
    import numpy as np
    import pandas as pd
  2. Create a customized of random integers using randint() method of NumPy.
    arr = np.random.randint(low=10, high=1000, size= (2**15,2**10))

The ‘low’ and ‘high’ parameters specify the lowest and the highest number chosen for random distribution of integers; the ‘size’ parameter specifies the (rows * columns) dimensions of the array.

  1. Store the data into a CSV file.
    np.savetxt("data.csv", arr, delimiter=",")

A condensed portion of the dataset appears as follows:

Figure 5. Part of the custom dataset
  1. Mean operation to calculate the average value of each column
    • Using Pandas:
      %time p_df.mean(axis=0)
    • Using Modin:
      %time m_df.mean(axis=0)
  2. Concatenate a DataFrame with itself
    • Using Pandas:
      %time pd.concat([p_df, p_df], axis=0)
    • Using Modin:
      %time md.concat([m_df, m_df], axis=0)9
  3. applymap() method to apply an operation elementwise to the whole DataFrame
    • Using Pandas:
      %time p_df.applymap(lambda i:i*2)
      This line of code multiplies each element of the DataFrame by 2.
    • Using Modin:
      %time m_df.applymap(lambda i:i*2)

Below is a summary of recorded wall clock time and speedup achieved by Modin for each operation:

DataFrame operation

Pandas (median of 4 runs)

Modin (1st run)

Modin (median of 4 runs)

Speedup

Read csv file

10.800s

15.400s

9.190s

1.175x

Mean

0.064s

0.066s

0.048s

1.333x

Concatenate

0.162s

0.002s

0.001s

162.000x

applymap()

8.570s

0.034s

0.029s

295.517x

Note: The speedup is computed as [(Pandas’ wall time)/ (Modin’s wall time)]. Both the wall time terms considered in the computation are medians for 4 runs of each operation.

♦ Click here to access the above-explained code sample. Note that the results may vary for different runs of the notebook depending upon the hardware configuration and software versions used, random generation of data and some initializations performed by Modin in the initial iteration.

Testing Details

Following is the information on the above tested code:

Testing date: 05/21/2022
Test done by: Intel Corporation
Hardware configuration used:
Intel® Xeon® CPU @ 2.30GHz (single core)
Intel® Xeon® CPU @ 2.30GHz (single core)
Processor Broadwell (CPU family: 6, Model: 79)

CPU MHz:

2200.214

Cache size: 56320 kB

Vendor ID:

GenuineIntel

Stepping: 0

fpu:

yes
fpu_exception: yes
cpuid level: 13
bogomips: 4400.42
clflush size: 64
cache alignment: 64
address size: 46 bits physical, 48 bits virtual

Operating system:

Linux (Ubuntu* 18.04.3 LTS (Bionic Beaver))

RAM: 12.68 GB
Disk: 107.72 GB

Software versions used:
Coding environment: Google Colab
Programming language: Python 3.7.13
NumPy 1.21.6
Pandas 1.3.5
Modin 0.12.1

Potential limitations of Modin

  • When Modin is run for the first time, it may take a long time to execute for some operations (at times even longer than Pandas) as it does some initialization in the initial iteration. However, over multiple runs it clearly demonstrates its acceleration capabilities.
  • The gain in execution speed may be low or insignificant over that achieved using Pandas for some small datasets of say few KBs; larger the dataset more effective will be the acceleration.
  • Visit this page to know about some scenarios where Modin may result in slower execution than Pandas.

Next steps

We hope that this post has piqued your interest in Modin by giving you a quick overview of its features and advantages while illustrating how it can easily expedite some of the crucial Pandas operations. There are several other operations covering a majority of the Pandas API for which Modin can give you faster outputs with minor code modifications. Get started with using Intel Distribution of Modin today as part of your AI workflows, and if interested we encourage you to contribute to the still relatively new and developing Modin project. We also encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

Useful Resources

To plunge deeper into Modin, refer to the following links:

Acknowledgment:

We would like to thank Vadim Sherman, Andres Guzman-ballen, and Igor Zamyatin for their contributions to the blog and Rachel Oberman, Preethi Venkatesh, Praveen Kundurthy, John Kinsky, Jimmy Wei, Louie Tsai, Tom Lenth, Jeff Reilly, Keenan Connolly, and Katia Gondarenko for their review and approval help.

See Related Content

Articles

  • Scale Your Pandas Workflow with Modin - No Rewrite Required 
    Read
  • One-Line Code Changes Boost Data Analytics Performance
    Read
  • Deliver Fast Python Data Science and AI Analytics on CPUs
    Read

Podcast

  • An Open Road to Swift Dataframe Scaling
    Listen

On-Demand Webinars

  • Seamlessly Scale pandas Workloads with a Single Code-Line Change
    Watch

Get the Software

Intel® oneAPI AI Analytics Toolkit

Accelerate end-to-end machine learning and data science pipelines with optimized deep learning frameworks and high-performing Python* libraries.

Get It Now

See All Tools