Python* Data Science at Scale: Speed Up Your End-to-End Workflow

Subscribe Now

Stay in the know on all things CODE. Updates are delivered to your inbox.

Overview

Data scientists and AI developers need the ability to explore and experiment with extremely large datasets as they converge on novel solutions for deployment in production applications. Exploration and experimentation means a lot of iteration, which is only feasible with fast turnaround times. While model training performance is an important part, the entire end-to-end process must be addressed. Loading, exploring, cleaning, and adding features to large datasets can often be so time-consuming that it limits exploration and experimentation. And responsiveness during inference is often crucial once a model is deployed.

Many of the solutions for large-scale AI development require installing new packages and rewriting code to use their APIs. For instance, data scientists and AI developers often use pandas to load data for machine learning applications. But once the size of the dataset gets to about 100 MB or larger, loading and cleaning the data really slows down because pandas is single-core only.

As a result, developers must change their workflow to use different data loading and preprocessing, such as switching to Apache Spark*, which requires data scientists to learn the Spark API and overhaul their code to integrate it. This is usually an inopportune time to make such changes and is not a good use of data scientists’ and AI developers’ skills.

Intel has been working to improve performance of popular Python* libraries while maintaining the usability of Python, by implementing the key underlying algorithms in built-in code using oneAPI performance libraries. This delivers concurrency at multiple levels, such as vectorization, multithreading, and multiprocessing with minimal impact on existing code. For example:

Intel® Distribution of Modin* scales pandas DataFrames to multiple cores with a single line of code change.
Intel® Optimization for PyTorch* or Intel® Optimization for TensorFlow* accelerate deep learning training and inference.
Intel® Extension for Scikit-learn* or XGBoost optimized for Intel architecture speed up machine learning algorithms with no code changes.

In this session, see how to accelerate your end-to-end workflow with these technologies via a demonstration using the full New York City taxi fare dataset.

Presenters

Rachel Oberman, technical consulting engineer, Intel
Todd Tomashek, machine learning engineer, Intel
Albert DeFusco, principal data scientist, Anaconda*

Jump to:

Featured Software

You May Also Like

Featured Software

Get these Intel-optimized versions of your Python libraries as part of the AI Tools, or download them as stand-alone components:

Additional Resources

AI Tools, Libraries, and Framework Optimizations

AI Tools

Accelerate data science and AI pipelines-from preprocessing through machine learning-and provide interoperability for efficient model development.

Get It Now

See All Tools

Intel® Distribution of Modin

Scale your pandas workflows to distributed DataFrame processing by changing only a single line of code.

Get It Now

See All Tools

Intel® Extension for Scikit-learn*

Speed up and scale your scikit-learn* workflows for CPUs and GPUs across single- and multi-node configurations with this Python* module for machine learning.

Get It Now

See All Tools

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Python* Data Science at Scale: Speed Up Your End-to-End Workflow