An Easy Introduction to Scikit-learn*: A Comprehensive Guide to the Library and Intel Extension

Get the Latest on All Things CODE

author-image

作者

What is Scikit-learn?

Scikit-learn* (also commonly referred to as sklearn) is a popular Python* module for machine learning built on top of the NumPy, SciPy, and matplotlib libraries. It is a simple and efficient tool for data mining and data analysis and features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, and k-means.

Scikit-learn offers a high degree of flexibility in fine-tuning models and can be used for both supervised and unsupervised machine learning algorithms. This module is one of the most common classical ML frameworks due to its user-friendly interface and the  availability of many algorithms. The scikit-learn project has continued to evolve over the years with several new releases and features.

Intel provides Intel® Extension for Scikit-learn to speed up the scikit-learn workflows or applications for Intel® architectures across single-node and multi-node configurations.

This blog introduces you to the Intel extension, provides a step-by-step code walk through, and highlights the performance benefits of using it.

Intel® Extension for Scikit-learn*

Intel’s extension seamlessly scales up your scikit-learn applications, dynamically patching scikit-learn estimators while improving performance for various ML algorithms. It’s available standalone but also as part of AI Tools; this provides flexibility to use the machine learning tool with your existing AI packages.

Features

  • Scale up your scikit-learn algorithms by replacing current estimators with mathematically-equivalent accelerated versions.
  • Execute on your choice of Intel® CPU or Intel® GPU because the accelerations are powered by Intel® oneAPI Data Analytics Library.
  • Select how to apply the accelerations:
    • Patch all compatible algorithms from the command line with no code changes.
    • Add two lines of code to patch all compatible algorithms in your Python script.
    • Specify in your script to patch only selected algorithms.
    • Globally patch and unpatch your environment for all uses of scikit-learn.

Getting Started

Installation

The extension can be installed in the following ways:

Check out this page to learn about system requirements and supported configurations.

Patching

Patching scikit-learn with Intel Extension for Scikit-learn means replacing stock original estimators in the scikit-learn workflows with the enhanced versions provided by the extension. In simple words, patching means enabling the extension optimizations. If the desired algorithm parameters are not supported by the extension, then the result of the original scikit-learn is returned.

There are different ways to patch scikit-learn with the extension:

  • Without editing the code of a scikit-learn application by using the following command line flag - python -m sklearnex my_application.py
  • Directly from the script:
    from sklearnex import patch_sklearn
    patch_sklearn()
  • Through importing the desired estimator from the sklearnex module in your script:
    from sklearnex.neighbors import NearestNeighbors
  • Through global patching to enable patching for your scikit-learn installation for all further runs - python -m sklearnex.glob patch_sklearn

Note: Remember to import scikit-learn after these lines. Otherwise, the patching will not affect the original scikit-learn estimators.

Also, the users can always undo the patch. To undo the patch is to come back to the use of original scikit-learn implementation and replace the patched algorithms with the stock scikit-learn algorithms. Unpatching requires scikit-learn to be re-imported again:

sklearnex.unpatch_sklearn()
# Re-import scikit-learn algorithms after the unpatch:
from sklearn.cluster import KMeans

Visit this link to learn which scikit-learn algorithms are impacted by applying Intel Extension for Scikit-learn.

Code Sample

This sample code shows how to use support vector machine (SVM) classifier from Intel Extension for Scikit-learn for a digit recognition problem. Also, the user can learn how to train a model and save the information to a file. All other machine learning algorithms available with scikit-learn can be used in a similar way.

The Intel extension depends on daal4py; this is a simplified API to Intel oneAPI Data Analytics Library that allows for fast usage of the framework suited for data scientists or machine learning users.

In this code example, we will recognize handwritten digits using the SVM ML classification algorithm. Handwritten digits Dataset is from sklearn toy datasets. Digits dataset contains 1797 input images. For each image there are 64 pixels (8x8 matrix) as features. Output has 10 classes corresponding to all the digits (0-9).

The following steps are implemented in the code sample:

  1. Import all necessary packages. Here, Intel Extension for Scikit-learn dynamically patches scikit-learn estimators to use Intel oneAPI Data Analytics Library as the underlying solver.
  2. Load the dataset and split the dataset into train and test.
  3. Train the model and save the training model to a file.
  4. Predict the digit for test images using the trained model.
  5. Get the accuracy of trained model on test data and export the results to a CSV file.

Try out the code sample on the Intel® Developer Cloud and Jupyter Notebook.

Performance Benefits

The performance of Intel Extension for Scikit-learn and scikit-learn are compared for ML training and inference, and they demonstrate the orders of magnitude acceleration that can be easily achieved with the use of the Intel extension (Figures 1 and 2).

Figure 1. Speedup with Intel Extension for Scikit-learn over the original package for FP32 (floating point 32) Workloads

Figure 2. Speedup with Intel Extension for Scikit-learn over the original package for FP64 (floating point 64) Workloads

In Figure 1, comparing the execution times for training shows up to 636X improvement and for inference shows up to 2286X improvement with patched scikit-learn.

In Figure 2, comparing the execution times for training shows up to 243X improvement and for inference shows up to 517X improvement with patched scikit-learn.

What’s Next?

Scikit-learn is a simple and versatile tool for your classical machine learning workflows. The optimizations in the Intel Extension for Scikit-learn can deliver orders of magnitude of drop-in performance acceleration for your applications and reduce the demand for compute power of AI workloads.

Learn about feature information and release downloads for the latest and previous releases of Intel Extension for Scikit-learn on GitHub and feel free to contribute to the project. We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Useful Resources