As the number of use cases for processing large data sets increases within the fields of AI and data analytics, so does the complexity of the hardware and architecture requirements. Depending on the required task, you may require a mix of architectures—including scalar, vector, matrix, and spatial—running on CPU, GPU, AI, or other accelerators. This becomes vital for speeding up big data processing to reduce the cost and time to completion.
The challenge is writing code that supports each of these architectures. Code that runs on CPUs will require different tools than code that runs on GPUs or other accelerators. As a project scales from single-node to distributed processing, new tools and APIs are often required. This could mean extra development time on top of code duplication to port code to different architectures.
To solve this, oneAPI provides a programming model that enables architecture-agnostic code. This allows developers to reuse code that can be executed, for example against a CPU and a GPU using SYCL*, which is the language that enables mappings between different architectures.
Intel® oneAPI Data Analytics Library (oneDAL) is a library with all the building blocks required to create distributed-data pipelines to transform, process, and model data. complete with all the architectural flexibility of oneAPI. This can be achieved using Intel® Distribution for Python*, C++, or Java APIs that can connect to familiar data sources such as Spark* and Hadoop*.
"What You Can Do with oneDAL
oneDAL includes optimized versions of popular machine learning and analytics algorithms to help build fast data applications that can run on different architectures. These functions span all analytics stages including data management, analysis, training, and prediction.
Data Management
Data management is a vital step in any big data processing platform. It encompasses ingestion of data from different sources and processing that data so it’s ready for downstream use. Without this, analysts and data scientists would have to fetch and process data, which is fine for smaller data sets but becomes slower and more difficult when processing large amounts and different formats of data.
oneDAL has tools for transferring out-of-memory data sources, such as databases and text files, into memory for use in analysis, training, or prediction stages. And if the data source cannot fit into memory, the algorithms in oneDAL also support streaming data into memory.
Data scientists often spend large amounts of time preparing the data for analysis or machine learning (ML). This includes converting data to numeric representation, adding or removing data, normalizing it, or computing statistics. oneDAL offers algorithms that accelerate these preparation tasks, speeding the turnaround of steps that are often performed interactively.
For example, assume you’re building an ML model using sales data from a database. First, you would use the oneDAL data management tools to stream the data source into memory to increase processing performance. Then you could use some of the data preparation tools to convert non-numeric data, such as sales product categories, into a numerical format more suitable for the model. You could also calculate some basic statistics like the mean, variance, and covariance on some of the sales values, as they are useful for feeding into ML models.
The final oneDAL data management step then enables you to stream this data into the ML algorithm of your choice.
Analysis
One of the main purposes of big data processes is to support the analysis of large amounts of data from various sources. Data analysis is the key step that turns data into information that facilitates making key business decisions.
As noted in the previous section, the oneDAL data management tools stream pre-prepared data into memory to be accessed by downstream analysis or machine learning algorithms. There are many different analysis algorithms available, such as k-means clustering and outlier detection, which you can implement in C++, Java, or Python*. Using Intel® oneAPI DPC++/C++ Compiler with either C++ or Python allows data to be processed in parallel.
For example, if you wanted to segment users based on a number of attributes such as age, gender, and average spending, then you could use the k-means clustering analysis algorithm. By using the implementation in oneDAL, especially using C++ or Python with the Intel compiler, you get parallel data processing out of the box. This is useful for running analysis algorithms on large data sets as parallel processing greatly improves execution time.
Machine Learning Model Training
Once you have a dataset ready to use for your machine learning algorithm, training is often the most compute-intensive step as it requires all the relevant data to be processed upfront in order for a predictive model to be produced.
There are two main complexities associated with ML model training: managing the input data and training time. For the former, you’ve already seen how oneDAL can help manage and prepare data so it’s ready to be used for training ML models;having clean data is vital to improving the accuracy of any model.
oneDAL can also help overcome the challenge of long training times. Similar to the analysis algorithms, oneDAL provides machine learning training algorithms with out-of-the-box support for parallel processing and hardware acceleration through oneAPI. For instance, you can download Intel® Extension for Scikit-learn* to accelerate the algorithms in the Python scikit-learn library.
Let’s say you have already built a support vector classifier in Python using scikit-learn to identify anomalies in medical images. It currently takes hours to train, and you can’t improve the model accuracy without adding more data and increasing training time.
By simply switching to the Intel-optimizedscikit-learn library with little to no other code changes, your training runs could be 25 to 77 times faster based on Intel’s benchmarks. This will reduce training time and cost if using pay-as-you-go cloud resources. But chiefly, it creates the possibility to further improve the model by using more data or by tuning the model, as you now have more headroom in training time.
"Machine Learning Model Prediction
The final stage of a machine learning pipeline is to use the model generated in training to make predictions. These predictions are used to make decisions, such as which TV shows to recommend to users or which segment a user belongs to. Naturally, the training and prediction stages are closely coupled. However, oneDAL separates them because training is more computationally intensive; it requires more resources than the prediction stage. It also allows for pre-trained models to be used within oneDAL.
The main benefit of using oneDAL for prediction is that a trained model can be deployed to a variety of architectures, as oneDAL takes care of abstracting the different architectural requirements. This enables code re-use for applications performing predictions and opens up the use of these predictions to many different systems.
Conclusion
The requirements for AI and machine learning applications vary widely, driving the need to develop on and deploy to a wide variety of architectures. The growth in heterogeneous compute systems provides more opportunities to accelerate processing of these algorithms. One benefit of acceleration is the ability to process big data. oneDAL includes machine learning algorithms optimized for a variety of architectures, but with the same API, meaning you can use the same application code for whatever type of system your project requires.
Furthermore, oneDAL accelerates many of your familiar tools and libraries, enabling you to reuse your existing code. Plus, oneDAL comes with all the benefits of oneAPI, meaning the underlying architecture is abstracted away from the developer allowing for faster development and testing of applications across different systems.
"See Related Content
Technical Articles
- Achieve Up to 36x Faster Gradient Boosting Inference with oneDAL
Read - Realize Up to 100x Performance Gains with Software AI Accelerators
Read - Optimize LLVM Code Generation for Data Analytics via Vectorization
Read - One-Line Code Changes Boost Data Analytics Performance
Read