Intel® VTune™ Profiler Case Study with Intel® Distribution for Python for Kernel Density Estimation from scikit-learn*

ID 标签 769194
已更新 12/8/2020
版本 Latest




  • Kernel Density Estimation (KDE) from scikit-learn* can be used to measure the difference between data sets (ex. 5G/LTE cells behavior in time) and detect anomaly. 
  • Normal Python’s KDE shows high Front-End Bound. 
  • Scikit-learn* KDE uses KD Tree algorithm which introduces high L3 latency. 


  • Adopt Intel® Distribution for Python* from Intel® oneAPI Base Toolkit. 
  • Adopt Sorting algorithm to the input data. 
  • Measure the performance data with Intel® VTune™ Profiler.


  • Intel® Xeon® Platinum 8280 CPU @ 2.7GHz with 28 cores per socket; 
  • 112 logical cores (hyperthreading) 
  • 128 GB DRAM 
  • Ubuntu* 18.04.3 LTS, 64-bit 
  • Normal Python*: conda create -n no_mkl_python nomkl python=3.6 python scipy numpy pandas numexpr scikit-learn=0.21.3 
  • Intel® Distribution for Python*: conda create -n intel_python -c intel python=3.6 python scipy numpy pandas numexpr scikit-learn=0.21.3 

Performance Results

Normal Python* v.s. Intel® Distribution for Python*

  • With 500,000 input data, Intel® Distribution for Python* shows +38% faster performance. 

  • Intel® Distribution for Python* has improved instruction decoding which improves the front-end bound metrics. 

  • From the help in the metrics demonstrating the most improvement, it appears the DSB caching is improved – due to better control flow. 

Figure 1. Normal Python* v.s. IDP performance comparison summary
Figure 1. Normal Python* v.s. Intel® Distribution for Python* performance comparison summary


Intel® Distribution for Python* without sorting - VTune™ Profiler Summary

  • L3 latency is 100% in the Back-end bound. L3 cache isn’t taking any advantage in this case  

  • Memory Bandwidth and Latency shows some numbers as well 

Figure 2. IDP without sorting performance summary
Figure 2. Intel® Distribution for Python* without sorting performance summary


Intel® Distribution for Python* with sorting - VTune™ Profiler Summary

  • L3 latency decreases to 0  

  • Memory bound goes down from 38.6% -> 1.6% 

  • CPI Rate also decreases  

  • DRAM bandwidth (59.1% -> 0.4%) 

  • DRAM memory latency (18.8% -> 1.0%) 

Figure 3. IDP with sorting performance summary
Figure 3.Intel® Distribution for Python* with sorting performance summary


Intel® Distribution for Python* without sorting - Bottleneck

  • KD_Tree recursive shows CPU time at 277.3  

  • 100% L3 Latency 100% 

Figure 4. IDP without sorting bottom-up
Figure 4. Intel® Distribution for Python* without sorting bottom-up 


Intel® Distribution for Python* with sorting - Solving the bottleneck

  • KD_Tree recursive  shows CPU time at 112.3 

  • 0% L3 Latency 

  • Definitely L3 caching takes advantage in this case. More than x2 speed up in the function 

Figure 5. IDP with sorting bottom-up
Figure 5. Intel® Distribution for Python* with sorting bottom-up


Kernel Density Estimation results

  • Estimated time for 1 KDE in 4 different cases 

  • Intel® Distribution for Python* with sorted input shows the best performance 

Figure 6. KDE performance comparison
Figure 6. KDE performance comparison


NumPy QuickSort time results

  • Estimated time for Sort 

  • Normal Python* shows stable sort overheads but sort overheads are significantly smaller than the performance benefits for the KDE calculation time.

Figure 7. Numpy Quicksort comparison
Figure 7. Numpy Quicksort comparison 



  • Normal Python* tends to run faster with small data sizes (up to 15,000). 

  • Intel Distribution for Python outperforms as the data size grows above 100,000. 

  • The performance gap grows larger with larger data sets. 

  • Sorting starts to become beneficial for Intel Distribution for Python with the data size about 100,000 and above.  

  • Intel Distribution for Python shows lower sorting time results but it is tremendously smaller than the performance benefit of adopting it. Great tradeoff.  

Intel software Tools can improve your solution and productivity. Download Intel® oneAPI Base Toolkit for Intel® VTune™ Profiler and Intel® Distribution for Python* today.