8 Ways to Analyze, Tune, and Maximize Your Application Performance With Intel® VTune™ Profiler

Get the Latest on All Things CODE

author-image

作者

A present-day developer has a wide range of hardware and software tools options available when building and maintaining an application. At the software level, they can choose programming languages such as C, C++, SYCL*, C#, Java*, Python*, Fortran, Assembly, and Google Go*; the .NET framework and other software options; or a combination. At the hardware level, there are several accelerated architectures available for application deployment such as CPUs, GPUs, and FPGAs.

With diverse tools and platforms employed for running an application, different factors such as memory usage, energy consumption, and latency decide the quality of application performance.

For developers looking to maximize their application performance, Intel® oneAPI tools have a one-stop solution called Intel® VTune™ Profiler for scrutinizing and fixing application performance bottlenecks so that you can effortlessly handle performance issues across several software and hardware tool combinations.

Not only does VTune Profiler pinpoint which code may be holding back your workload performance, it also:

  • provides information on the nature of the issues it identifies and provides recommendations on improvements and resolutions and;
  • helps you analyze several aspects affecting application performance, be it a source code issue or an architectural bottleneck, covering both hardware- and software-level performance problems and;
  • allows you to fine-tune all the portions of the application, regardless of which parts are accelerated by specialized hardware.

This blog introduces the major types of application performance analyses that you can conduct using VTune Profiler. (Note: Some advanced resources to do a deeper dive into the tool are awaiting you towards the end of the blog.)

Advantages of Intel® VTune™ Profiler

In a nutshell, it delivers insights into the program execution of your application, serving as a program that measures the behavior of another program, hence called a “profiler”. It is available for Windows* and Linux* hosts but can profile code executed on a variety of additional platforms such as Android*, QNX* and FreeBSD* across diverse architectures such as CPUs, GPUs, and FPGAs.

It helps you in several aspects such as:

  • Assess parallel performance of your program
  • Analyze hotspots
  • Detect anomalies
  • Find cache misses
  • Determine memory consumption
  • Detect performance bottlenecks in I/O-intensive applications and much more

Let’s dive into the eight crucial types of application performance analysis that you can perform. 

1. Hotspot Analysis for CPU Utilization Issues

One of the most important performance-affecting aspects that VTune Profiler helps you locate and analyze is that of hotspots: Those sections of your program that consume the maximum execution time.

Hotspot analysis acts as an initial point to understand an application flow for algorithm analysis. VTune Profiler lets you analyze various aspects of the application such as the top hotspots, CPU time, and utilization for the whole application as well as per hotspot function, frame rate, memory bandwidth, and parent and child functions of a particular function, along with its performance metrics.

Note:

Hotspot Analysis is available with or without taking advantage of the processor’s Performance Monitoring Unit’s (PMU) event counter. Thus, it can be used with OS timer-based sampling even if the developer does not have access to administrator or root permissions on the test system. Hardware and System Architecture Events-based sampling requires the ability to install a collection driver on the platform under test. For most users, this requires no input and is handled automatically by the Intel VTune Profiler installer.

For source code-level analysis, the availability of application symbol information with the binary as well as the file system location of the source files is required.

Configure data collector with VTune Profiler [5:49]

 

The performance results available in the Hotspots viewpoint are then interpreted following a systematic sequence of steps:

  1. Define the elapsed time (start-to-termination time of the application) as a baseline to compare the application’s performance before and after its optimization
  2. Identify the top hotspots – functions taking the maximum execution time
  3. Identify issues with calling sequences of functions in an algorithm
  4. Analyze source code incorporating a hotspot function

   → Explore in detail about hotspot analysis with VTune Profiler.

2.  Anomaly Detection

Apart from hotspots, there exist some short-term, non-deterministic issues in frequently recurring code sections such as loop iterations. Such anomalies can cause unrecoverable issues and hence require timely detection.

VTune Profiler allows you to identify performance anomalies through Intel® Processor Trace technology and Instrumentation and Tracing Technology (ITT) API.

​​​​​​​   Dive deeper into anomaly detection analysis.

Note: Anomaly detection analysis is a preview feature which is not guaranteed to be available in future versions of VTune Profiler.

3.  Memory Consumptions and Allocations Analysis

In addition to identifying the code sections that can cause potential performance issues, analyzing memory usage by various parts of the application can help further in performance optimization.

VTune Profiler allows you to identify functions that are consuming the maximum amount of memory as well as flag continuous memory allocation increase leading to potential code stability issues. You can also extract information about the memory allocation stack and the source module of each such function.

​​​​​​​​​​​​​​   Memory consumption and allocations view.

4.  Triage Hardware Concerns and Memory Issues
VTune Profiler shows you the top-down microarchitecture analysis so that you can focus your optimization efforts on specific parts of the microarchitecture causing hardware inefficiencies. Once you are done with analyzing the time-consuming code portions, the Microarchitecture Exploration Analysis allows you to detect hardware-level performance issues in your application by understanding how your code is passing through the core pipeline.

VTune Profiler also helps you analyzing memory-related bottlenecks such as cache misses, high memory bandwidth usage, and non-uniform memory access (NUMA) issues through various metrics that give you information on memory loads/stores, last level cache (LLC) misses, memory requests to remote DRAM and much more! (See the complete list here)

​​​​​​​​​​​​​​   → Know more about how you can analyze hardware problems, cache misses and bandwidth issues using VTune Profiler.

5.  Analyze Threading Efficiency and HPC Performance for Parallel Computations

Parallelism is at the heart of compute-intensive applications. And the efficiency of parallelism depends largely on how effectively the application utilizes hardware resources to achieve it. VTune Profiler helps you estimate the efficiency through threading and HPC performance analyses.

Threading analysis lets you identify the root causes of poor utilization of the available processor cores such as inefficient use of threading runtimes and thread contention issues during synchronization. In doing so it assists in improving workload balancing across the processor cores and hyper-threads.

It also highlights opportunities for better management of execution core affinity for threads and tasks. You can analyze aspects such as application thread count, thread wait time on I/O or synchronization objects, spin time, and overhead time to determine how effective is the cores’ utilization. This can be very useful to validate the efficiency of your threading allocation proposed by a tool such as the Intel® Advisor.

​​​​​​​​​​​​​​   Threading analysis with Intel VTune Profiler.

For compute-intensive and data throughput-focused applications, VTune Profiler allows you to estimate how effectively the CPU, memory, and floating-point operation hardware resources are being used. The VTune Profiler data collector gathers all the performance data of a specific application for you. However, it can also collect limited performance details for the entire system if required.

​​​​​​​​​​​​​​   → Get detailed information on threading analysis and HPC performance characterization analysis

6.  I/O Performance Analysis

VTune Profiler lets you analyze hardware and software bottlenecks in I/O-intensive applications. You can have a well-bifurcated analysis through the following main types of performance metrics:

  • Platform-level metrics category measures hardware-level, event-based metrics analyzing platform I/O, Intel® Ultra Path Interconnect (UPI), persistent memory (PMEM), and Dynamic RAM (DRAM).
  • Platform I/O includes Intel® Data Direct I/O Technology (Intel® DDIO) utilization efficiency and Memory-Mapped I/O traffic, covering data flow between the CPU and PCIe devices, memory-mapped devices, and integrated accelerators.
  • OS- and API-level metrics category allows you to analyze Linux Kernel I/O, Storage Performance Development Kit (SPDK), and Data Plane Development Kit (DPDK) applications.

NOTE: The complete range of I/O analysis metrics is available on Intel® Xeon® processors only.

​​​​​​​​​​​​​​   → Learn more about analyzing platform performance, DPDK applications, and SPDK applications to improve I/O performance.

7.  GPU Offload Analysis

If your application has diverse workloads across various CPU and GPU cores, GPU offload analysis with VTune Profiler allows you to analyze all the workloads within a unified time domain. It:

  • Enables knowing how effectively your application uses Intel® Media SDK, SYCL* and OpenCL™.
  • Helps you examine the code execution on both CPU and GPU and determine whether the application is CPU or GPU bound.
  • Estimate GPU usage time for your application, energy consumed by the GPU, and data transfer efficiency between host CPU and GPU device.
  • Lets you analyze the software queued up for GPU engines at any specific point of time.

​​​​​​​​​​​​​​   Learn about GPU ofload analysis with VTune Profiler.

8.  Python Performance Analysis

With VTune Profiler, you can optimize your Python* applications by identifying time-consuming code sequences and critical call paths. It lets you know whether your application uses the CPU resources effectively. For instance, you can check if more time is spent on native extension execution instead of glue code interpretation.

Through Hotspots, Memory Consumption, and Threading Analysis, you can fix performance bottlenecks in your workload and make efficient use of the available hardware.

Note: Similar principles for data collection and VTune Profiler usage also apply to other interpreted languages mentioned at the beginning of this blog such as Java*.

What’s Next?

This blog outlined some of the major features facilitated by VTune Profiler in brief. However, you can also leverage VTune Profiler for other features such as source code analysis, energy analysis, power and thermal throttling analysis, and data views and result files management.

Get started with VTune Profiler today!

Explore a wide range of its amazing features for application performance analysis. For practical demonstrations on fixing various application performance bottlenecks, don’t forget to go through our comprehensive VTune Profiler Training Video Series!

Useful Resources

Here are the links to find out more about VTune Profiler and get started with profiling and tuning your workload:

Get The Software

You can install VTune Profiler as a part of the Intel® oneAPI Base Toolkit or download its stand-alone version for free! We also encourage you to check out other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.

Acknowledgement

We would like to thank Chandan Damannagari for his contribution to this blog.