Introduction to VR Application Performance Tuning

ID 标签 659759
已更新 10/19/2018
版本 Latest
公共

author-image

作者

Introduction

This guide introduces virtual reality (VR) developers to rudimentary analysis and optimization techniques they can apply to their projects. It is not intended as a comprehensive reference or replacement for existing guides but rather a quick-start tutorial on VR application performance analysis and optimization, with a sample of engineer-vetted recipes for achieving results quickly.

Use it as a primer on:

  • Full Stack Analysis. Learn about the tools and methods used by Intel engineers to examine complex performance dependencies that go beyond those of traditional applications.
  • VR Application Tuning Recipes. Quickly achieve results with step-by-step tutorials to:
    • Detect performance issues.
    • Identify their root cause.
    • Correct the problem.
  • Solution Scenarios. Highlights of lessons learned by Intel performance engineers when solving real problems for customers.
  • Resources. Where to learn more.

While currently focused on Windows* tools and methods, many of the techniques apply to tuning Linux* and macOS* VR applications. This guide will be updated frequently to reflect our evolving knowledge in the rapidly changing world of VR development. We welcome your suggestions and feedback.

Full Stack Analysis Tools and Methods

The increasing complexity of VR applications requires sophisticated performance analysis. No single tool will render a complete picture, so you must use multiple tools and methods to gain a comprehensive understanding of an application’s performance characteristics. We chose to cover the tools we frequently use and have found to be the most useful in various VR software developer kits (SDKs) and development frameworks.  However, many more tools are available for the type of analysis we cover in this guide.

pyramid chart
Figure 1. Continuum of analysis.

Figure 1 illustrates the various performance-analysis levels across the software stack. Less time is spent on high-level analysis at the system level; the most time is spent on low-level analysis at the microarchitecture, or uArch, level. In other words, the time spent increases as the complexity of the data collected, and the time required to analyze that data, increases. Note that any tool may lead you to the cause of a bottleneck and it is not always necessary to work through all the analysis levels.

Analysis Levels

System

At the system level, general metrics are needed to properly aim and drive optimization decisions. With tools such as Windows Typeperf see the VR Application Tuning Recipes section, you can characterize the application load on a system quickly and easily, which immediately helps identify obvious issues. Table 1 highlights a few metrics and rules of thumb to follow.

Power

If you’re concerned about power consumption, various tools measure the power used throughout the workload. Intel® Power Gadget monitors the energy model-specific registers (MSR) reported by the CPU, computes the watts of power consumed by the CPU, and then reports the measurements in a CSV file. The tool also reports CPU utilization, frequency, and temperature, as well as GPU metrics parts from Intel. (You will need to determine where your application draws power by lining up workload markers with the reported timestamp in the CSV file.)

OS stacks

After you have analyzed the system and identified possible bottlenecks, capture the OS stack information to further examine threads, modules, and functions (should symbols be available). Windows Performance Analyzer (WPA), a tool developed by Microsoft, post-processes samples captured by Windows Performance Recorder (WPR). WPA displays the captured data in graphs and tables through a graphical user interface, and presents data such as system activity, computation, storage, memory usage, and complete call stacks (if symbols are available).

uArch & GPU

For the CPU, when call stacks aren’t deep enough to identify and fix bottlenecks, use Intel® VTune™ Amplifier to identify serial and parallel bottlenecks. Analysis from the VTune Amplifier helps you study algorithm choices, and understand where and how your application can benefit from available hardware resources. VTune Amplifier also provides a view into the assembly code by disassembling the target application. The tool presents hardware performance-counter data in graphs and tables, and can cross-link into source code (if symbols are available) and assembly. Using VTune Amplifier can also help reveal deep microarchitectural issues that might be limiting your application’s performance when scaled across threads or run as part of a larger stack.

As for the GPU, it’s sometimes necessary to understand the performance of the GPU at a low level to identify bottlenecks. Use the Intel® Graphics Performance Analyzers (Intel® GPA) to study the GPU and determine if your application is GPU-bound or CPU-bound. Going deeper, Intel GPA can also identify hotspots in the graphics pipeline and provide draw-call analysis at the frame level, allowing you to study a particular frame and identify areas to optimize.

Tools

TypePerf

TypePerf, built into Windows, can be executed from the command prompt. It samples at 1Hz, and provides enough granularity to identify potential issues with minimal overhead. It can also be run throughout the duration of most workloads.

A typical command looks like this:

typeperf –cf typeperfinput.txt –o workload_perfmon.csv

The –cf switch allows a text file that lists all the desired metrics to be input. See the Appendix for the list we use for data collection. The –o switch identifies the output file. Typically users would open the resulting CSV file in a spreadsheet application to graph its values and identify system issues.

Table 1. Typical core TypePerf metrics 

Stat Value Notes
Processor(_Total)\% Processor Time 60.88  
Processor(_Total)\% User Time 38.17  
Processor(_Total)\% Privileged Time 22.71  
Processor(_Total)\% Interrupt Time 0.30  
Processor(_Total)\%DPC Time 0.42  
System\Context Switches/sec 19559.96 Context switches are high
System\System Calls/sec 1015040.67 System calls are extremely high!
Processor(_Total)\Interrupts/sec 19417.73 Interrupts are high at 20k/s
Memory\Demand Zero Faults/sec 21291.03  
Memory\Page Faults/sec 24319.89  

CPU utilization graph
Figure 2. Typical chart of TypePerf metrics

Interpreting this data quickly takes time to perfect the skill. Table 2 below contains our guidelines for examining the core set of of TypePerf metrics.

Table 2. Core TypePerf metrics

Metric Description Guideline
Context Switches Whenever a logical core switches from executing one thread to another Less than 10k/s per active running thread
System Calls

Calls to the operating system service

Less than 50k/s per active running thread
Page Faults When a thread refers to a page that is not in the OS current working set Less than 50k/s per active thread (if majority are soft faults)
Hardware Interrupts Devices interrupt the processor when they have completed a task or require attention Less than 6-7k
Processor Queue Threads in the queue ready to be executed More than 1 indicates a bottleneck
Average Disk Queue Average number of both read and write requests that were queued for the selected disk during the sample interval More than 1 means you are partially gated on disk IO, More than 2 and you are complete gated by disk IO
Processor Frequency Self-explanatory No knowledge of turbo state (P-state)

Windows performance analyzer

WPA (figure 3) makes use of Windows Performance Recorder (WPR), a tool based on event tracing for Windows (ETW). WPR records system events that can be analyzed using WPA. WPR records the data during testing, and WPA displays the captured data (CPU, GPU, etc.). WPA can be obtained free of charge through the Windows Assessment and Deployment Kit (Windows ADK).

performance analyzer results
Figure 3. Windows* performance analyzer displays CPU and GPU usage data

WPR can also be executed from the command line through the log.cmd file, installed under the gpuview folder within the Windows Performance Toolkit folder. Call log.cmd from the command prompt to both start and stop collection. After stopping data collection, a set of traces will be created in the gpuview folder; merged.etl is the file of interest. This file can be opened in WPA for analysis. To Install WPR, WPA, get the Windows* Assessment and Deployment Kit (Windows ADK).

Intel® VTune™ Amplifier

VTune Amplifier 2018 (figure 4) allows you to find serial and parallel code bottlenecks, and speed execution. Use this tool to analyze algorithm choices, and understand where and how your application can benefit from available hardware resources. Download a trial version of VTune Amplifier.

hotspots in the Intel V Tune Amplifier U I
Figure 4. Intel® VTune™ Amplifier 2018 lets you identify and analyze code bottlenecks to optimize performance on modern CPUs

PresentMon

Use PresentMon to trace ETW events related to swap chain presentation on Windows. It can capture and analyze key performance metrics for graphics applications (for example, CPU and Display frame durations and latencies).  And, it works across all graphics APIs and supports UWP applications.

PresentMon is an open-source tool developed by Intel and available on GitHub*.

Intel® Power Gadget

Monitor power usage on Intel® Core™ processors with Intel Power Gadget (figure 5). The tool runs on Windows and Mac OS X* and includes an application, driver, and libraries to monitor and estimate real-time processor package power information.

monitor power use
Figure 5. Intel® Power Gadget uses the integrated energy counters in the processor to monitor an application’s power consumption

To generate a CSV log file with Intel Power Gadget, click the “Start Log” button. A red “Rec” flashes to indicate logging has started. At the end of your workload, click “Stop Log” to complete your data capture, and save the log in your documents folder by default.

Download Intel Power Gadget.

Intel® Graphics Performance Analyzers (Intel® GPA)

Detect bottlenecks at the frame level and apply real-time experiments on frames with Intel Graphics Performance Analyzers (Intel GPA, figure 6). Download Intel Graphics Performance Analyzers.

frame performance review
Figure 6. Intel® Graphics Performance Analyzers lets you detect and mitigate bottlenecks at the frame level

VR Application Tuning Recipes

When optimizing VR applications, use the key metrics in the "Key performance metrics" section to gauge initial performance and track the impact of changes at each level.

Key performance metrics

Frame rate

Defined as the number of frames rendered each second, commonly referred to as fps (frames per second). The higher the fps, the better.

Late Stage Reprojection (LSR)

LSR applies to Windows Mixed Reality (WMR) applications only. Capture LSR to determine whether an image can be re-projected before being rendered to the headset. This helps prevent motion sickness. The higher the LSR, the better.

Frame time

This means the total time to render a frame. For a target frame rate of 90 fps, the work must be completed in 11.1ms, or 16.6ms for 60 fps. The lower the frame time, the better.

To determine the budget of time required for a certain fps, use the calculation:
1000 / (Target fps) = X ms
For example: 1000 / 90 fps = 11.1 ms

CPU and GPU utilization

The total work handled by the processing units over a given period of time.

Defining success

Having a firm understanding of the key performance metrics of a VR application, and knowing the maximum achievable performance of your target hardware, determines your performance ceiling. Calculate this ceiling for various metrics to put the actual measured performance in context. For example, if you achieve frame rates and Late Stage Reprojection (LSR) values that reach or exceed 98 percent of the target platform and hardware spec's performance ceiling, your application performs well. Put another way, with a maximum achievable refresh rate of 90Hz, you could attempt to achieve at least a frame rate of 88.2 fps and LSR average. The term Late Stage Reprojection is coined by Microsoft and is used for Windows Mixed Reality applications.However, Oculus* and Vive* applications have their own reprojection nomenclature and solutions. For the purposes of this guide, we will refer to LSR for WMR applications to describe ideal performance.

Performance tiers

Many of the major VR platforms offer different levels of performance (see table 3).

Microsoft, for example, defines two tiers of Mixed Reality PCs:  Windows Mixed Reality Ultra PCs (WMR Ultra PCs) and Windows Mixed Reality PCs (WMR Mainstream PCs) – the key differences being their minimum hardware requirements and the maximum achievable headset frame rate. WMR Mainstream PCs support 60Hz and WMR Ultra PCs support 90Hz.

For other minimum hardware requirements, see the Oculus* Rift support page and the list of recommended specs for HTC Vive*.

Table 3. Maximum achievable fps performance 

  WMR Mainstream PCs WMR Ultra PCs Oculus Rift HTC Vive & Vive* Pro
Max FPS 60 90 90 90

VR tuning flow

Once you’re familiar with the key performance metrics and general indicators that your application runs as intended, start testing based on minimum specifications. Your goal is to deliver a positive user experience on each target platform.

Benchmarking and tuning will determine if there are major performance issues with your VR application. Figure 7 shows the general flow of the analysis and tuning process. The tools employed are shown at the top of the diagram.

VR analysis flowchart
Figure 7. Flowchart of VR application analysis and tuning

Get started

Use PresentMon to reveal how the application behaves, and then determine the need for further optimizations. Generate graphs with the PresentMon data (figure 7) to understand application behavior throughout your test. The graphical representation of this data lets you identify any instantaneous buffering and stuttering issues, as well as performance spikes. Keep in mind that taking an average of these metrics may not reveal specific issues and how they impact the user experience.

Notice that while the average frame-rate is above 55 fps (figure 8), there are frequent spikes. The large spikes represent a significant number of frames dropping during that period, which translates into stuttering and poor UX.

dropped frames
Figure 8. Graph generated based on collected PresentMon data

Step-by-step:

  1. Download the latest version of PresentMon.
  2. Collect a sample from your application using the command line to collect at least a 90-second clip.
  3. Example capture command: sampleapp.exe: .\PresentMon64-1.3.0.exe -timed 90 -process_name sampleapp.exe -verbose
    1. For Windows Mixed Reality applications, use the -include_mixed_reality option for additional metrics.
  4. Process the data. We use Excel* to get a graphical view of application performance, but you can use any general spreadsheet application.
    1. Open the .csv that PresentMon generated.
    2. In the Application column (Column A), make sure the rows are filtered to only the VR application that is being analyzed.
    3. Create a new column, name it fps, use the calculation: =1000/[MsBetweenPresents] and apply it to all rows.
  5. Graph the data.
    1. In Excel, highlight the newly created fps column, then choose Insert > 2D Line. Notice the average displayed at the bottom right of the window.

excel report
Figure 9. Graph generation in Excel*

Analyze the presentMon data

Determining performance issues becomes simple when looking at the App fps graph (figure 10). If the application performs in an ideal manner, the fps average will meet or exceed 98 percent of the max achievable fps, and the fps standard deviation will be low.

ideal power consumption graph
Figure 10. An example of an ideal application where the max achievable fps is 60 fps with a stable 10s average

While the average fps could be in the high 50s, you may experience intermittent frame drops. Figure 11 shows:

(a) frames dropping every 90 seconds due to a dynamic quality setting change.

(b) frames dropping and then stabilizing after a few seconds, illustrating likely buffering issues.

frames dropping
Figure 11a. Frames dropping every 90 seconds

Buffering issues graph
Figure 11b. Buffering issues that smooth over time

If your application misses the fps target on a consistent basis, use Intel GPA to investigate.

Intel® Graphics Performance Analyzers

Capture frames using Intel GPA to continue analyzing your VR application. Identify the largest ergs. This helps reveal:

  • Unnecessary background rendering
  • Inefficiently batched draw calls
  • Other issues that could degrade performance

In VR applications, ergs may appear twice. This is because they're rendered once per eye, unless single-pass stereo rendering was used, in which case the erg will appear only once. We recommend using single-pass stereo rendering to lower CPU and GPU utilization. Read more about single-pass stereo rendering from Unity*

Figure 12 shows a common scenario in which a media player menu is being rendered behind the video, which the user will never see. GPA is able to show the draw calls and the render target, confirming not only that the player was being drawn, but also that the user never sees it in the final output.

media player menu
Figure 12. A media-player menu being drawn behind video

Figures 13a, 13b, and 13c reveal how Intel GPA can detect a static RenderScale that's been set too high for the target hardware. We noticed in this instance that the RenderScale had been set to 1.3 and then downscaled to the target size. Changing the RenderScale back to 1.0 showed a significant performance increase.

In fact, on lower-end hardware, dynamically lowering the RenderScale to 0.7 or slightly higher can result in minimal quality degradation (depending on the use-case) while being able to maintain a stable fps, which is more important in VR. Watch this video to learn more about fighting VR sickness.

Resolution changes displayd in UI
Figure 13a. Resolution goes from 1664x1664 to 1280x1280 between render targets, which indicated that the RenderScale must have been higher than 1.0

PresentMon data shows the increased performance after reverting the RenderScale to 1.0.

graph
Figure 13b. With a RenderScale of 1.3, frame-rates average 48.9 fps

graph
Figure 13c. After adjusting RenderScale to 1.0, frame-rates average 58.6 fps

Using Intel GPA, look at the largest ergs to examine the draw call and understand what it's doing. Then, look for the bottleneck—for example, does the shader stall due to large textures? Finally, experiment on the erg to see if the change improves performance. Learn more about how to pinpoint performance bottlenecks within a frame in this guide to Get Started with Intel GPA.

Step-by-step:

  1. Download Intel Graphics Performance Analyzers.
  2. Analyze frames by launching your VR application through Intel GPA.
  3. Once launched, press (Ctrl+C) or (Cmd+C) to capture a frame in a scene with suspected performance issues.
  4. View the captured frames in Graphics Frame Analyzer to inspect them at the frame level.

If optimizations at the application level using GPA do not meet your requirements, a system-level exam with Windows Performance Recorder may help uncover issues.

Windows* Performance Recorder (WPR)

Record data during testing with WPR and analyze it with WPA. WPA shows CPU and GPU usage along with a graphical representation of events. In addition to GPU view and Xperf, WPA and WPR come with the Windows Performance toolkit, available as part of the Windows ADK.

Be sure to select the GPU activity option under Resource Analysis in the WPR GUI as this is not selected by default.

Windows performance recorder

Figure 14 shows a WPA trace captured when an app named WWAHost.exe (highlighted in yellow) was running. The CPU Usage (sampled) field shows % weight for the selected executable at 10.56 percent. You'll find GPU usage under the Video twirl-down menu in the Graph explorer on the left.

analyze UI
Figure 14. WPA lets you analyze CPU and GPU usage

Solution Scenarios

As performance engineers, we help other developers optimize their applications. The following list highlights some of the techniques most relevant to optimizing VR applications.

GPU usage tuning:

GPA bottlenecks

Look for potential culprits. The portal and unnecessary elements may be rendering in the background as well as unused textures being loaded, impacting performance. Use Graphics Frame Analyzer to look for the textures being fetched, the draw call batching, unnecessary clear calls and determine if essential details are being rendered to the screen as well as the draw calls with the biggest impact.

Focus both on changing the X and Y metrics and on GPU duration in the Frame Analyzer, as shown below. This lets you detect which draw calls are taking the longest to render. Experiment to pinpoint the pipeline bottleneck.

show GPU information
Figure 15. Change the X and Y dropdown to show GPU duration to reveal which ergs are taking the most time.

WPA

Issues

Figure 16 shows a compositor pushing GPU utilization to almost 100 percent. This indicates a GPU-bound application. In such scenarios, check CPU usage. If it’s not at capacity, try offloading the GPU workload to the CPU.

GPU analysis results
Figure 16. When you see GPU utilization at nearly 100 percent (highlighted in orange), it means your application is GPU-bound

Improve startup time

Many factors impact VR application startup time—numerous large asset files take an enormous toll. Even with high-performance hardware, it might take 30-to-60 seconds or more to launch a VR app. Let's explore a startup-time problem scenario, its root cause, and a workaround.

Problem scenario: Serialized network operations

This problem can show up anywhere in a VR application, but it's most painful during startup. Consider figure 17, an example we collected with WPR and viewed in WPA.

serialized network in analysis

Figure 17. Serialized network operations can add significant latency to a VR application’s startup time

In this example, a nine-second window with an idle CPU and GPU dominates the startup time. This red flag means the bottleneck is something other than the processors. The developers had no idea anything was out of the ordinary—they assumed there was no way to reduce startup time.

This illustrates an important rule: a well-tuned application should always be CPU- or GPU-limited, the only exception being to achieve a desired user-experience before hitting a bottleneck.

Once we identified the behavior with WPA, the root cause of the nine-second gap was easy to find. Analysis of the source code revealed the culprit was an update operation. The intention was to run it on its own thread, but it was being run on the UI thread. The workaround took 10 minutes and three lines of code to correct. We spawned the thread to offload the update operation and unblock the UI thread.

That work resulted in the startup profile in figure 18.

Utilization report
Figure 18. Eliminating the source of the startup bottleneck reduced the startup time by 11 seconds

Note the gap in CPU usage is gone. The startup time dropped from ~26sec to less than 16sec. We achieved this by performing the update operation asynchronously.

Always parallelize network operations on separate threads to ensure that they won't block the main or UI thread during startup unless required. Sometimes the best optimization opportunities are hiding in plain sight.

Reduce CPU usage for video playback

Problem: The application is showing high CPU usage for video playback.

Solution: Check that you're using hardware-decode using Task Manager by going into the Performance tab (see figure 19).

task manager performance report

task manager performance report
Figure 19. No hardware decode (above left); hardware decode on (above right)

If hardware acceleration is unavailable in your editor, check the settings for an option. In Unity, it is recommended you use an H.264 source. If there’s no hardware decode with your H.264 video in Unity, use this sample to integrate it into your application.

Summary

This guide is a starting point to introduce VR application developers to a few basic methods, tools, and techniques for optimizing VR applications. Every application is different in its own way and the recipes and solutions included here might not be suitable for all applications in all cases. Depending on the type of situation and the direction that is shown here, try analyzing the app by using different tools and methodologies described in the above sections. We have listed links to all the tools mentioned in this guide in the appendix. This will remain an excellent reference for those who wish to go deeper into specific optimization topic areas.

Learn More

Tools

Additional Resources

Best practices checklist

Quality and rendering settings: WMR mainstream PCs

HDR Disable. RGBA 8 or 10
MSAA Disable
Media Input 2K
Anisotropic Filtering Disable
V-Sync Disable
Shadow Cascades Disable
Textures Low or Medium
Single-Pass Stereo Rendering Enable
Render Scale 0.7 - 1.0

Unity* application tuning tips:

If the vertex shaders are complex with math functions such as pow, exp, log, cos, sin, tan, etc.

Consider using lookup textures as an alternative to complex math calculations if possible in this guide to post-processing with User LUT.

If there is a sampler bottleneck, the texture sampler is starving EUs due to slow retrieval.

  • Consider reducing the size of the texture by using a lower resolution or color precision such as RGBA8.
  • Consider using a different filtering algorithm. For example, anisotropic filtering is more expensive compared to a simpler algorithm such as bilinear filtering.
    • To see if this is an issue, try the 2x2 textures experiment and see if the change positively affects the duration of the frame.
  • If too many draw calls are causing significant overhead on the CPU side, consider using static batching for non-moving GameObjects if the application uses Unity.
    • Check Static (under inspector) for non-moving GameObjects.
  • For targeting a wide range of system specifications, consider dynamically changing the RenderScale settings based off of the FPS. If there’s no noticeable difference going down to 0.7 or 0.8, it can be a good option for low-spec systems. Reducing graphics is a better alternative to dropping frames.
  • Use Single-Pass Stereo Rendering to reduce CPU processing time.