Intel® Arc™ A-series Graphics Gaming API Developer and Optimization Guide

ID 标签 737258
已更新 6/20/2022
版本 Latest
公共

author-image

作者

The new Intel® Arc™ A-series discrete graphics (formerly code named Alchemist) implements the Xe-HPG microarchitecture (high-performance graphics) and hosts multiple advancements of interest to developers. To gain peak performance, software must account for the new architecture and developers must make the right choices regarding APIs. This document contains developer guidance and optimization methods, as well as best practices, to most effectively harness the architecture’s capabilities and achieve peak performance. 

Xe-HPG Features Highlights

Xe-HPG offers updates and enhancements to features such as Microsoft DirectX* 12 Ultimate, high-dynamic range (HDR) support, Adaptive-Sync, and DirectStorage.

DirectX 12 Ultimate* Support

Xe-HPG will support features available with DirectX 12 Ultimate*, such as:

  • Ray Tracing
  • Mesh Shading
  • Variable Rate Shading (VRS) Tier 2
  • Sampler Feedback

To take advantage of DirectX 12 Ultimate, applications must support feature level 12_2.

Hardware Ray-Tracing Support

Xe-HPG supports hardware accelerated ray tracing in both DirectX 12 (1.0 and 1.1) and Vulkan* (Vulkan RT). Ray tracing is a technique that can be used to simulate physical light behavior in 3D applications. It can be used to enable global illumination, realistic shadows, reflections, ambient occlusion, and other techniques. A separate guide dedicated to optimizations and best practices for our hardware ray-tracing support will be published to the Intel® Developer Zone.

For Vulkan, ray tracing is supported on Xe-HPG through the following Khronos* extensions:

  • VK_KHR_ray_tracing_pipeline
  • VK_KHR_acceleration_structure
  • VK_KHR_ray_query
  • VK_KHR_pipeline_library
  • VK_KHR_deferred_host_pipeline

Variable-Rate Shading

Variable-rate shading (VRS) gives programmers the ability to vary the shading rate independent from the render target resolution and rasterization rate. Among other use cases, this feature allows developers to reduce the number of pixel shader invocations for content that has slowly varying shading parameters, or for pixels that may be blurred later in the rendering pipeline. The feature enables developers to direct shader resources to the pixels that matter most in their content. This can provide a better visual solution than rendering at a lower resolution and then upscaling, since it preserves the depth and stencil at full pixel rate. Xe-HPG hardware supports VRS via DirectX 12 Tier 2, and via the VK_KHR_fragment_shading_rate extension for Vulkan. For more information on enabling VRS Tier 1 in your application, refer to the white papers Getting Started with Variable-Rate Shading on Intel® Processor Graphics and Velocity and Luminance Adaptive Rasterization Using VRS Tier 2.

Mesh Shading

Mesh shading is a new technique that can replace the traditional geometry pipelines. For instance, the input assembly, vertex shader, hull shader, domain shader, tessellator, and geometry shader are traditionally used to feed primitive data into the rasterizer. With mesh shading, these steps can be replaced with one or two stages, and primitives can be generated in a compute shader fashion. This enables developers to increase flexibility and performance when defining 3D primitives, including procedural generation of 3D geometries, primitive culling, and other techniques.

DirectX* 12 View Instancing and Vulkan* Multiview

View Instancing is based on the observation that there are cases of redundant geometry processing where there is a shared use of the geometry between different views, based on the position of the camera. Examples of this include each side of a cube map, cascades of shadow maps, and/or stereo view.

  • Normally the draw calls will be sent from the CPU to the GPU multiple times, which can become a bottleneck for scenes with many objects or draw calls. Xe-HPG introduces support for hardware that enables replicating geometry for multiple views in a single pass. Converting multiple passes, or instances where the geometry is processed the same way, to a single pass avoids redundant CPU and GPU work.
  • Xe-HPG supports Tier 2 level view instancing on Direct3D* v12, and works in conjunction with VRS to enable further performance gains.

DirectX 12 Sampler Feedback

Sampler feedback is a hardware-accelerated feature introduced in DirectX 12 Ultimate and supported by Xe-LP, Xe-HPG and later. Conceptually, it is the reverse of texture sampling: in contrast to the shader Sample() intrinsic, which reads a number of texels from a texture and returns an average value, the new WriteSamplerFeedback() shader intrinsic writes to a binary resource "marking" the texels that would have been read. This enables two important usages:

  1. Sampler Feedback Streaming: When objects are drawn, we can simultaneously collect data about what texture data was required to draw the scene. By dynamically streaming only those necessary portions of resources just-in-time, we can draw a scene that accesses more textures than could simultaneously fit in physical graphics memory. Sampler feedback includes a "min mip feedback" feature to facilitate this usage: a single-layer texture where each texel represents a region of the streaming texture with an integer (byte) representing the minimum mip that was sampled in that region. If the sampler feedback region size is set to match the tile dimensions of a partially resident resource, for example a DirectX 12 reserved resource, the result is a map that informs what tiles to stream. Min-mip feedback maps are very small—for example, for a 16Kx16K BC7 reserved resource, the corresponding min mip feedback map with region size 256x256 is only 64x64 bytes or 4 KB in size; much smaller than the 350 MB texture it represents.
  2. Texture Space Rendering: Intel pioneered texture space rendering in a SIGGRAPH 2014 research paper [Clarberg et al. 2014]. We observed that the cost of pixel shading is tightly coupled to both the geometric complexity and the screen resolution, which has only increased over time. Instead of shading directly through traditional pixel shaders, the application uses sampler feedback to mark a mask which corresponds to the texels it requires. This mask can then be referenced in a subsequent compute shader pass, and the required texels can be shaded at a resolution and frequency determined by the developer in texture space, as opposed to a fixed screen-space resolution and frame frequency. Once shading to texels is completed in a separate lighter-weight pass, pixels in screen space are mapped into texture space, and the corresponding texels are sampled and filtered using standard-texture lookup operations. This feature allows pixel shading to be largely independent of the geometric complexity or screen resolution, providing a fine-grain control to trade quality for performance.

For more information on sampler feedback, please visit the GitHub* repository that has sample code and links to the Intel Game Developer Conference (GDC) summer 2021 presentation.

DirectStorage

All Intel® platforms and graphics products support Microsoft* DirectStorage for Windows*. Performance of some usages will be dependent on features of the platform and devices. One desirable configuration for game loading and streaming experience includes the following:

  • An Xe-HPG GPU that supports concurrent copy and render engines
  • A 12th generation Intel® Core™ processor (such as the Intel® Core™ i9-12900K processor) that supports 24 lanes of PCIe* v4 and high-performance DDR5 memory
  • A high-speed NVMe SSD.

High Dynamic Range Displays

Xe-HPG architecture features additional support and improvements for high dynamic range (HDR) images and displays. It’s integrated with half precision point (FP16) for faster rendering speeds of deep color bit depths (10, 12, and 16 bits per channel), supports up to the scRGB color gamut, and has a HDR10 protocol output and input capable of Dolby Vision*.

Ultimately, Xe-HPG is fully compatible with the VESA Certified DisplayHDR* and Ultra High Definition (UHD) Premium* display certifications, and already supports Dolby Vision ahead of future consumer monitors release.

HDR is supported in DirectX and Vulkan. For Vulkan, support can be found via the extension VK_EXT_hdr_metadata.

For additional information on HDR, see: Using DirectX with High Dynamic Range Displays and Advanced Color.

Adaptive-Sync—Variable Refresh Rate

Adaptive-Sync is the VESA standard for variable refresh rate displays. This display feature and controller enables a better experience for the user by reducing tearing and stuttering. Adaptive-Sync may also reduce overall system power consumption. Basic requirements for Adaptive-Sync are:

  • Full-screen rendering by the game, or 3D application
  • Simple application swap-chain modification to ensure asynchronous buffer flips
  • A DisplayPort 1.4 VESA Adaptive-Sync capable display panel
  • Windows® 10 RS5 and beyond

The game, or 3D application, must ensure that its rendering swap-chain implements asynchronous buffer flips. On displays that support Adaptive-Sync, this results in smooth interactive rendering, with the display refresh dynamically synchronized with the asynchronous swap-chain flips. If application and platform conditions are met, the Xe-HPG driver enables Adaptive Sync by default. There is also an option to disable it using the Intel® Graphics Control Panel.

On DirectX 12, use DXGI_SWAP_CHAIN_ALLOW_TEARING and DXGI_PRESENT_ALLOW_TEARING when creating the swap chain.

On Vulkan, use VK_PRESENT_MODE_IMMEDIATE_KHR or VK_PRESENT_MODE_FIFO_KHR.

For more information on enabling Adaptive-Sync, refer to the Enabling Intel® Adaptive Sync Guide.

Other API Feature Support

In addition to all these key features, Xe-HPG supports all the major APIs, including DirectX, OpenGL*, Vulkan, and OpenCL™ APIs. The table below shows the features discussed above, and others that are among the Direct3D 12 products mapped to Xe-HPG—as well as corresponding Vulkan support.

Table 1. Xe-HPG Feature Support by API

API Feature DirectX 12 Support Vulkan Support
Max Feature Level 12_2 N/A
Shader Model 6_6 N/A
Resource Binding Tier 3 Limits Based on Driver Query
Tiled Resources Tier 3 Limits Based on Driver Query
Typed UAV Loads Yes Yes—Core Spec
Conservative Rasterization Yes Yes—VK_KHR_conservative_rasterization
Rasterizer-Ordered Views Yes Yes—VK_EXT_fragment_shader_interlock
Stencil Reference Output Yes Yes—VK_EXT_shader_stencil_export
UAV Slots Full Heap Limits Based on Driver Query
Resource Heap Tier 1 N/A
Variable Rate Shading Tier 2 VK_KHR_fragment_shading_rate
View Instancing Tier 2 Yes—VK_KHR_multiview
Asynchronous Compute Yes Yes—Core Spec
Depth Bounds Test Yes Yes—Core Spec
Sampler Feedback Tier 0.9 N/A
Ray Tracing DXR 1.0/1.1 See Hardware Ray-Tracing Support
Mesh Shading Tier 1 Work in Progress

Tools for Performance Analysis

Intel recommends using Intel® Graphics Performance Analyzers (Intel® GPA) and Intel® VTune™ Profiler as the primary optimization tools for performance analysis on Xe-HPG. The Intel GPA Cookbook is found here. Intel GPA is updated frequently with new features to help you debug and profile your application.

Performance Recommendations for Intel Xe-HPG Graphics Processors

Modern graphics APIs, such as DirectX 12 and Vulkan, give developers more control over lower-level choices that were once handled in driver implementations. Each API is different, however, and these are general recommendations for application developers that are not API-specific.

GPU Detection for Features

Xe-HPG supports various hardware features that are supported natively in DirectX 12 or Vulkan. When querying, enabling, or disabling hardware features, keep the following in mind:

  • Avoid using vendor IDs to disable features, use slower execution paths, or default to lower performance settings.
  • Query the hardware for support using defined APIs.
  • Favor vendor-agnostic features, when available, over hardware vendor-specific extensions.
  • Check for support on optional features in an API.

To help with detection of Intel® GPUs, Intel has a GPU Detect sample to assist with this process.

Intel’s driver versioning has also been updated. For more information refer to Understanding the Intel® Graphics Driver Version Number.

Additional Notes

  • Xe-HPG does not support double precision floats in hardware. Use CheckFeatureSupport and query the DoublePrecisionFloatShaderOps member of the D3D12_FEATURE_DATA_D3D12_OPTIONS struct to check for support before using in shaders.
  • For compute shaders, Single Instruction Multiple Data (SIMD) lane count is not fixed but is variable depending on factors later highlighted in this guide in the Shader Optimizations section. Shaders should not assume wave size is fixed.
  • Barycentrics are not supported.
  • Divergent barriers in shaders can cause hangs.
  • Uninitialized descriptor heaps are undefined behavior and may cause issues.

For more information about GPU detect, and a description of driver version schemes, consult this article.

Configuring Graphics Pipeline State

When configuring pipeline states, consider the following:

  • Take advantage of all the available CPU threads on the system when creating Pipeline State Objects (PSOs). In previous APIs, the driver would create these threads for you, but now you must create the threads yourself.
  • Define optimized shaders for PSOs instead of using combinations of generic shaders mixed with specialized shaders.
  • Avoid defining depth-plus-stencil format, when stencil will not be used. Use depth-only formats, such as D32.

Resource Binding

The latest graphics APIs give you more control over resource binding, such as with DirectX root signatures, and Vulkan pipeline layout. Using these resources requires particular attention to maximize performance. When designing an application strategy for resource binding, employ the following guidance:

  • Minimize the number of root signature slots, or descriptor sets, to only what will be used by a shader.
  • Find a balance between root signature, or descriptor set, reuse across shaders.
  • Consider packing all constant buffer views into one descriptor table for multiple constant buffers that do not change between draws.
  • For multiple unordered access views (UAV) and shader resource views (SRV) that do not span a consecutive range of registers, and do not change between draws, it is best to pack them into a descriptor table.
  • Minimize descriptor heap changes. Changing descriptor heaps severely stalls the graphics pipeline. Ideally, all resources will have views appropriated out of one descriptor heap.
  • Avoid generic root signature definitions where unnecessary descriptors are defined, and not leveraged. Instead, optimize root-signature definitions to the minimal set of descriptor tables needed.
  • Favor root constants over root descriptors, and favor root descriptors over descriptor tables when working with constants.
    • Make use of root/push constants to enable fast access to constant buffer data (they are preloaded into registers).
    • Root/push constants are great to use on frequently changing constant buffer data.
  • Use root/push constants for cases where the constants are changing at a high frequency.
    • If certain root signature slots are less frequently used (i.e., not referenced by a PSO), put those at the end of the root signature to reduce graphics resource file (GRF) usage.
  • Use hints that allow the driver to perform constant-based optimizations, such as D3D12_DESCRIPTOR_RANGE_FLAG_DATA_STATIC.
  • For placed resources, initialize using clear, copy, or discard, before rendering to the resource. This helps enable proper compression by putting the placed resource into a valid state.
  • When creating resource heaps, resources that need to be accessed by the GPU should be placed in heaps that are declared as resident in GPU memory—preferably exclusively. Default heaps are preferred over upload heaps for constant buffers. This has a significant impact on discrete GPU performance.
  • Use queries to identify scenarios when GPU local memory gets oversubscribed, and adjust resource location to accommodate this. On one device the memory footprint may be fine but may be oversubscribed on a different device.

Render Targets and Textures

Most developers optimize texture data for GPU or CPU access, based on a given texture’s storage mode, and its use in the app. However, you can explicitly optimize texture data for either processor or opt out of optimization altogether. When texture data is optimized, performance increases for one processor but decreases for the other. Before optimizing texture data, carefully consider the storage modes and usage options for your textures.

General Guidance

The following application guidelines ensure the efficient use of bandwidth with render targets:

  • Avoid defining unnecessary channels, or higher precision data formats, when not needed to conserve memory bandwidth and optimize cache fetches.
  • Create multiple resources from the same memory-object.
  • Create resources (when possible) in the state that they will first be used. For example, starting in COMMON state then transitioning to DEPTH_WRITE will cause a Hi-Z resolve. Starting in DEPTH_WRITE, for this example, is optimal.

Vulkan specific advice for optimal device access:

  • Always use VK_IMAGE_LAYOUT_{}_OPTIMAL for GPU access.
  • Only use VK_IMAGE_LAYOUT_GENERAL when really needed.
  • Only use VK_IMAGE_CREATE_MULTIPLE_FORMAT_BIT when really needed.

UAVs and SSBOs

When dealing with resources that have both read- and write-access in a shader, such as UAVs and shader storage buffer objects (SSBOs), consider the following:

  • Access to read-only data is much more efficient than read/write data. Use read/write kinds of resources with caution, and when there are no better options.
  • Do not set a resource to use a SSBO in a Vulkan bind flag if the resource will never be bound as a UAV. This programming behavior may disable resource compression.

Anti-Aliasing

To get the best performance when performing multi-sample anti-aliasing, the following tips are recommended:

  • Minimize the use of stencil, or blend, when multi-sample anti-aliasing (MSAA) is enabled.
  • Avoid querying resource information from within a loop or branch where the result is immediately consumed or duplicated across loop iterations.
  • Minimize per-sample operations. When shading per sample, maximize the number of cases where any kill pixel operation is used to get the best surface compression.

We do recommend using optimized post-processing anti-aliasing, such as Temporal Anti-Aliasing, and/or Conservative Morphological Anti-Aliasing 2.0. Alternatively, advanced upscaling technologies such as Intel® Xe Super Sampling will produce high resolution, anti-aliased images.

Resource Barriers

Each resource barrier generally results in a cache flush, or GPU stall operation, that affects performance. Given that, the following guidelines are recommended:

  • Batch pipeline barriers. Use render passes to help properly batch barriers, and allow the driver to defer and hoist barriers to render pass edges.
  • If you need to use barrier- only command lists, the preference is to submit barrier-only command lists with other work, preferably at the end of the execute command list as trailing command lists. This will avoid paying extra synchronization costs. Alternately, submit at the very start as the first command list in an execute command lists call.
  • Use implicit render pass barriers when possible.
  • Limit the number of resource transitions by batching them and avoid interleaving with dispatches/render passes.
  • Batch barriers with render target changes, and avoid states such as D3D12_RESOURCE_STATE_COMMON—unless necessary for presenting, context sharing, or CPU access.
  • Supply resources when possible for barriers; they allow more optimal GPU cache flushing, especially for aliasing buffers.
  • Employ split barriers when possible to allow for maximum predication of synchronizing events. Signaling should happen as early as possible and wait as late as possible.
  • When transitioning resource states, do not over-set them when not in use; it can lead to excessive cache flushes.
  • For Vulkan, only use transitions from UNDEFINED state when really needed.

Command Submissions

When working with command queues and buffers, the following are recommended:

  • Batch command list submission at ExecuteCommandLists in DirectX 12 and Vulkan when possible, but not to the point where the GPU is starved. This ensures efficient use of the CPU and GPU.
  • When filling command buffers or command queues, use multiple CPU cores when possible. This reduces single-core CPU bottlenecking of your application.
  • When using ExecuteIndirect
    • Use high max counts in each ExecuteIndirect call
    • Avoid ExecuteIndirect calls on buffers of size zero or one. For size zero use predication.
    • Generate arguments for many ExecuteIndirect calls together, then issue the ExecuteIndirect calls in batchs. Generating a single argument followed by a single ExecuteIndirect in a loop is inefficient.

DirectX 12 recommendation:

  • Avoid the overuse of bundles, as they may incur additional CPU and GPU overhead.

Vulkan recommendations:

  • Use primary command buffers where possible, as these provide better performance due to internal batch buffer usage.
  • For primary command buffers, use USAGE_ONE_TIME_SUBMIT_BIT.
  • For primary command buffers, avoid USAGE_SIMULTANEOUS_USE_BIT.
  • Minimize the use of secondary command buffers as they are less efficient than primary command buffers and not as efficient with depth-clears.

Optimizing Clear, Copy, and Update Operations

For the best performance on clear, copy, and update operations, follow these guidelines:

  • Use the API provided functions for clear, copy, and update operations, and refrain from implementing your own versions. Drivers have been optimized and tuned to ensure that these operations work with the best possible performance.
  • There is no “clear color” value in Vulkan. Clear the entire image (all layers) at the same time, rather than layer-by-layer. Enable hardware “fast clear” values as defined per API:
    • In DirectX 12, clear values are defined at resource creation as an argument with ID3D12Device::CreateCommittedResource. Use this defined clear-value for clear operations on that resource.
    • For Vulkan, use VK_ATTACHMENT_LOAD_OP_CLEAR and avoid using VkCmdClearColorImage() to clear all layers at the same time.
  • Copy depth and stencil surfaces only as needed, instead of copying both unconditionally; they are stored separately on Xe-HPG.
  • Batch the clear and copy operations and execute the barriers at the beginning and end of batches.

Geometry Transformation

To ensure that vertex and geometry shader functions operate optimally, consider the following guidelines:

  • Ensure all bound attributes are used. When a draw is bottlenecked on geometry work, reducing the number of attributes per vertex can improve performance.
  • Implement a level-of-detail system that allows flexibility in model accuracy by adjusting the number of vertices per model, per level-of-detail.
  • Implement efficient CPU occlusion culling to avoid submitting hidden geometry. This approach can save both CPU (draw submission) and GPU (render) time. We suggest using Intel’s highly optimized Masked Software Occlusion Culling for this. It can eventually be used in combination with finer-grain GPU culling.
  • Define input geometries as a structure of arrays for vertex buffers. Try to group position information vertex data in its own input slot to assist the tile binning engine for tile-based rendering.
  • The Xe-HPG vertex cache does not cache instanced attributes. For instanced calls, consider loading attributes explicitly in your vertex shader.
  • For full-screen post-processing passes, use a single triangle that covers the entire screen space, instead of two triangles to form a quad. The shared triangle edge in the two triangles use-case results in sub-optimal utilization of compute resources.
  • Optimize vertex attribute inputs to only include attributes that will be used in the vertex shader and downstream stages. This enables better use of bandwidth and space with the L2 cache.
  • Optimize transformation shaders (that is, vertex to geometry shader) to output only attributes that will be used by later stages in the pipeline. For example, avoid defining unnecessary outputs from a vertex shader that will not be consumed by a pixel shader. This enables better use of bandwidth and space with the L2 cache. This applies to vertex input as well by saving on memory bandwidth.

Shader Optimizations

The more computations and processing your shader code performs, the more important it is that you take steps to optimize the process for peak performance. For example, the frequency of performing complex computations has a big impact on the performance of your game.

General Shader Guidance

When writing shaders, look for these opportunities to optimize:

  • Xe-HPG supports double-rate FP16 math. Use lower precision when possible. Also, note that Xe-HPG removed FP64 support to improve power and performance. Make sure that you query hardware support for double rate and ensure proper fallback.
  • Structure the shader to avoid unnecessary dependencies, especially high-latency operations, such as sampling or memory fetches.
  • For performance reasons try to avoid shader control flow based on results from sampling operations.
  • Aim for uniform execution of shaders by avoiding flow control based on non-uniform variables.
  • Implement early returns in shaders where the output of an algorithm can be predetermined or computed at a lower cost than that of the full algorithm.
  • Use shader semantics to flatten, branch, loop, and unroll wisely. It is often better to specify the desired unrolling behavior, rather than let the shader compiler make those decisions.
  • Branching is preferable if there are enough instruction cycles saved that outweigh the cost of branching.
  • Extended math and sampling operations have a higher weight, and may be worth branching (see Table 2 for issue rates).

Table 2. Xe-HPG Vector Engine (XVE) instruction issue rates

Instruction Single Precision (ops/VE/clk) Theoretical Cycle Count
FMAD 8 1
FMUL 8 1
FADD 8 1
MIN/MAX 8 1
CMP 8 1
INV 2 4
SQRT 2 4
RSQRT 2 4
LOG 2 4
EXP 2 4
POW 1 8
IDIV 1-6 1.33-8
TRIG 2 4
FDIV 1 8
  • Small branches of code may perform better when flattened.
  • Unroll conservatively. In most cases, unrolling short loops helps performance. However, unrolling loops does increase the shader instruction count. Unrolling long loops with high iteration counts can impact shader residency in instruction caches and therefore negatively impact performance.
  • Avoid extra sampler operations when it is possible that the sampler operation will later be multiplied by zero. For example, when interpolating between two samples, if there is a high probability of the interpolation being zero or one, a branch can be added to speed up the common case and only perform the load only when needed.
  • Avoid querying resource information at runtime—for example, High-Level Shading Language (HLSL) GetDimensions calls to make decisions on control flow—or unnecessarily incorporating resource information into algorithms.
  • When passing attributes to the pixel shader, mark attributes that do not change per vertex within a primitive as constant.
  • For shaders where depth-test is disabled, use discard (or other kill operations) where output will not contribute to the final color in the render target. Blending can be skipped where the output of the algorithm has an alpha channel value of zero or where adding inputs into shaders that are zeros that negate output.

Texture Sampling

To get the best performance out of textures and texture operations, please consider the following items:

  • When sampling from a render target, note that sampling across mip levels of the surface, with instructions such as sample_l/sample_b, is costly.
  • Use API defined and architecture-supported compression formats (that is, BC1-BC7) on larger textures to improve memory bandwidth utilization, and improve memory locality, when performing sampling operations.
  • Avoid dependent texture samples between sample instructions. For example, avoid making the UV coordinates of the next sample operation dependent upon the results of the previous sample operation. In this instance, the shader compiler may not be able to optimize or reorder the instructions, and it may result in a sampler bottleneck.
  • Avoid redundant and duplicate sampler states within shader code, and use static/immutable samplers, if possible.
  • Define appropriate resource types for sampling operation and filtering mode. Do not use volumetric surface as a 2D array.
  • For best performance when fetching from an array surface, ensure that the index is uniform across shader instances co-executing in SIMD lanes to ensure the best performance.
  • Avoid defining constant data in textures that could be procedurally computed in the shader, such as gradients.
  • Avoid anisotropic filtering on sRGB textures.
  • Sample_d provides gradient-per-pixels and throughput drops to one-fourth. Prefer sample_l unless anisotropic filtering is required.
  • When using VRS, anisotropic filtering may not be needed, as pixels in that draw will be coarser. Non-anisotropic filtering improves sampler throughput.

Constants

When defining shader constants, the following guidelines can help to achieve better performance:

  • Structure constant buffers to improve cache locality so that memory accesses all occur on the same cache line, which improves memory performance.
  • Favor constant access that uses direct access, since the offset is known at compile time, rather than indirect access, in which the offset must be computed at runtime. This benefits high-latency operations like flow-control and sampling.
  • Group the more frequently used constants for better cache utilization, and move them to the beginning of the buffer.
  • Organize constants by frequency of update, and only upload when the values change.
  • When loading data from buffers or structured buffers, organize the data access in such a way that all, or the majority, of the cache line is used. For example, if a structured buffer has ten attributes, and only one of those attributes is used for reading and/or writing, it would be better to split that one attribute into its own structured buffer.
  • Consider using ByteAddressBuffers when performing consecutive data loads, instead of loading data from a Typed Buffer. Those can be optimized by our shader compiler.
  • Developers will see the best performance in shaders that avoid sparsely referencing the constant data in constant buffers. Best performance will also be achieved if only up to two constant buffers are referenced in non-compute shaders.

Temporary Register Variable Usage

Each thread on a vector engine has its own set of registers to store values. The more work that can be done using these register operations, the more can be done to help reduce memory-penalties. However, if there are more temporary variables than registers, some of those variables will have to be stored in memory, where reading and writing have a latency cost. Avoiding this spillover can help to improve performance.

On Xe-HPG, reducing register pressure allows not only an increase in SIMD width but also significantly better code scheduling.

When writing shaders, there is only limited control of how registers get allocated. The following guidelines should be considered to help reduce spillover and improve performance:

  • Try to optimize the number of temporaries to 16, or fewer, per shader. This limits the number of register transfers to and from main memory, which has higher latency costs. Check the instruction set assembly code output and look for spill-count. Spills are a good opportunity for optimization, as they reduce the number of operations that depend on high-latency memory operations. This can be done in Intel GPA by selecting a shader and choosing to look at the machine-code generated by the compiler.
  • If possible, move the declaration and assignment of a variable closer to where it will be referenced.
  • Weigh the options between full and partial precision on variables, as this can store more values in the same space. Use caution when mixing partial precision with full precision in the same instruction, as it may cause redundant type-conversions.
  • Move redundant code that is common between branches out of the branch. This can reduce redundant variable duplication.
  • Avoid non-uniform access to constant buffer/buffer data. Non-uniform access requires more temporary registers to store data per SIMD lane.
  • Control-flow decisions based on constant buffer data forces the compiler to generate sub-optimal machine code. Instead, use specialization constants, or generate multiple specialized shader permutations when possible.

Compute Shader Considerations

When developing compute shaders, the following guidelines can help to achieve optimal performance when selecting thread-group sizes:

  • Pick thread-group sizes and dimensions that fit the nature of your workload’s memory access patterns. For instance, if your application accesses memory in a linear fashion, specify a linear dimension thread-group size, such as 64 x 1 x 1.
  • For two-dimensional thread groups, smaller thread-group sizes typically lead to better performance and achieve better vector engine thread occupancy.
  • Generally, a thread-group size of 8 x 8 performs well on Xe-HPG. In some cases, this may not be optimal due to memory access patterns, and/or cache locality. In this case, 16 x 16 or higher dimensions should be experimented with and chosen based on their performance in testing.
  • Thread-group sizes higher than or equal to 256 threads can cause thread occupancy issues.
  • Shared local memory variable-based atomics have fewer memory hierarchy implications, making them more efficient than atomics on UAVs.
  • For efficient use of Xe-cores, ensure thread-group sizes are a multiple of 32 threads total. This ensures fused Vector Engine pairs run efficiently.

When developing compute shaders that use Shared Local Memory (SLM), consider the following:

  • Load an array of float4 data in one bank of float4 types, rather than four banks of float arrays.
  • Try to keep variables in registers, rather than SLM, to save on memory access penalties.
  • Load and store data in such a manner that data elements that are consecutively accessed are located back-to-back. This allows read and write access to be coalesced, and to use memory bandwidth efficiently.
  • Use HLSL interlocked functions to perform min, max, or, and other, reductions—instead of moving data to and from SLM to perform the same operation with a user-defined operation. The compiler can map HLSL functions to a hardware-implemented version.

Wave Intrinsics

Xe-HPG supports the use of wave intrinsics for both 3D and compute workloads. These can be used to write more efficient, register-based reductions, and to reduce reliance on global or local memory for communication across lanes. This allows threads within the thread-group to share information without the use of barriers, and to enable other cross-lane operations for threads in the same wave. While working with wave intrinsics, consider the following:

  • Do not write shaders that assume a specific machine-width. On Gen architecture, wave width can vary across shaders from SIMD8, SIMD16, and SIMD32, and is chosen by heuristics in the shader compiler. Because of this, use instructions such as WaveGetLaneCount() in algorithms that depend on wave size.
  • Wave operations can be used to reduce memory bandwidth by enabling access to data already stored in registers by other threads, instead of storing and re-loading results from memory. It is a great fit for optimizing operations such as texture mipmap generation.

Frame Presentation

For the best performance and compatibility across different Windows versions, it is recommended to use full-screen presentation modes, if possible. Other modes require an extra context switch and full-screen copy. On Windows 10 and Windows 11, it is possible to use full-screen, borderless windowed modes with no penalty, and no performance loss to the Desktop Window Manager. If in doubt, use tools such as PresentMon to know which presentation mode is active, and GPUView (part of the Windows 10 SDK) for possible intervention by the Desktop Window Manager. Ensure that PresentMon reports “Hardware Composed: Independent Flip” (see Table 3), and that no activity is reported by DWM.exe in your ETW trace.

Table 3: Optimal Present Mode, as reported by PresentMon

Interval PresentFla AllowsTea PresentMode
1 0 0 Hardware Composed: Independent Flip
1 0 0 Hardware Composed: Independent Flip
1 0 0 HArdware Composed: Independent Flip
  • For DirectX 11 and 12, it is recommended to use the flip model, if available.
  • For Windows 8, FLIP_SEQUENTIAL is recommended.
  • For Windows 10/11, FLIP_DISCARD is recommended.

Multi-Queue Support Recommendations

Xe-HPG supports the use of queues, which can concurrently have both 3D and compute workloads resident in the threads of each Xe-core. There is also a copy engine present for parallel copies. When sharing resources across queues, consider the following recommendations:

  • Always profile code to see if using dual-queue support for asynchronous compute will provide a benefit.
  • Use COMMON state transitions when possible.
  • Use a dedicated copy queue to fetch resources when it makes sense.
  • Vulkan: Use vkGetPhysicalDeviceQueueFamilyProperties to enumerate queues to see what kind of queues there are, and how many can be created per family.

CPU Performance Considerations

  • The Intel DirectX and Vulkan drivers utilize threading to perform work parallelization. Fully subscribing all available CPU hardware threads with application work can result in starvation of these helper threads and significantly reduce performance. Query the available hardware threads on the system your application runs on, and pay attention to thread oversubscription.
  • The DirectX 12 driver relies heavily on background shader optimizations. Do NOT disable this thread with DISABLE_BACKGROUND_WORK or DISABLE_PROFILING_BY_SYSTEM in ID3D12Device6::SetBackgroundProcessingMode(). If you do, it will significantly increase loading times, and can significantly reduce GPU performance, because these optimizations can no longer be performed. Also, ensure you leave at least one CPU hardware thread available to perform this work; do not fully subscribe every available hardware thread on the CPU with application work.

Additional Resources

Software and Samples

GitHub Intel Repository

Intel® Game Dev

GPU Detect Sample

Fast ISPC Texture Compressor Sample

Tools

Intel GPA

Intel VTune Profiler

Graphics API

Direct3D* Website – DirectX 12 and other DirectX resources

Vulkan – Khronos* site with additional resources