Download

Dealing with GPU crashes

Aug 2, 2021.Knowledge

Sometimes when you get a crash the callstack says things such as:
“GPUCrash - exiting due to D3D device being lost - D3D Hung”.
“DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG”

These messages are difficult to navigate because they indicate the GPU crashed and that is generally harder to debug than CPU crashes.

What can I do?

You can send engineers callstacks (at least Game Thread, Render Thread and RHI Thread) and log files via UDN that hopefully will contain information that will help understand what is happening. Unfortunately, when you get a GPU crash, the CPU call stack does not really point to the real cause of the crash but just indicates what the CPU was doing when the GPU crash happened. Therefore it does not provide actionable info.

The best way to proceed in this case is to run UE with the -gpucrashdebugging flag and see if the log contains useful information. After that, you can also run UE4 with the -d3ddebug flag that also could give you some clues. It is strongly recommended not to use -d3ddebug and -gpucrashdebugging together, you should pick one or the other. Ideally you should send engineers both logs, running the engine separately with each of these flags. UE logs are saved in [MyProject]/Saved/Logs.

Windows generates dump files that can be helpful so gathering them is also a good idea. Ask Epic engineers if you need to know more on how to get these dump files.

Usually GPU crashes can happen for any of the following reasons:

  • The GPU runs out of memory
  • The GPU times out while doing a expensive operation (TDR event)
  • A bug in engine code
  • A bug in the driver
  • A bug in the OS
  • A problem in the hardware (very unlikely)

There is a number of things that can be done that will help to identify which one of the above is the underneath cause:

The GPU runs out of memory (OOM)
If the GPU runs out of memory, it could crash. That depends on the RHI you are using, some are more resilient than others and in the case of an OOM event they just get very slow instead of dying. To find out how much memory your graphics card is using, open the task manager, go to the performance tab, select the GPU and check what is the memory consumption before and during the crash.

If you are close to your memory limit, that is possibly the problem. In that case try to reduce the memory usage. In order to do so, you can do the following:

  • Simplify the scene (use lower resolution textures, lower resolution meshes, etc)
  • Render at lower resolution
  • If you are working in editor and have multiple viewports, close all but one.
  • Do not disable features that use extra memory such as Niagara or RayTracing because if the crash is gone after doing so you might think it is because the memory reduction but bypassing these components will change many other things and that could lead you to get invalid conclusions.

The GPU times out while doing a expensive operation (TDR event)

When the CPU sends a command to the GPU for computing something, the CPU sets a timer to count how much time the GPU needs to complete the operation. If the CPU detects that the operation is taking too much time (by default it is 2 seconds in Windows), it resets the driver causing a GPU crash. This is called a TDR event (Timeout Detection and Recovery).

Ideally the engine should never send the GPU such an amount of work that triggers a GPU event, but it should be able to split the task in smaller chunks so TDR is avoided. However real life is not as beautiful and TDR events happen. In order to avoid them you can increase the TDR value in the Windows register to avoid the GPU driver reset. You can find more information here:

TDR and Ray Tracing
Ray tracing is particularly costly so it is more likely to trigger TDR events when it is enabled.

Some expensive ray tracing passes (i.e RTGI at very large resolution) could take a long time and therefore could trigger TDR events. The most expensive ray tracing passes (GI and reflections) provide a way to render the pass in tiles instead of in a single pass through the following Cvars:

r.RayTracing.GlobalIllumination.RenderTileSize
r.RayTracing.Reflections.RenderTileSize

When the tile size of a pass is greater than zero, these passes are rendered in NxN pixel tiles, where each tile is submitted as a separate GPU command buffer, allowing high quality rendering without triggering timeout detection. (default = 0, tiling disabled)

Again, this is something the engine should handle internally and engineers will continue working to minimize TDR events as much as possible.

A bug in engine code
Bugs in the engine can cause GPU crashes. UE is quite large so some preliminary A/B testing helps a lot. These are some of the things you can do:

  • Run the engine with -gpucrashdebugging and -d3ddebug as described above (reminder: better use these flags separately).
  • Run with -onethread -forcerhibypass. This will force UE to run with one thread only and will help to determine if the underneath problem is a threading/timing issue.
  • Run with r.RDG.Debug=1 which might give you information about render passes that have not been properly set
  • Run with r.RDG.ImmediateMode=1 which will force the RenderGraph (RDG) to execute passes immediately after creation and can give you more meaningful callstacks (that actually changes other things under the hood and can be a red herring factory but it is still worth doing).
  • Switch to a different RHI. If you are in DX12 you can switch to DX11 or vice versa. Check if the crash only happens in one RHI, that could help engineers to identify if the problem is at a higher or lower level. Notice that some features only work with specific RHIs (i.e ray tracing does not work in DX11)
  • A/B test your scene
    • Turn rendering passes on/off and check if the crash still happens. Many times the problem is a faulty crash and doing this can give good clues on what is going on.
    • Turn rendering features on/off: Lumen, Nanite, ray tracing,… (some of these require a restart)
    • Hide/Show specific objects. The problem could be a specific asset

A bug in the driver

It is worth investigating all previously mentioned possibilities before coming to this conclusion. Try to get drivers up to date and check with engineers and QA if you are using a driver that has known issues.

A bug in the OS

It is worth investigating all previously mentioned possibilities before coming to this conclusion. For the specific case of Windows, the strongly recommended version is 20H2. To find out which version you are running, press the Windows key and type “winver”.

@Rudy_Triplett just letting you know that the end of the KB is truncated :slight_smile:

It is worth investigating all previously mentioned possibilities before coming to this conclusion. For the specific case of Windows, the