5.5 - Advice on Aftermath crash report gathering

Hello,

We have a released game with a crash report system integration. One of the larger types of crash we see are GPU crashes and its something we have poor visibility on, and generally we are not able to reproduce internally, so I am hoping to get some advice particularly around Aftermath usage and getting data from end users.

I have read through a bunch of UDN threads, particularly this one [Content removed] but there are still a few things I would like to ask for clarification on.

We are on UE5.5.4 but I have cherry picked in CL#39543979 and 39482595 as recommended here (https://dev.epicgames.com/community/learning/knowledge\-base/j2yV/unreal\-engine\-ue\-5\-5\-x\-most\-common\-rendering\-issues).

I can run with `-nvaftermathall` and force a GPU crash and get both a `*.nv-gpudmp` and a `*.nvdbg` and with this I can successfully correlate a crash back to the source with symbols I generated, which is great.

  • Am I correct in thinking that I do need the `.nvdbg` to be able to correlate the crash back to the source shader, or is there a way to do that without it? (e.g. with the aftermath shaderhash). We have a bunch of already collected dumps with no `.nvdbg` and it would be nice to get anything useful from there if possible.
  • The aftermath documentation mentions that “GFSDK_Aftermath_FeatureFlags_GenerateShaderDebugInfo” is expensive, is that true even when it is deferred with “GFSDK_Aftermath_GpuCrashDumpFeatureFlags_DeferDebugInfoCallbacks” like in this implementation?
  • Would it be reasonable to turn this on in a shipped build so we could collect these from end users, or should we expect that it is prohibitively expensive?
  • Is there anything else obvious I might be missing with Aftermath that would be helpful to us?

Thanks,

Andy

Hello,

Sharing our experience with aftermath as well, in case it helps,

We have unreal engine 5.4 with cherry picked CL’s from 5.5 and 5.6, in order to make aftermath working for us.

Aftermath is created with GFSDK_Aftermath_FeatureFlags_EnableResourceTracking and GFSDK_Aftermath_FeatureFlags_GenerateShaderDebugInfo flags always.

There are some other flags that give extra information but are impacting performance significantly, the ones mentioned seem to be fine not impacting perf, at least on a meaningful way that we could measure, we even have those flags enabled on shipping builds.

We run using GFSDK_Aftermath_GpuCrashDumpFeatureFlags_DeferDebugInfoCallbacks, even though docs mention it has a higher impact than default behavior, we did not find any perf impact with it.

We are in conversations with Nvidia as well, and it looks like we need the .nvdbg file in order to map the shaders correctly, they suggested to fill all the callbacks for GFSDK_Aftermath_GpuCrashDump_GenerateJSON, but we are not seeing any difference at all while using them.

We expected as well, that using all the callbacks we will not need the .nvdbg file, since it already has all the information needed, but sadly is not the case.

Hi Andy, Enrique,

I also recommend integrating 39634316, and more recently 43919074, if you haven’t already done so.

“Am I correct in thinking that I do need the `.nvdbg` to be able to correlate the crash back to the source shader, or is there a way to do that without it? (e.g. with the aftermath shaderhash). We have a bunch of already collected dumps with no `.nvdbg` and it would be nice to get anything useful from there if possible.”

You do not need the .nvdbg files to correlate the source file in question, however, you do need them to correlate lines/columns of code. Please note the “Active Shaders” dumped on crashes with valid Aftermath dumps in the log, based on general availability you’ll see the set of faulting shaders with the following metadata:

  • DebugName, should RHI_INCLUDE_SHADER_DEBUG_DATA be enabled.
  • Function, the entrypoint name of the shader that crashed (if in the DXIL metadata)
  • Shader PDB, if compiled with r.Shaders.Symbols=1
  • Shader Hash, always dumped, can be used for cross referencing

From the list of cooked shaders, assuming none of the other options above worked, you can generally associate the Shader Hash with your cooked data.

“The aftermath documentation mentions that “GFSDK_Aftermath_FeatureFlags_GenerateShaderDebugInfo” is expensive, is that true even when it is deferred with “GFSDK_Aftermath_GpuCrashDumpFeatureFlags_DeferDebugInfoCallbacks” like in this implementation?”

Yes, it’s quite expensive even with DeferDebugInfoCallbacks enabled. Whereas GenerateShaderDebugInfo (without the latter) dumps the binaries immediately, the latter keeps it in memory until a crash occurs. I have observed up to ~1GB additional memory usage.

Should this option be enabled, the relevant .nvdbg files will be included in the crash reports.

“Would it be reasonable to turn this on in a shipped build so we could collect these from end users, or should we expect that it is prohibitively expensive?”

While I do not recommend it, if you have measured no difference with your builds (particularly memory wise), then it’s an option.

“Is there anything else obvious I might be missing with Aftermath that would be helpful to us?”

The DebugName / Function / Shader PDB / Hash mentioned above is a reasonable way to bucket crashes. We are planning to compile with debug information by default to greatly improve GPU crash handling, with some internal conversations on tooling changes needed to streamline things.

Hi Miguel,

Thanks for the response!

I have brought in those extra CLs, so thank you for the recommendation there. I will note I also had to bring in 39641811 for some missing foundational changes.

I have done some more experimentation with this myself. With the aftermath debug info enabled we also see ~1gb of extra memory usage, so for now we are enabling this only for users in the largest memory bucket who should hopefully have plenty of spare memory, and will see how that goes.

I was able to correlate back a bunch of pre-existing aftermath dumps (from before these cherry-picks) just from their shader hash by dumping out a list of “Aftermath Hash -> Shader PDB hash” then using "r.Shaders.SymbolInfo=1" to generate “ShaderSymbols.info” to correlate back to the source shader.

Not necessarily the most useful without line numbers etc. but I was at least able to do some bucketing, and hopefully we’ll have even more info going forwards.

Thanks for the advice!

Andy