Registering interest in custom crash reporter backend that can handle GPU crashes

The majority of our crashes on PC are GPU crashes. That has always been the case. We are and have been using several of the “commonly used” crash reporting services https://dev.epicgames.com/documentation/en\-us/unreal\-engine/crash\-reporting\-in\-unreal\-engine\#crashreportserver

None of them do a good job at de-duplicating GPU crashes. Since we can’t repro these GPU crashes users are experiencing, debugging them has been incredibly difficult and time intensive. We don’t even know how common which crash is. As a result we don’t even start on it because we don’t know if we’re investing time looking at the right crash.

With the recent focus on improving the aftermath integration in the engine, we have been wondering if there is a way we could upload shader symbols to the crash reporting service, so that it is properly able to categorize GPU crashes.

It appears a custom crash reporting backend will be required to achieve this.

From past posts it appears that Epic used to have the code for a custom backend available at some point in time Game crash report - #2 by anonymous_user_76ae4db4

But it since has been deprecated and removed [Content removed]

I’m writing to

  1. Hear from the community if someone was able to solve the challenge of de-duplicating GPU crashes
  2. Hear from Epic if there are any plans to make at least parts of their crash reporting backend code available to the Unreal Enegine community, ideally as a well documented tool like UGS, Horde, etc or at least raise awareness that interest in such a solution exist.

Hi,

Epic does not provide the crash report backend. As far as I’m aware, we don’t have plan to make it available either. You can check alternatives at the end of this page: https://dev.epicgames.com/documentation/en\-us/unreal\-engine/crash\-reporting\-in\-unreal\-engine

Regards,

Patrick

Thanks, Patrick. I understand Epic isn’t providing the crash report backend. My apologies if my initial message wasn’t clear, but I’m really just trying to raise awareness internally at Epic that there’s a clear need within the Unreal Engine community for a better solution to de-duplicating GPU crashes, particularly one that could handle shader symbols.

I’m hoping that by highlighting this, it might prompt some internal discussion about the possibility of making parts of a solution available in the future.

Hi,

The GPU crashes is also a common cause of crashes internally too. I checked our crash database for the Editor and Fortnite and we de-duplicate (or group) GPU crashes from the CPU callstack… So even internally, we don’t have a good way yet to precisely de-duplicate GPU crashes at large scale. I reached out to the rendering team to see if they are working on a possible solution and to learn which strategy they are using to figure out which GPU crashes they should investigate.

Regards,

Patrick

Hi,

I have new information. The rendering team added GPU breadcrumbs to the crash report when generating CrashContext.runtime-xml file in UE 5.5. This looks like below. Our crash backend can also group the crash using the GPU Breadcrumbs, so we get another way to group similar crashes (de-duplicate them). You can look at SetGPUBreadcrumbs or GPUBreadcrumbs in the C++ code to find where the engine code add this info during a crash.

CrashContext.runtime-xml

`
526BDA74-7A81-44C3-B0FD-9DBF80973C25

RHI Graphics Queue 0 3E70EC82449375E359F774DE4A88AEED7A9EF29E 56A90E82A1DE40C016E305BAF31E608F77B0DADD {{Frame 10001},A,{{{SceneRender - ViewFamilies},A,{{{RenderGraphExecute - %s},A,{{{Scene},A,{{{VirtualTextureUpdate},A}}}}}}},{{RenderGraphExecute - Slate},N}}} `Regards,

Patrick

This is great Patrick and with a custom crash reporting backend like yours it will be valuable.

The crash reporting services listed in your documentation don’t use this information though. We’re seriously considering building our own for that reason and anything that helps us getting there would be much appreciated.

Hi,

I cannot provide any code, but I know that just maintaining a crash report service like ours is very expensive. About 3 years ago, we wanted to drop our custom system to use a commercial one, to reduce operational costs, but we didn’t find anything that could do what the existing backend was doing at the scale we needed. So we kept developing our own solution. We have few people dedicated to keep it running, fixing live issues and handling new feature requests. There is also the cost of keeping those crash reports with their artifacts stored somewhere (cloud in our case) for some time, plus the bandwidth to move them around. I assume you will not have to run at the same scale as our system in term of platforms supported/application supported/reports per seconds to ingest, so you may be able to implement something that has a reasonable cost, but I reviewed the solution proposed (Sentry, BugSplat, Backtrace) and I’m not 100% sure they cannot handle GPUBreadcrumbs. Bugsplat/Backtrace handle the attributes of the CrashContext.runtime-xml. The doc doesn’t explicitly mention GPUBreadcrumbs… but this is recent, so their doc might be outdated. I’d surely experiment a bit with those systems spending few days to set one up and check what is possible or contact their support to ask how you can query the crash attributes. It might save you lot of time, work and money.

Otherwise, you can probably implement something that would not be too costly. For example, you may be able to keep the last 500 000 crashes on PC for your game on a company server with a small SQLITE database and a small backend in C# that resolve the call stack and store the crash data in the database. You may not need any front end at all, just a daily background job to run some canned SQL queries once a day and dump the result in .csv file you can load in Excel… If you need to do any ‘advanced/ad-hoc’ search, you open a SQL prompt and you query the database model directly. That would be a cost effective solution. It is probably a couple of weeks of work for a person that is used to deal with database-like services, but that’s what BugSplat/Backtrace do exactly already, but they provide a front end that you can query/filter/group based on attributes… so double check those services and choose your battle…

Regards,

Patrick

Thanks for the reply Patrick. For now we have settled on automatically downloading the crash info from our production sentry backend, running a script locally to categorize the GPU crashes and just looking at that output manually from time to time.

This should be good enough for our needs. It already helped track down and attribute a frequent new GPU crash in VirtualShadowMapBuildPerPageDrawCommands to driver 580.88 on Blackwell GPUs. In the past it would have been much more difficult to understand why crashes are increasing so I’m pretty happy with this.

Hi,

I’m glad you were able to implement a low cost solution to this problem. Even if that’s not a ideal one, it’s a win!

Regards,

Patrick