None of them do a good job at de-duplicating GPU crashes. Since we can’t repro these GPU crashes users are experiencing, debugging them has been incredibly difficult and time intensive. We don’t even know how common which crash is. As a result we don’t even start on it because we don’t know if we’re investing time looking at the right crash.
With the recent focus on improving the aftermath integration in the engine, we have been wondering if there is a way we could upload shader symbols to the crash reporting service, so that it is properly able to categorize GPU crashes.
It appears a custom crash reporting backend will be required to achieve this.
But it since has been deprecated and removed [Content removed]
I’m writing to
Hear from the community if someone was able to solve the challenge of de-duplicating GPU crashes
Hear from Epic if there are any plans to make at least parts of their crash reporting backend code available to the Unreal Enegine community, ideally as a well documented tool like UGS, Horde, etc or at least raise awareness that interest in such a solution exist.
Thanks, Patrick. I understand Epic isn’t providing the crash report backend. My apologies if my initial message wasn’t clear, but I’m really just trying to raise awareness internally at Epic that there’s a clear need within the Unreal Engine community for a better solution to de-duplicating GPU crashes, particularly one that could handle shader symbols.
I’m hoping that by highlighting this, it might prompt some internal discussion about the possibility of making parts of a solution available in the future.
The GPU crashes is also a common cause of crashes internally too. I checked our crash database for the Editor and Fortnite and we de-duplicate (or group) GPU crashes from the CPU callstack… So even internally, we don’t have a good way yet to precisely de-duplicate GPU crashes at large scale. I reached out to the rendering team to see if they are working on a possible solution and to learn which strategy they are using to figure out which GPU crashes they should investigate.
I have new information. The rendering team added GPU breadcrumbs to the crash report when generating CrashContext.runtime-xml file in UE 5.5. This looks like below. Our crash backend can also group the crash using the GPU Breadcrumbs, so we get another way to group similar crashes (de-duplicate them). You can look at SetGPUBreadcrumbs or GPUBreadcrumbs in the C++ code to find where the engine code add this info during a crash.
This is great Patrick and with a custom crash reporting backend like yours it will be valuable.
The crash reporting services listed in your documentation don’t use this information though. We’re seriously considering building our own for that reason and anything that helps us getting there would be much appreciated.
I cannot provide any code, but I know that just maintaining a crash report service like ours is very expensive. About 3 years ago, we wanted to drop our custom system to use a commercial one, to reduce operational costs, but we didn’t find anything that could do what the existing backend was doing at the scale we needed. So we kept developing our own solution. We have few people dedicated to keep it running, fixing live issues and handling new feature requests. There is also the cost of keeping those crash reports with their artifacts stored somewhere (cloud in our case) for some time, plus the bandwidth to move them around. I assume you will not have to run at the same scale as our system in term of platforms supported/application supported/reports per seconds to ingest, so you may be able to implement something that has a reasonable cost, but I reviewed the solution proposed (Sentry, BugSplat, Backtrace) and I’m not 100% sure they cannot handle GPUBreadcrumbs. Bugsplat/Backtrace handle the attributes of the CrashContext.runtime-xml. The doc doesn’t explicitly mention GPUBreadcrumbs… but this is recent, so their doc might be outdated. I’d surely experiment a bit with those systems spending few days to set one up and check what is possible or contact their support to ask how you can query the crash attributes. It might save you lot of time, work and money.
Otherwise, you can probably implement something that would not be too costly. For example, you may be able to keep the last 500 000 crashes on PC for your game on a company server with a small SQLITE database and a small backend in C# that resolve the call stack and store the crash data in the database. You may not need any front end at all, just a daily background job to run some canned SQL queries once a day and dump the result in .csv file you can load in Excel… If you need to do any ‘advanced/ad-hoc’ search, you open a SQL prompt and you query the database model directly. That would be a cost effective solution. It is probably a couple of weeks of work for a person that is used to deal with database-like services, but that’s what BugSplat/Backtrace do exactly already, but they provide a front end that you can query/filter/group based on attributes… so double check those services and choose your battle…
Thanks for the reply Patrick. For now we have settled on automatically downloading the crash info from our production sentry backend, running a script locally to categorize the GPU crashes and just looking at that output manually from time to time.
This should be good enough for our needs. It already helped track down and attribute a frequent new GPU crash in VirtualShadowMapBuildPerPageDrawCommands to driver 580.88 on Blackwell GPUs. In the past it would have been much more difficult to understand why crashes are increasing so I’m pretty happy with this.