We’re seeing a very frequent crash on 5.5 related to async lumen diffuse indirect. We’re seeing it more frequently on high end hardware, but it also seems to manifest as an idle GPU crash.
We have aftermath crashes with shader symbols for the crash. It can crash one of 2 ways in aftermath. One in the shadow map code, a GPU page fault. Another in lumen SDF code, also with a GPU page fault. These two passes are done around the same time on the GPU, lumen in the async path, shadows in the Graphics pipe.
I’ve attached both aftermath dumps.
The other issue is that this isn’t the only way that the problem shows up. for example, yesterday i was crashing for 9 straight hours with this gpu fault. But this morning i’m 95% fine. We also get these floating shimmers
[Image Removed]
this seems to be related to
[Content removed]
[Content removed]
turning off async lumen would cost us a lot of perf, but we’re just seeing so many crashes related to it.
From looking at the .dxil and assembly, these crashes don’t look similar to the crash dumps I’ve seen for the InstanceCull/NodeAndClusterCull GPU crash linked [Content removed] and [Content removed] which to my knowledge we haven’t seen a repro for on a 50 series card yet, only 30 and 40 series cards.
These are a couple known potential GPU crashes and fixes related to Lumen
Regarding the floating shimmers, it’s not something I’ve seen reported, but if you have a PIX capture of the issue that could help narrow down the possibilities. There was a splitscreen issue with artifacts that might be related:
Thanks for trying that, I haven’t found any other more recent fixes that may address this though the InstanceCull/NodeAndClusterCull GPU crash does appear related to running Lumen on async compute, though our investigations are pointing towards a driver issue. What kind of performance drop do you see with r.Lumen.DiffuseIndirect.AsyncCompute=0?
Assuming that’s 1ms on a 50 series GPU, if you’re willing to accept that tradeoff, you can target testing that on the subset of users that have that hardware using device profile matching rules. In the case of the InstancCull crash, it typically occurred on the first frame of drawing lots of Nanite geo and several local shadow casting lights - have you been able to narrow down from the logs, where players are and what content (number and type of lights, amount of Nanite geo) is typically in the scene? Otherwise, I don’t have any further recommendations at this time.
Also, I’m still not seeing the breadcrumbs for these crashes in the attached files, can you attach those? That will help me look for similar GPU crashes on our side.
Apologies for the delay - I’ve been unable to merge the “broken fog 5090.7z” files together with 7-Zip and keep getting an “unexpected end of file” error. Are you able to open the multi-part 7z files locally? I will spin up a separate file location to upload the capture to where we don’t have this limit if it isn’t some problem with the files or something I’m doing wrong.
I was able to find similar crash reports on our end with those breadcrumbs in 5.5 and some in 5.6 but none had known workarounds aside from disabling async compute, and none had additional information.