We are seeing a consistent GPU crash in NV Aftermath on line 112 of NaniteDice.ush
Verts[ Corner ].PointClip = GroupPointPackedClip[ SourceIndex ];where it accesses beyond the bounds of the GroupPointPackedClip array.
It seems like there is an assumption that WaveGetLaneCount() <= THREADGROUP_SIZE which I’m not sure is always true, although in practice it should be on NV since THREADGROUP_SIZE == 32 here. But possibly the indexing is incorrect somehow in this code? It’s complex to follow.
This looks fairly similar to the GPU hang here [Crash in NodeAndClusterCull with tesselation [Content removed] - can you confirm? We don’t have a resolution yet, though in some cases, updated drivers seemed to resolve the issue.
Sorry, that issue has a lot of posts in it and the TLDR is that we don’t really have a fix for it and the driver update and suggested CL don’t fix the GPU hang in all cases. The current thinking we have is along the same lines, any extra threads in the last iteration of the loop will have a QueueIndex that is NumQueues, which is just out of bounds of initialized range of lanes in WorkSource, and a child task is being constructed unconditionally using an uninitialized SourceIndex, which might point to an invalid parent task and this can result in invalid PatchVertIndexes that can go out of bounds on the GroupPointPackedClip LDS allocation.
We’re not certain yet how we’re going to tackle the fix (Adding a bActive parameter to CreateChild, initializing WorkSource so out of bounds threads always point to valid source, etc) but hope to look at it next week.
We did in face use this to debug and test the fix for this AMD artifact
CL#47241577 (c2b571) Fix a bug with the work distributor shader code that happens when the wave lane count is larger than the compute shader’s thread group size.
I’ll reach out to Rune about the other locations using unguarded WaveGetLaneIndex() - do you have a fairly consistent repro to test with?
Thanks. Yeah, that does seem to be the exact same issue. Unfortunately, we are on the newest drivers already. I can try merging that CL but I’m not hopeful since on NV hardware the case it fixes is likely not possible.
so just doing const uint LaneCount = min(WaveGetLaneCount(), THREADGROUP_SIZE); isn’t enough insince WaveGetLaneIndex() returns between 0 and WaveGetLaneCount()-1
Thanks for the update. Yeah, it’s tricky, because also stuff like WaveActiveCountBits and WavePrefixCountBits can read past the THREADGROUP_SIZEth lane. It’s hard to mask everything.
But also, there must be something else going on here (unless I’m misunderstanding you) since on NV GPUs WaveGetLaneCount() always returns 32 as far as I know these CS are THREADGROUP_SIZE 32 or THREADGROUP_SIZE 64
Thanks! Yeah, I followed the other thread also. I’ll look into grabbing that CL and to see if we can still repro. We had a fairly reliable repro previously, but it wasn’t 100% it took some playtime and luck so fingers crossed