GPU access violation accessing LDS in NaniteDice.ush

anonymous-edc · April 10, 2026, 12:03am

We are seeing a consistent GPU crash in NV Aftermath on line 112 of NaniteDice.ush

Verts[ Corner ].PointClip = GroupPointPackedClip[ SourceIndex ];where it accesses beyond the bounds of the GroupPointPackedClip array.

It seems like there is an assumption that WaveGetLaneCount() <= THREADGROUP_SIZE which I’m not sure is always true, although in practice it should be on NV since THREADGROUP_SIZE == 32 here. But possibly the indexing is incorrect somehow in this code? It’s complex to follow.

Is this something Epic has seen?

Thanks!

[Attachment Removed]

anonymous-edc · April 10, 2026, 12:03am

Steps to Reproduce
Run a build with -nvaftermathall -gpucrashdebugging on a RTX 4080 using driver 585.79

[Attachment Removed]

Petrockets · April 10, 2026, 12:21am

Hello!

This looks fairly similar to the GPU hang here [Crash in NodeAndClusterCull with tesselation [Content removed] - can you confirm? We don’t have a resolution yet, though in some cases, updated drivers seemed to resolve the issue.

[Attachment Removed]

Petrockets · April 10, 2026, 10:14pm

Sorry, that issue has a lot of posts in it and the TLDR is that we don’t really have a fix for it and the driver update and suggested CL don’t fix the GPU hang in all cases. The current thinking we have is along the same lines, any extra threads in the last iteration of the loop will have a QueueIndex that is NumQueues, which is just out of bounds of initialized range of lanes in WorkSource, and a child task is being constructed unconditionally using an uninitialized SourceIndex, which might point to an invalid parent task and this can result in invalid PatchVertIndexes that can go out of bounds on the GroupPointPackedClip LDS allocation.

We’re not certain yet how we’re going to tackle the fix (Adding a bActive parameter to CreateChild, initializing WorkSource so out of bounds threads always point to valid source, etc) but hope to look at it next week.

[Attachment Removed]

Petrockets · April 17, 2026, 1:15am

Hello!

By any chance did you see Rune submitted a potential fix?

CL#52723974 (74a0b4) Fix for tessellation GPU crash.

FDiceTask::CreateChild could crash when it didn’t reference a valid parent task.

Work distribution code now guarantees that the parent is valid even for inactive threads.

UE 5.8 CL#52724036 (c9c405)

I don’t know if you guys are able to use this (based on SDKs or target platforms, etc.) but it might at least rule that out as a debugging aid:https://microsoft.github.io/DirectX-Specs/d3d/HLSL_SM_6_6_WaveSize.html

We did in face use this to debug and test the fix for this AMD artifact

CL#47241577 (c2b571) Fix a bug with the work distributor shader code that happens when the wave lane count is larger than the compute shader’s thread group size.

I’ll reach out to Rune about the other locations using unguarded WaveGetLaneIndex() - do you have a fairly consistent repro to test with?

[Attachment Removed]

anonymous-edc · April 10, 2026, 5:32pm

Thanks. Yeah, that does seem to be the exact same issue. Unfortunately, we are on the newest drivers already. I can try merging that CL but I’m not hopeful since on NV hardware the case it fixes is likely not possible.

[Attachment Removed]

anonymous-edc · April 10, 2026, 5:40pm

Also, I’m not sure that CL is a complete fix since searching for WaveGetLandIndex() there is code like

		const uint LaneIndex = WaveGetLaneIndex();
 
		GroupPointPackedClip[LaneIndex] = Vert.PointClip;
		GroupNormalPackedClip[LaneIndex] = Vert.NormalClip;

so just doing const uint LaneCount = min(WaveGetLaneCount(), THREADGROUP_SIZE); isn’t enough insince WaveGetLaneIndex() returns between 0 and WaveGetLaneCount()-1

[Attachment Removed]

anonymous-edc · April 10, 2026, 10:27pm

Thanks for the update. Yeah, it’s tricky, because also stuff like WaveActiveCountBits and WavePrefixCountBits can read past the THREADGROUP_SIZEth lane. It’s hard to mask everything.

But also, there must be something else going on here (unless I’m misunderstanding you) since on NV GPUs WaveGetLaneCount() always returns 32 as far as I know these CS are THREADGROUP_SIZE 32 or THREADGROUP_SIZE 64

I don’t know if you guys are able to use this (based on SDKs or target platforms, etc.) but it might at least rule that out as a debugging aid: https://microsoft.github.io/DirectX\-Specs/d3d/HLSL\_SM\_6\_6\_WaveSize.html

[Attachment Removed]

anonymous-edc · April 17, 2026, 11:47pm

Thanks! Yeah, I followed the other thread also. I’ll look into grabbing that CL and to see if we can still repro. We had a fairly reliable repro previously, but it wasn’t 100% it took some playtime and luck so fingers crossed

[Attachment Removed]