Thread race crash in FenceOcclusionTests() / WaitOcclusionTests()

(This is something I’ve already added code locally to fix.)

Over the past two weeks, we have been getting a number of crash dumps posted with different exact callstacks and crash locations, but all consistently somewhere within FSceneRenderer::WaitOcclusionTests(). I spent some time investigating and found that the wait loop was failing to ever reach ViewStateFenceCount <= FencesAllowedInQueue and exit the do/while loop. The FenceToWaitOn keeps getting decremented to negative values, and eventually we crash trying to access whatever happens to be right before that array in the heap. (Explaining why the actual crash is different each time).

I added additional tracking and discovered that between the first loop within that function where ViewStateWaitCount is computed, and the second loop where it waits on the fences, the OcclusionSubmittedFence array can be modified by another thread executing a FenceOcclusionTests RDG task:

[Image Removed]

[Image Removed]

(OcclusionSubmittedFencesBeforeWaitis copied from OcclusionSubmittedFence at the beginning of the function, before we count up ViewStateFenceCount.)

In most situations, this wouldn’t lead to a crash, since the wait loop always starts from the end of the array anyway. Theoretically you could end up exiting the loop before waiting on the new fence 0 if it has the same ViewStateUniqueID as the wait loop. However, it doesn’t seem likely that we’d be adding a fence for a view after we’ve already started waiting on it. For a crash to occur, you’d somehow have to have the timing perfect such that the fence for the correct view being pushed back from index [N] to [N+1] precisely as the wait loop iterates down from [N+1] to [N].

My solution to this for now was to add a critical section with FScopeLocks within the FenceOcclusionTests lambda and WaitOcclusionTests to protect the array, and we have not had any new reports of related crashes since. However, I still do not know why this was occurring ONLY in Shipping configuration builds. Posting here for someone to double-check my work, basically, and ensure there is not something subtle going on that I’m missing.

[Attachment Removed]

Steps to Reproduce
This would just occur at some arbitrary point after playing the game for a while in Shipping configuration only.

[Attachment Removed]

The scope lock will only fix simultaneous access. The appropriate fix here is to either remove the FRDGAsyncTask tag from the lambda so it can’t race with WaitOcclusionTests in the next render frame, or add FRDGBuilder::WaitForAsyncExecuteTask() in WaitOcclusionTests to fence RDG async execution tasks.

The latter is probably best given that removing an async tag will force all passes in that execution batch to be awaited at the end of RDG execution, but fencing all the tasks during scene visibility in the next frame is generally late enough that those tasks have completed anyway.

Either solution is technically correct, I’ll likely be submitting the former for 5.8.

Thanks for reporting this!

[Attachment Removed]