Note, we are reporting this issue on 5.4 but we have backported a couple CLs to it that impact this a bit (I don’t think it created the issue, but- for example, UninstallGPUPage does not exist in 5.4 (but it’s logic does)- specifically we have backported CL# 35337406, 35367731, 35734956, and 36016323. Looking over Epic’s P4 I do not see a change I think would fix this crash besides potentially large rewrites/refactors).
We have ran into what seems to be a very unlucky and rare crash within the Nanite streaming manager (the starts need to align just right- and rather annoyingly we have this crashing an automated test 100% of the time by sheer bad luck). I cannot provide a repro easily as it’s highly specific and someone familiar with the code will need to weigh in on this (/should be able to see the logic issues here). I also might be missing a couple of details as I’m not super familiar with the code but I believe I mostly know what is happening.
The crash is from checks in either FStreamingManager::UninstallGPUPage or FStreamingManager::ApplyFixups throwing, usually because we are removing more clusters than a resource actually has or we are accessing a hierarchy index beyond how many hierarchy nodes a resource has). I believe what is happening is the following (these are gonna be accompanied by log output lines showing this happening as that’s how I caught it).
- A resource is added to the streaming manager at a certain root page with a certain version (generating it’s runtime ID)
- “Added resource /Game/Art/CHAR/Pickups/X/Y/Z.Z to root page 6 with ID 4194310”
- This resource uploads a page, creating an FFixupChunk for a specific GPU page index.
- “Resource /Game/Art/CHAR/Pickups/X/Y/Z.Z has page index 60 fixup page 000002B52D704500 with 71 clusters”
- “Resource /Game/Art/CHAR/Pickups/X/Y/Z.Z installed page 60, resident clusters 87”
- This resource is removed from the streaming manager.
- “Removed resource /Game/Art/CHAR/Pickups/X/Y/Z.Z with root page 6 with ID 4194310”
- Many more resources allocate at root page index 6 (256 to be exact because the RootPageVersions is only a uint8 and wraps after that- creating repeating runtime IDs)- during this time GPU page index 60 (the page uploaded to by the previous resource) is never reused/touched/freed/etc.
- A new resource allocates at root page 6 with the same exact ID.
- “Added resource /Game/Art/WEAP/A/B/C.C to root page 6 with ID 4194310”
- Page 60 becomes a potential page to use in ::InstallReadyPages
- UninstallGPUPage attempts to uninstall page 60. Page 60 belongs to a resource freed long ago (and normally this should just return out because GetResources returns null), but because a new resource has been added that is using the same RuntimeID UninstallGPUPage attempts to uninstall the page from the new resource.
- “Uninstalling page 60 for resource /Game/Art/WEAP/A/B/C.C with fixup chunk 000002B52D704500 with 71 clusters”
- (Note, there is no log about this resource attempting to allocate page 60 (or any other resource prior to this for that matter)- hence why the old FFixupChunk is still there.)
- This causes problems- the most common we have seen crashing is that the num clusters on the fixup chunk is larger than the num resident clusters on the resource- causing the num resident clusters to wrap around and fail the “Resources->NumResidentClusters <= Resources->NumClusters” check.
- Even if the checks didn’t throw this should cause chaos/issues because we are uninstalling pages that aren’t actually installed/aren’t used by that resource.
I’m not sure if this is still a problem in 5.6+ as this code looks like it has been heavily refactored- but
- We cannot upgrade our engine before shipping, so a fix for 5.4 would be appreciated.
- Even though this code has changed I’m not sure the underlying logic issues that allow this crash/bug are fixed in the newer code.
I currently don’t have a fix locally for this but am testing around with stuff/will post a follow up if I get something. In the meantime any help from Epic would be appreciated.
Steps to Reproduce
Rapidly load and unload Nanite assets until the stars align.
See description.
Hi Brandon,
Sorry for the inconvenience. I have actually *just* been looking into what I believe is the same issue.
The root of the problem seems to be that if same root page slot happens to be reused more than (1 << MAX_RUNTIME_RESOURCE_VERSIONS_BITS times) == 256, then the actual RuntimeResourceID wraps around and becomes reused. As page uninstalls happen when the streaming page is reused, not immediately when a resource is uninstalled, there can be a situation where an old unused streaming page sits around in the streaming pool while its associated root page gets reused 256 times. When it is then uninstalled, it will incorrectly think that it is still a live page and attempt to uninstall itself using the information from the new resource, which may crash.
Definitely a nasty issue. In practice this seems to be extremely rare, probably as streaming happens in batches and it is rare for the same slot to be hit with both installs and uninstalls that many times in a short enough amount of time for the streaming to not have reused the streaming page for something else in the meantime. This does not seem to show up in the Fortnite crash statistics for instance.
I got a JIRA about with nebulous repro steps, so I was actually just about to submit a fix for this to Main, which is just pending review. It will probably go in later today or tomorrow.
I have shelved it in UE5/Main in CL44777354 for now, so you can perhaps take a look at it already.
The fix is in UE5/Main now in CL44781127.
I also tried merging it over to 5.4. I have not done a lot of testing on that though, so it would be helpful if you can tell me if it works for you. Shelved in CL44789433 on UE5/Release-5.4.
Not sure about the specifics of the UninstallGPUPage implementation here, but yes, the intention is basically just to call UninstallGPU page on all the virtual pages. And a version of it that doesn’t actually apply any fixup, which I assume is the false here.
Wrt, when explicitly doing this:
Resources->NumResidentClusters = 0;
ModifiedResources.Add(Resources->RuntimeResourceID, Resources->NumResidentClusters);
I can see that is different from what would happen in main in that in main the root pages would not be subtracted off, so it would never hit zero.
I was assuming that sending a 0 residency request to ray tracing manager would be fine, but it seems it isn’t.
It makes my think that maybe there is actually no need to update ModifiedResources when the resource is removed as that must be somehow handled elsewhere in the ray tracing manager, but I will have to confirm with one of the ray tracing guys.
>I think the shelve could be fixed by using the ResidentPageIndex to index ResidentPageFixupChunks per resident page and subtract the fixup chunk’s cluster count from the Resource’s cluster count (and then remove setting the resources cluster count to 0 at the end, the cluster count should be non zero, but still add the modified resource entry) instead of forcing the count to 0 and adding an entry with that.
Yes, I think the same as that should be equivalent to what was before. I’m just thinking that that last non-zero update can’t be important as it must either get ignored by the ray tracing system or it ends up happening but is never used as the resource gets removed from ray tracing right after as well.
Ray tracing guy told me that last update to ModifiedResources on Remove shouldn’t matter.
Thanks for the quick response!
Grabbing this and testing it now, will let you know how it goes. We have a pretty much 100% repro on this at the moment so I will be able to check if it fixes that and I can put it through some automated testing to see if we run into any new issues.
With the supplied shelve I appear to be crashing with the following stack:
> X.exe!Nanite::FRayTracingManager::RequestUpdates(const TMap<unsigned int,unsigned int,FDefaultSetAllocator,TDefaultMapHashableKeyFuncs<unsigned int,unsigned int,0>> &) Line 349 C++ X.exe!FDeferredShadingSceneRenderer::Render(FRDGBuilder &) Line 2035 C++ X.exe!CaptureSceneToScratchCubemap(FRHICommandListImmediate &, FSceneRenderer *, const FReflectionCubemapTexture &, ECubeFace, int, bool, bool, const FLinearColor &, bool) Line 530 C++ X.exe!CaptureSceneIntoScratchCubemap::__l4::<lambda>(FRHICommandListImmediate &) Line 1407 C++ [Inline Frame] X.exe!UE::Core::Private::Function::TFunctionRefBase<UE::Core::Private::Function::TFunctionStorage<1>,void __cdecl(FRHICommandListImmediate &)>::operator()(FRHICommandListImmediate &) Line 555 C++ X.exe!FRenderThreadCommandPipe::EnqueueAndLaunch::__l5::<lambda>() Line 1852 C++ [Inline Frame] X.exe!UE::Core::Private::Function::TFunctionRefBase<UE::Core::Private::Function::TFunctionStorage<1>,void __cdecl(void)>::operator()() Line 555 C++ [Inline Frame] X.exe!TFunctionGraphTaskImpl<void __cdecl(void),1>::DoTaskImpl(TUniqueFunction<void __cdecl(void)> &, ENamedThreads::Type, const TRefCountPtr<FGraphEvent> &) Line 1731 C++ [Inline Frame] X.exe!TFunctionGraphTaskImpl<void __cdecl(void),1>::DoTask(ENamedThreads::Type, const TRefCountPtr<FGraphEvent> &) Line 1724 C++ X.exe!TGraphTask<TFunctionGraphTaskImpl<void __cdecl(void),1>>::ExecuteTask(TArray<FBaseGraphTask *,TSizedDefaultAllocator<32>> &, ENamedThreads::Type, bool) Line 1234 C++ [Inline Frame] X.exe!FBaseGraphTask::Execute(TArray<FBaseGraphTask *,TSizedDefaultAllocator<32>> &, ENamedThreads::Type, bool) Line 838 C++ X.exe!FNamedTaskThread::ProcessTasksNamedThread(int, bool) Line 779 C++ X.exe!FNamedTaskThread::ProcessTasksUntilQuit(int) Line 670 C++ X.exe!RenderingThreadMain(FEvent *) Line 458 C++ X.exe!FRenderingThread::Run() Line 610 C++ X.exe!FRunnableThreadWin::Run() Line 149 C++ X.exe!FRunnableThreadWin::GuardedRun() Line 71 C++ kernel32.dll!BaseThreadInitThunk() Unknown ntdll.dll!RtlUserThreadStart() Unknown
as this check is failing “check(Data.NumResidentClustersUpdate > 0)” inside FRayTracingManager::RequestUpdates. This seems to be pretty expected because the shelve adds an update specifically with 0 clusters.
Resources->NumResidentClusters = 0; ModifiedResources.Add(Resources->RuntimeResourceID, Resources->NumResidentClusters);
Looking at the change in mainline it will add an entry to modified resources as well but the cluster number will be whatever it ends up being after the last resident page is uninstalled (which I assume is non zero as from a quick glance FRayTracingManager::RequestUpdates looks the same in mainline).
Okay, because of the changes we’ve backported we have UninstallGPUPage (which I suspect has evolved into UninstallResidentPage in main). Thus I’m able to a bit more closely replicate how Main is uninstalling pages (and avoiding the crash) with the following code change:
[Image Removed]This is what I’m testing now.
Note if we didn’t have these other backports / didn’t have UninstallGPUPage (aka were on stock 5.4) I think the shelve could be fixed by using the ResidentPageIndex to index ResidentPageFixupChunks per resident page and subtract the fixup chunk’s cluster count from the Resource’s cluster count (and then remove setting the resources cluster count to 0 at the end, the cluster count should be non zero, but still add the modified resource entry) instead of forcing the count to 0 and adding an entry with that.
Okay, so far in testing this has fixed the crash that we had (I’ve gone 4 for 4 in what used to be like a 95% repro case). I’m going to start doing automated testing on that change on XSX and if we don’t run into any new issues I think this is okay. I am going to keep the fix I have in 5.4 (which more closely matches main- uninstalling each page specifically and sending the last ModifiedResources update) but I do agree I could see the last update being unnecessary (since it clearly wasn’t sent before anyways when we just never uninstalled the pages).
Thanks for the help on this! I will report back on whether or not we encounter any stability issues in 5.4 with the change but I expect it will be okay.
Okay, we did some automated testing of this overnight (mostly on platforms without HWRT FWIW) and didn’t encounter any new issues/crashes so we are pretty happy this is stable.
Thanks for the help / fix!