Unrecoverable 200ms GPU frame time after nearing (local) VRAM budget (PC d3d12). (Resources appear to page to shared and not page back when local is available). Possibly related to r.RDG.TransientAllocator

We recently completed a port of ARK: Survival Ascended from UE 5.2 to 5.5. Soon after releasing this we received many complaints of horrible performance from gamers, primarily on 8gb cards. These users had not had issues on 5.2 with the same HW and scalability settings.

The affected players would be playing the game for some time with normal performance then reach a part of the map with a higher memory footprint and start getting 200ms+ GPU frame times. They would not usually recover once in this state.

After some investigation we found that we could work around this issue by turning off r.RDG.TransientAllocator. Our current hack workaround is to turn off the transient allocator when we detect less than 1gb of available local VRAM. We would like to have a real fix though to get gamers back to where they were with the 5.2 version as not using the transient allocator is a significant perf hit.

We haven’t had much luck on our own with this. But generally, it appears more resources are being moved to shared memory than previously (in 5.2), even when there is available local memory. I’m not convinced it is the transient allocator itself, but possibly there is more higher priority memory now which is squeezing out other high priority memory (such as render targets). I don’t know that residency itself is to blame. The residency manager seems to be operating under the overall local+nonlocal budget which we are never approaching. This seems like it might be resident nonlocal VRAM as opposed to nonresident memory, but that is just a guess on my part.

Curious if other products moving to 5.5 have had similar issues. Any help would be appreciated.

Steps to Reproduce
This is easiest to reproduce with 8gb GPUs. In our game you can reproduce the issue on “Low” settings on 8gb GPUs. But it is easier to reproduce with more memory so higher resolution, larger texture pool budget, etc. The same GPU and settings did not reproduce the issue with 5.2

Hi there,

Looking at your memreport, I can see the following stats regarding your transient memory usage:

859\.375MB \- Texture Memory Requested \- STAT\_RHITransientTextureMemoryRequested \- STATGROUP\_RHITransientMemory \- STATCAT\_Advanced

1755.688MB - Buffer Memory Requested - STAT_RHITransientBufferMemoryRequested - STATGROUP_RHITransientMemory - STATCAT_Advanced

2615.062MB - Memory Requested - STAT_RHITransientMemoryRequested - STATGROUP_RHITransientMemory - STATCAT_Advanced

128\.000MB \- Memory Aliased \- STAT\_RHITransientMemoryAliased \- STATGROUP\_RHITransientMemory \- STATCAT\_Advanced

640\.000MB \- Memory Used \- STAT\_RHITransientMemoryUsed \- STATGROUP\_RHITransientMemory \- STATCAT\_Advanced

These values indicate that the transient heap cache is using 640MB, which is fairly significant. It is likely that the transient heap cache memory requirements have increased from version 5.2 of the engine due to more work being moved to the GPUs asynchronous compute queue. Transient memory required by the async compute queue is harder to alias (reuse space that isn’t needed anymore) with memory required by the graphics queue. This is because the memory might be required by the async compute queue at almost any time (between a graphics fork and join event) during graphics queue execution. Transient memory required by the async compute queue is therefore more likely to need its own unique space, in one of the transient heaps, that doesn’t overlap with any of the space used by resources required by the graphics queue.

The best way to debug your transient heap allocations is through unreal insights. To give a bit of background, the transient allocator uses a cache of pre-allocated GPU (device local) memory heaps. By default, each heap is 128MB. A new heap is only allocated if a particular transient resource cannot be placed in an existing heap. Below you can see an insights trace of a test scene with the RDG trace channel enabled and set to visualize the transient heaps. You can see on the left that this scene uses 3 memory heaps (Memory Ranges 0-2). When resources are placed in a heap, they can alias with other resources in the same heap (occupy the same memory), as long as their lifetimes are guaranteed not to overlap. As noted above, this is harder when async compute is involved. You can see an example of this in the TSR Decimate history pass (buffers labeled in second screenshot, associated render pass in third screenshot), which requires a number of large resources on the async compute queue. The two turquoise passes highlighted in the third screenshot shows where the graphics queue fork and join events occur. This means that the memory required for these TSR resources can’t occupy the same space as anything required by the graphics queue between these two points, which is why they occupy their own unique spaces in the heaps.

[Image Removed][Image Removed]

Regarding your memory pressure and residency, from your memory report it looks like you are running very close to, and probably exceeding, your memory budget. Under the rhi.DumpResourceMemory section of your memreport, I can see that you are using 7317MB of non-transient resource memory. Combined with the transient memory usage of 640MB we got from stat RHITransientMemory, it appears you are using around 7957MB of rhi resource memory. I can also see that quite a few of your largest rhi resources are also getting evicted (made non-resident) due to your high memory pressure (all the entries under rhi.DumpResourceMemory that are missing the Resident flag have been evicted by the residency manager).

As you get closer to the VRAM memory budget, the residency manager will start evicting memory more aggressively (to system memory). Eviction starts at 70% of your memory budget, where resources will be evicted if not used for more than 1 minute. Once memory pressure reaches 100%, evictions happen after 1 second of non-use. Once you go over your memory budget, any resources that are required for the frame, but are not already resident, will also need to evict currently resident resources to make room. This is most likely the source of your unrecoverable performance drop.

I am unsure why turning off r.RDG.TransientAllocator would give you more memory headroom. If anything, I would expect you to end up with less memory headroom due to having to allocate all memory for transient resources up front, and not being able to alias memory that isn’t used anymore within the frame. Looking at an insights trace with the RDG trace channel enabled, and set to visualize transient heaps, should give you some more insights into what might be happening here. If you still have a 5.2 build, it might be useful to compare the RDG insights between the two different versions.

For debugging what other resources are occupying your VRAM you can use the Render Resource Viewer in editor, or run rhi.DumpResourceMemory in the console. The full list of options for rhi.DumpResourceMemory are as follows: rhi.DumpResourceMemory [<Number To Show>] [all] [summary] [Name=<Filter Text>] [Type=<RHI Resource Type>] [Transient=<no, yes, or all> [csv]. Memreport will also run and report rhi.DumpResourceMemory, but without the `all` option, meaning it will only output the top 50 largest rhi resources.

[Image Removed]

Regarding residency vs locality, the concept of residency only really applies to resources that are originally allocated in device local memory (VRAM). For a resident resource to be made non-local, it would have to be evicted and made non-resident. So the concept of a resource being resident, but non-local (in shared memory), doesn’t really make sense to me. The engine can, and does, allocate some resources directly in non-local memory (these are generally upload and feedback buffers). However, residency doesn’t really apply to these, since they are never moved to device local memory (except by copying into other buffers that are already resident). The engine does not move a resource to non-local storage (say to free VRAM) without evicting it and making it non-resident through the residency manager.

Actually, after reading the D3D12 residency library readme (available here) and looking a bit more into the relationship between it, and the OS’s own video memory management system (VidMM), it appears that it IS possible to have resident non-local memory. The definition of residency, according to the D3D12 residency library, is whether or not a resource is accessible by the GPU. So when a resource is evicted by the residency manager, it becomes completely unmapped from GPU accessible address space. According to this documentation, eviction happens to disk (it does not go into shared memory), so the concept can be applied to both local and non-local GPU addressable memory. VidMM also has the final say on what actually gets evicted. When evict is called by the residency manager, the resource is only actually marked for eviction. VidMM will try not to actually evict these resources unless it actually needs to. On the other hand, if you are over your memory budget, the residency manager may not be sufficient to keep your memory within budget. In this case VidMM may start evicting GPU resources opaquely under the hood, without informing the residency manager. This is the case where resident non-local resources could occur. The resource is still technically accessible by the GPU, but will need to stall on accessing these resources, and Unreal has no knowledge of this. However, I think this scenario is very unlikely, as the residency manager should be robust enough to prevent this from happening (resources cannot be made resident until enough resources have been evicted to make space available).

I hope this was clear, and helps you debug your issues further. Let me know if you have any more questions regarding this.

Regards,

Lance

Hi Lance, I think I figured out what the main issue is for us. We were compiling our game with D3D12_RHI_RAYTRACING 0. And if you look at FD3D12Adapter::SetResidencyPriority, without RT that is a noop.

I just brought FD3D12Device::GetDevice5() out of the ifdef so D3D12_RHI_RAYTRACING is not required to call SetResidencyPriority() and everything now behaves as I would expect. We made this change to exclude RT during our transition to 5.5 as we were getting crashes in the RT code and we never turn on RT anyways. So that probably explains the difference with 5.2. This also explains why messing with priorities never had any effect when I tried that.

Anyways, I’m embarrassed I didn’t see that before, but I never looked at FD3D12Adapter::SetResidencyPriority source code and just assumed it was a simple wrapper.

I’m not sure how many games ever actually compile without RT (possibly none besides us), but it doesn’t seem like FD3D12Adapter::SetResidencyPriority should be tied to RT for any reason right?

Thank you for reporting this issue - it does look this was overlooked in a past refactor and needs to be addressed. I’ve created the following public issue for tracking which should be visible soon Unreal Engine Issues and Bug Tracker (UE\-305620)

Thank you Lance for all the information. I had been on vacation, but will begin investigating further based on your suggestions (above). One thing I would note is that it did appear to me that the budget passed into the residency manager included both local and non-local VRAM (which for the system I was testing on was 8gb+128gb), so residency was never evicting anything when I was testing. I will have to run through and verify that again though as my recollection may be wrong there. I also remember passing in a strictly local budget to residency manager and that not really affecting the issue at all.

Small update, although the documentation states that eviction happens to disk, I was pretty suspicious of this statement, so I tested it in Unreal to confirm the behavior. I Made a small stress test that continuously allocates new textures until the GPU runs out of memory to see what happens. The results were that my dedicated GPU memory usage increased, until my dedicated GPU memory usage reached a certain level (about 10 GB / 16 GB for my laptop), then the system memory started increasing. Once dedicated GPU memory usage maxed out, system memory usage continued to increase, with no extra disk writes or paging file activity that I could see. So this indicates to me that it actually evicts to system memory, despite what the documentation says. My shared memory stayed low and didn’t change at all throughout the entire test run.

I do see what you mean though about both local and non-local VRAM being added together in the residency manager when calculating available space for resources to be paged back in. This indicates that it might be possible for the driver to page into non-local memory, if the dedicated memory is completely full. The documentation isn’t really clear on whether this can actually happen though. It may be that the paging back in to local memory passes the available memory checks, but the calls to MakeResident just fail for device local allocated resources when the dedicated VRAM is full, which would then force sync wait and eviction to make room. It might be possible to create a test case that can better evaluate the behavior here if you would like me to investigate that further.

Regards,

Lance

I doubt you are the only one compiling with ray tracing disabled. That macro is defined based on the value of

[/Script/WindowsTargetPlatform.WindowsTargetSettings]

bEnableRayTracing

in either BaseEngine.ini or DefaultEngine.ini

This seems like it will only affect one code path currently (when ray tracing is disabled), which is a call to SetResidencyPriority() in FD3D12Resource::CommitReservedResource. While the FD3D12TransientHeap::FD3D12TransientHeap constructor does call SetResidencyPriority(), it should currently be redundant, due to the subsequent call to Heap->DisallowTrackingResidency() instead of Heap->BeginTrackingResidency().

It’s not like the code path in FD3D12Resource::CommitReservedResource is particularly rarely used however. It is used by VSM physical pages by default, and also by the Nanite Streaming Manager for its ClusterPageData.

I’m going to elevate this to a subject matter expert at epic, to get confirmation on whether the FD3D12Adapter::SetResidencyPriority function should actually be conditioned on ray tracing being enabled.

Regards,

Lance