Since upgrading to 5.5 we have a group of players who have been complaining about hitches/freezes. We have seen the issue once or twice internally as well, and through working with a very active community member we have traces that indicate the hitches stem from creating a new memory pool. The timing lines up with an engine upgrade from 5.3 to 5.5.4 that we shipped with one of our patches.
We have tried disabling Aftermath ResourceTracking as that came up in another similar thread as a potential contention point with creating these pools. We have also tried swapping this player to Vulkan instead of DX12 with no luck. The data we have about their client performance confirms the hitches of at least 2 seconds started happening in 5-10% of this players matches after the upgrade to 5.5.4. The data we have also says that in the matches where these hitches happen the allocated VRAM of the player is slightly lower on average than the matches where hitches occur. It is possible that some other application running at the same time could be impacting the behavior, but we have received a group of complaints that align with the upgrade of the client engine version.
If needed I can provide traces or more logs and can instrument the code as needed to get more context in the traces if that helps.
Steps to Reproduce
We have been unable to consistently reproduce, but we have a community member who typically encounters the issue once or twice a day across 8 hours of streaming.
I have not heard of any issues with those API calls in 5.5.4, but I will investigate this for you. Can you let me know if the person who is running into this crash is running on the latest GPU drivers? I see in the logs that they are using both an Nvidia and an AMD GPU. Does this crash occur on both GPUs? Do you have any custom modifications to the engine?
Hi everyone, thanks for the extra information. I have started reaching out to some folks to get to the bottom of this. I had seen some reports about a similar issue as far back as 5.3, but I will need to figure out if this is the same problem. In the meantime, can you take some captures with Nvidia Nsight Systems on the affected GPUs and share them with us? In case you haven’t used it, here is a link to the official documentation page: https://developer.nvidia.com/nsight\-systems
Hello, I am sorry. I should have been clearer. Ideally, I would like to have an Nsights Systems capture at the time the hitch occurs. However, I see now that we have some data from a Pix capture, so that we can work off that for now. I have contacted some people internally to see if we have encountered this issue recently. I will get back to you as soon as I hear more. Thanks for your patience in the meantime, and if you manage to get an Nsights Systems capture, please do not hesitate to upload it.
Hello, I have more questions: did you perform any memory profiling before and after the upgrade? If so, did you see any outstanding issues correlating increased memory consumption?
Also, can you clarify the following sentence, which you had in your original question:
The data we have also says that in the matches where these hitches happen the allocated VRAM of the player is slightly lower on average than the matches where hitches occur
This means you saw hitches when the allocated amount of VRAM was lower than usual, yes?
I appreciate you providing us with more data to help diagnose the issue. The fact that these hitches are occurring at low VRAM is common, but since you mentioned the hitches do not only occur during low VRAM usage, we could be looking at any number of reasons why they are happening. Without any detailed memory profiling, we cannot really give any guidance on how to solve your issue. Instead of Nvidia Nsight Systems, do you have a sampling profiler, such as Superluminal, to capture a trace during these hitches? That will likely be the most surefire way to determine what is happening during those frame spikes.
Dan, if you feel comfortable uploading the trace to our internal file sharing service, I can create a URL for the upload. Otherwise, you can choose any other file sharing service you would like, and I can work with that. Let me know what you think.
Jordan, that sounds good to me. We might only need the trace from Dan, so if it becomes too difficult to get it from your player, don’t worry.
That sounds good! Feel free to upload the SLP file with everything bundled together. I will let you know in case we need the PDBs as well. To upload your files, follow this link: Box | Login. I do not get notifications when a file has been uploaded, so please ping this thread once the upload is done.
As for the screenshot, I recently found out that we had an issue with “CreateCommittedResource” in the past where the operating system tried to zero out the allocated memory due to security reasons, so maybe that might have something to do with it. Reviewing the SLP trace to see what is going on would be good.
I got a chance to review the traces together with some devs. They confirmed to me that oftentimes these hitches happen due to the design of the Windows allocator and how UE is currently handling resource/heap allocations on the critical path. As you pointed out, the affected allocations enter a critical section and take a lock, causing a stall for other resources. What is curious, though, is that the allocations causing these hitches also seem larger than 64MB for some dynamic primitives, which is a bit concerning. Can you please let me know why you are making this large of an allocation? Fixing this will likely require a refactor of the engine code so as not to take the lock before entering the critical section, but perhaps we can find an alternate solution.
I’m sorry for not getting back to you sooner. Unfortunately, I do not have any new updates, as the devs I have been speaking with are currently out of the office and won’t return until the beginning of September. Is this something you need urgent help with? I can try to escalate the case then.
Looks like this is caused to due to contention on the pool allocator - one thread is allocator a new heap of 8MBs and the other threads are hanging on this. But I am pretty surprised that allocating a new upload buffer of 8MB takes 23msec. Are you running close to over RAM or vRAM usage when this happens? Keeping more pools of default size around in the upload heap allocator should minimize this problem, and possibly preallocating a set of them during startup. Can you try and setting InFrameLag to UINT64_MAX in FD3D12UploadHeapAllocator::CleanUpAllocations for the small and big block allocator to see if that helps. Can use a bit more memory but just wondering if the issue starts happening somewhere else for resource creation or if it’s only related to the upload heaps somehow.
Hi, we’re currently experiencing this exact issue. Machine has 24gb of vram and 256gb of ram so it’s unlikely that we’re reaching any limits in our scene. We also clamp most of our textures to 2k, though we do allow some occasional 4k textures. Has there been any internal movement from epic for this issue? We are on 5.4.
We are also seeing a similar issue. We are on 5.5, but I managed to repro with our in-progress 5.6 integration. In our case it happens with Nanite streaming. It usually starts to happen when the system reports high usage of VRAM. We reproduced it with a RTX 3080 with 10GB of VRAM. We see stalls of 200 to 300ms.
[Image Removed]When Nanite streams, depending on the number of pages to upload, it may need a large buffer, up to 16MB. For that it uses the BigBlockAllocator.
By default d3d12.UploadHeap.BigBlock.PoolSize is set to 8MB so it usually only fits one of those buffers.
These pools are freed pretty quickly (after 20 frames) in FD3D12DynamicRHI::RHIEndFrame.
A few seconds later, the next time Nanite needs to stream in a large amount of pages it will probably allocate a new buffer, potentially stalling again.
I suppose that anytime CreateCommittedResource is called from the RenderThread it may create frame stutter if the system has high VRAM usage.
Thanks for the response! The player that we have been working with has confirmed they are on latest drivers. In the LogInit section I see that they have an AMD CPU and NVIDIA GPU. I would guess that any AMD GPU you’re seeing is probably the integrated one with that CPU model so I don’t believe they would be able to run the game on that. The game doesn’t usually crash in when this happens although it has happened. Typically their PC is unresponsive for this time and they can continue to play after this deadlock resolves.
We do have some engine modifications, but we try to minimize those to facilitate taking occasional engine upgrades.
The current theory we’re investigating is that the issue seems to be more frequent on subsequent launches of the game between PC restarts.
Was trying to think of how to best coordinate with our player who has been having the problems, but realized you might be asking for something not quite so involved. Are you asking for a specific capture including the moment of the hitch? It’s still an uncommon occurrence for us. But your post only asks for some captures from the impacted GPUs. Are you just asking for some baseline captures for analysis or a Systems trace including a hitch?
Given our lack of consistent repro and the smaller general capture window for Nsights Systems we likely won’t be able to work with the player to get a similar capture, but I’ll still see if we can get one internally in case it is a different callstack than has been provided in the PIX image.
We did not specifically profile for memory in a direct pre/post comparison, but we do have some summaries and trends that indicate it isn’t just memory pressure. I am attaching an image here that shows per match averages on frame time, game thread, render thread, standard deviations and RAM/VRAM. I am also sharing an older image that shows more history from the 5.3 matches to demonstrate that the issue happened as the 5.5 patches started to ship.
To clearly answer the most recent questions:
Did we perform a direct profile in this case to check for memory? No, but we do have data that implies this isn’t directly related.
We saw hitches when VRAM was lower than usual? Yes, but if you look at the graph you’ll see it was probably just dumb luck as the hitches aren’t specifically tied to lower or higher VRAM
I managed to get a Superluminal trace. The slp is around 2GB. Any way to upload it?
[Image Removed]
About memory usage:
While the low VRAM these hitches start to happen more often(running the steam build in parallel with the editor opened for example).
However even in perfect conditions on my 24Gb VRAM I can still see one or two on any insight I gather locally. Usually when the game triggers level loads in the background. But they pop up in other places as I move in the world.