See the extract from insight :
[Image Removed]
[Attachment Removed]
See the extract from insight :
[Image Removed]
[Attachment Removed]
Steps to Reproduce
Hi,
We are using asynchronous level loading within a flight simulator. We have paging in / out logic which behave quite fine loading levels from external (from the shipped application) pak data files. We are now focusing on performances and reviewing insights traces in order to resolve frame drops.
During loading phase, we see frame being dropped when new content gets loaded.
As our target hardware has massive GPU memory available, we want to setup GPU object pools that could be created at startup and remain persistent for the application execution.
I noticed pooling parameters for textures but I could not pinpoint equivalent for object buffers. See an example for the spike we are willing to resolve :
[Image Removed]
80k is not so much and has no reason for blocking the CPU for that long.
Thanks for inputs,
Basile
[Attachment Removed]
Hi Sam,
Thanks for the answer. I will give a try to those variables. I remember the topic from last Fest in Orlando. I read as well the post. Am sure improvements can be dereived from it.
Any hints for those ones : [Image Removed]
[Image Removed]
I will be on site next week and may be able to collect more precise data.
Best,
[Attachment Removed]
Hi Sam,
Any suggestion for us ?
Reverse engineering the source code is quite time consuming for individual hits…
Same question for this one :
[Image Removed]It feels we miss some preallocation that souncds achievable.
Another mystery for the moment :
[Image Removed]
Thanks,
Basile
[Attachment Removed]
Hi Basile,
Sorry for the longer wait time. As Sam already alluded to, setting up memory pools is tricky, so we will likely either have to move these allocations off the render thread or batch them to distribute them across multiple frames. There are a few issues at play here, which we would usually split into multiple tickets, but we will make an exception in this case. With that said, I need more info to figure out what is going on. Would it be too much trouble to send over two captures of your packaged game: one with Unreal Insights and one Superluminal capture? Could you also run a memreport around the time you are level loading, so that we can take a look at the kinds of assets you have loading in for your level? Thanks for your patience. Feel free to let me know if you need any clarification.
[Attachment Removed]
Hi Tim,
Thanks for the answer. I totally understand and if splitting the tickets is preferred, I have no problem with it. I was in a support position in a formal life and it is clear that focus and data are required for making progress.
Can you suggest some capture flags for the insights trace you want ? In my experience, trace files are very large and selecting the proper elements will allow me to get one hopefully across the support portal. In the mean time, I sent one small already to another support ticket ( [Content removed] )
Can you also confirm what you mean with Superluminal ? I guess you are referring to this but I never used that tool ( https://superluminal.eu/ ) so I prefer confirmation before I get a crash course into it.
I am out of office this afternoon (France) so I hope to be able to generate the traces tomorrow.
Thanks,
Basile
[Attachment Removed]
Hi Basile,
Yeah, that makes sense. Launching Unreal Insights with -trace=cpu,memory,metadata,assetmetadata,callstack,gpu should give us all the relevant information to take a closer look. And yes, that is the Superluminal that I am talking about. It should be straightforward to set up and get a trace. One thing to watch out for is getting a development build traced so we can get more call stack information. Please let me know if you still need some extra clarification.
Cheers,
Tim
[Attachment Removed]
Hi Tim,
I generated a capture using superluminal, with the development build suggested traces parameters and executed the memory report before exiting the application. The run is “simply” starting the application using a default startup location and loading high resolution levels around the eye point.
No travelling, no simulated moving models, nothing else.
Looking at the “holes” in the traces, we can see the frame drops I am trying to understand / resolve :
[Image Removed]
[Image Removed]
I am presently uploading all correlated data to our FTP. I believe there is an option for creating external download links. Traces are larger than 1GB and are so not conveniently sharable using the portal !
I will share a link as soon as I have one. In the meantime, here is the memory dump
I hope you will be able to shed some lights on our hickups.
Thanks,
Basile
Is there any benefit from SuperLuminal compared to NVIDIA nSight (systems) ?
[Attachment Removed]
Hi Tim,
Please try this link to get the files :
https://www.dropbox.com/t/KUPbb0vgJgvf9shy
Thanks,
Basile
[Attachment Removed]
Hi Basile,
I got a chance to look at your traces. Unfortunately, the Superluminal capture appears corrupted, so I could not open it, but the Insights traces have provided some useful information. There are two things at play here with the resources that are being created in the screenshot you shared:
[Image Removed]First, the allocation is not going through the async resource-creation path via RHIAsyncCreateTexture2D, but rather through the synchronous path from InitRHI -> SetTextureReference. Secondly, the RHI commonly allocates committed resources when a texture size exceeds 4MB, using the big block allocator path, which is heavy. I inspected the allocations in Memory Insights and noticed that some Texture2d arrays are allocated around the time of the screenshot and are rather large. However, I cannot tell whether they used the big block allocator, since I don’t have your build symbols. What you could do is open the Memory Insights capture with your debug symbols, and you should have been able to see where the allocation was made. Can you send that information back to me then?
Once I know the texture sizes and types, we can check whether we can move that to the async allocation path.
[Image Removed]
As for the interrupted process stalls, I did notice some allocation logic happening on the render thread that is stalling out the GPU. I assume that is a similar issue, but I will need to investigate that further. You might also want to get another trace with the context-switch channel added, just to see if the game process is being preempted by something on your system.
[Image Removed]
I hope that makes sense, but please let me know if you have any further questions. I will try to update you in the coming days, once I find out more.
Cheers,
Tim
[Attachment Removed]
Hi Tim,
I hope you are fine despite the announcements that were made yesterday…
I am reuploading the superluminal capture that goes along the trace you already have to see how it goes this time.
https://www.dropbox.com/t/bkvqWCEznHjKStXc
I believe this is the stack that you are inquiring :
[Image Removed]
Am not sure how you got to display the package name for the texture but it does look like the allocation / time we are checking at the moment. It may not be the worrying allocation as this texture array is loaded once and only once during initial startup time, it does not belong to a specific level / area. Still I am very eager to understand the journey and the methodology for getting the spike resolved. Traces are overwhelming with information and without a leadn it is hard to extract the meaningful information.
I am attaching also another trace with the context switch data in. I guess that tomorrow I will get you as well a trace from the target platform. It may help focusing on the right bottle necks.
Thanks,
Basile
[Attachment Removed]
Hi Tim,
Within the previous link, the trace does not have context switches. The key is contextswitch and not context-switch. Also it requires the process being launched as admin. It took me a couple attempt to get it right.
You can find the trace here:
https://www.dropbox.com/t/Ov8WShZhWdfGCCsk
If my colleague is on the reference platform tomorrow, we will generate a full report from a test run.
Thanks again,
Basile
[Attachment Removed]
Hi Tim,
My colleague is on customer site so live traces will have to wait for next week. In the meantime, I did the following :
Here is the loading profile :
[Image Removed]
Now looking at individual spikes, looking at them in magnitude order.
[Image Removed]Please correct me if I am wrong but this is a false alarm as the low level memory tracker is disabled in shipping builds, …
Kind of frustrating that the profiler induces this but I will not complain it goes away on its own. Note the extra trace is coming from me.
3. I have some like that one:
[Image Removed]While I would be curious to know what would be required to add such content at runtime, an action is alrady ongoing to the production side to get rid of this specific asset which is overkill for our present needs.
4. Interesting things begins now :
[Image Removed]The frame takes a longer time as the render thread is delayed. It gets delayed as it waits much longer than usual / expected.
Yet the cores are available.
[Image Removed]Looking in more details, the worker 0 (and backgroubd work 2) is dealing with most tasks in a serialized way !
[Image Removed]Switching to compact view, this is happening because all threads are “busy” (using “” as we saw the cores are not being used !
[Image Removed]All these threads are trying to execute task looking like this : [Image Removed]Now the questions :
Side question :
Last but not least, one frame seems to accumulate back luck and repeated but different contention :
Some more diagnosis / traces will be needed but it does feel the async loading thread is jamming all.
[Image Removed]
Thanks in advance,
Basile
[Attachment Removed]
Hi Basile,
I am just about to head home for the day, so I will get into the details more, but I noticed that the tasks that are running in serial are RHI translation tasks. In 5.6, we added support to run the translation in parallel, and the switch should be straightforward by setting r.RHICmd.ParallelTranslate.Enable. That could already help you get some immediate gains in that area. I will get you some more updates in the coming day or two, but keep me in the loop with any further updates.
Cheers,
Tim
[Attachment Removed]
Hi Tim,
I will try it but I feel / fear this is not the main problem at stake here. See the traces with instrumented source code :
This is the faulty frame.
[Image Removed]The waiting part, second half is the problem. I left, on purpose, the RHI stack in the view.
It is waiting on a lock.
The locked is preempted by the workers :
[Image Removed]This is pending the call to the DirectX device for the actual allocation.
Thanks,
Basile
PS: The need for preallocating the memory is, in my opinion, the best solution.
[Attachment Removed]
HI Basile,
You are right: running the RHI translation tasks in parallel won’t fix your thread contention issue with the committed resource allocations. It was just something I noticed in passing, but it could still be beneficial for your project. I am still discussing with some devs how we can get you unblocked on the large resource allocations, but you also had some other questions earlier, which I want to answer in the meantime:
How can we control the actual number of threads in the scheduler?
By “scheduler”, I assume you mean the Task Graph scheduler. Did you come across any of the cvars in TaskGraph.cpp? There are a few cvars there that might be useful for you:
Is there a convenient way to identify the activity for task without labels?
Unfortunately, there is not. Typically, I like to enable the -task channel in an Unreal Insights trace, which provides a nice visualization of the call site that spawned the task, but that requires in-code instrumentation. There is a tiny sliver of a label below ExecuteForegroundTask in the screenshots you provided. Can you share what that says? I can also take a look at the capture if you upload it to Box.
[Image Removed]
[Attachment Removed]
Hi Tim,
Taking note for enabling parallel tasks.
Regarding the scheduler, yes, I am referring to the one you mentioned, I would be tempted to allow more workers threads than existing logical cores.
-> Unless I miss something, I cannot do this without modifying a bit the engine source.
The idea is quite simple, when there is contention, some tasks are staying in the queues and are executed (but they could). I noticed a couple of pattern leading to multiple tasks being blocking each others. The graphics ones I mentioned but also some IO works most likely reading on disk data.
[Image Removed]Playing with the number of foreground worker would be a start but it would imply reducing the number of background ones. It may help starvation on foreground task but I feel it would slow the application overall as foreground threads are not handling background tasks.
As for the task without label, this was an exemple. There are some without any tiny bit of information: [Image Removed]I will consider using the task channel you mention if I feel those become blocking.
I received some traces from the reference platform I have to analyze.
I will make a pass at them and will probably get back with a couple more precise questions.
Best,
Basile
[Attachment Removed]
Ok, sounds good. In the meantime, I have also received a response from the dev team about the spiky allocations. We just recently reworked the code path for upload heap allocations across two different CLs. You could try and cherry-pick them, or wait until they are released in 5.8 to try out:
CL 50684724: This change introduces asynchronous allocation of N pools for buffers**,**which should avoid large spikes when the upload pools are exhausted, which are the spikes you had in CreateCommittedResource, but you might need to tweak how many extra pools you need. The default values of d3d12.UploadHeap.OverFlowPoolsCount or d3d12.DefaultBuffer.OverflowPoolsCount can be adjusted further if you want to smooth out the hitches. We have been mostly conservative and tailored the defaults for examples like CitySample / Lyra, but since you have a lot of memory available on your server, you can be a bit greedy here.
CL 51777483:This change is more intrusive and less tested so far, but basically ALL D3d12 resources (buffers and textures) get created by default as Evicted and made resident at submit time. This should significantly reduce CPU creation time, spreading the cost at submit time when the resource is first used. The big drawback of this technique is that there may be GPU page faults if you hit a path where residency is not properly handled. You might need to take fixes 52179465 and 52217430 as well. Since this change is pretty recent, we recommend you wait before integrating, but you can use it for reference or even try it directly.
I was also made aware that we are currently talking to Nvidia about a thread contention issue on the driver side, which could be related to your issue. If you can still get a working Superlumina capture to us, it would help us to confirm that.
Feel free to let me know if you have any new questions.
Cheers,
Tim
[Attachment Removed]
Hi Tim,
Taking note that GPU preallocation may not be coming any time soon for us.
While I do experiment with a customized version of the engine, unless we are hitting a wall, we will stick to a vanilla version.
I am still hoping for a trace from the site later today but on my side, I also tried travelling around the loaded scene and another pattern appears quite often : [Image Removed]When the world partition system is streaming new data in, it just blows up the time frame budget.
What leverage do we have against this ?
Side question, I only had a brief look for now but do you know which cvar if any is driving the parallelfor in the end of the call ? At this point, the CPU seems quite available and the task is takinig way too long…
The trace is available here:
https://www.dropbox.com/t/4MinbKZQnrWLsQsc
Thanks,
Basile
PS: I believe I uploaded the superluminal capture there : https://www.dropbox.com/t/bkvqWCEznHjKStXc
Is this one corrupt as well ?
Since SuperLuminal is discussed, since I installed the trial version, the vanilla Unreal Editor fails at link time for my application. I feel the cs file for the superluminal detected the install and is looking for some lib that do not come with the engine. Does this ring any bell ? I am mostly using the custom build and will likely uninstall superluminal so I did not investigate further.
[Attachment Removed]
Hi Tim,
I uploaded here a trace, still from my laptop, but attaching the pawn to the simulation and using terrain LODs…
https://www.dropbox.com/t/5MW2k46ZsyMNqTEA
There are still optimizations that we must apply especially when it comes to niagara effects but here are the questions that are coming along. I am leaving aside the GPU memory allocations which will have to wait for Unreal Engine 5.8 and / or custom developments on the engine.
Also, not focusing on garbage collection which will be an issue on its own…
[Image Removed]If the debug information is hinted from windows / standard API, then I would like to apply them within our internal tools.
I found those : static CORE_API void SetThreadName( const TCHAR* ThreadName );
However, I wonder if you are adding anything else within thread local storage or any other information.
Thanks,
Basile
[Attachment Removed]