Render thread gathering deleted resources

The “Gather Deleted Resources” is taking more than 5 ms per frame every frame.

[Image Removed]

What can contribute to such an impact ? How can we smooth this ?

Thanks for any hint.

Basile

[Attachment Removed]

Steps to Reproduce
Hi,

During execution, we are profiling using Insights. Our goal consists in a flight simulator flying stable 60Hz.

I spent this week on site studying traces using source code as reference. There is one major contributor to our performance problem that I cannot explain.

The picture from the render thread highlight the trace (in description).

Thanks,

[Attachment Removed]

Hello,

Thank you for reaching out.

I’ve been assigned this issue, and we will be looking into this performance issue for you.

[Attachment Removed]

Hi Sam,

Thanks for helping us with this. For sake of completeness, I went ahead and added extra traces for Insight.

void FRHIResource::GatherResourcesToDelete(TArray<FRHIResource*>& OutResources, bool bIncludeExtendedLifetimeResources)

{

SCOPE_CYCLE_COUNTER(STAT_GatherDeletedResources);

if (bIncludeExtendedLifetimeResources)

{

TRACE\_CPUPROFILER\_EVENT\_SCOPE\_STR ("FRHIResource::GatherResourcesToDelete: Extended life time");

PendingDeletesWithLifetimeExtension.ConsumeAllLifo(\[\&OutResources](FRHIResource\* Resource)

{

  TRACE\_CPUPROFILER\_EVENT\_SCOPE\_STR ("FRHIResource::GatherResourcesToDelete: Emplace");

  OutResources.Emplace(Resource);

});

}

TRACE_CPUPROFILER_EVENT_SCOPE_STR (“FRHIResource::GatherResourcesToDelete: Others”);

PendingDeletes.ConsumeAllLifo([&OutResources](FRHIResource* Resource)

{

TRACE\_CPUPROFILER\_EVENT\_SCOPE\_STR ("FRHIResource::GatherResourcesToDelete: Emplace");

OutResources.Emplace(Resource);

});

}

inline EConsumeAllMpmcQueueResult ConsumeAll(const F& Consumer)

{

//pop the entire Stack

FNode* Node = Head.exchange(nullptr, std::memory_order_acq_rel);

if (Node == nullptr)

{

return EConsumeAllMpmcQueueResult::WasEmpty;

}

if (bReverse) //reverse the links to FIFO Order if requested

{

FNode\* Prev \= nullptr;

while (Node)

{

  FNode\* Tmp \= Node;

  Node \= Node\-\>Next.exchange(Prev, std::memory\_order\_relaxed);

  Prev \= Tmp;

}

Node \= Prev;

}

while (Node) //consume the nodes of the Queue

{

FNode\* Next \= Node\-\>Next.load(std::memory\_order\_relaxed);

T\* ValuePtr \= Node\-\>Item.GetTypedPtr();

Consumer(MoveTemp(\*ValuePtr));



{

  TRACE\_CPUPROFILER\_EVENT\_SCOPE\_STR ("EConsumeAllMpmcQueueResult ConsumeAll: DestructItem");

  DestructItem (ValuePtr);

}



{

  TRACE\_CPUPROFILER\_EVENT\_SCOPE\_STR ("EConsumeAllMpmcQueueResult ConsumeAll: AllocatorType::Free");

  AllocatorType::Free (Node);

}



Node \= Next;

}

return EConsumeAllMpmcQueueResult::HadItems;

}

See what it looks like :

[Image Removed]

It tells me :

  1. The impact comes from the first container (extended life time)
  2. The hit is equally shared by the container memory management (AllocatorType::Free (Node):wink: and the required emplace calls.

If I get to the questions for making progress :

  • “Do better” : Is there anything that can be done for improving the allocator memory release. As far as I understand, the container is using fixed size block for all nodes and this could probably be specialized. The code is written in such a way that providing a different allocator would be accessible for testing…
  • “Do less” : I will setup a fixed position and level to assess our resolution attempts impacts but are you able to describe to us which graphics resources are ending up in this container ? As far as we understand it, our level is nothing fancy…

Thanks,

Basile

[Attachment Removed]

Hello,

Thank you for reaching out.

Can you provide an Unreal Insights trace file and more information about your level?

You can see what resources are being queued for deletion by examining FRHIResource::ResourceType when stepping through with a debugger.

[Attachment Removed]

Hi Stern,

I am attaching a smaller trace (2 pieces) generated from my laptop. It is lighter version but already demonstrates the problem. (Original traces are more than 15GB and would require a better transfer tool)

[Image Removed]Since most of the time occurs within the allocator release, I suppose that the sequence of events before will worsen the issue (fragmentation or similar ?)

In that small run, the eye point is standing still at a startup location. After everything is loaded, we see the anomaly.

When it comes to our levels, they are quite simple in my opinion. There is a terrain skin with textured mesh. On top of the terrain, there are HSIM / ISM with cultures (trees, …) and some static meshs for non recurring assets (buildings).

Thanks,

Basile

[Attachment Removed]

It looks like your cleanup mode is set to serialize. You may want to consider one of the other options here.

static int32 GSceneRenderCleanUpMode = 2;

static FAutoConsoleVariableRef CVarSceneRenderCleanUpMode(

TEXT(“r.SceneRenderCleanUpMode”),

GSceneRenderCleanUpMode,

TEXT(“Controls when to perform clean up of the scene renderer.\n”)

TEXT(" 0: clean up is performed immediately after render on the render thread.\n")

TEXT(" 1: clean up deferred until the start of the next scene render on the render thread.\n")

TEXT(" 2: clean up deferred until the start of the next scene render on the render thread, with some work distributed to an async task. (default)\n"),

ECVF_RenderThreadSafe

);

Worth noting that your next render thread frame will wait for this to complete if it takes too long, but it does allow for a bit more parallelization here.

As for “why it takes so long”. I would like to know this as well.

[Attachment Removed]

Hi Brenden,

I am not sure to understand. Note that I am running Unreal Engine 5.6.2 for now.

The CVars looks like this :

static int32 GSceneRenderCleanUpMode = 1;

static FAutoConsoleVariableRef CVarSceneRenderCleanUpMode(

TEXT(“r.SceneRender.CleanUpMode”),

GSceneRenderCleanUpMode,

TEXT(“Controls when to perform clean up of the scene renderer.\n”)

TEXT(" 0: clean up is performed immediately after render on the render thread.\n")

TEXT(" 1: clean up is performed asynchronously in a task. (default)\n"),

ECVF_RenderThreadSafe

);

Also when I checked it was already running asynchronously and trigger a task which is leading to scene renderer deletion in an async task:

ENQUEUE_RENDER_COMMAND(SceneRenderBuilder_End)([this](FRHICommandListImmediate& RHICmdList) mutable

{

const auto DeleteLambda = [this]

{

TRACE\_CPUPROFILER\_EVENT\_SCOPE(FSceneRenderProcessor::DeleteSceneRenderers);

for (FSceneRenderer\* Renderer : Renderers)

{

  delete Renderer;

}

delete this;

};

if (GetSceneRenderCleanUpMode() == ESceneRenderCleanUpMode::Async)

{

FGraphEventArray Prereqs;

Prereqs.Add(AsyncTasks.Cleanup);

Prereqs.Add(AsyncTasks.Delete);



AsyncTasks.Delete \= FFunctionGraphTask::CreateAndDispatchWhenReady(DeleteLambda, TStatId(), \&Prereqs);

}

else

{

DeleteLambda();

}

});

Looking at the trace, it is indeed running in a task.

Are you meaning the behavior around this was refactored in more recent versions ?

As, as far as I can see, the variable does not drive the behavior for the mentioned step.

This one is also somehow in my radar as it occasionnally completely blow up the frame time :

[Image Removed]

Thanks,

Basile

[Attachment Removed]

Hello,

Please see my response to your other question for how ParallelDraw tasks could be affecting the performance of this task: [Content removed]

[Attachment Removed]

Hi Stern,

I doubt CPU starvation is the cause here.

Here is a snapshot :

[Image Removed]It is hard to read but this is in compact view, while we are using most cores, there are still few ones which are idle. Also, I expect / hope that the threads which are critical to the frame execution have an higher priority compared to the background worker threads.

That being said, I might have a lead. I need to confirm this and do more profiling onsite for confirmation but I wonder if this cannot be a side effect from the profiling itself. As you can see in the second capture I sent where I added traces within the “lifo” structure, most of the time is actually spent within the allocator free method. I wonder if the delay is not introduced by the memory profiler which most likely monitors individual allocation / release. I started suspecting this trying to generate a lighter trace (below 1GB) to share with you. Removing trace elements was strongly reducing the delays in that method.

Again, must be validated on site but may be a good explanation.

Basile

[Attachment Removed]

Hi,

I created a custom version from Unreal Engine by rebuilding with the below changes : [Image Removed]

The idea consists in bypassing the Unreal custom allocator for this specific container.

In my test case, this neutralizes the impact and modifies the function profile :

[Image Removed]The Emplace calls are the dominant ones and the free almost disappear !

For reference, when tracing all, the method in intensive.

[Image Removed]This is much less the case using a smaller trace.

[Image Removed]It is quite frustrating as I conclude from this that using the profiler is significantly altering the application profile. I am not sure what element in the trace is the culprit and I cannot guarantee that all is fine.

However, based on those, I hardly can pursue this lead as it is unlikely to unlock performances on the final application.

Thanks anyway for the answers.

[Attachment Removed]

Hello,

Instrumented profiling does add overhead for the additional tracking required for markers.

The overhead can be perceivably high when dealing with frequent small events since the overhead accumulates on the timing trace.

Please let us know if this helps.

[Attachment Removed]