GPU Bubble from Ray Tracing translation?

Hello,

We are noticing a GPU bubble of 7-8ms between our RayTracingDynamicGeometryUpdate and Base pass. From investigation, it seems like this is due the latency between calls to RHI_SubmitToGPU on the RHIThread, which spends a lot of time during translation of SetRayTracingBinding commands.

Is this a known issue or perhaps we’re doing something wrong at our end? Also, should the translation be handled by worker threads instead of the RHI thread? Thanks!

[Image Removed]

[Attachment Removed]

Upon some more investigation, it looks like command submission is blocked until the buffer copy over the copy queue is submitted/executed.

[Image Removed]

This is the code in question. The last line inserts a sync point, but if I’m reading this correctly it should be a GPU sync point only and shouldn’t block command submission on the graphics queue?

const FRHIBufferCreateDesc CreateDesc =
 FRHIBufferCreateDesc::Create(TEXT("ShaderBindingTable"), BufferSize, 0, BUF_Static)
  .SetInitialState(ERHIAccess::CopyDest)
  .SetGPUMask(FRHIGPUMask::FromIndex(Device->GetGPUIndex()));
ID3D12ResourceAllocator* ResourceAllocator = nullptr;
Buffer = Adapter->CreateRHIBuffer(
    BufferDesc,
    BufferDesc.Alignment,
    CreateDesc,
    ED3D12ResourceStateMode::MultiState,
    D3D12_RESOURCE_STATE_COPY_DEST,
    /*bHasInitialData*/ true
);
// Use copy queue for uploading the data
Context.BatchedSyncPoints.ToWait.Emplace(Buffer->UploadResourceDataViaCopyQueue(Context, &Data));

This is what it looks like in PIX. The graphics queue is flushed while there are two small copies in the copy queue.

[Image Removed]

[Attachment Removed]

Hey,

Yes, the bubble you are seeing is unfortunately something real. Because building RT bindings can take a significant amount of time we introduced persistent SBT (r.RayTracing.PersistentSBT 1), are you using that? It should be enabled by default in 5.7. If you are already using it and it’s still takes so much time, you’d need to investigate why so many of the bindings are transient and updated every frame (ie. you’d have a lot of dynamically spawned objects). I’d also suggest trying using Inline mode if possible or you could try to move the sync points for SBT buffer from copy queue to graphics queue to unblock the GPU.

[Attachment Removed]

Hi Aleksander, thank you for the response and apologies for the delay in replaying. We are using the persistent SBT and moving the SBT buffer copy to the graphics queue unfortunately did not help. I have a colleague of mine investigating why there are so may updates every frame. We don’t have VFX in the ray tracing scene and there are not a ton of dynamically spawned objects. Is there any console command or such to help get more info on this?

[Attachment Removed]

Hello Aleksander, I am a colleague of Sakib, I have been digging a bit more and trying to understand how all this works and I have a bunch of information I want to validate:

After what Sakid did with the graphics queue, I tried to increase the pool size using d3d12.UploadHeap.BigBlock.PoolSize to 32 mb and also inside the FD3D12DynamicRHI::RHIEndFrame I increase the BufferPoolDeletionFrameLag value to have the buffers lingering a bit more so they get to be reused. After some insights traces, it overall help the graph on the ShaderTableCommit, but it does not really help our overall perf.

Looking into it a bit more, I was checking on have many primitives we are processing, using stat RayTracingGeometry I got the following:

[Image Removed]I see that nanite has the need to rebuild constantly and that we have a big number on the geometry count, in nanite everything almost all behaves as dynamic. We knocked down a few of them using the commands: r.RayTracing.Geometry.Text=0, r.RayTracing.Geometry.ProceduralMeshes=0, etc etc, that reduced our memory footprint. We had from before already some good configuration, so we dont really see a win in perf yet.

Finally checking the stats with stat D3D12Raytracing the shaderbindingtable needs to record a bunch of information everytime, setting the hit groups takes the biggest time, inside the: FD3D12CommandContext::RHISetBindingsOnShaderBindingTable

this specific code inside the lambda:

if (BindingType == ERayTracingBindingType::HitGroup)
		{
			if (Binding.BindingType != ERayTracingLocalShaderBindingType::Clear)
			{
				//UE_LOG(LogD3D12RHI, Log, TEXT("Set hit record data for RecordIndex %d on SBT %#016llx with mode: %d"), Binding.RecordIndex, ShaderTableForDevice, Binding.BindingType);
				const FD3D12RayTracingGeometry* Geometry = FD3D12DynamicRHI::ResourceCast(Binding.Geometry);
				SetRayTracingHitGroup(Device,
				    ShaderTableForDevice, Binding.RecordIndex,
				    Pipeline, Binding.ShaderIndexInPipeline,
				    Geometry, Binding.SegmentIndex,
				    Binding.NumUniformBuffers,
				    Binding.UniformBuffers,
				    Binding.LooseParameterDataSize,
				    Binding.LooseParameterData,
				    Binding.UserData,
				    Binding.BindingType,
				    Context.WorkerIndex);

I added a couple of extra cycle counts, and runned the parallelfor in single thread to mesure it. Yeah thing is just heavy, it seems to me we have way too many indiviual raytracing shaders and it scales lineraly. What are the strategies we can take towards helping this process?

My hardware specs are the following:

Processor: AMD Ryzen threadripper PRO 5965WX

Mem: 65k MB

Graphics: RTX A5500

[Attachment Removed]

That’s not that many geometries, what about instance count and how many SBT entries do you have?

You mentioned that “in nanite everything almost all behaves as dynamic”, could you confirm you are running on PC with static binding layout and bindless enabled? This will cut down the number of entries that need updating. If all of this is true, I’d suggest checking a couple of things:

  1. What actually is in the array returned by FRayTracingShaderBindingTable::GetDirtyBindings(). How many entries are actually Transient?
  2. Do you have custom UBs that are not in the shader binding layout?
  3. Does the performance change when you make all entries dirty (r.RayTracing.PersistentSBT.ForceAlwaysDirty)
  4. Is persistent SBT being recreated all the time? You should see “Recreating Persistent SBTs due to initializer changes” in the log.

Hopefully that will help narrow down the problem.

[Attachment Removed]

Hello Aleksander

I am sorry for the late response, I have been assigned to other tasks and I just have a bit of time look at it. So far we are not using bindless in our project, I checked the FRayTracingShaderBindingTable::GetDirtyBindings() for the specific area we are testing in our game and once the loading is done and the game is stabilized I see a couple of hundreds that are transient, that is not much.

For you other questions:

Do you have custom UBs that are not in the shader binding layout? No we don’t

Does the performance change when you make all entries dirty (r.RayTracing.PersistentSBT.ForceAlwaysDirty)? Yeah totally, forcing always dirty punshes in from 2 to 3+ ms more hehe

Is persistent SBT being recreated all the time? no, we are good here

But I have a bit of good news

In our project settings we found our we were incorrectry using using r.Lumen.HardwareRaytracing.LightingMode to 2, we revert back the value to default 0 to use surface cache for the hits, reduce the cost drastically for when we set the binding on the SBT, this not only make dissapear the bubble we were looking at Sakib’s captures, but also help the GPU, the copyqueue was actually waiting for the parallelfor that set the binding to finish, it reduced the time from 5+ ms to 1.0- ms

[Attachment Removed]

Glad you found a workaround! I think shader binding layout requires bindless to work, so it’s understandable you still saw the cost. Having said that, if you can run with just inline that’s much better as all this cost is skipped - that’s what we usually recommend doing.

[Attachment Removed]