GPU Crash in RTAS build for hair/grooms with mutable

Setting the following cvars had no effect on occurrence of this crash. It happens on

  • r.RayTracing.PersistentSBT 0
  • D3D12.ResidencyManagement=0
  • r.HairStrands.CompressedPosition=0
  • r.HairStrands.Cards.BulkData.AsyncLoading=-1

[Attachment Removed]

Steps to Reproduce
We’re having an issue similar to the one described in this forum thread: 5.6.1 D3D12 GPU crash in RayTracing AnyHit

The difference is that the hang/AddressTranslationError occurs in an AS Build or Refit rather than the AnyHit shader.

I added some more detailed event markers to get a better idea of what’s happening in breadcrumbs- check the attached logfile. Surprisingly, the breadcrumbs seem to show that the HairStrands::CardsDeformation passes are being executed on both the Graphics queue and the AsyncCompute queue.

  GraphBuilder.AddPass(
    RDG_EVENT_NAME("HairStrands::CardsDeformation(%s)", bSupportDynamicMesh ? TEXT("Dynamic") : TEXT("Static")),
    Parameters,
    ERDGPassFlags::Compute,
    [Parameters, ComputeShader, DispatchCount, CardsRestPositionBuffer, CardsRestTangentBuffer, bManualFetch](FRDGAsyncTask, FRHIComputeCommandList& RHICmdList)
    {
      // On platforms not supporting manual vertex fetching, ensure the resources are in 'VerteOrIndexBuffer' state after position/normals update
      if (!bManualFetch)
      {
        RHICmdList.Transition(FRHITransitionInfo(CardsRestPositionBuffer, ERHIAccess::Unknown, ERHIAccess::SRVMask));
        RHICmdList.Transition(FRHITransitionInfo(CardsRestTangentBuffer, ERHIAccess::Unknown, ERHIAccess::SRVMask));
      }
      FComputeShaderUtils::Dispatch(RHICmdList, ComputeShader, *Parameters, DispatchCount);
      if (!bManualFetch)
      {
        RHICmdList.Transition(FRHITransitionInfo(CardsRestPositionBuffer, ERHIAccess::SRVMask, ERHIAccess::VertexOrIndexBuffer));
        RHICmdList.Transition(FRHITransitionInfo(CardsRestTangentBuffer, ERHIAccess::SRVMask, ERHIAccess::VertexOrIndexBuffer));
      }
    });



My understanding is that the Compute PassFlag being passed in means the RHICmdList passed in from GraphBuilder should execute the pass (and the dispatch call in the pass) on the graphics/default queue. So I guess my question is- how trustworthy are the breadcrumbs here/is it intended for these interpolation passes to be run on AsyncCompute?

If it helps, this is how I added the markers in GroomManager.cpp.

  // Cards only - Deform final cards geometry (using guides)
  for (uint32 InstanceIndex : CardInstances)
  {
    FInstanceData& InstanceData = InstanceDatas[InstanceIndex];
    FHairGroupInstance::FCards::FLOD& LOD = *InstanceData.CardInstance;

    if (InstanceData.bNeedDeformation)
    {
      // 1. Cards are deformed based on guides motion (simulation or RBF applied on guides)
      if (InstanceData.CardsSimulationType == EHairCardsSimulationType::Guide)
      {
        RDG_EVENT_SCOPE_STAT(GraphBuilder, HairCardsInterpolation, "CardsDeformationPass");
        AddHairCardsDeformationPass(
          GraphBuilder,
          ShaderMap,
          View->GetFeatureLevel(),
          ShaderPrintData,
          InstanceData.Instance,
          InstanceData.HairLODIndex,
          InstanceData.MeshLODIndex,
          CardPositionExternalAccessPipeline);
      }
    }
  }

[Attachment Removed]

Hi there,

We have a few rare crash reports with similar breadcrumbs in these internal ray tracing shaders. Unfortunately, I wasn’t able to find any known fixes in this area since 5.6 but here are some related changes:

CL#45964370 Enable groom RT geometry with r.hairstrands.raytracing is enabled.

CL#47955242 Fix groom crash on AMD when strands are stretched or invalid

CL#46715638 Do not force async compute pipeline when when hair raytracing geometry is disabled.

I’m assigning this to my colleague who is more familiar with the changes in this area but is out of the office till later next week. Please escalate the issue if you need assistance before then.

[Attachment Removed]

Thank you Alex! I am having trouble finding those CL numbers, though I’ve already backported a change 46142446 with the same description as your change 45964370.

What files do the other two CLs touch?

[Attachment Removed]

Hi,

The CL’s I posted above are from //UE5/Main, I’ve updated the links to point to the GitHub commits if that helps. The files are GroomManager.cpp and the AMD crash fix is in NiagaraDataInterfaceHairStrandsTemplate.ush‎

[Attachment Removed]

The github links were exactly what I needed to try the CLs out, thank you. Unfortunately I was immediately able to reproduce the same crash with those changes unshelved.

I don’t think it’s directly related, but I’ve observed that this intermittent crash tends to almost always repro after I hit the ensure in CheckMatrixPrecision:

ensureMsgf(OriginX <= OriginMax && OriginY <= OriginMax && OriginZ <= OriginMax,

TEXT("Found precision loss while converting matrix to GPU format, verify the input transforms. ")

TEXT("This error usually indicates the view transform is invalid, or the PreViewTranslation/ViewOrigin was not set up correctly."));

Which itself is in a callstack related to generating distance fields from a cable generator. We do have r.RayTracing.Geometry.Cable=0 set in our project due to Raytracing Error Crash in 5.6 - #2 by BRGEzedeRocco which… may be unrelated, but I thought I’d mention it since this other question also mentioned disabling other geometry for ray tracing in order to work around a different crash (in their case, Niagara meshes- which we incidentally also have disabled in our project).

[Attachment Removed]

Spent some more time looking at AddHairCardsDeformationPass.

It seems that the AddCopyBufferPass behind !bHasLODSwitch is sometimes being duplicated on the AsyncCompute queue. There aren’t explicit transitions around the copybuffer pass; is GraphBuilder.UseInternalAccessMode sufficient for RDG to properly set up transitions for CardsDeformedPositionBuffer_Curr? It’s used as a copy source before and after the CardsDeformation compute dispatch, which accesses it as a UAV.

[Attachment Removed]

It looks like for D3D12, the appropriate barriers are set up during RHICopyBufferRegion:

FScopedResourceBarrier ScopeResourceBarrierSrc(*this, pSourceResource, &SourceBuffer->ResourceLocation, D3D12_RESOURCE_STATE_COPY_SOURCE, 0);

FScopedResourceBarrier ScopeResourceBarrierDst(*this, pDestResource , &DestBuffer->ResourceLocation, D3D12_RESOURCE_STATE_COPY_DEST , 0);

FlushResourceBarriers();

so then the question is why some of these copies are ending up on the AsyncCompute queue and some on the Graphics queue. The RHI command looks like it ought to put it on the copy queue, but maybe the breadcrumbs don’t show the copy queue?

[Attachment Removed]

Ok, I think I was misinterpreting the RHI breadcrumbs. It seems it’s possible for a cmdlist containing both a build and an update for the same RTAS to be dispatched. Probably related to the updates not going through GRayTracingGeometryManager. I’ll try having the updates go through the manager.

[Attachment Removed]

Unfortunately relying on the dynamic geometry manager to update the hair RT geo results in the hair RT geo lagging a frame behind :confused: Which I suppose is why it was done this way.

Instead I added an early-return to UpdateHairAccelerationStructure if the RT geo has pending build requests or requires a build. Still testing since it’s an intermittent crash but so far so good.

[Attachment Removed]

Hi, I tried to reproduce that issue on our side, but I couldn’t manage to get a crash. Did you workaround work?

/Charles.

[Attachment Removed]

Yes it did! Haven’t seen crashes with breadcrumbs in hair raytracing geo since I checked in that workaround. Thanks for checking in.

[Attachment Removed]