Pathtracer crashing GPUs - what's happening?

We have a scene that’s rendering using deadline on a mix of 4090’s (24GB) and A6000’s (48GB). We are getting gpu crashes and when looking at the logs, the 4090’s typically crash out when the needed memory exceeds the budget - which is fully understandable. They typically lose their footing during a groom conversion to raytrace prims. My first question is if there is any way to pre-cache that process so the only geo held in vram is the raytrace ready geo?

The other issue is with the a6000’s. They crash out, but they typically have quite a bit of headroom yet on their vram budget, anywhere from 6-18GB remaining. The message they have when crashing is related to the ray depth samples getting deallocated (i imagine this is a pixel compositing step). Along the process of doing so, they crash. This typically happens after multiple successful frames, so I’m wondering if it’s related to memory fragmentation in any way? Or if you’ve seen it before, has anyone successfully troubleshot this issue. I’ll post logs below for the A6000.

Overall, has anyone encountered and solved these problems? Is there a way to pre-cache and use the raytrace ready prims? Is there any command to clear or de-fragment vram on a per-frame basis?

We obviously prefer the performance level of the 4090’s but ultimately we need to get final frames out, so we want to land on a predictable methodology.

Thanks!!

A6000 Logs - here you can see the clearbuffer steps and the ram budget (plenty of room):

                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 17, BeginEvent [ClearBuffer(PathTracer.NumActivePaths Size=12bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 18, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 19, BeginEvent [Path Tracer Compute (1920 x 1080) Tile=(0,0 - 1920x1080) Sample=49/512 NumLights=95 (Bounce=1)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 20, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 21, BeginEvent [ClearBuffer(PathTracer.ActivePaths1 Size=8294400bytes)] - LAST COMPLETED",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 22, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 23, BeginEvent [ClearBuffer(PathTracer.NumActivePaths Size=12bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 24, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 25, BeginEvent [Path Tracer Compute (1920 x 1080) Tile=(0,0 - 1920x1080) Sample=49/512 NumLights=95 (Bounce=2)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 26, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 27, BeginEvent [ClearBuffer(PathTracer.ActivePaths0 Size=8294400bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 28, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 29, BeginEvent [ClearBuffer(PathTracer.NumActivePaths Size=12bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 30, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 31, BeginEvent [Path Tracer Compute (1920 x 1080) Tile=(0,0 - 1920x1080) Sample=49/512 NumLights=95 (Bounce=3)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 32, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 33, BeginEvent [ClearBuffer(PathTracer.ActivePaths1 Size=8294400bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 34, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 35, BeginEvent [ClearBuffer(PathTracer.NumActivePaths Size=12bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 36, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 37, BeginEvent [Path Tracer Compute (1920 x 1080) Tile=(0,0 - 1920x1080) Sample=49/512 NumLights=95 (Bounce=4)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 38, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 39, BeginEvent [ClearBuffer(PathTracer.ActivePaths0 Size=8294400bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 40, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 41, BeginEvent [ClearBuffer(PathTracer.NumActivePaths Size=12bytes)]",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Op: 42, EndEvent",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error: DRED: No PageFault data.",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error: Memory Info from frame ID 88:",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Budget:    47803.00 MB",
                "2024-07-16 15:44:41:  0: STDOUT: [2024.07.16-22.44.41:134][ 12]LogD3D12RHI: Error:     Used:    28194.48 MB",```


Here's another example that's less common but similar - doing the env captures and not making to through the mip gen process, again with plenty of memory:

            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 50, BeginEvent [DistantHeightFog]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 51, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 52, BeginEvent [CloudView (CS) 2048x2048] - LAST COMPLETED",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 53, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 54, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 55, BeginEvent [Capture Face=5]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 56, BeginEvent [Capture Sky Materials]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 57, BeginEvent [CaptureSkyMeshReflection]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 58, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 59, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 60, BeginEvent [DistantHeightFog]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 61, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 62, BeginEvent [CloudView (CS) 2048x2048]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 63, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 64, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 65, BeginEvent [MipGen]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 66, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 67, BeginEvent [MipGen]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 68, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 69, BeginEvent [MipGen]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 70, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 71, BeginEvent [MipGen]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 72, EndEvent",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Op: 73, BeginEvent [MipGen]",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error: DRED: No PageFault data.",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error: Memory Info from frame ID 77:",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Budget:    47803.00 MB",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.16:980][  6]LogD3D12RHI: Error:     Used:    15981.05 MB",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.17:022][  6]LogD3D12RHI: Error: Aftermath: Writing Aftermath dump to: D:/Render/OLT_Previz_53/unreal/workarea/Saved/Logs/UEAftermathD3D12.nv-gpudmp",
            "2024-06-21 08:41:17:  0: STDOUT: [2024.06.21-15.41.17:038][  6]LogD3D12RHI: Error: GPU Crashed or D3D Device Removed."