UE 5.8 Release - Instant Vulkan Crash (VK_ERROR_DEVICE_LOST) on Linux with RTX 3090 Ti / NVIDIA Driver

I spent the day with Claude trying to track down what I could with this. Here’s my results:

UE 5.8 Source — Recurring VK_ERROR_DEVICE_LOST on Linux (AMD RDNA4 / RADV), Multiple Independent Render Systems

Summary

Editor crashes with vkQueueSubmit returning VK_ERROR_DEVICE_LOST (VkResult=-4), consistently and reproducibly, on UE 5.8 source builds, on Linux, on an AMD RX 9070 XT (RDNA4 / gfx1201). The crash has occurred during at least four structurally unrelated GPU compute passes across separate engine systems (Lumen, Nanite scene culling, Global Distance Field, Nanite hit-proxy culling), with the stuck point varying between occurrences. Extensive isolation testing (below) rules out scene content, a known prior UE 5.7 Vulkan pipeline-lifecycle bug, and several specific feature subsystems as the sole cause. Current working theory is a RADV/RDNA4 driver-level issue with compute dispatch scheduling under load, not a single engine-side bug, though this is not yet conclusively proven.

System Specification

  • Engine: Unreal Engine 5.8, built from source
  • OS: Nobara Linux (Fedora-based)
  • GPU: AMD Radeon RX 9070 XT (RDNA4, PCI device ID 0x7550)
  • Driver: Mesa 26.1.0, RADV (ACO compiler), LLVM 21.1.8
  • Vulkan: 1.4.341
  • CPU: AMD Ryzen 9950X3D
  • Note: Confirmed RADV reports itself as non-conformant (radv is not a conformant Vulkan implementation, testing use only — standard self-disclaimer, not a fault)

Crash Signature (constant across all occurrences)

LogVulkanRHI: Error: VulkanRHI::vkQueueSubmit(Queue, InSubmitInfos.Num(), InSubmitInfos.GetData(), FenceHandle) failed, VkResult=-4
LogVulkanRHI: Error: at Runtime/VulkanRHI/Private/VulkanQueue.cpp:[490 or 507]
LogVulkanRHI: Error: with error VK_ERROR_DEVICE_LOST
LogCore: FUnixPlatformMisc::RequestExit(1, FVulkanDynamicRHI.TerminateOnGPUCrash)

No shader diagnostic messages are ever reported on either Graphics or AsyncCompute queue (“No shader diagnostics found for this queue”), and the DEVICE FAULT REPORT block is always empty (no description, address info, vendor info, vendor binary size = 0).

Confirmed kernel-level evidence (dmesg)

amdgpu 0000:03:00.0: ring comp_1.0.1 timeout, signaled seq=X, emitted seq=X+2
amdgpu 0000:03:00.0: Process UnrealEditor pid [...] thread RHISubmission pid [...]
amdgpu 0000:03:00.0: Starting comp_1.0.1 ring reset
amdgpu 0000:03:00.0: Ring comp_1.0.1 reset succeeded
amdgpu 0000:03:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:03:00.0: ring gfx_0.0.0 timeout, signaled seq=X, emitted seq=X+2
amdgpu 0000:03:00.0: Starting gfx_0.0.0 ring reset
amdgpu 0000:03:00.0: Ring gfx_0.0.0 reset succeeded

In separate occurrences this has been observed as: compute ring timing out first followed by graphics ring timing out shortly after (same UnrealEditor PID), and graphics ring timing out alone with no preceding compute timeout (different session/PID). This inconsistency in which ring fails first is itself a data point — see Analysis below.

Breadcrumb stuck-points observed across separate crash occurrences

# System Exact stuck pass Queue
1 Lumen LumenSceneLighting → BuildCardUpdateContext → DirectLighting → CullTiles 1 lights AsyncCompute
2 Nanite SceneCulling_ComputeExplicitChunkBounds Graphics
3 Nanite (repro, no scene content) SceneCulling_ComputeExplicitChunkBounds Graphics
4 Global Distance Field UpdateGlobalDistanceField → Update MostlyStatic (Update Movable never started) Graphics
5 Global Distance Field (repeat) Same as #4, identical stuck point Graphics
6 Nanite (HitProxies pass) Nanite::DrawGeometry → NoOcclusionPass → NodeAndClusterCull Graphics

Occurrence #6 happened on Frame 2, during the editor’s hit-proxy (click-selection) render pass — a structurally different render path from normal scene rendering, and notably occurred after disabling the Global Distance Field runtime (r.AOGlobalDistanceField=False), with no Global Distance Field passes present in that breadcrumb at all.

Isolation steps performed, in order, with results

1. Ruled out: known UE 5.7 Vulkan pipeline-lifecycle bug

A previously-documented UE 5.7 Linux bug (use-after-free of VkPipeline handles in VulkanPipeline.cpp’s NotifyDeletedGraphicsPSO(), calling DeleteVkPipeline(true) instead of routing through the deferred deletion queue) was identified via forum research as a plausible match. Workarounds applied:

ini

r.Vulkan.EnablePipelineLRUCache=1
r.Vulkan.WaitForIdleOnSubmit=1

Result: no effect. Crash persisted identically with both flags active.

2. Ruled out: Vulkan validation layer violations

Ran with VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation across multiple full crash reproductions. Result: zero VUID violations logged in any session, despite confirmed validation layer activation (both Instance Layer and Device Layer registration confirmed in log). This rules out CPU-side Vulkan API misuse (e.g. the use-after-free pattern from the 5.7 bug) as the cause — the GPU hang occurs with fully valid, spec-compliant API usage from the engine’s side.

3. Ruled out: scene-content dependency

Repro tested against the stock /Engine/Maps/Templates/OpenWorld template map (loaded via EditorStartupMap override in DefaultEngine.ini), with zero custom project content (no Mass Entity actors, no custom Blueprints) loaded. Result: crash still occurred, with the identical SceneCulling_ComputeExplicitChunkBounds stuck point as on the full project scene. This rules out anything specific to the project’s own assets/entities/chunk data as the trigger.

4. Ruled out: AMDVLK as a comparison driver

Investigated installing AMDVLK (AMD’s alternate, LLPC-based Vulkan driver) as a way to isolate whether the bug is RADV-specific. Result: not viable. AMDVLK was formally discontinued by AMD in September 2025; its last release (v-2024.Q4.1) predates RDNA4 hardware entirely and has no real RDNA4 support. AMD has consolidated all Linux Vulkan driver development into RADV. No comparison test possible.

5. Inconclusive / partially effective: disabling specific render features

  • r.Lumen.AsyncCompute=0 — applied alongside other flags; crash recurred in a different system (Nanite/Distance Field) afterward, so not independently verified as ineffective, but did not prevent crashes in combination.
  • r.GenerateMeshDistanceFields=False (full distance field disable) — editor became stable, no crash, but rendering broke significantly (black screen with visual artifacts), strongly suggesting Lumen and/or other systems have an undocumented hard dependency on distance field data even when not visibly enabled. Not a usable workaround.
  • r.GenerateMeshDistanceFields=True + r.AOGlobalDistanceField=False (disable GDF runtime only, keep mesh DF generation) — delayed the crash by a few seconds but did not prevent it; crash recurred in an unrelated Nanite HitProxies culling pass instead (see occurrence #6 above). This was the most informative negative result: it shows the crash is not specific to the Global Distance Field’s sparse page-table allocation, contrary to an earlier working hypothesis (see below).

6. Investigated and ruled out (with caveats): sparse/PRT resource binding theory

One occurrence’s call stack showed the failing vkQueueSubmit originating from FVulkanQueue::SubmitPayloads’s sparse-resource-commit branch (VulkanQueue.cpp:357, the Payload->ReservedResourcesToCommit.Num() path), rather than the standard per-frame RDG submission path (VulkanQueue.cpp:463). This pointed toward a known class of RADV bug: sparse/PRT (Partial Resident Texture) binding issues on RDNA4, which have documented precedent in other Vulkan applications (DXVK removed sparse buffer usage citing AMD driver hangs; VKD3D-Proton has pending/partial RDNA4 sparse-SMEM workarounds; a near-identical "The CS has been cancelled because the context is lost" / vkQueueSubmit failed -4 report exists from an unrelated application on the same GPU model). However, step 5’s negative result (disabling GDF runtime did not stop crashes) means this is not confirmed as the root cause — it may be a contributing factor in some occurrences but not a complete explanation, since crashes also occur via non-sparse compute dispatches (e.g. Nanite culling, which does not use sparse/PRT resources to our knowledge).

7. Stack trace instrumentation (CPU-side)

Added logging (UE_LOG) plus full stack capture (FPlatformStackWalk::CaptureStackBackTrace) at:

  • FVulkanQueue::Submit — every call resolved to the same generic submission-thread plumbing (FVulkanThread::Run → ProcessSubmissionQueue → ForEachQueue → SubmitQueuedPayloads → SubmitPayloads → Submit), regardless of which RDG pass produced the command buffer. Not useful for root cause — by the time execution reaches this point, the original RDG pass context is gone; this only shows generic Vulkan RHI submission machinery.
  • FRDGBuilder::ExecutePass (pass name + pipeline type logged on entry/exit) — confirmed that in the occurrence where this was active, the entire frame’s RDG pass list executed and “finished” cleanly on the CPU side, including the pass that later breadcrumbs would show as GPU-side stuck (Update Movable/Propagate Clipmap etc. all completed). The actual crash occurred in a later, separate submission with a different stack signature, on a different queue, with no corresponding ExecutePass log entry — consistent with the GPU having already silently hung on earlier work, with the failing submission simply being the next one to surface the already-dead device via vkQueueSubmit.

Current working theory (not fully proven)

The evidence is most consistent with a RADV/RDNA4 driver-level compute dispatch or scheduling bug, rather than a single Unreal Engine code defect:

  • Crashes occur across at least four structurally unrelated engine systems (Lumen direct lighting, Nanite scene culling x2, Global Distance Field update, Nanite hit-proxy culling), all of which are compute-shader-driven culling/binning passes operating on variable-sized GPU work.
  • Crashes are not scene-content-dependent (reproduces on a stock empty template).
  • Crashes are not caused by CPU-side Vulkan API misuse (zero validation layer violations across multiple instrumented runs).
  • Crashes are not fully explained by the known UE 5.7 PSO-lifecycle bug (workarounds for that bug had no effect here).
  • Disabling any single implicated feature (Lumen async compute, Global Distance Field) does not stop the crash — it relocates to a different system instead.
  • dmesg confirms genuine GPU-side ring timeouts (comp_1.0.1, gfx_0.0.0) with kernel-level reset/recovery, not merely an application or API-level error.
  • RDNA4 + RADV is independently documented elsewhere (Mesa release notes, DXVK, VKD3D-Proton changelogs, unrelated GitHub issues on the same GPU model) as having had, and in some cases still having, compute/sparse-resource/hardware-bug-related hang issues requiring driver-side workarounds (e.g. the documented buggy HiZ/HiS on GFX12 requiring a RADV workaround merged in Mesa 25.1).

This is not conclusively proven — it remains possible there is a genuine engine-side bug common to all four affected systems (e.g. a shared utility function used by Nanite culling, Lumen tiling, and GDF page updates) that has not yet been identified, rather than a driver issue. No such common code path has been identified yet.

What has NOT yet been tried

  • Direct correlation of dmesg ring-timeout timestamps against ExecutePass logs for the same crash occurrence (only done for the submission-stack/breadcrumb correlation so far).
  • GPU coredump analysis via umr (AMDGPU Userspace Register Debugger) — attempted, but the installed Nobara package version does not recognize either the RDNA4 (0x7550) or Raphael APU (0x13c0) ASIC IDs, and appears to lack a workflow for offline analysis of an already-captured coredump file (it is designed for live register inspection).
  • Testing on a different RDNA4 card or a different vendor’s GPU (NVIDIA/Intel) on the same engine build, to further isolate vendor-specific vs. engine-wide.
  • Testing against a Mesa version newer than 26.1.0, in case any of the relevant fixes (e.g. continued RDNA4 hang workarounds visible in recent Mesa release notes) have landed since.
  • Bisecting engine versions between 5.5 (last confirmed-stable per other forum reports) and 5.8 to narrow which specific UE change, if any, introduced or worsened this.

Request

Has anyone else seen VK_ERROR_DEVICE_LOST on UE 5.8 source builds specifically on AMD RDNA4 hardware (RX 9070 series or similar), with crashes occurring in varying compute-driven culling passes (Nanite scene culling, Lumen, Global Distance Field) rather than one consistent location? Particularly interested in:

  • Whether this reproduces on non-source/binary editor builds
  • Whether anyone has a confirmed fix or driver workaround
  • Whether Epic’s render hardware interface (RHI) team has visibility into RDNA4-specific compute scheduling issues independent of this thread
4 Likes