UE 5.7.x Linux – VK_ERROR_DEVICE_LOST in Vulkan RHI

Hi!

I am consistently experiencing Vulkan device loss on Linux in UE 5.7.x (tested on 5.7.2 and now 5.7.3).
In versions 5.5 and 5.6, using the same hardware and the same project, this issue did not occur.

My system:

  • Arch Linux

  • KDE Plasma 6.5.5

  • KDE Frameworks 6.22.0

  • Qt 6.10.2

  • Kernel 6.18.9-zen1-2-zen (64-bit)

  • Graphics Platform: X11

  • CPU: AMD Ryzen 9 5900X (24 threads)

  • RAM: 128 GB

  • GPU: NVIDIA GeForce RTX 3090

For testing purposes, I installed on a separate drive:

  • the recommended Ubuntu version

  • Rocky Linux

The behavior is identical — the error persists.
I also tested different NVIDIA driver versions (including rollbacks), but the issue remains.

Error

The logs consistently report:

LogVulkanRHI: Error: Result failed, VkResult=-4
with error VK_ERROR_DEVICE_LOST

The crash occurs in:

Runtime/VulkanRHI/Private/VulkanSynchronization.cpp

GPU breadcrumbs indicate execution during:

ShadowDepths
Nanite Shadows
Nanite::DrawGeometry
NodeAndClusterCull

Followed by:

FUnixPlatformMisc::RequestExit(1, FVulkanDynamicRHI.TerminateOnGPUCrash)

Observations

  • Occurs in the Editor during normal workflow

  • Does not require extreme load

  • UE 5.5 and 5.6 were stable on the same setup

  • In 5.7.x crashes are frequent enough that productive work is difficult

  • Reproduced across multiple Linux distributions

This appears to be a regression in Vulkan RHI.

I would greatly appreciate it if Epic could take a look at this issue.
If anyone has found a temporary workaround for the current version, I would be grateful for any suggestions.

1 Like

I’m not on Arch but I have a similar hardware setup to yours. Rolling the nvidia driver back from 590 (current latest for me)→570 seemed to help.

The engine keeps crashing on me consistently. It might run for 2 hours, sometimes just 2 minutes, or 15 or 40, but it always crashes eventually. Since I lack the technical knowledge to properly diagnose this myself, I had Claude analyze and write the crash report.

To @tginick: Rolling back the NVIDIA driver is not a solution in my case. I have tested drivers 570, 575, 580, and 590 — across three kernel versions (mainline, LTS, and Zen). The crash is 100% reproducible on all combinations. The issue is definitively not driver- or kernel-related.

New crash scenario identified

In addition to the previously reported VK_ERROR_DEVICE_LOST during Nanite rendering, I have now captured a second, more specific crash scenario that happens during Material Editor workflow — specifically when closing a material preview window while another material is open.

Vulkan Validation Layer output (captured with VK_LAYER_KHRONOS_validation)

Running the editor with validation layers enabled immediately exposes the root cause before the crash:

VUID-vkQueueSubmit-pCommandBuffers-00070
vkQueueSubmit(): pSubmits[19].pCommandBuffers[0] — bound VkPipeline
0x83904300002a854b was destroyed.

A command buffer submitted to the GPU still holds a reference to a VkPipeline object that has already been destroyed. This is a use-after-free on the Vulkan object lifecycle side. The GPU cannot execute this command, which produces:

  • NVRM: Xid 69 — Class Error (invalid pipeline handle)
  • NVRM: Xid 32 — channel interrupt (GPU execution stall)

Crash stack trace (SIGABRT)

The crash is not a GPU hang per se — it is the CPU-side render thread deadlocking while waiting for a GPU fence that will never signal (because the GPU stalled on the invalid pipeline):

Signal 6 caught (SIGABRT — abort() called)

FPThreadEvent::Wait()
FRenderCommandFence::Wait()
FFrameEndSync::Sync()
FlushRenderingCommands()
FLinuxWindow::ReshapeWindow()   ← triggered by window resize
SWindow::ResizeWindowSize()
FSlateApplication::DrawPrepass()
FSlateApplication::PrivateDrawWindows()
FSlateApplication::DrawWindows()
FSlateApplication::Tick()
FEngineLoop::Tick()

Sequence of events leading to the crash

  1. Material MLB_HightBlend is saved and compiled in the Material Editor.
  2. Material test is opened in a second editor window.
  3. The first preview window (M_Blend_Inst) is closed.
  4. Slate initiates DrawPrepass, which triggers ReshapeWindow on the remaining window.
  5. FlushRenderingCommands() is called to synchronize the render thread.
  6. The render thread blocks indefinitely in pthread_cond_timedwait waiting for a fence that the GPU will never signal — because a command buffer submitted earlier references an already-destroyed VkPipeline.
  7. The engine calls abort() → SIGABRT → crash.

Why this is a UE 5.7 regression

UE 5.7 introduced changes to the Material Editor and its Vulkan pipeline lifecycle management. It appears there is a race condition where a VkPipeline object is destroyed (as part of closing a preview window / shader recompilation) while an in-flight command buffer in the render thread still holds a reference to it. In UE 5.5 and 5.6, the same workflow on the same hardware is completely stable.

System info

  • UE 5.7.3 (CL-50162420)
  • Arch Linux, GPU: NVIDIA RTX 3090
  • Tested with NVIDIA drivers: 570, 575, 580, 590
  • Tested kernels: mainline, LTS, Zen
  • Tested distros: Arch Linux, Ubuntu (recommended), Rocky Linux
  • Rendering backend: Vulkan

Temporary workaround request

I have tried the following without success:

  • All available NVIDIA driver versions
  • Multiple kernels
  • Multiple Linux distributions

Is there a way to force UE 5.7 to delay pipeline destruction until all in-flight command buffers have completed (e.g., a CVar or engine config flag)? Or any way to force the editor to use a safer pipeline eviction strategy?

The crash is consistently reproducible during normal Material Editor usage, making productive work in UE 5.7 on Linux essentially impossible.

1 Like

I can reproduce this problem, also on Arch Linux, across UE 5.7.1, 5.7.2, and 5.7.3. The specific message I am getting under 5.7.3 is this:

[2026.02.20-22.03.42:760][814]LogVulkanRHI: Error: Result failed, VkResult=-4
at ./Runtime/VulkanRHI/Private/VulkanSynchronization.cpp:136
with error VK_ERROR_DEVICE_LOST
[2026.02.20-22.03.42:761][814]LogVulkanRHI: Error: Shader diagnostic messages and asserts:

Device: 0, Queue Graphics:
	No shader diagnostics found for this queue.

[2026.02.20-22.03.42:761][814]LogVulkanRHI: Error:
DEVICE FAULT REPORT:

Description:

Address Info:

Vendor Info:

Vendor Binary Size: 0

[2026.02.20-22.03.42:761][814]LogRHI: Error: Active GPU breadcrumbs:

Device 0, Pipeline Graphics: (In: 0x8018e768, Out: 0x8018e769)
	No breadcrumb nodes found for this queue.

When I first encountered this issue on 5.7.1, i found some sources online suggesting it was a power profile problem with the Nvidia driver. The recommended workaround was to change PowerMizer profile to “prefer maximum performance” and to add a udev rule to make that change persistent across restarts (apparently it’s not persistent if changed in the Nvidia Settings applet).

That workaround seemed to work for me under 5.7.2, but I just installed 5.7.3 and opened a minimal level, and the crash happened about 3 or 4 minutes later. I don’t recall exactly the previous error, so this may be identical or slightly different, but the overall behavior is the same.

The errors started occurring for me a couple of Nvidia driver versions ago. I’m currently at 590.48.01.

One data point I can contribute: I can rule out Optimus entirely. I have a Quadro RTX5000 (mobile), and I have switched graphics disabled in firmware/BIOS settings. It doesn’t even show up as a device to the kernel.

I may have a workaround for this; I am testing it today.

Several days ago I updated my local full source repo to 5.7.3 and then did some collaborative diagnostics with Claude Code in that repo. I combined observations from this thread (by @mirthost) with my own test results and Claude’s ability to rapidly cross-reference different parts of the code base. After all that, I believe we have a workaround.

Quick Workaround

Edit your project’s Config/DefaultEngine.ini to add or change the following:

[/Script/Engine.RendererSettings]
; r.Vulkan.WaitForIdleOnSubmit=1
r.Vulkan.EnablePipelineLRUCache=1

Having one line commented out is not a typo. For most people, enabling the LRU (least recently used) pipeline cache is sufficient to prevent the error with minimal performance impact. If that doesn’t work for you, also enable r.Vulkan.WaitForIdleOnSubmit to fully serialize CPU/GPU communication and eliminate the race condition — but that option imposes a much larger performance penalty and should only be used if necessary.

I had a project reproducing the error within about two minutes of starting the editor at idle, and crashing immediately if I tried to debug a PCG graph. With both INI flags enabled, the crashes stopped. I then commented out WaitForIdleOnSubmit and restarted — it remains stable. So the lighter-weight setting alone is sufficient for my case.

Technical Analysis

Credit for the root cause analysis goes to @mirthost (who ran Vulkan validation layers and identified the exact violation) and Claude Code (Anthropic’s AI coding assistant, which traced the bug to specific lines in the engine source). I contributed the observation that the crash correlated with editor cleanup/GC activity and the idea of looking for CVARs as a workaround to avoid rebuilding the engine from source.

Running with VK_LAYER_KHRONOS_validation catches the specific violation:

VUID-vkQueueSubmit-pSubmits[19].pCommandBuffers[0] — bound VkPipeline
0x83904300002a854b was destroyed.

This is a use-after-free of VkPipeline handles. The CPU destroys a pipeline object while the GPU still has in-flight command buffers referencing it. This triggers NVRM: Xid 69 (invalid pipeline handle) → VK_ERROR_DEVICE_LOST → render fence never signals → CPU deadlock → SIGABRT.

The bug is in Engine/Source/Runtime/VulkanRHI/Private/VulkanPipeline.cpp, in the function NotifyDeletedGraphicsPSO(). On PC/Linux, the LRU pipeline cache is disabled by default. When a PSO’s (pipeline state object’s) reference count hits zero — triggered by material recompilation, level unloading, garbage collection, editor window close, PCG shader invalidation, or similar events — this function calls:

(*Contained)->DeleteVkPipeline(true);   // line ~2502: immediate vkDestroyPipeline
VkPSO->DeleteVkPipeline(true);          // line ~2515: immediate vkDestroyPipeline

The true argument bypasses the engine’s deferred deletion queue and calls vkDestroyPipeline() immediately, regardless of whether the GPU is still executing commands that use that pipeline.

When the LRU cache is enabled (via the INI setting above), PSO deletion instead goes through LRURemove(), which checks whether the pipeline was used within the last 3 rendered frames. If it was recently used, it calls DeleteVkPipeline(false) — enqueuing the handle in FDeferredDeletionQueue2 for destruction only after the GPU has finished with it. That safety mechanism already exists in the codebase; the default non-LRU path simply doesn’t use it.

For developers who build from source, the proper two-line fix in VulkanPipeline.cpp is to change true to false at both call sites above, so all PSO destruction routes through the deferred deletion
queue. The CPU-side handle is still cleared immediately; only the actual vkDestroyPipeline call is deferred until the GPU is done.

Forum Notes

For this forum post, I wrote the workaround procedure and test results, Claude Code wrote the technical analysis, and I final-edited the merged post.

I have examined the code change proposed by Claude Code and consider it sensible, but I have not personally tested it because the INI file changes are sufficient for me. Use at your own risk.

1 Like