[Android] GPU hang or crash in FVulkanOcclusionQueryPool

We’re dealing with a pair of relatively uncommon, but consistent crashes being reported related to FVulkanOcclusionQueryPool. The callstacks are similar enough that I think it’s the render thread and RHI thread equivalents of the same issue.

The problem is mainly visible through crash reports, but there is also a third form of it where the GPU appears to just hang without ever finishing, and the RHI starts spamming the logs endlessly with the message “Timed out while waiting for GPU to catch up on occlusion results. (0.5 s)” Because this does not actually crash, it doesn’t get reported to our Backtrace, so I’m concerned the issue might be even more commonplace than anticipated.

As far as I can tell, it started happening only after we replaced the normal rendering in the background of our menus with a still image. We do this by capturing a screenshot of the viewport just before showing the menu, use that screenshot as a background image widget, and then enable UGameViewportClient::bDisableWorldRendering to skip the costly background render.

Shortly after this change, we started seeing this error in Backtrace. If I create an automation test that just endlessly cycles through menus, I’ll eventually hit the hang & log manifestation of this problem. To isolate that this is due to freezing rendering, and not something in the menus, I can also get it to happen by making the automation test simply cycle UGameViewportClient::bDisableWorldRendering on and off.

There may be something particular to our scene that makes this happen, so if you can’t seem to reproduce the issue, I’d welcome some ideas on where to narrow down the investigation.

Steps to Reproduce
Unreal 5.4 (was also present in 5.3)

Android Vulkan + Deferred renderer

Toggle UGameViewportClient::bDisableWorldRendering on/off until the problem manifests

(Callstacks as a reply to get around the silly 4500 character limit while keeping the callstacks text searchable:)

RHI thread callstack:

[ 00 ] FVulkanOcclusionQueryPool::FlushAllocatedQueries() ( VulkanQuery.cpp:259 )
[ 01 ] FVulkanDevice::AcquireOcclusionQueryPool(FVulkanCommandBufferManager*, unsigned int) ( VulkanQuery.cpp:292 )
[ 02 ] FVulkanCommandListContext::BeginOcclusionQueryBatch(FVulkanCmdBuffer*, unsigned int) ( VulkanQuery.cpp:237 )
[ 03 ] FVulkanCommandListContext::RHIBeginRenderPass(FRHIRenderPassInfo const&, char16_t const*) ( VulkanRenderTarget.cpp:601 )
[ 04 ] FRHICommand<FRHICommandBeginRenderPass, FRHICommandBeginRenderPassString1634>::ExecuteAndDestruct(FRHICommandListBase&, FRHICommandListDebugContext&) ( RHICommandList.h:1299 )
[ 05 ] FRHICommandListBase::Execute(TRHIPipelineArray<IRHIComputeContext*>&, FRHICommandListBase::FPersistentState::FGPUStats*) ( RHICommandList.cpp:477 )
[ 06 ] operator() ( RHICommandList.cpp:786 )
[ 07 ] decltype(Forward<FRHICommandListImmediate::ExecuteAndReset(bool)::$_13&>(fp)()) Invoke<FRHICommandListImmediate::ExecuteAndReset(bool)::$_13&>(FRHICommandListImmediate::ExecuteAndReset(bool)::$_13&) ( Invoke.h:47 )
[ 08 ] UE::Core::Private::Function::TFunctionRefCaller<FRHICommandListImmediate::ExecuteAndReset(bool)::$_13, void ()>::Call(void*) ( RHICommandList.cpp:405 )
[ 09 ] UE::Core::Private::Function::TFunctionRefBase<UE::Core::Private::Function::TFunctionStorage<true>, void ()>::operator()() const ( Function.h:555 )
[ 10 ] TFunctionGraphTaskImpl<void (), (ESubsequentsMode::Type)0>::DoTaskImpl(TUniqueFunction<void ()>&, ENamedThreads::Type, TRefCountPtr<FGraphEvent> const&) ( TaskGraphInterfaces.h:1733 )
[ 11 ] TFunctionGraphTaskImpl<void (), (ESubsequentsMode::Type)0>::DoTask(ENamedThreads::Type, TRefCountPtr<FGraphEvent> const&) ( TaskGraphInterfaces.h:1726 )
[ 12 ] TGraphTask<TFunctionGraphTaskImpl<void (), (ESubsequentsMode::Type)0>>::ExecuteTask(TArray<FBaseGraphTask*, TSizedDefaultAllocator<32>>&, ENamedThreads::Type, bool) ( Function.h:1235 )
[ 13 ] FBaseGraphTask::Execute(TArray<FBaseGraphTask*, TSizedDefaultAllocator<32>>&, ENamedThreads::Type, bool) ( TaskGraphInterfaces.h:840 )
[ 14 ] FNamedTaskThread::ProcessTasksNamedThread(int, bool) ( TaskGraphInterfaces.h:760 )
[ 15 ] FNamedTaskThread::ProcessTasksUntilQuit(int) ( TaskGraph.cpp:650 )
[ 16 ] FRHIThread::Run() ( RenderingThread.cpp:330 )
[ 17 ] FRunnableThreadPThread::Run() ( PThreadRunnableThread.cpp:25 )
[ 18 ] FRunnableThreadPThread::_ThreadProc(void*) ( PThreadRunnableThread.h:187 )

Render Thread callstack:

[ 00 ] TArray<unsigned long long, TSizedDefaultAllocator<32>>::RangeCheck(int) const ( Array.h:758 )
[ 01 ] TArray<unsigned long long, TSizedDefaultAllocator<32>>::operator[](int) const ( Array.h:843 )
[ 02 ] FVulkanQueryPool::GetResultValue(unsigned int) const ( VulkanResources.h:796 )
[ 03 ] FVulkanDynamicRHI::RHIGetRenderQueryResult(FRHIRenderQuery*, unsigned long long&, bool, unsigned int) ( Array.h:468 )
[ 04 ] RHIGetRenderQueryResult(FRHIRenderQuery*, unsigned long long&, bool, unsigned int) ( DynamicRHI.h:1310 )
[ 05 ] bool FGPUOcclusionPacket::OcclusionCullPrimitive<false, FGPUOcclusionPacket::FProcessVisitor>(FGPUOcclusionPacket::FProcessVisitor&, FOcclusionCullResult&, int) ( DynamicRHI.h:2537 )
[ 06 ] FGPUOcclusionSerial::AddPrimitives(FPrimitiveRange) ( SceneVisibility.cpp:3178 )
[ 07 ] FVisibilityTaskData::ProcessRenderThreadTasks() ( SceneVisibility.cpp:4503 )
[ 08 ] FMobileSceneRenderer::InitViews(FRDGBuilder&, FSceneTexturesConfig&, FInstanceCullingManager&, FVirtualTextureUpdater*, FMobileSceneRenderer::FInitViewTaskDatas&) ( MobileShadingRenderer.cpp:489 )
[ 09 ] FMobileSceneRenderer::Render(FRDGBuilder&) ( MobileShadingRenderer.cpp:1033 )
[ 10 ] RenderViewFamilies_RenderThread(FRHICommandListImmediate&, TArray<FSceneRenderer*, TSizedDefaultAllocator<32>> const&) ( SceneRendering.cpp:4829 )
[ 11 ] FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_36::operator()(FRHICommandListImmediate&) const ( SceneRendering.cpp:5119 )
[ 12 ] TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5113>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_36>::DoTask(ENamedThreads::Type, TRefCountPtr<FGraphEvent> const&) ( RenderingThread.h:263 )
[ 13 ] TGraphTask<TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5113>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_36>>::ExecuteTask(TArray<FBaseGraphTask*, TSizedDefaultAllocator<32>>&, ENamedThreads::Type, bool) ( RenderingThread.h:1235 )
[ 14 ] FBaseGraphTask::Execute(TArray<FBaseGraphTask*, TSizedDefaultAllocator<32>>&, ENamedThreads::Type, bool) ( TaskGraphInterfaces.h:840 )
[ 15 ] FNamedTaskThread::ProcessTasksNamedThread(int, bool) ( TaskGraphInterfaces.h:760 )
[ 16 ] FNamedTaskThread::ProcessTasksUntilQuit(int) ( TaskGraph.cpp:650 )
[ 17 ] RenderingThreadMain(FEvent*) ( RenderingThread.cpp:413 )
[ 18 ] FRenderingThread::Run() ( RenderingThread.cpp:564 )
[ 19 ] FRunnableThreadPThread::Run() ( PThreadRunnableThread.cpp:25 )
[ 20 ] FRunnableThreadPThread::_ThreadProc(void*) ( PThreadRunnableThread.h:187 )

Hi Camille,

Does this occur when bDisableWorldRendering is true of false? Or on transitions frames when bDisableWorldRendering changes state?

Best regards.

Hi Camille,

Checking in if you had managed to isolate a more consistent repro case?

Best regards.

Thanks for the update. Please reach out if you do.

Hi Camille,

Are all devices this issue has been seen on Adreno based or has it been observed on other GPU families. If only Adreno based, is it generation specific? Has this been seen in GLES or does it appear to be VK specific?

Best regards.

Hi Camille,

Does disabling Occlusion culling mask away the issue (Project Setting > Rendering > Culling > “Occlusion Culling)? Should that be the case, it would be interesting to see how it affects performance of your scene.

Best regards.

Hi Camille,

Glad to hear that narrowed down and masks the GPU crash. We will continue to investigate. If you have a scene you can share that reproduces the issue and shows superior performance without occlusion queries enabled, please let us know.

Best regards.

Hi Camille,

Does running the app with -gpuvalidation enabled yield any additional information in the application log?

Best regards.

Hi Camille,

None of those validations errors appear linked to the initial problem. I do not believe UE 5.4 ran completely clean against -gpuvalidation. Can you see if -gpucrashdebugging yields any more information.

Best regards

Hi Camille,

This may be difficult to do on your end, but does a test retargeting of the app against UE 5.6.1 still exhibit the issue?

Best regards.

Thanks for the information Camille,

Should the issue persist on current or newer versions of UE and you’re in a position to share a repro that we can diagnose, please reach out.

Best regards.

Hi Camille,

Indeed, these weren’t available in 5.5 stock. You can try enabling them with:

		// @todo - new gpu profiler. This is experimental.
		PublicDefinitions.Add("RHI_NEW_GPU_PROFILER=1");

in RHI.Build.cs, however the new progiler for Vulkan wasn’t fully enabled until 5.6.

Best regards.

It definitely seems to be when changing state and, in particular, when it becomes false (i.e.: when menus are closed and we resume world rendering). The repro seems wildly inconsistent, maybe it needs the device at just the right throttled state to trigger a particular race condition.

All I know is that in true Heisenbug fashion, actively trying to reproduce it makes it go away. :melting_face:

Not yet, I was holding off to see if 5.5 fixes it. We just integrated and are starting to test more proactively. Nothing in a couple of days (knock on wood), but it’s always been a very sporadic issue.

It does still occur in 5.5, one of our QA in particular is running a Pixel 4a that seems to have it occur much more commonly. They haven’t been able to determine repro steps, though I’ll work with them to confirm whether it also occurs when closing menus + unfreezing rendering, and prepping an experimental build where we don’t freeze rendering in the background (to see if that prevents the hang from ever manifesting).

Since this only manifests as occlusion query timeouts, but for all intents and purposes the app is still running, is there some diagnostics I could issue to get some sense of what the GPU is doing at that point? I’m pretty sure occlusion queries are just a canary in the coal mine, and something is hanging or deadlocking the GPU (yet is not triggering Android not responding detection), so if I can dump some breadcrumb state like whatever GPU queue was last/is currently executing, that should maybe narrow it down a bit.

It seems to happen on either Adreno or Mali. I’ve attached the stats for the issue on our current alpha build, though we don’t have enough data to determine whether the top ones are higher because those GPUs crash more frequently, or because the GPUs are more prevalent.

I did try an experiment of disabling our “toggle UGameViewportClient::bDisableWorldRendering while menus are active” optimization to see if that was indeed the cause, but we still had a couple of instances of the issue internally. So perhaps my initial repro impression that it happens when restoring world rendering was incorrect…

r.AllowOcclusionQueries 0 does seem to prevent the hang for our QA tester’s repro device. (Still waiting on IT send me one of them to try it myself.)

Perf-wise, I’m actually a bit surprised. I think the last time I tried turning it off was before we merged 5.5 and were able to enable GPUScene, but the impact on draw call count is less than I expected.

Considering the amount of draw calls that are spent on occlusion queries themselves, this probably warrants a second look…

Some scenes are better without occlusion, but there’s also enough that still need it, so turning queries off outright unfortunately won’t be a long term solution. Since it sounds like this is an issue particular to our project that I cannot get to happen with a stock UE5.5 project, I’m taking as much of a proactive approach as I can without having a 100% repro device on hand.

I did manage to get the occlusion query hang while having a debugger available, and poked at it a little bit. The one that I’m experiencing seems to exclusively occur when closing menus that freeze rendering, i.e.: the transition out of UGameViewportClient::bDisableWorldRendering that I was describing above. But our QA’s test device case seems to happen just from lingering in one of our levels in particular. I’ll try to get them to repro it with -onethread.

In any event, the callstack during the hang is:

>	libUnreal.so!FVulkanDynamicRHI::RHIGetRenderQueryResult(FVulkanDynamicRHI * this, FVulkanOcclusionQuery * QueryRHI, uint64 & OutNumPixels, bool bWait, uint32 GPUIndex) Line 480	c++14 
 	libUnreal.so!RHIGetRenderQueryResult(FRHIRenderQuery * RenderQuery, uint64 & OutResult, bool bWait, uint32 GPUIndex) Line 1327	c++14
 	libUnreal.so!bool FGPUOcclusionPacket::OcclusionCullPrimitive<false, FGPUOcclusionPacket::FProcessVisitor>(FGPUOcclusionPacket * this, FGPUOcclusionPacket::FProcessVisitor & Visitor, FOcclusionCullResult & Result, int32 Index) Line 2634	c++14
 	libUnreal.so!FGPUOcclusionSerial::AddPrimitives(FGPUOcclusionSerial * this, FPrimitiveRange PrimitiveRange) Line 3270	c++14
 	libUnreal.so!FVisibilityTaskData::ProcessRenderThreadTasks(FVisibilityTaskData * this) Line 4681	c++14
 	libUnreal.so!FMobileSceneRenderer::InitViews(FMobileSceneRenderer * this, FRDGBuilder & GraphBuilder, FSceneTexturesConfig & SceneTexturesConfig, FInstanceCullingManager & InstanceCullingManager, FVirtualTextureUpdater * VirtualTextureUpdater, FMobileSceneRenderer::FInitViewTaskDatas & TaskDatas) Line 473	c++14
 	libUnreal.so!FMobileSceneRenderer::Render(FMobileSceneRenderer * this, FRDGBuilder & GraphBuilder) Line 1057	c++14
 	libUnreal.so!RenderViewFamilies_RenderThread(FRHICommandListImmediate & RHICmdList, const TArray<FSceneRenderer *, TSizedDefaultAllocator<32> > & SceneRenderers) Line 5428	c++14
 	libUnreal.so!FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41::operator()(FRHICommandListImmediate&) const(const (unnamed class) * this, FRHICommandListImmediate & RHICmdList) Line 5731	c++14
 	libUnreal.so!TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5726>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41>::DoTask(TEnqueueUniqueRenderCommandType<TRenderCommandTag<TSTR_FDrawSceneCommand5726>, (unnamed class)> * this, ENamedThreads::Type CurrentThread, const FGraphEventRef & MyCompletionGraphEvent) Line 234	c++14	libUnreal.so!TGraphTask<TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5726>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41>>::ExecuteTask(TGraphTask<TEnqueueUniqueRenderCommandType<TRenderCommandTag<TSTR_FDrawSceneCommand5726>, (unnamed class)> > * this) Line 633	c++14
 	libUnreal.so!UE::Tasks::Private::FTaskBase::TryExecuteTask(UE::Tasks::Private::FTaskBase * this) Line 503	c++14
 	libUnreal.so!FBaseGraphTask::Execute(FBaseGraphTask * this, TArray<FBaseGraphTask *, TSizedDefaultAllocator<32> > & NewTasks, ENamedThreads::Type CurrentThread, bool bDeleteOnCompletion) Line 481	c++14
 	libUnreal.so!FNamedTaskThread::ProcessTasksNamedThread(FNamedTaskThread * this, int32 QueueIndex, bool bAllowStall) Line 778	c++14
 	libUnreal.so!FNamedTaskThread::ProcessTasksUntilQuit(FNamedTaskThread * this, int32 QueueIndex) Line 666	c++14
 	libUnreal.so!RenderingThreadMain(FEvent * TaskGraphBoundSyncEvent) Line 316	c++14
 	libUnreal.so!FRenderingThread::Run(FRenderingThread * this) Line 467	c++14
 	libUnreal.so!FRunnableThreadPThread::Run(FRunnableThreadAndroid * this) Line 24	c++14
 	libUnreal.so!FRunnableThreadPThread::_ThreadProc(FRunnableThreadAndroid * pThis) Line 186	c++14

Other threads are all paused. FGPUOcclusionSerial::AddPrimitives is iterating over all primitives in the range, and every one of them calls RHIGetRenderQueryResult and waits for the result. For a pathological scene with 3k primitives, and a 0.5s timeout for each of them, that’s basically 20+ minutes worth of waiting to get out of that loop.

I noticed the full logcat/debugger output also has some driver spew while in that state:

09-24 17:41:29.495 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error

09-24 17:41:29.495 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot

09-24 17:41:29.542 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error

09-24 17:41:29.542 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot

09-24 17:41:29.584 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error

09-24 17:41:29.584 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot

I added an escape hatch to disable bWait on RHIGetRenderQueryResult after a couple of queries, and this promptly leads to a Vulkan device lost on queue submit:

LogVulkanRHI: Error: VulkanRHI::vkQueueSubmit(Queue, 1, &SubmitInfo, Fence->GetHandle()) failed, VkResult=-4

at ./Runtime/VulkanRHI/Private/VulkanQueue.cpp:74

with error VK_ERROR_DEVICE_LOST

This kind of makes sense, as a GPU loss/crash could explain queries never returning, but then I would also expect the checks inside FVulkanOcclusionQueryPool::InternalTryGetResults to catch the VK_ERROR_DEVICE_LOST and fail in that callstack instead. I’ll add more logging in there and see where that gets me, but I want to make sure I’m not causing the device lost with these changes. Is it possible that skipping the wait or otherwise not retrieving the query pool results is causing the GPU crash?

Also, in the event that we can’t get a repro with -onethread, are there functional GPU breadcrumbs on Android Vulkan that I could use to maybe determine what crashed the GPU before occlusion queries, if anything?