Some scenes are better without occlusion, but there’s also enough that still need it, so turning queries off outright unfortunately won’t be a long term solution. Since it sounds like this is an issue particular to our project that I cannot get to happen with a stock UE5.5 project, I’m taking as much of a proactive approach as I can without having a 100% repro device on hand.
I did manage to get the occlusion query hang while having a debugger available, and poked at it a little bit. The one that I’m experiencing seems to exclusively occur when closing menus that freeze rendering, i.e.: the transition out of UGameViewportClient::bDisableWorldRendering that I was describing above. But our QA’s test device case seems to happen just from lingering in one of our levels in particular. I’ll try to get them to repro it with -onethread.
In any event, the callstack during the hang is:
> libUnreal.so!FVulkanDynamicRHI::RHIGetRenderQueryResult(FVulkanDynamicRHI * this, FVulkanOcclusionQuery * QueryRHI, uint64 & OutNumPixels, bool bWait, uint32 GPUIndex) Line 480 c++14
libUnreal.so!RHIGetRenderQueryResult(FRHIRenderQuery * RenderQuery, uint64 & OutResult, bool bWait, uint32 GPUIndex) Line 1327 c++14
libUnreal.so!bool FGPUOcclusionPacket::OcclusionCullPrimitive<false, FGPUOcclusionPacket::FProcessVisitor>(FGPUOcclusionPacket * this, FGPUOcclusionPacket::FProcessVisitor & Visitor, FOcclusionCullResult & Result, int32 Index) Line 2634 c++14
libUnreal.so!FGPUOcclusionSerial::AddPrimitives(FGPUOcclusionSerial * this, FPrimitiveRange PrimitiveRange) Line 3270 c++14
libUnreal.so!FVisibilityTaskData::ProcessRenderThreadTasks(FVisibilityTaskData * this) Line 4681 c++14
libUnreal.so!FMobileSceneRenderer::InitViews(FMobileSceneRenderer * this, FRDGBuilder & GraphBuilder, FSceneTexturesConfig & SceneTexturesConfig, FInstanceCullingManager & InstanceCullingManager, FVirtualTextureUpdater * VirtualTextureUpdater, FMobileSceneRenderer::FInitViewTaskDatas & TaskDatas) Line 473 c++14
libUnreal.so!FMobileSceneRenderer::Render(FMobileSceneRenderer * this, FRDGBuilder & GraphBuilder) Line 1057 c++14
libUnreal.so!RenderViewFamilies_RenderThread(FRHICommandListImmediate & RHICmdList, const TArray<FSceneRenderer *, TSizedDefaultAllocator<32> > & SceneRenderers) Line 5428 c++14
libUnreal.so!FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41::operator()(FRHICommandListImmediate&) const(const (unnamed class) * this, FRHICommandListImmediate & RHICmdList) Line 5731 c++14
libUnreal.so!TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5726>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41>::DoTask(TEnqueueUniqueRenderCommandType<TRenderCommandTag<TSTR_FDrawSceneCommand5726>, (unnamed class)> * this, ENamedThreads::Type CurrentThread, const FGraphEventRef & MyCompletionGraphEvent) Line 234 c++14 libUnreal.so!TGraphTask<TEnqueueUniqueRenderCommandType<TRenderCommandTag<FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::TSTR_FDrawSceneCommand5726>, FRendererModule::BeginRenderingViewFamilies(FCanvas*, TArrayView<FSceneViewFamily*, int>)::$_41>>::ExecuteTask(TGraphTask<TEnqueueUniqueRenderCommandType<TRenderCommandTag<TSTR_FDrawSceneCommand5726>, (unnamed class)> > * this) Line 633 c++14
libUnreal.so!UE::Tasks::Private::FTaskBase::TryExecuteTask(UE::Tasks::Private::FTaskBase * this) Line 503 c++14
libUnreal.so!FBaseGraphTask::Execute(FBaseGraphTask * this, TArray<FBaseGraphTask *, TSizedDefaultAllocator<32> > & NewTasks, ENamedThreads::Type CurrentThread, bool bDeleteOnCompletion) Line 481 c++14
libUnreal.so!FNamedTaskThread::ProcessTasksNamedThread(FNamedTaskThread * this, int32 QueueIndex, bool bAllowStall) Line 778 c++14
libUnreal.so!FNamedTaskThread::ProcessTasksUntilQuit(FNamedTaskThread * this, int32 QueueIndex) Line 666 c++14
libUnreal.so!RenderingThreadMain(FEvent * TaskGraphBoundSyncEvent) Line 316 c++14
libUnreal.so!FRenderingThread::Run(FRenderingThread * this) Line 467 c++14
libUnreal.so!FRunnableThreadPThread::Run(FRunnableThreadAndroid * this) Line 24 c++14
libUnreal.so!FRunnableThreadPThread::_ThreadProc(FRunnableThreadAndroid * pThis) Line 186 c++14
Other threads are all paused. FGPUOcclusionSerial::AddPrimitives is iterating over all primitives in the range, and every one of them calls RHIGetRenderQueryResult and waits for the result. For a pathological scene with 3k primitives, and a 0.5s timeout for each of them, that’s basically 20+ minutes worth of waiting to get out of that loop.
I noticed the full logcat/debugger output also has some driver spew while in that state:
09-24 17:41:29.495 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error
09-24 17:41:29.495 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot
09-24 17:41:29.542 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error
09-24 17:41:29.542 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot
09-24 17:41:29.584 23393 23778 W Adreno-GSL: <gsl_ldd_control:553>: ioctl fd 137 code 0xc040094a (IOCTL_KGSL_GPU_COMMAND) failed: errno 71 Protocol error
09-24 17:41:29.584 23393 23778 W Adreno-GSL: <log_gpu_snapshot:462>: panel.gpuSnapshotPath is not set.not generating user snapshot
I added an escape hatch to disable bWait on RHIGetRenderQueryResult after a couple of queries, and this promptly leads to a Vulkan device lost on queue submit:
LogVulkanRHI: Error: VulkanRHI::vkQueueSubmit(Queue, 1, &SubmitInfo, Fence->GetHandle()) failed, VkResult=-4
at ./Runtime/VulkanRHI/Private/VulkanQueue.cpp:74
with error VK_ERROR_DEVICE_LOST
This kind of makes sense, as a GPU loss/crash could explain queries never returning, but then I would also expect the checks inside FVulkanOcclusionQueryPool::InternalTryGetResults to catch the VK_ERROR_DEVICE_LOST and fail in that callstack instead. I’ll add more logging in there and see where that gets me, but I want to make sure I’m not causing the device lost with these changes. Is it possible that skipping the wait or otherwise not retrieving the query pool results is causing the GPU crash?
Also, in the event that we can’t get a repro with -onethread, are there functional GPU breadcrumbs on Android Vulkan that I could use to maybe determine what crashed the GPU before occlusion queries, if anything?