Android Vulkan 多线程 PSOPrecache 出现崩溃

我们在 Android Vulkan 上开启了 PSOPrecache​,并且设置了两个 CVar:

  1. r.pso.PrecompileThreadPoolPercentOfHardwareThreads=50
  2. r.PSOPrecache.GlobalShaders=1

在不同的Android设备上,比较高概率出现了这两个崩溃。

崩溃 1 堆栈:

#00 pc 000000000005bdc0 /apex/com.android.runtime/lib64/bionic/libc.so (abort+164) [arm64-v8a]

#01 pc 0000000009310228 libUnreal.so FAndroidErrorOutputDevice::Serialize(char16_t const*, ELogVerbosity::Type, FName const&) (.\Runtime/Core/Private/Android/AndroidErrorOutputDevice.cpp:52) [arm64-v8a]

#02 pc 000000000fd6a728 libUnreal.so FOutputDevice::LogfImpl(char16_t const*, …) (.\Runtime/Core/Private/Misc/OutputDevice.cpp:81) [arm64-v8a]

#03 pc 000000000f611f48 libUnreal.so FDebug::CheckVerifyFailedImpl2(char const*, char const*, int, char16_t const*, …) (Runtime\Core\Public\Misc/OutputDevice.h:246) [arm64-v8a]

#04 pc 0000000016f29618 libUnreal.so FVulkanPipelineStateCacheManager::RHICreateGraphicsPipelineState(FGraphicsPipelineStateInitializer const&) (.\Runtime/VulkanRHI/Private/VulkanPipeline.cpp:2123 [Inline: TRefCountPtr]) (Other infos:FRHIResource::AddRef() const Runtime\Core\Public\Templates/RefCounting.h:299FRHIResource::FAtomicFlags::AddRef(std::__ndk1::memory_order) Runtime\RHI\Public/RHIResources.h:68FRHIResource::FAtomicFlags::AddRef(std::__ndk1::memory_order) Runtime\RHI\Public/RHIResources.h:136) [arm64-v8a]

#05 pc 0000000016f2c204 libUnreal.so FVulkanDynamicRHI::RHICreateGraphicsPipelineState(FGraphicsPipelineStateInitializer const&) (.\Runtime/VulkanRHI/Private/VulkanPipeline.cpp:2242) [arm64-v8a]

#06 pc 0000000010004794 libUnreal.so FCompilePipelineStateTask::CompilePSO(FGraphicsPipelineStateInitializer::EPSOPrecacheCompileType const*) (Runtime\RHI\Public/DynamicRHI.h:1127) [arm64-v8a]

#07 pc 0000000010003e4c libUnreal.so TGraphTask<FCompilePipelineStateTask>::ExecuteTask() (Runtime\Core\Public\Async/TaskGraphInterfaces.h:639 [Inline: FCompilePipelineStateTask::DoTask(ENamedThreads::Type, TRefCountPtr<FBaseGraphTask> const&)]) (Other infos:FCompilePipelineStateTask::DoTask(ENamedThreads::Type, TRefCountPtr<FBaseGraphTask> const&) .\Runtime/RHI/Private/PipelineStateCache.cpp:3189) [arm64-v8a]

#08 pc 000000000f97c69c libUnreal.so UE::Tasks::Private::FTaskBase::TryExecuteTask() (Runtime\Core\Public\Tasks/TaskPrivate.h:509) [arm64-v8a]

#09 pc 000000000f97b8e8 libUnreal.so _ZN13LowLevelTasks13TTaskDelegateIFPNS_5FTaskEbELj48EE17TTaskDelegateImplIZNS1_4InitIZN2UE5Tasks7Private9FTaskBase4InitEPKDsNS_13ETaskPriorityENS8_21EExtendedTaskPriorityENS8_10ETaskFlagsEEUlvE_EEvSC_SD_OT_NS_10ETaskFlagsEEUlbE_Lb0EE11CallAndMoveERS4_Pvjb (Runtime\Core\Public\Async\Fundamental/Task.h:500 [Inline: operator()]) (Other infos:operator() Runtime\Core\Public\Tasks/TaskPrivate.h:188) [arm64-v8a]

#10 pc 0000000009334dd4 libUnreal.so LowLevelTasks::FScheduler::ExecuteTask(LowLevelTasks::FTask*) (Runtime\Core\Public\Async\Fundamental/TaskDelegate.h:308) [arm64-v8a]

#11 pc 0000000009336354 libUnreal.so _ZN13LowLevelTasks10FScheduler18TryExecuteTaskFromINS_7Private19TLocalQueueRegistryILj1024ELj1024EE11TLocalQueueEXadL [arm64-v8a]

#12 pc 0000000009335fa8 libUnreal.so LowLevelTasks::FScheduler::WorkerLoop(LowLevelTasks::Private::FWaitEvent*, LowLevelTasks::Private::TLocalQueueRegistry<(unsigned int)1024, (unsigned int)1024>::TLocalQueue*, unsigned int, bool) (.\Runtime/Core/Private/Async/Fundamental/Scheduler.cpp:513) [arm64-v8a]

#13 pc 0000000009336c64 libUnreal.so LowLevelTasks::FScheduler::WorkerMain(LowLevelTasks::Private::FWaitEvent*, LowLevelTasks::Private::TLocalQueueRegistry<(unsigned int)1024, (unsigned int)1024>::TLocalQueue*, unsigned int, bool) (.\Runtime/Core/Private/Async/Fundamental/Scheduler.cpp:571) [arm64-v8a]

#14 pc 00000000095ae814 libUnreal.so FThreadImpl::Run() (.\Runtime/Core/Private/HAL/Thread.cpp:66 [Inline: UE::Core::Private::Function::TFunctionRefBase<UE::Core::Private::Function::TFunctionStorage<true>, void()>::operator()() const]) (Other infos:UE::Core::Private::Function::TFunctionRefBase<UE::Core::Private::Function::TFunctionStorage<true>, void()>::operator()() const Runtime\Core\Public\Templates/Function.h:470) [arm64-v8a]

#15 pc 000000000954af48 libUnreal.so FRunnableThreadPThread::Run() (.\Runtime/Core/Private/HAL/PThreadRunnableThread.cpp:25) [arm64-v8a]

#16 pc 0000000009354da4 libUnreal.so FRunnableThreadPThread::_ThreadProc(void*) (.\Runtime/Core/Private/HAL/PThreadRunnableThread.h:187) [arm64-v8a]

#17 pc 00000000000c0b88 /apex/com.android.runtime/lib64/bionic/libc.so [arm64-v8a]

#18 pc 000000000005d5f8 /apex/com.android.runtime/lib64/bionic/libc.so [arm64-v8a]

java:

[Failed to get Java stack]

目前从引擎源码来看,发现可能的原因是GraphicsPSOLockedMap中的PSO在引用计数为0时,并没有立即删除,而是标记为了delete状态,但是在RHICreateGraphicsPipelineState创建时从GraphicsPSOLockedMap查找到,这时有两种情况:

  1. 在函数返回时,对象已被删除并标记DeletingBit,转换为FGraphicsPipelineStateRHIRef时,触发崩溃;
  2. 在函数返回时,对象正在Deleting(),并在函数返回后标记DeletingBit,这时返回的对象变为野指针

正常的逻辑,应该是在标记为delete状态时,不应该再从GraphicsPSOLockedMap中找到且不能影响到新的PSO的创建。

崩溃 2 堆栈:

Scudo ERROR: invalid chunk state when deallocating address 0x2000075f7980910

#19 pc 000000000063b230 /vendor/lib64/libllvm-qgl.so (CreateQGLCProgram(QGPUCompiler::CompileData*)+48) [arm64-v8a]

#20 pc 0000000000db9f00 /vendor/lib64/libllvm-qgl.so [arm64-v8a]

#21 pc 0000000000db9b78 /vendor/lib64/libllvm-qgl.so [arm64-v8a]

#22 pc 0000000000040d4c /vendor/lib64/libllvm-glnext.so [arm64-v8a]

#23 pc 00000000001aec9c /vendor/lib64/hw/vulkan.adreno.so [arm64-v8a]

#24 pc 00000000001aad8c /vendor/lib64/hw/vulkan.adreno.so [arm64-v8a]

#25 pc 00000000001a9190 /vendor/lib64/hw/vulkan.adreno.so (qglinternal::vkCreateGraphicsPipelines(VkDevice_T*, VkPipelineCache_T*, unsigned int, VkGraphicsPipelineCreateInfo const*, VkAllocationCallbacks const*, VkPipeline_T**)+7008) [arm64-v8a]

#26 pc 00000000198df1dc /data/app/~~9jDzNJpuYBYxa3L7XQLGjQ==/com.tencent.mf.nf-szbeKqDFipD2HrdT4Hgbvg==/lib/arm64/libUnreal.so (FastDecimalFormat::Internal::FDecimalNumberSignParser::~FDecimalNumberSignParser()+148) [arm64-v8a]

#27 pc 00000000197e82e0 /data/app/~~9jDzNJpuYBYxa3L7XQLGjQ==/com.tencent.mf.nf-szbeKqDFipD2HrdT4Hgbvg==/lib/arm64/libUnreal.so (VkResult FVulkanPipelineCacheChunk::CreatePSO<FVulkanRHIGraphicsPipelineState>(FVulkanRHIGraphicsPipelineState*, FVulkanPipelineCacheChunk::EPSOCacheFindResult, TUniqueFunction<VkResult(FVulkanChunkedPipelineCacheManager::FPSOCreateFuncParams<FVulkanRHIGraphicsPipelineState>&)>)+1196) [arm64-v8a]

#28 pc 00000000197d537c /data/app/~~9jDzNJpuYBYxa3L7XQLGjQ==/com.tencent.mf.nf-szbeKqDFipD2HrdT4Hgbvg==/lib/arm64/libUnreal.so (VkResult FVulkanChunkedPipelineCacheManagerImpl::CreatePSO<FVulkanRHIGraphicsPipelineState>(FVulkanRHIGraphicsPipelineState*, bool, TUniqueFunction<VkResult(FVulkanChunkedPipelineCacheManager::FPSOCreateFuncParams<FVulkanRHIGraphicsPipelineState>&)>)+316) [arm64-v8a]

在首次启动时会稳定重现,​根据错误类型来看,属于Scudo内存分配器多次释放内存导致,极有可能是UE中对PSO对象进行了多次释放导致;由于Vulkan层的PSO的管理是多线程机制,所以怀疑和多线程有关;尝试把 Threadpool 调小或者设置 r.Vulkan.ForcePSOSingleThreaded=1 都可以使该问题不再重现。

想问下这是否是已知问题?如果要修改建议如何修改?​

[Attachment Removed]

Hi,

第二个问题可以试一下r.Vulkan.AllowSynchronization2=0,看看是否还有问题。​第一个问题我不太确定具体情况,(不太明白"PSO在引用计数为0时,并没有立即删除"的含义),能否更详细的描述一下出问题的流程?

[Attachment Removed]

Hi,

第一个问题的流程,我暂时没看出来哪里线程不安全?查询,添加,删除都有加锁GraphicsPSOLockedCS。

第二个问题,是否是一些特定的高通设备?我们遇到过不少PSO在Adreno 7xx的设备上编译崩溃的情况,这个是驱动的问题,我们已经反馈给高通了,不过暂时没有修复方案。可能需要继续联系高通同学跟进问题。

[Attachment Removed]

第一个问题,FVulkanRHIGraphicsPipelineState在计数归零的时候会调用FRHIResource::MarkForDelete,然后被置入延迟销毁队列,最终在FRHIResource::DeleteResources中延迟销毁对象,然后在析构函数里将对象从GraphicsPSOLockedMap移除,从标记到移除过程都是线程不安全的。因此我们怀疑是PSO标记为删除的同时从GraphicsPSOLockedMap取出了PSO,导致线程竞争出现了野指针。

第二个问题,r.Vulkan.AllowSynchronization2=0 我们之前为了修复其他PSO崩溃问题已经开启了。

[Attachment Removed]