When running > 505 tests: PHYSX: eINVALID_OPERATION : PxScene::unlockRead() called without matching call to PxScene::lockRead()

We’re writing a custom plug-in for our project. For testing the plug-in, we are writing Automation Specs. So far, they’re working out well except for one frustrating problem – if we try to run more than about 475 of our tests at a time, UE4 will hang/crash with the following error:

Error        LogPhysicsCore       PHYSX: (D:\Build\++Fortnite\Sync\Engine\Source\ThirdParty\PhysX3\PhysX_3.4\Source\PhysX\src\NpScene.cpp 3006) eINVALID_OPERATION : PxScene::unlockRead() called without matching call to PxScene::lockRead(), behaviour will be undefined.

The thing is… it doesn’t matter which combination of tests we select – all that matters is that if we select more than 475 tests, anywhere from 478 to 503 tests will run before the error appears, after which point the engine hangs indefinitely. If I select the same exact combination of tests, UE4 will freeze after executing exactly the same number of tests (usually 503 to 504 tests, but it varies based on which tests were selected).

I am at a loss to understand why this happens. If we run the tests in smaller batches, we can execute all of them without the error appearing. The strange thing is that none of our tests involve physics simulation – we’re just testing stat calculations with the GAS. But, to instantiate the stats on an ASC, we have to spin up a world and spin up a test pawn for every test case, so if there are 500+ tests, we’re spinning up and then destroying about 500+ worlds. That being the case, I am inclined to believe this is related to a race condition in the engine, possibly between PhysX handling and garbage collection.

On the off chance this was timing related, I tried adding some 50 millisecond sleeps via FPlatformProcess::Sleep() before, during, and after calls that initialize the world and destroy it, but that had no discernible impact other than making tests run slower.

The specs I am running can be found here:
https://github.com/OpenPF2/PF2Core/tree/feature/specs-and-unit-tests/Source/OpenPF2Core/Private/Tests

This is still an issue in 4.27.2.

Running ~500 or more tests results in this error that freezes the UI:
Error LogPhysicsCore PHYSX: (D:\Build++Fortnite\Sync\Engine\Source\ThirdParty\PhysX3\PhysX_3.4\Source\PhysX\src\NpScene.cpp 3006) eINVALID_OPERATION : PxScene::unlockRead() called without matching call to PxScene::lockRead(), behaviour will be undefined.

I encountered something which seems to be the same error, although in completely different situation. I guess it’s not relevant for you at this point, but I would like to describe my findings for future reference.

If it’s the really same thing:

The underlying issue

  • When an UWorld is created with enabled physics while PhysX is used, FPhysScene instance is created (as UWorld::PhysicsScene member.)
  • The FPhysScene creates, among other things, instance of NpScene.
  • The NpScene::NpScene(...) constructor allocates a TLS (thread local storage), at line mThreadReadWriteDepth = Ps::TlsAlloc();.
  • The validity of the mThreadReadWriteDepth is not checked so nothing warns if TLS allocation fails.
  • The NpScene::lockRead(...) and NpScene::unlockRead() methods use the mThreadReadWriteDepth for un/locking.
  • If the mThreadReadWriteDepth is invalid, unlockRead() fails, because with invalid TLS, an invalid value (by the TlsGetValue(...)) is read.

So, if there are too many such UWorld instances created, system runs out of TLS and that breaks un/locking the PhysX scene, ultimately causing the described issue and a deadlock.

How to fix it

It’s needed to make sure that there are not too many UWorld instances created (i.e. much less than 500.) In your case, maybe you just need to run Garbage Collector between tests - that’s indicated by your claim “If we run the tests in smaller batches, we can execute all of them without the error appearing.” - I guess it’s run when tests are finished, cleanup pending worlds.

If running GC won’t work, it means the Unreal refuses to destroy these UWorld instances. In that case, as a hack-fix, call the UWorld::ReleasePhysicsScene() method on UWorld instances which are not supposed to be used anymore - that releases the FPhysScene which, ultimately, releases TLS. But doing that on a world which is actively used will most likely cause issues and maybe even a crash.

Notes

If you (reader) are blueprint-only user of UE or generally don’t know C++, all that mumbo-jumbo above is probably completely cryptic to you. In that case: make sure to not load too many levels and make sure to properly unload ones which are not supposed to be used anymore. If it doesn’t help, you need a C++ programmer who have access to your project and engine source code, can use a debugger, and investigates the issue in more depth.

1 Like

@Aithoneku YOU ROCK! Thank you for finally solving this stubborn issue for me.

As you surmised, it was garbage collection. So, just adding the following to my world teardown code solved the issue:

GEngine->ForceGarbageCollection(true);