Intermittent cook worker crash during multiprocess cook

Since upgrading to 5.6, we’ve been seeing some intermittent cook failures. We tracked this down to being caused by one of the cook worker processes crashing, with what looks like memory corruption. We have not been able to repro with the stomp allocator to track this down further. We also have not managed to repro it with TBB yet either.

At this point we figured we should reach out to see if this is something that was seen internally at Epic, if there’s a fix available, or if not what you would recommend for next steps.

Related, it seems that even after the crash the director will reassign and cook the left over packages. However, the disconnect gets logged as an error, so the cook exits 1 anyway. Is this the expected behavior, or is there an intended way for us to be able to complete the cook successfully on worker crash?

We haven’t seen this in 5.6 at Epic. During 5.4 we had cases of MPCook encountering crashes due to dangling pointers after garbage collection caused by raw pointers not declared to garbage collection. These crashes were more likely during MPCook than SPCook in 5.4 because MPCook was garbage collecting more frequently due to higher memory pressure, and was using Soft Garbage Collection which keeps more packages in memory (including in some cases e.g. the ones with the undeclared pointer). Tracking down that issue was simpler than tracking down a memory stomp because we could sometimes find the referencer of the dangling pointer and find out how it was supposed to guarantee the object would be in memory. In 5.6, crashes due to garbage collection are now almost as likely to occur in SPCook as in MPCook, because we have expanded Soft Garbage Collection to occur in SPCook, and we run it periodically every two minutes or so. But MPCook does encounter more memory pressure on the machine and therefore does more Full Garbage Collects than are done on SPCook, so some MPCook-specific garbage collection bugs still exist.

Other than dangling pointers caused by extra memory pressure, we have never seen memory stomps in MPCook that do not occur in SPCook.

You did not mention SPCook; are you using MPCook exclusively, so the problem is not necessarily specific to MPCook and might occur with SPCook. You can disable those more frequent Soft GCs by editting the SoftGC variables in DefaultEditor.ini:

`; Existing settings in 5.6:

[CookSettings]
SoftGCStartNumerator=0
SoftGCTimeFractionBudget=.05

; Settings to restore 5.5 behavior:
[CookSettings]
SoftGCStartNumerator=5
SoftGCTimeFractionBudget=0`

Try those 5.5 behavior settings to see if the problem reproduces less frequently.

If so, your problem is likely due to a dangling pointer during garbage collection and you can focus on finding those.

Here’s one method that might allow you to find it, we may be able to think of more.

  • Create a global TArray<FYourDataType> that is written to from FCoreUObjectDelegates::GetPreGarbageCollectDelegate() and that you will read in the debugger
    • FYourDataType has
      • UPTRINT ObjectMemoryStart;
      • UPTRINT ObjectMemoryEnd;
      • FString ObjectPath;
  • In FCoreUObjectDelegates::GetPreGarbageCollectDelegate() you
    • add every UObject in memory to a TMap<UObject*, FYourDataType> using a TObjectIterator
    • Populate FYourDataType fields using Object->GetPath, (UPTRINT)Object, and ((UPTRINT)Object) + Object->GetClass() + PropertiesSize.
  • In FCoreUObjectDelegates::GetPostGarbageCollect() you
    • iterate over every UObject in memory again, and remove all of them from the TMap, so the TMap only contains entries for deleted objects
    • Move the TMap’s values over to a TArray
    • Sort the TArray by ObjectMemoryStart

You then reproduce the problem, and after getting the pointer value for the stomped memory, look at the array in the debugger and see whether the pointer value matches one of the elements in the array.

If your problem still occurs even without frequent SoftGC, and StompAllocator and TBB are not finding it, I don’t have any great suggestions for tracking it down. Maybe there is a commonly encountered package, or a commonly encountered class used by all of the packages, reported by the cooker as the ActivePackage when the crash occurs; the cooker reports its ActivatePackage to the CrashReporter context and to a file written next to the minidump saved by CrashReporter:

<ProjectRoot>\Saved\Crashes\UECC-Windows-<SomeGuid>\ActivePackage.txt.

For the error-handling recovery of the director: it is intentional that the cook gives an error exit in that case. We’re not sure whether retracting the packages after a crash is completely robust, and we decided not to work on making it robust because CookWorker crashes is terrible for performance and its better to fix the issue as high priority rather than try to make it possible to work around it. Usually when you get a build after a CookWorker crash with no other errors logged, the build is completely valid despite the error exit code, but that’s not currently guaranteed and we weren’t planning on making it guaranteed.

Thanks for the great info! We’ll give that a try and see what we find

Hey Matt

We were able to trace this to something in the Substance plugin. Unfortunately it seems to be in memory managed by precompiled libraries that come with the plugin, but it doesnt seem people are really using the plugin so we’ll maybe just remove it.

Thanks again for your insights and advice!