We haven’t seen this in 5.6 at Epic. During 5.4 we had cases of MPCook encountering crashes due to dangling pointers after garbage collection caused by raw pointers not declared to garbage collection. These crashes were more likely during MPCook than SPCook in 5.4 because MPCook was garbage collecting more frequently due to higher memory pressure, and was using Soft Garbage Collection which keeps more packages in memory (including in some cases e.g. the ones with the undeclared pointer). Tracking down that issue was simpler than tracking down a memory stomp because we could sometimes find the referencer of the dangling pointer and find out how it was supposed to guarantee the object would be in memory. In 5.6, crashes due to garbage collection are now almost as likely to occur in SPCook as in MPCook, because we have expanded Soft Garbage Collection to occur in SPCook, and we run it periodically every two minutes or so. But MPCook does encounter more memory pressure on the machine and therefore does more Full Garbage Collects than are done on SPCook, so some MPCook-specific garbage collection bugs still exist.
Other than dangling pointers caused by extra memory pressure, we have never seen memory stomps in MPCook that do not occur in SPCook.
You did not mention SPCook; are you using MPCook exclusively, so the problem is not necessarily specific to MPCook and might occur with SPCook. You can disable those more frequent Soft GCs by editting the SoftGC variables in DefaultEditor.ini:
`; Existing settings in 5.6:
[CookSettings]
SoftGCStartNumerator=0
SoftGCTimeFractionBudget=.05
; Settings to restore 5.5 behavior:
[CookSettings]
SoftGCStartNumerator=5
SoftGCTimeFractionBudget=0`
Try those 5.5 behavior settings to see if the problem reproduces less frequently.
If so, your problem is likely due to a dangling pointer during garbage collection and you can focus on finding those.
Here’s one method that might allow you to find it, we may be able to think of more.
- Create a global TArray<FYourDataType> that is written to from FCoreUObjectDelegates::GetPreGarbageCollectDelegate() and that you will read in the debugger
- FYourDataType has
- UPTRINT ObjectMemoryStart;
- UPTRINT ObjectMemoryEnd;
- FString ObjectPath;
- In FCoreUObjectDelegates::GetPreGarbageCollectDelegate() you
- add every UObject in memory to a TMap<UObject*, FYourDataType> using a TObjectIterator
- Populate FYourDataType fields using Object->GetPath, (UPTRINT)Object, and ((UPTRINT)Object) + Object->GetClass() + PropertiesSize.
- In FCoreUObjectDelegates::GetPostGarbageCollect() you
- iterate over every UObject in memory again, and remove all of them from the TMap, so the TMap only contains entries for deleted objects
- Move the TMap’s values over to a TArray
- Sort the TArray by ObjectMemoryStart
You then reproduce the problem, and after getting the pointer value for the stomped memory, look at the array in the debugger and see whether the pointer value matches one of the elements in the array.
If your problem still occurs even without frequent SoftGC, and StompAllocator and TBB are not finding it, I don’t have any great suggestions for tracking it down. Maybe there is a commonly encountered package, or a commonly encountered class used by all of the packages, reported by the cooker as the ActivePackage when the crash occurs; the cooker reports its ActivatePackage to the CrashReporter context and to a file written next to the minidump saved by CrashReporter:
<ProjectRoot>\Saved\Crashes\UECC-Windows-<SomeGuid>\ActivePackage.txt.
For the error-handling recovery of the director: it is intentional that the cook gives an error exit in that case. We’re not sure whether retracting the packages after a crash is completely robust, and we decided not to work on making it robust because CookWorker crashes is terrible for performance and its better to fix the issue as high priority rather than try to make it possible to work around it. Usually when you get a build after a CookWorker crash with no other errors logged, the build is completely valid despite the error exit code, but that’s not currently guaranteed and we weren’t planning on making it guaranteed.