Multiprocess cooks resulting in sporadic crash

After upgrading to UE5.6, we tried turning on multi-process cooks by modifying DefaultEditor.ini to have

[CookSettings]
CookProcessCount=3

However, we started getting an intermittent failure that started around the time we enabled multi-process cooks and stopped after we disabled them

Doesn’t seem to be any other reports of this happening, wondering if this is a known issue or if you have any advice on how to troubleshoot and resolve this

I did notice that the recent 5.6.1 hotfix seems to address something related

UE-295085 	Crash in garbage collector during cook, due to accessing an FProperty that has been freed

Is this a fix for the issue we’re encountering? How do I map an issue to a specific commit so I can see if cherry-picking it resolves our issue?

Unreal Engine Issues and Bug Tracker (UE\-295085) reports nothing for that number

Steps to Reproduce
Enable multiprocess cooks

Fire off many cooks

See this happen periodically

Hi Adam,

The GH commit would be https://github.com/EpicGames/UnrealEngine/commit/a605b0b8d44a66fb3e48879aa89d9ea68fd47067

Martin

I’ll transfer your case to the owner of the cooker.

Martin

For the assertion you reported:

LogWindows: Error: appError called: Assertion failed: CachedObjectsInOuterInfo.IsEmpty() [File:D:\HORDE\<project>-IncrementalCook\Sync\Engine\Source\Editor\UnrealEd\Private\Cooker\CookGenerationHelper.cpp] [Line: 2571]
Cooker bug: Expected FCookGenerationInfo::Uninitialize to not be called for a package in an active save state, but generated package /Game/<project>/Levels/<project>Main/_Generated_/36DNYZM01VKXA5627POVIVZ9P has non-empty CachedObjectsInOuterInfo. Package State == Idle, SaveSubState == StartSave

We haven’t seen that error occur yet. I reviewed the code but haven’t found any leads for how it might occur. The verbose logs might give a lead, if not then I’ll submit some more instrumentation and send you a link to the github to cherrypick it.

Can you try to reproduce the error with verbose LogCookGenerationHelper logs on, and post the CookDirector’s log from when it reproduces?

To run with LogCookGenerationHelper, run the cookcommandlet with the argument

-logcmds="LogCookGenerationHelper Verbose"

or modify ini variables:

<ProjectRoot>\Config\DefaultEngine.ini

[Core.Log]
LogCookGenerationHelper=Verbose

If you’re launching the cook from the editor or from AutomationTool/RunUAT, then to pass in extra commandline options you will need to add them in the AdditionalCookerOptions commandline argument. Passing in the nested quotes will be difficult, so it’s better to set the setting in ini in that case.

If you’re cooking from the editor or from AutomationTool/RunUAT, the director’s cook log will be written to <UnrealEngineRoot>\Engine\Programs\AutomationTool\Saved\Logs\Cook*.txt.

If you’re running the cookcommandlet from commandline, the default log location is instead at <ProjectRoot>\Saved\Logs\<ProjectName>.log

I think converting to a private ticket is the only way to share confidential information, and you can’t convert back. Can you create a confidential ticket and notify me in it with @ and reference this ticket?

If that’s inconvenient I can ask our support experts whether there’s an alternative.

Thanks, I believe I see the problem. I’m surprised we never encountered this internally, that’s the only reason I doubt my analysis. Can you try out the fix? I am still testing it locally; it might cause knockon errors. But if you’re getting frequent reproductions of the problem then I expect you will hit those errors quickly and can let me know. Otherwise, if no information from you or my testing about any errors, I will submit the fix into UE5/Main in a few days, for inclusion in 5.6.2 if we make one. If your problem still occurs after the attempted fix, I will submit anyway, but will add some more verbose information and will ask you for a new log to analyze.

What went wrong:

The log reports that the package with the problem started to save on the Director, was retracted from Director and sent to a CookWorker, the CookWorker finished it, and then the Director noticed the CachedObjectsInOuterInfo was unexpectedly non-empty. That sequence of events means that the CookDirector sent the package to SaveStalledAssignedToWorker, possibly after populating CachedObjectsInOuterInfo, and then received the FCookWorkerServer::RecordResults function with the save results from the Cooker. RcordResults does not trigger the cleanup function for the package: FCookGenerationHelper::ResetSaveState. ResetSaveState is the only function that clears CachedObjectsInOuterInfo.

The fix:

RecordResults should call UCookOnTheFlyServer::ReleaseCookedPlatformData on the package before setting its ParentGenerationHelper to nullptr, so that ReleaseCookedPlatformData can call ParentGenerationHelper->ResetSaveState.

Insert this code between BUGFIXSTART and BUGFIXEND into the location in RecordResults where we call SetParentGenerationHelper(nullptr, StateChangeReason):

Engine\Source\Editor\UnrealEd\Private\Cooker\CookWorkerServer.cpp

void FCookWorkerServer::RecordResults(FPackageResultsMessage& Message)
{
...
		// If generated or generator, and this is a save or other end-of-cook signal, defer events and mark saved
		if (PackageData->IsGenerated() && bTerminalStateChange)
		{
			TRefCountPtr<FGenerationHelper> ParentGenerationHelper = PackageData->GetOrFindParentGenerationHelper();
			if (!ParentGenerationHelper)
			{
...
			}
			else
			{
				DeferGenerationHelperEvents.Emplace(ParentGenerationHelper);
				for (int32 PlatformIndex = 0; PlatformIndex < NumPlatforms; ++PlatformIndex)
				{
					ITargetPlatform* TargetPlatform = OrderedSessionPlatforms[PlatformIndex];
					FPackageRemoteResult::FPlatformResult& PlatformResult = Result.GetPlatforms()[PlatformIndex];
					if (!bResultIsSaveResult || PlatformResult.WasCommitted())
					{
						ParentGenerationHelper->MarkPackageSavedRemotely(COTFS, *PackageData, TargetPlatform, GetWorkerId());
					}
				}
// BUGFIXSTART
				// ReleaseCookedPlatformData would ordinarily be called when we leave the SaveState during
				// PromoteToSaveComplete or SendToState below, but we need to call it now before we
				// SetParentGenerationHelper(nullptr), so that it can call ParentGenerationHelper->ResetSaveState.
				// It is safe to call twice, so call it now and it will be a noop when called again later.
				// Call it with EStateChangeReason::DoneForNow and EPackageState::AssignedToWorker so that it does
				// the minimum shutdown for leaving the SaveState rather than also doing extra work for packages
				// returning to idle (such as SetParentGeneration(nullptr)) that we need to do ourselves in a specific
				// order.
				COTFS.ReleaseCookedPlatformData(*PackageData, EStateChangeReason::DoneForNow,
					EPackageState::AssignedToWorker);
// BUGFIXEND
				PackageData->SetParentGenerationHelper(nullptr, StateChangeReason);
			}
		}
...

Cooker has been blocked from saving the current packages for 121 seconds only happens when it takes more than the warning time (2 minutes) for a UObject to return true from IsCachedCookedPlatformDataLoaded, because it’s waiting for async work on another thread. And the warning is only given when the cooker has no other packages it can work on in the meantime.

The warning is issued (1) because it identifies the case when deadlocks happen and the package never returns true and (2) to communicate why the cooker is not making progress during a long wait on one of these objects. Other than those two cases it is not intended to identify performance issues.

The Reduction in the number of hyperthreads you documented is a known issue with the current approach for MPCook: We split the number of hyperthreads assigned to each process naively rather than allowing each process to use threads when it gets busy, based on the fear that they would all get busy at the same time and thrash the number of hardware threads available on the machine, causing a slower overall outcome due to the thrashing. We knew that naive case will be suboptimal eventually as we push more work off to async threads, but we haven’t encountered that yet and don’t have a design plan yet for something more sophisticated. Possibly we could allow oversubscription without a problem, that would be the first thing to check, and would only require a few-lines code change in CookDirector.cpp.

Other than that assignment of hyperthreads, it is unexpected that MPCook would encounter stalls more frequently than SPCook.

Since the problem of naive threadcount assignment hasn’t hit anyone else yet (or they haven’t reported it), it’s possible that you’re not hitting it either, and you’re hitting a case of an object type that is slow to build because it has an inefficient async load work. The cooker logs out the list of the types it is waiting on, are there engine types in that list? Which ones? Are there types from your project in that list? Can you look at their IsCachedCookedPlatformDataLoaded, and the threads they launch, to see whether its reasonable for them to take 2 single-threaded minutes for some objects?

You can also capture a profile to see where the time is going; it would be easiest to look at that in the SingleProcess Cook and then profile a Multiprocess Cook, with its more complicated capture of profile data from the multiple cook processes, only if you don’t see any leads for performance improvements from the data provided by the SingleProcess Cook.

The recommended profilers are UnrealInsights (run the UnrealInsights program built in the Unreal solution and then run the cookcommandlet with the -trace argument), or an off-the-shelf profiler such as EventTracingForWindows and a viewer of it like Superluminal.

Possibly MPCook is slow unrelated to the stall.

MPCook is known to be slow sometimes due to insufficient memory, causing frequent Full Garbage Collects that drop every UObject possible (rather than the periodic Soft Garbage Collects that only free packages that will no longer be needed). In that case, the cooker spends a lot of wasted time running Garbage Collection code and then reloading packages.

If that is the case, the solution is to increase RAM on the machine. You can tell whether that is the case by searching for LogCook messages that mention

Operating system has signalled that memory pressure is high.

Other than memory pressure causing that GC thrashing, it is unexpected that MPCooks are slower than SPCooks. But they might not be much better than SPCooks either, if your content has a high dependency rate that stymies are attempt to cluster the packages into packages that can be loaded independently, and so all of the CookWorkers have to load most packages, and therefore duplicate the loading effort that is around 50% of the cooktime.

The Static Meshes explain why a lot of work might be going on on async threads, but do not explain why a one of them would take more than 2 minutes.

The poor performance due to unsolicited packages is almost certainly not a problem: the performance cost is relatively minor and you likely will not notice it. Also, that occurrence of the warning, a package referenced by a level, is probably a false report that another licensee reported in 5.6.1 last week and we fixed in CL 44929591, aka github commit 94bd60a6db86d5f7a96582ade5393cc6b6e4f6db.

So no definite leads from those two. The most likely fruitful lead for the stall is the diagnostic information about which UObjects are being waited on during the stall.

I submitted the Bugfix that you locally copied above into UE5/Main: CL 45049574 aka github commit 4f03a49ba86c46bcf62550ffc7aeb043d750bb7b.

The messages you found about OOM are not necessarily a problem, they are just performance logs. We hit those from time to time in the Fortnite cook as well.

You should capture the performance data (the trace or third party profile) and see if you can find anything.

How bad is the relative performance of MPCook compared to SPCook (with each cookprocesscount that you have tested)? What is the walltime of SPCook and what is the walltime of MPCook?

Glad to help, thanks for the report and debugging.

Oh dang we already have that change, so I guess it’s not a fix for our issue. I thought we hadn’t yet grabbed that change since we’re on CL 43944956 from July 1st and hotfix just came out

I got lucky and repro’d issue on first attempt

I have cook logs I can share but they contain sensitive information about the project, is there a way to share them privately or do we need to convert this post to a private one?

After we resolve the issue, is it possible to delete the logs and convert the post back to being public facing?

Done and tagged you in it, case 500QP00000tqsPdYAI

thank you!

Awesome, thank you! Trying it out now.

It was relatively low repro (though I did manage to get it on my first attempt last time) so I’ll hammer on it for the rest of the day and then try to roll it out wide tomorrow and update you on status wednesday (and obviously follow up again if we do it hit it after wednesday)

Yeah we’ve been hammering on our builds for a bit now and the issue hasn’t occurred so I think the fix is looking promising

Hmm so from a stability perspective, it seems to be working well

However from a performance perspective, its seeming like it’s slowing things down

We’re getting lots of logs with stalls that seem to be slowing things down significantly. We’ve tried with CookProcessCount of 2 and 3 and same result

LogCook: Display: CookMultiprocess changed number of cores from 8 to 4.
LogCook: Display: CookMultiprocess changed number of hyperthreads from 16 to 8.
LogCook: Display: Cooked packages 0 Packages Remain 10453 Total 10453
LogCook: Display: Cook Diagnostics: OpenFileHandles=549, VirtualMemory=3343MiB, VirtualMemoryAvailable=110374MiB
...
LogCook: Warning: [CookWorker 0]: Cooker has been blocked from saving the current packages for 121 seconds.

Do you have any advice on stalls?

aa

Hmm no hits in logs concerning memory pressure being high

Tons of hits for things like the log excerpts shown below, guessing that could be the culprit?

LogStaticMesh: Display: Waiting for static meshes to be ready 984/1,588 (/Game/<project>/Levels/TestLevels/<foo1>/<foo2>/.../Concept/Cube_2032) ...
...
LogCook: Display: Unsolicited package /Game/<project>/Levels/TestLevels/<foo1>/<foo2>/.../Concept/Cube_2032 was loaded by package /Game/<project>/Levels/<Level>. This is a hidden dependency, causes poor cook performance, and it might be a bug to add it to the cook. Declare this package in the AssetRegistry dependencies of the loading package, or mark up its load with FCookLoadScope to specify whether it is runtime or editoronly.

dang I did find one instance of OOM being the culprit, this was on a build agent with 64 gb of ram and 46 gb page file

LogCook: Display: [CookWorker 1]: Garbage collection triggers ignored: Out of memory condition has been detected, but is only 7s after the last GC. It will be prevented until 60 seconds have passed and we may run out of memory.
Garbage collection triggered by conditions:
  Operating system has signalled that memory pressure is high.
LogCook: Display: [CookWorker 1]: Cooked packages 8134 Packages Remain 583 Total 8717
LogCook: Display: [CookWorker 1]: Cook Diagnostics: OpenFileHandles=26667, VirtualMemory=22823MiB, VirtualMemoryAvailable=6093MiB
LogCook: Display: [CookWorker 1]: Cooked packages 8134 Packages Remain 583 Total 8717
LogCook: Warning: Cooker has been blocked from saving the current packages for 123 seconds.