Cook fails because the CookWorker crashed

Hi, FCookWorkerServer::SendCrashDiagnostics logs errors if a CookWorker crashes e.g. with the message :

“LogCook: Error: CookWorkerCrash: CookWorker 2 failed to read from socket with description: Connection terminated. we will shutdown the remote process. Assigned packages will be returned to the director.”

However we don’t want to really fail the cook since its “Assigned packages will be returned to the director.”, and we don’t want to ignore LogCook errors either.

What are we missing? Thanks in advance.

Steps to Reproduce

Hi,

I was able to find a very similar JIRA for this (FORT-859303 for internal ref) with the same errors logged. That issue was fixed at CL 39949365 (https://github.com/EpicGames/UnrealEngine/commit/e3d33961c6a282b45f112fca723ebb4477d78501\) and was released with 5.6. The change list for the fix is simple, so you should try it and let me know if that solve the issue.

Regards,

Patrick

LogCook: Display: LostConnection to CookWorker 2. Log messages written after communication loss:

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 880 Packages Remain 651 Total 1531

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 896 Packages Remain 636 Total 1532

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 912 Packages Remain 628 Total 1540

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 928 Packages Remain 612 Total 1540

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 944 Packages Remain 596 Total 1540

17:09:38 LogCook: Display: [CookWorker 2]: Cook Diagnostics: OpenFileHandles=5033, VirtualMemory=11177MiB, VirtualMemoryAvailable=193902MiB

17:09:38 LogCook: Display: [CookWorker 2]: Cooked packages 960 Packages Remain 580 Total 1540

17:09:43 LogCook: Error: CookWorkerCrash: CookWorker 2 failed to read from socket with description: Connection terminated. we will shutdown the remote process. Assigned packages will be returned to the director.

What I don’t get is how the bLoggedErrorMessage get to true if all the “Log messages written after communication loss” don’t have any Error log category that would make TryParseLogCategoryVerbosityMessage set the Verbosity to Error.

> However we don’t want to really fail the cook

Failing the cook when a CookWorker crashes is by design, to avoid the case where something is wrong with the cooked build but we didn’t log an error so a bad build gets released to QA or endusers. It would take some significant extra code to guarantee that the cook is 100% valid after a CookWorker crash; it is possible that some partially completed messages leaked through to the director before the crash, and to make it guaranteed valid we would have to add some message handling to mark all of those messages and undo them after a crash, before reassigning the package save to a new worker. Since this is a failure case that we want to avoid anyway, rather than implementing that extra code we chose to flag it as an error, and we prioritize fixing crashes on CookWorkers with the same priority as a crash on the CookDirector.

I recommend that you keep that same strategy and diagnose and fix the crash rather than downgrading the severity of the error.

If you want to proceed with it anyway, you will need to change the code in two places, both of which are in SendCrashDiagnostics function that you were looking at.

The first block allows messages that were logged as Errors on the crashing client before it crashed to be logged as errors on the director as well. The code currently downgrades Fatal severity to Error severity; you will additionally need to downgrade Error severity to Warning or lower:

			// Downgrade Fatals in our local verbosity from Fatal to Error to avoid crashing the CookDirector
			if (Verbosity == ELogVerbosity::Fatal)
			{
				Verbosity = ELogVerbosity::Error;
			}
// NEW CODE
			// Do not log an error even if errors were reported by the CooKWorker when it crashed.
			if (Verbosity == ELogVerbosity::Error)
			{
				Verbosity = ELogVerbosity::Warning;
			}
// END NEW CODE
			bLoggedErrorMessage |= Verbosity == ELogVerbosity::Error;

The second block is at the bottom of FCookWorkerServer::SendCrashDiagnostics. The use of the variable bLoggedErrorMessage currently is so that we always ensure that we have log an error and forced the cook to fail due to the crash. If the error came already from an error logged on the CookWorker, then we are done and don’t need to log an additional error. But if the cookworker did not send us a logmessage with severity error, then we log our CrashDiagnostic message at severity Error to force the presence of an Error message. You do not want this behavior, so you can turn off the code in that block that reads bLoggedErrorMessage:

// NEW CODE
                // We do not want to force an error log message; always log the CrashDiagnosticsError at Warning severity.
               (void) bLoggedErrorMessage; // We no longer use this variable
// Commented-out Epic's original code:
//		if (!bLoggedErrorMessage)
//		{
//			UE_LOG(LogCook, Error, TEXT("%s"), *CrashDiagnosticsError);
//		}
//		else
// END NEW CODE
		{
			// When we already logged an error from the crashed worker, log the what-went-wrong as a warning rather
			// than an error, to avoid making it seem like a separate issue.
			UE_LOG(LogCook, Warning, TEXT("%s"), *CrashDiagnosticsError);
		}

Hi Matt, thanks for the detailed answer. We ended up doing the second change only, since we want to fail the cook if the CookWorker logged an error, and not fail the cook if no error where logged by the CookWorker (if (!bLoggedErrorMessage)) and so we log the CrashDiagnosticsError as a warning instead (as you new code suggests). Thanks.