At present, I’m only aware of hardware issues pertaining to 13th- and 14th-generation Intel Core CPUs. I haven’t heard of abnormal rates of issues on 11th- or 12th-generation CPUs. That said, those do appear to be similar crash sites to the aforementioned issues.
Confirming a hardware level defect without access to the faulting hardware is non-trivial, and if it is a hardware issue, working around is unlikely to possible. If one cannot trust that the CPU is executing instructions in an expected and consistent manner, then one cannot trust any result from that CPU.
You mention that you only have reports of the issue. Are these reports from end-users, and if so, would it be possible to compute an incident rate against the population of similarly situated machines with regards to configuration? If the rate of failures among the population of similar hardware is relatively high, then it’s possible there is something here.
Will chime in here and say I have one of the 14th gen CPU + MB combos that had this issue, and it was very easy to reproduce when doing any CPU-intensive operation, e.g., compiling UE or running a game that needed to do first-time PSO compilation on startup. But it didn’t just crash the game, the whole computer would BSOD. Updating the BIOS or manually lowering the CPU voltage settings from the defaults in BIOS fixed the underlying issue.
I haven’t heard of any 11th/12th gen CPUs having this issue.
If you can provide an example log with the relevant hardware specs of the machine it was running on that might provide further clues.
Potentially if you identify the hardware you could try lowering the amount of processing taken by PSO precaching on startup (relevant docs) for that specific CPU make/model using CVar overrides with device profile rules (SRC_Chipset, SRC_MakeAndModel), but likely the crash will still occur, just later in the game and additional hitching may occur.
Thanks for the answer. I was asking this because of the following message in the code
Fatal error: [File:J:\work\25b5fff9cfae834e\Engine\Source\Runtime\RenderCore\Private\ShaderCodeArchive.cpp] [Line: 405] DecompressShaderWithOodleAndExtraLogging(): Could not decompress shader group with Oodle. Group Index: 6692 Group IoStoreHash:d21e6f8344f6027953e95609 Group NumShaders: 19 Shader Index: 11070 Shader In-group Index: 6692 Shader Hash: 3E99C5A99AA6A20EAEF714620AFCAB9F1C46937A. The CPU (11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz) may be unstable; for details see http://www.radgametools.com/oodleintel.htm I also have multiple crash report happening on the same cpu very similar to what we had on 12th nd 13th
Great points, and I concur that I’ve never heard 11th/12th failing at higher than expected rates. I personally have run high-end 12th-gen since their launch without issue. Admittedly, that’s a vanishingly small sample size.
I think either the result of the CPU degradation is fairly case-by-case until the degradation becomes sufficiently terminal or 14th-gen is just more prone to total instability. In one deployment, we’ve seen two 13th-gen CPUs in a build cluster crash the editor, write bad data during lightmap baking, and cause sporadic compile failures. The CPU damage wasn’t sufficient to cause BSOD with noticeable regularity, but the issues followed the affected CPUs when tasks were rotated between the other build machines.
I think only one of the two faulting CPUs regained stability about BIOS updates and reducing CPU multiplier. The other CPUs that had yet to act abnormally are -- to the best of my knowledge -- still stable after the BIOS updates.
That error message occurs regardless of CPU in use. Looking at the code, it grabs the CPU identifier string and formats that message with it. It can trigger on AMD CPUs that are definitely unaffected by Intel’s instability issues, so I wouldn’t put too much stock in it unless you are dealing with 13th- or 14th-gen Intel CPUs.
It would be interesting to know if the 11xxx and 12xxx chips are making up a similar sized portion (compared to 13/14th gens) of the reports you are seeing. That Oodle link calls out 13xxx and 14xxx chips as being the affected SKUs, rather than 11xxx and 12xxx, and Intel confirm this in the following statement: Intel® Core™ 13th and 14th Gen Instability Customer Passthrough Statements.
How often, in broad strokes if you are unable to share specifics, are you seeing this issue reported? From information I’ve seen during the 13th- and 14th-generation issues, the 11th- and 12th- generation and prior, along with AMD, have considerably lower failure rates.
If you are worried about hardware stability, and if you have access to these systems, running the Intel tool (Intel® Processor Diagnostic Tool)'s stress test and running decompress tests outside of Oodle decompression may help to rule out CPU issues. If tests are failing or if decompression is erroring outside of Unreal’s codebase, then could be a hardware issue of some kind.
If these are being read from DDC, it’s possible there’s DDC corruption that may be fixed by clearing the DDC and letting it regenerate.