Shifting to Linux for UE5 Cinematics. Need a guide please

achiestdragon · January 4, 2022, 8:30am

not good news but …

if the cpu or gpu start running at temps above 94C , often , and are power cycled frequently , over time it causes the silicone in the chip to develop small cracks , dew to repeated thermal expansion and cooling contraction ,
usually over 5 to 6 years but quicker the higher the temp change the bigger the cracks , and quicker if the temp goes over 99C , some point between that and 124C the chip will just burn out almost straight away
but the chip starts to fail / crash more often as the cracks grow till eventually it just fails totally ,

servers and systems left on 24/7 last longer (if there 100% use rated * )
as the temp swing is often 60 to 90C (a max 30C cycle ) rather than the ambient (20C average) to 98C or 78C cycle change , so less thermal stress on 24/7 use,
its also safe to say that most gaming gpus are not 100% rated and do run hot under load like the dead gtx780 i have sat here on the desk , died of this issue , all i can say to do is do whatever you can to keep the full load temp from ever exceeding 90C in the first place ,
my advice would be to aim for a 100% rated system , even a ex lease 3 year old dell t7600 workstation is often a way better choice than a cheap new gaming* system (cheap as in £1000 to £2000) better spec also (* gpu excluded would need upgrade gpu on both ) (seen twin xeon i7 8 core 3.07ghz (16 cores total 32 with ht) dell t7600 for around £1000 refurbished on ebay this week )

100% rated is where the equipment is designed to be used 24/7 ie always on (ie servers workstations and some industrial pc’s , as to desktops only some dell and hp kit is afaik )
gaming pc’s and cards are usually 60 to 80% rated ( consumer grade equipment ) so are not designed to be run 24/7 thermal stress being the biggest cause of failure (burnout) of these cards/systems when driven hard for long lengths of time

gpus are a pita to get at the moment dew to the global chip shortage and as such the prices are high

although if your lucky then its the cpu your having the problem with not the gpu

if you have any thermal paste i would say remove the cpu heatsink and clean and re-paste the chip this tends to dry out and become hard after a few years , if you have not done so , it may stop it crashing so often and make it usable for a while longer , but if the damage is done then its going and its on borrowed time

Famekrafts.com · January 4, 2022, 8:49am

That is not happening, unless I stop the CPU cooler pump.

I am using a Gigabyte server motherboard, it is not a gaming rig but an assembled workstation.

GPU though a gaming GPU has been running well for the last 6 years. I still stress-tested it with heaven benchmarks and even with OCCT for an hour. No reboots.

Used even cinebench and no reboots.

Tested the PSU and GPU with Aida 64 extreme and no reboots.

Yes CPU could be an issue as I have been using it for the last 11 years. Last 6 years it is under a cooler master liquid cooler and the temps are running fine, so I am not sure I should remove the thermal paste right now, because that paste came with the cooler and if I remove the cooler pump from the CPU it might just make the pump unusable or not. Also if there are cracks in the CPU as you say, that would make the problem even worse.

I might have to call someone to look at my system which is again an issue due to covid.

Famekrafts.com · January 4, 2022, 10:02am

I am having reboots just like these, just not while benchmarking but randomly. My PSU is 6 yrs old, it could be a capacitor bloating up after so long usage. The problem is I do not have an extra PSU to check.

Famekrafts.com · January 4, 2022, 10:16am

This is the video I was talking about- the CPU cooler being the cause, even though temp it shows were fine.

The problem is I do not have all the other items to test each component by replacing them and testing with new components.

So sending them to a repair shop seems to be the best option which is not valid right now due to covid restrictions. It is really frustrating.

edit: 4 reboots today, while just writing on this site using google chrome. So there was no stress on GPU and cpu.

achiestdragon · January 4, 2022, 12:48pm

maybe i should explain the cpu crack issue and how it will manifest
take the cpu , case temp 45c , just idle running the os
if the cpu has 4 cores , only one active the core is producing the heat to package , the os load balances so switches cores from time to time , in an attempt to stop the chip overheating at one point on its surface (trying to keep the temp even across the silicone)
the change on a good none cracked chip is evenly balanced , keeping most of the chip within a sort of equal temp range and random to stop the chance of any flexing causing the chip to creep away from its package bonding
but when the chip is cracked the temp change path has to go around the crack
the crack itself maybe in just one core , but switching thread from one to another say between cores 1 and 3 may cause a connection over the crack to break in core 2 while the “temp ripple” passes and connect again after it passes around the crack , a bit like fluid ripples would around an obsticle , so switching cores to try and stop “thermal ripples”
so since the hop from thread to thread is not always 1 2 3 4 1 2 3 4 the sequence changes , even though the package temp appears to remain constant , so if thread 2 is still active while the switch happens then the system will blue screen , added to that the connection may only affect part of the chips function , like only while the core with the crack is performing a certain instruction and the temp ripple is affecting that connection ,
so the crash appears to be random and without cause
if left , the issue will just get worse as the crack grows
although with a lot of work you maybe able to pinpoint the exact part of the chip thats causing the issue but you wont just be able to disable that core , or it will upset the load balancing making the disabled part of the chip colder than the others ,that in turn will accellarate the cracking

the cooler is not the issue as you would see a temp difference or hear it change speed to cope

Famekrafts.com · January 5, 2022, 9:34am

My first test will be to change the GPU and see if that fixes the problem.

My ex-wife has taken the keys to the almirah where 2 of my GPUs are, so I will have to change the locks and get it open soon. then test with NVidia Quadro and a nominal Nvidia GTX card just to see if there are no random restarts. I hope it is the GPU, which can be easily replaced.

If it is still happening, then I will test the PSU and CPU.

Famekrafts.com · January 5, 2022, 10:35am

I do hear the CPU fan speeding up when it is going to restart and then click sound and everything stops.
Anyhow thanks for all the info, cross my fingers and hoping it is not the CPU.

edit - I am noticing one thing, the crashes are very frequent on win 10 but very few in number on Linux. Had only one reboot in Linux today compared to 5 in windows yesterday.

Famekrafts.com · January 6, 2022, 11:35am

I have just installed a nominal GT 520 with 1 GB VRAM.

Now I am going to test and see whether I still get reboots for a day or two, especially while playing games and doing heavy GPU work.

The GPU does not need 8 pin connectors, so I do not think it is very power-hungry.

Famekrafts.com · January 6, 2022, 1:43pm

No reboots from the last 2 hrs.

It rebooted even though the Nvidia card was installed, but the moment I removed the AMD drivers using DDU, no reboots till now.

I have played league of legends twice, with chrome open, even ran photoshop.

I will update if there is another reboot but for now, it looks like either the R9 390 GPU or AMD drivers are an issue.

Famekrafts.com · January 6, 2022, 5:22pm

It has been 6 hrs no reboot.

I think either the GPU is the problem, AMD drivers or the PSU.

Right now 16 pins of PSU are not connected to any GPU, so PSU is drawing much less power. I will try getting a high-end 16 pin GPU to check whether GPU is the problem or PSU is not being able to take the load of GPU.

Famekrafts.com · January 7, 2022, 2:51pm

Only one reboot today and I realized something went wrong with the drivers as I was not able to play league of legends. I re-installed Nvidia drivers and no reboots and I can play LOL again.

It looks like it is the drivers causing the issues or someone is purposely causing problems for me through the cable internet I am using, which does not seem to be secure at all.

I think if I reinstall r9 390 as well, it will not reboot unless something goes wrong with the drivers again.

I am planning to rent another GPU to check this theory, just in case the problem is with GPU.

Famekrafts.com · January 8, 2022, 10:16am

Installed Quadro 4000 (2 GB) and immediately realized why I upgraded to r9 390 in the first place.

At idle the temp was 70 degrees. Had to use MSI afterburner to push the fans to 100 speed for a temp of 50 degrees.

It uses 6 pin PCIe, so I cannot use Quadro and r9 390 together with my 700W PSU.

So Quadro goes back into the box.

Overall I believe it is a driver’s problem rather than GPU. Let me run R9 390 for a few days with newly installed drivers and check whether it reboots.

It is definitely not a CPU problem. It is GPU or drivers.

achiestdragon · January 9, 2022, 4:45am

700w psu , hmm if it was not for the fact that you have used the system a long time ,
added to the fact that you do say its crashng when not on full load ,
i would be tempted to say the psu is not man enough for the job , the psu in the system i am using is max rated at 1300w its also fair to say that with the gpu and cpu’s running at 100% the power consuption is arround 1,000w

beeing a hardware engineer by trade , myself i would strip down the psu and see if its got any bloated /leaky caps , that do degrade the psu’s over time, and 11years old as you say so arround the time the bad ones where still in cerculation (theres a story to that but too long to post here), but its far to say if your not skilled then its not something i can recommend doing , high voltages and such that stay charged when unpluged for weeks sometimes ,

the easyest way to solve issues like your having is not something you can do ,
but you would need a known working system to the same spec, and swap in the suspected part till the error starts to show on that ,
but only ideal when you have a few the same systems kicking arround

otherwise from a repair point of view
again replace parts till you find the faulty part , then replace the old parts other than the replacment faulty part and retest ,
that way the customer only gets charged in the end for the actual faulty part , rather than all the bits needed to find that out ,

beeing without the bits to try then its going to be hit and miss ,
so is it the gpu , or the psu when using that gpu or is it still another problem , maybe you do need to look at sending it to a service center or repair shop that would have the parts on hand to do this (covid hassles and such but you could arrange shipping to them if complealty stuck)

changing the gpu to a lower spec one may show it works with that
but thats not takeing into account the psu load , the only way to find that is to fit another gpu of the same spec and see then if the issue persists , borrow one to try if you must , there not cheap atm

quite often from experiance it often tends to turn out to be something you did not initaly expect ,
on that note i would be tempted to invistigate the power /reset button issue again , ensure no cable or connector issues , bad connections are intermittent and somewhat random , and top of the list of issues that are a real pita to find and solve , also ram , you sure the ram is not getting too hot and the crash beeing the result of memory errors , changing the gpu etc may inadvertanlty give the ram better airflow so it does not crash as frequent , i would love to be able to pinpoint the issue directly , but such is the case that it could be any one of a wide number of posibilaties that cause the issue , and only realy found when you hit upon changing the actual faulty bit and the problem goes

given also you say your system is 11 years old , (mostly but not all parts ) rather than buying all the suspected parts (mobo, cpu ,gpu , psu, case ,ram) (11 year old parts are not new so may be faulty anyway) you may be better looking on ebay for a base system unit with the base spec (ram and disks aside ) ex lease refurbished workstation (3 to 5 years old)(should come with a garanttee) ,that you can use the bits on hand to upgrade it to the same spec (existing gpu included) , plus getting a new machine ( to you ) in the progress , i know it will cost (£700 to £1200), but your stuck with eather the cost of getting bits to test an old system , and the time , the gpu’s are the issue atm dew to the global chip shortage , even if it turns out to be the gpu in the end , a replacment gpu atm can cost well over the price of such a base system alone atm , even secondhand can be well over £1500

personly i try to replace my system every 5 years even if its a 5 year old machine to start with, not only for new spec things like faster sata/sas , faster pci/e and other such new features , replacment part availabilaty (usualy still new parts are still available), but also because of driver support , often older systems get dropped from current driver support , not shure if thats part of the issue your having atm

anyway i feel we digress from topic even more , as much as am willing to help , maybe you would be better looking for answers on freenode irc chat chan ##hardware the guys there could give you pointers regarding your hardware issues , and it is a more suitable place to discuss them than here