BlackEagle2k18, I too have worked this problem for days on end, have had techs from Microsoft involved, as well as Nvidia. I’m told by Microsoft my system checks out, all drivers uptodate, Microsoft C++ Redistributables are solid, the TDR issue discussed above is a normal function in place to protect the GPU in the event other protective measures, such as Nvidia’s temp threshold, doesn’t preempt. I’m told by Nvidia techs that my GPU is behaving normally, MSI Komubuster shows stable performance, even with 90° C temps on my laptop, which they say speaks well for my particular build, a GoBoxx, but that they don’t want to see higher temps than that on this machine. When we test my UE4 project, I used “r.screenpercentage 50” to halve the rendering workload in a project that had consistently crashed with the D3D device being lost error, temps ran to high 80s, but stable. I bumped rendering quality to 75, still stable at 90°, but there we stopped.
Evidently, my project crashes with full rendering quality and thus higher temps, leaving me to wonder why this issue now and not during a two-month period before in which I could work for hours with the project open, fly the cameras as fast as I’d like, no crashing. The crashes began around the time my air conditioner in my “server room” (A/V closet/bathroom) went down, am still awaiting a part. So, with ambient temps here around 29° C, this could explain why the crashing began. If I were just under the threshold of what my PC could take, note my project has some 40 million polys with well over one hundred 8K textures, this running on a laptop with a GTX 1080, then this PC’s ability to shed heat would most certainly be compromised with the 5°+ higher ambient temp than the usual 22.2° I get with AC.
Even if a broken AC figures in to explaining my particular case, that proximate cause says nothing to explain why hundreds of Unreal users have begun reporting this issue since February of 2017 since the 4.14 release. We have two trouble tickets describing the chase and hoped for resolution with the next build. I read about various workarounds, disabling live thumbnail previews, turning down rendering quality, and as you point out, forcing lower clock speed. Each of these workarounds achieves the same effect of lowering rendering workload, so while everyone may have a different PC/gpu and stress those system resources differently based on a project or cooked build, these workarounds may point to a simple fact, that the driver to the crashes is simply exceeding whatever threshold to high temperature. It may take a heavier workload by a given project on a stronger machine, but whatever it takes to exceed that threshold, the result is the same, GPU hangs, Microsoft TDR says “party over”.
Why this problem only beginning with 4.14? Did 4.14 introduce appreciably higher workloads on GPU than before? If it’s only about reaching this thermal threshold, then that theory doesn’t add up. UE4 kicks out warnings in Editor, “Exceeding texture pool limits”, also messages asking user if they’d like UE4 to lower frame speed or lower rendering quality, so I’m wondering why UE4 isn’t invoking those same type warnings before things get too hot?
I get my AC fixed in less than a week. Then I can test my project and see if the crashes go away. It seems the D3D device being lost error is a bug no matter which way you cut it, but if my project(s) that acted stable before continue to crash after I restore cool ambient temps, then that says something yet more troubling about this bug. Afterall, I had reset Windows, reinstalled Epic Games, fresh install of 4.20.1 (crashes began on 4.19.2 and 4.20), so if Nvidia is correct that my GPU is strong and healthy, then what’s left? It would have to be a bug persistent through each these engine versions.