So, I should update about my experience with this D3D device being lost issue, a 6-month epic finally makes more sense, note the qualifier. For starters, to relate this to the two more recent posts above, I checked with GoBoxx who makes my gaming laptop w/ GTX 1080, they disabled the auto overclocking software that Clevo shipped these cards with precisely because they didn’t like what they were seeing. This company specializes in PCs for entertainment industry and engineers, handles lots of overclocking on beefy PCs, but on the laptops they saw grounds to pull back. So, that my GoBoxx MXL VR laptop not only encountered these error with frequent crashes, but then totally fried the card, raises the question, what’s different here? I’ll tell you.
I run another app, Reality Capture, that pushes GPU pretty hard during depth map creation for photogrammetry-based workflow, then work the card even harder in UE4, my work being all about the environment, very little about game-play. 43 million polys with many dozens of 8K maps, dozens of movable lights, looks killer but you get the picture. When I first encountered the crashing it was only in UE4, Reality Capture remained solid and a tech from Nvidia said the card seemed totally up to spec based on testing in MSI Kombuster, ran an hour at 100% with CPU stressed and whatever settings this tech setup, so this seemed to point solely within UE4. Add to that the fact that hundreds, many hundreds of users over Steam and this forum report this issue related to UE4, in project and in games like Fortnite. Add that the fact that Epic devs have had what is it, a JIRA posted since 2016, one version after the next promising a fix, and here we are today.
When I was totally out of ideas what to do I came upon a post from one user who wasn’t even encountering this problem but tried to help with a suggestion, “whenever I encounter problems like this in UE4, 90% of the time if I remove the GPU and DIMMs, clean them with alcohol and reseat them, the problem goes away”. So, I tried that last July and that worked. Yay. The beast reared its head again in September and this time that trick didn’t work. That it worked at all is weird. Is that like resetting your PC, or going further to power drain by unplugging and pressing the power button a few secs, and going yet further by physically disconnecting PCBs? WTH In any event, during this relapse I then encountered (for the first time) crashes in Reality Capture, GPU lost error 30, woah. Then tested in Kombuster, couldn’t even go 10 seconds at low settings without crashing.
$1100 later I’m here to say, woe ye who doesn’t take these errors seriously, it’s not just about software crashing, it can hurt your hardware. I didn’t know that was possible these days, but clearly I push my system hard, actually happy to learn where too hard lies so I can stay south of it. I’m now setting quality settings to Low while developing, believe that helps quite a bit, don’t care what things look like during much of the work. I’m also moving my work to a maxxed out Mac Pro, only use the laptop to test and present. And sure, if you’re overclocking, I’d reconsider, especially on gaming laptops with their super constrained issues with thermal regulation.
On a final note, dear Epic Games. I submitted a bug report and have been in dialogue with someone who has always returned my emails whenever I’ve written him with detailed updates, but hasn’t written once proactively with the first bit of information. All he ever responds with is to say, “Thank you for the information, we’ll let you know if we learn anything.” To show some understanding, if an issue isn’t reproducible then big surprise that a dev team would largely have their hands tied. I offered to ship my laptop to them and let them hang on to a PC with very reproducible crashing from this error, but same response, “don’t us, we’ll you”.
I don’t see that this is simply about GPU workload surpassing some threshold. The overclocking thing may well be related, set up conditions that some illegal instruction from UE4 then serves as the straw that broke the camel’s back. I say that because my card wasn’t overclocked, yes running hard, but performed fine in Kombuster at 100% for an hour. Also, I could be in UE4 doing nothing, just open this project, no key commands, no mousing or navigation, static, watching system resources, GPU-Z, seeing clock speed climbing on its own higher and higher, temps still totally fine in the 70s, crash. It’s my expectation that between Microsoft C++ Redistributables like TDR (will shut app down if GPU freezes), protective measures operating within the GPU like ramping fan speed in response to temps, and UE4 lowering frame speed in response to strained performance, that there are all kinds of things going on to protect this very expensive component in any system. To learn there’s an upper threshold where a user can damage a GTX 1080 is one thing, got it, but I see that’s more a proximate cause. The ultimate cause really seems to be some illegal instruction coming from UE4. Many hundreds of users having many thousands of variables in play have one thing in common, UE4. Alas, this issue still ranks as low priority at Epic Games. If the tech handling my case showed the slightest sign of caring. I don’t agree with the logic of whining online, gets you nowhere. I’m good now, can work within the lines. I’m venting, but also sharing with others scratching their heads and hoping my experience saves others from frying their cards.
Benjy