UE Forums Outage/Data Loss Incident Today

There was an outage of our database servers that support our forums today (Wednesday 6/18). While we were able to recover from the outage, unfortunately there was a data loss suffered.

What happened:

  • During the process of removing an unused database cluster, we, apparently, accidentally performed the operation on the live environment supporting the forums.
  • We immediately realized the mistake, and began the recovery from our backup. [Edited for clarity]However, we were not able to recover all of the data.

The Impact:

  • Forum outage affected all users access to from 2:58PM EDT to 5:43PM EDT
  • Forum data was lost for any posts after approximately 8:00 PM EDT 6-17-2014 - 2:58 PM EDT 6-18-2014.

Future Prevention:

We completed an internal investigation of what occurred, and found the following:

Root Cause Analysis

There are several root causes for this incident.

  • Insufficient barriers or failsafe to keep someone from accidentally removing systems that are in use by our online production services.

  • Data loss caused by insufficient protection against human error. Database backups are hourly. However snapshotting of the backup destination volume happens nightly.

  • Lack of data protection policies and enforcement of existing best practices.

  • Lack of process associated with these types of actions.

Future Mitigation Steps

  • Add Termination Protection to all Production Database and specialty instances.

  • Increase frequency of snapshotting of Production instances to be every hour or replicate in real time.

  • Add to the process and require Production systems to be stopped for a sufficient period before being terminated.

  • Create a policy for the removal/destruction of Production systems requiring, at minimum, peer approval.

  • Create offsite backup location(s) a) Separated by rights from TechOps and DevOps.
    b) Process for gaining access to these systems and the data with approval steps and sufficient oversight.

We apologize for the outage and loss of data, and the inconvenience this caused. [added post investigation] We take this event extremely seriously and will implement these measures to ensure that issues of this sort do not recur.

Human error may have been the “cause” or trigger of this outage and data loss, but it was not the root cause. We will strive to mitigate the impact of mistakes to limit the damage and downtime caused when human error occurs. If any of these measures were in place, the impact of the outage would have been limited and data loss could have been avoided. Thank you for your patience with us as we strive for excellence in all we do here at Epic.

  • Chris Gerhardt

Yeah it happens, I remember a guy at an internet backbone tripped over Jim Morris’s router one time during maintenance after he went from **Firetalk **to E4 Chat providing voice chat services, think they had cables everywhere, it messed up services temporarily. I have the habit of encrypting data too much and create custom processes, only to forget a portion of that process, and I end up locked out of my own stuff, for some things, it’s been permanently. I’m just glad your loss wasn’t too bad.

Oh, I did have around 50GB of project data downloaded before the Unreal Launcher Update with Unreal Engine 4.2.1, I did lose a bulk of that data after the launcher update and very close to data cap now, so won’t be fully recovered till later, just re-downloaded a few things for now, nothing’s 100% perfect all the time and things like this do happen, the old saying could be "Too Error Is Human."

Sometimes, it makes you feel like you’ve received that old error message like it was that 'It was an ID ten T error.'

As time goes on your Unreal Engine Forums and Answer Hub will be massive with regular participation, the only thing I’m worried about is how the future Unreal Tournament network and data management hosting might be handled for multiplayer.

The top tabs are different now, or is it just me? There used to be two additional tabs to show “whats new UT” “whats new UE4” which was very useful.

Logo concepts, core gameplay discussions, movement, art direction… WHY GOD, WHY?
