UE Forums Outage/Data Loss Incident Today

There was an outage of our database servers that support our forums today (Wednesday 6/18). While we were able to recover from the outage, unfortunately there was a data loss suffered.

What happened:

  • During the process of removing an unused database cluster, we, apparently, accidentally performed the operation on the live environment supporting the forums.
  • We immediately realized the mistake, and began the recovery from our backup. [Edited for clarity]However, we were not able to recover all of the data.

The Impact:

  • Forum outage affected all users access to forums.unrealengine from 2:58PM EDT to 5:43PM EDT
  • Forum data was lost for any posts after approximately 8:00 PM EDT 6-17-2014 - 2:58 PM EDT 6-18-2014.

Future Prevention:

We completed an internal investigation of what occurred, and found the following:

Root Cause Analysis

There are several root causes for this incident.

  • Insufficient barriers or failsafe to keep someone from accidentally removing systems that are in use by our online production services.

  • Data loss caused by insufficient protection against human error. Database backups are hourly. However snapshotting of the backup destination volume happens nightly.

  • Lack of data protection policies and enforcement of existing best practices.

  • Lack of process associated with these types of actions.

Future Mitigation Steps

  • Add Termination Protection to all Production Database and specialty instances.

  • Increase frequency of snapshotting of Production instances to be every hour or replicate in real time.

  • Add to the process and require Production systems to be stopped for a sufficient period before being terminated.

  • Create a policy for the removal/destruction of Production systems requiring, at minimum, peer approval.

  • Create offsite backup location(s) a) Separated by rights from TechOps and DevOps.
    b) Process for gaining access to these systems and the data with approval steps and sufficient oversight.

We apologize for the outage and loss of data, and the inconvenience this caused. [added post investigation] We take this event extremely seriously and will implement these measures to ensure that issues of this sort do not recur.

Human error may have been the “cause” or trigger of this outage and data loss, but it was not the root cause. We will strive to mitigate the impact of mistakes to limit the damage and downtime caused when human error occurs. If any of these measures were in place, the impact of the outage would have been limited and data loss could have been avoided. Thank you for your patience with us as we strive for excellence in all we do here at Epic.

Sucks, but it happens. Keep up the good work and thanks for releasing the Black Jack content!!!

Oops! Every time data is lost, a kitten dies.

You monster :frowning:

Love the honesty :slight_smile:

****, I know people at Avnet (EMC partners) that could solve your issues, one of their top boys was my old house mate. Say the word and ill get Epic a good deal :slight_smile:

I like it too. :slight_smile:

Luckily the forum is up again :smiley:

Using Firefox with private browsing the forum say “maintenance mode”.

Hi SonKim,
It works fine for me. Both using firefox and Chrome on private/incognito mode.
Can you tell us which OS you are using?
It would be great if you can provide us with the detail request to help us troubleshoot your issue.
You can do this by clicking on “Developer” from firefox menu, and select network. Copy and paste the request and response header.

Thanks!

Firefox just auto updated to version 30, the problem is gone now strangely.

I love it too :wink:

:smiley:

://i.imgur/HRxyEVv.jpg

Me too! It’s rare nowadays for companies to be so straightforward :slight_smile:

Unfortunate to hear :/.

Sorry to hear that Chris, sucks when you realise the snapshot data has been fried too…:mad:

Lesson learned…!:rolleyes:

Yeah it happens, I remember a guy at an internet backbone tripped over Jim Morris’s router one time during maintenance after he went from **Firetalk **to E4 Chat providing voice chat services, think they had cables everywhere, it messed up services temporarily. I have the habit of encrypting data too much and create custom processes, only to forget a portion of that process, and I end up locked out of my own stuff, for some things, it’s been permanently. I’m just glad your loss wasn’t too bad.

Oh, I did have around 50GB of project data downloaded before the Unreal Launcher Update with Unreal Engine 4.2.1, I did lose a bulk of that data after the launcher update and very close to data cap now, so won’t be fully recovered till later, just re-downloaded a few things for now, nothing’s 100% perfect all the time and things like this do happen, the old saying could be “Too Error Is Human.”

Sometimes, it makes you feel like you’ve received that old error message like it was that ‘It was an ID ten T error.’

As time goes on your Unreal Engine Forums and Answer Hub will be massive with regular participation, the only thing I’m worried about is how the future Unreal Tournament network and data management hosting might be handled for multiplayer.

I’m sure you all learned something from this, and with that you can improve moving forward! :cool:

The top tabs are different now, or is it just me? There used to be two additional tabs to show “whats new UT” “whats new UE4” which was very useful.

OH NOOOOOOOOOOOOOOOOO, SO MUCH DATA HAD BEEN LOST ON UT-branch as well as on others((
Logo concepts, core gameplay discussions, movement, art direction… WHY GOD, WHY?
*Not kidding, it’s a pity such things happen…

It’s all there https://forums.unrealtournament/

Oh sh** again to making registering to forum, mi user id and pass here is not valid for that forum
Thanks! I thought all data is lost 100% but it seems does not get lost hahah

Love the transperency. Good work Epic.