UE Forums Outage/Data Loss Incident Today

anonymous_user_85444fdc · June 18, 2014, 10:29pm

There was an outage of our database servers that support our forums today (Wednesday 6/18). While we were able to recover from the outage, unfortunately there was a data loss suffered.

What happened:

During the process of removing an unused database cluster, we, apparently, accidentally performed the operation on the live environment supporting the forums.
We immediately realized the mistake, and began the recovery from our backup. [Edited for clarity]However, we were not able to recover all of the data.

The Impact:

Forum outage affected all users access to forums.unrealengine from 2:58PM EDT to 5:43PM EDT
Forum data was lost for any posts after approximately 8:00 PM EDT 6-17-2014 - 2:58 PM EDT 6-18-2014.

Future Prevention:

We completed an internal investigation of what occurred, and found the following:

Root Cause Analysis

There are several root causes for this incident.

Insufficient barriers or failsafe to keep someone from accidentally removing systems that are in use by our online production services.
Data loss caused by insufficient protection against human error. Database backups are hourly. However snapshotting of the backup destination volume happens nightly.
Lack of data protection policies and enforcement of existing best practices.
Lack of process associated with these types of actions.

Future Mitigation Steps

Add Termination Protection to all Production Database and specialty instances.
Increase frequency of snapshotting of Production instances to be every hour or replicate in real time.
Add to the process and require Production systems to be stopped for a sufficient period before being terminated.
Create a policy for the removal/destruction of Production systems requiring, at minimum, peer approval.
Create offsite backup location(s) a) Separated by rights from TechOps and DevOps.
b) Process for gaining access to these systems and the data with approval steps and sufficient oversight.

We apologize for the outage and loss of data, and the inconvenience this caused. [added post investigation] We take this event extremely seriously and will implement these measures to ensure that issues of this sort do not recur.

Human error may have been the “cause” or trigger of this outage and data loss, but it was not the root cause. We will strive to mitigate the impact of mistakes to limit the damage and downtime caused when human error occurs. If any of these measures were in place, the impact of the outage would have been limited and data loss could have been avoided. Thank you for your patience with us as we strive for excellence in all we do here at Epic.

sanford87 · June 19, 2014, 12:12am

Sucks, but it happens. Keep up the good work and thanks for releasing the Black Jack content!!!

DrLuke · June 19, 2014, 12:14am

Oops! Every time data is lost, a kitten dies.

You monster

Basingse · June 19, 2014, 12:41am

Love the honesty

KingBadger3D · June 19, 2014, 12:46am

****, I know people at Avnet (EMC partners) that could solve your issues, one of their top boys was my old house mate. Say the word and ill get Epic a good deal

anonymous_user_923c79df · June 19, 2014, 12:50am

I like it too.

Luckily the forum is up again

SonKim · June 19, 2014, 3:33am

Using Firefox with private browsing the forum say “maintenance mode”.

anonymous_user_55f6c96a · June 19, 2014, 6:01am

Hi SonKim,
It works fine for me. Both using firefox and Chrome on private/incognito mode.
Can you tell us which OS you are using?
It would be great if you can provide us with the detail request to help us troubleshoot your issue.
You can do this by clicking on “Developer” from firefox menu, and select network. Copy and paste the request and response header.

Thanks!

SonKim · June 19, 2014, 8:50am

Firefox just auto updated to version 30, the problem is gone now strangely.

Shinje · June 19, 2014, 12:17pm

I love it too

Andargor · June 19, 2014, 5:03pm

://i.imgur/HRxyEVv.jpg

TactileVisions · June 19, 2014, 6:38pm

Me too! It’s rare nowadays for companies to be so straightforward

anonymous_user_0128cfb8 · June 19, 2014, 11:49pm

Unfortunate to hear :/.

anonymous_user_a57c1d73 · June 20, 2014, 12:03am

Sorry to hear that Chris, sucks when you realise the snapshot data has been fried too…:mad:

Lesson learned…!:rolleyes:

KnightTechDev · June 20, 2014, 3:25am

Yeah it happens, I remember a guy at an internet backbone tripped over Jim Morris’s router one time during maintenance after he went from **Firetalk **to E4 Chat providing voice chat services, think they had cables everywhere, it messed up services temporarily. I have the habit of encrypting data too much and create custom processes, only to forget a portion of that process, and I end up locked out of my own stuff, for some things, it’s been permanently. I’m just glad your loss wasn’t too bad.

Oh, I did have around 50GB of project data downloaded before the Unreal Launcher Update with Unreal Engine 4.2.1, I did lose a bulk of that data after the launcher update and very close to data cap now, so won’t be fully recovered till later, just re-downloaded a few things for now, nothing’s 100% perfect all the time and things like this do happen, the old saying could be “Too Error Is Human.”

Sometimes, it makes you feel like you’ve received that old error message like it was that ‘It was an ID ten T error.’

As time goes on your Unreal Engine Forums and Answer Hub will be massive with regular participation, the only thing I’m worried about is how the future Unreal Tournament network and data management hosting might be handled for multiplayer.

I’m sure you all learned something from this, and with that you can improve moving forward!

Tom_Looman · June 20, 2014, 10:49am

The top tabs are different now, or is it just me? There used to be two additional tabs to show “whats new UT” “whats new UE4” which was very useful.

polyneutron · June 20, 2014, 11:12am

OH NOOOOOOOOOOOOOOOOO, SO MUCH DATA HAD BEEN LOST ON UT-branch as well as on others((
Logo concepts, core gameplay discussions, movement, art direction… WHY GOD, WHY?
*Not kidding, it’s a pity such things happen…

anonymous_user_518c3a4c · June 20, 2014, 11:38am

It’s all there https://forums.unrealtournament/

anonymous_user_aff479a5 · June 20, 2014, 1:31pm

Oh sh** again to making registering to forum, mi user id and pass here is not valid for that forum
Thanks! I thought all data is lost 100% but it seems does not get lost hahah

Dingo_aus · June 20, 2014, 11:34pm

Love the transperency. Good work Epic.