Announcement

Collapse
No announcement yet.

UE Forums Outage/Data Loss Incident Today

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    UE Forums Outage/Data Loss Incident Today

    There was an outage of our database servers that support our forums today (Wednesday 6/18). While we were able to recover from the outage, unfortunately there was a data loss suffered.

    What happened:
    - During the process of removing an unused database cluster, we, apparently, accidentally performed the operation on the live environment supporting the forums.
    - We immediately realized the mistake, and began the recovery from our backup. [Edited for clarity]However, we were not able to recover all of the data.

    The Impact:
    - Forum outage affected all users access to forums.unrealengine.com from 2:58PM EDT to 5:43PM EDT
    - Forum data was lost for any posts after approximately 8:00 PM EDT 6-17-2014 - 2:58 PM EDT 6-18-2014.

    Future Prevention:

    We completed an internal investigation of what occurred, and found the following:

    Root Cause Analysis

    There are several root causes for this incident.

    - Insufficient barriers or failsafe to keep someone from accidentally removing systems that are in use by our online production services.

    - Data loss caused by insufficient protection against human error. Database backups are hourly. However snapshotting of the backup destination volume happens nightly.

    - Lack of data protection policies and enforcement of existing best practices.

    - Lack of process associated with these types of actions.


    Future Mitigation Steps
    - Add Termination Protection to all Production Database and specialty instances.

    - Increase frequency of snapshotting of Production instances to be every hour or replicate in real time.

    - Add to the process and require Production systems to be stopped for a sufficient period before being terminated.

    - Create a policy for the removal/destruction of Production systems requiring, at minimum, peer approval.

    - Create offsite backup location(s) a) Separated by rights from TechOps and DevOps.
    b) Process for gaining access to these systems and the data with approval steps and sufficient oversight.


    We apologize for the outage and loss of data, and the inconvenience this caused. [added post investigation] We take this event extremely seriously and will implement these measures to ensure that issues of this sort do not recur.

    Human error may have been the "cause" or trigger of this outage and data loss, but it was not the root cause. We will strive to mitigate the impact of mistakes to limit the damage and downtime caused when human error occurs. If any of these measures were in place, the impact of the outage would have been limited and data loss could have been avoided. Thank you for your patience with us as we strive for excellence in all we do here at Epic.

    - Chris Gerhardt
    Last edited by Stephen Ellis; 06-23-2014, 03:25 PM. Reason: Update of information after full investigation

    #2
    Sucks, but it happens. Keep up the good work and thanks for releasing the Black Jack content!!!
    Current Projects -Cat Interstellar
    Lead Developer - Ionized Games

    Comment


      #3
      Oops! Every time data is lost, a kitten dies.

      You monster

      Comment


        #4
        Love the honesty

        Comment


          #5
          ****, I know people at Avnet (EMC partners) that could solve your issues, one of their top boys was my old house mate. Say the word and ill get Epic a good deal

          Comment


            #6
            Originally posted by Basingse View Post
            Love the honesty
            I like it too.

            Luckily the forum is up again
            Last edited by fighter5347; 06-19-2014, 08:26 AM.

            Comment


              #7
              Using Firefox with private browsing the forum say "maintenance mode".
              TOUR of DUTY

              Comment


                #8
                Originally posted by SonKim View Post
                Using Firefox with private browsing the forum say "maintenance mode".
                Hi SonKim,
                It works fine for me. Both using firefox and Chrome on private/incognito mode.
                Can you tell us which OS you are using?
                It would be great if you can provide us with the detail request to help us troubleshoot your issue.
                You can do this by clicking on "Developer" from firefox menu, and select network. Copy and paste the request and response header.

                Thanks!

                Comment


                  #9
                  Originally posted by Junaili Lie View Post
                  Hi SonKim,
                  It works fine for me. Both using firefox and Chrome on private/incognito mode.
                  Can you tell us which OS you are using?
                  It would be great if you can provide us with the detail request to help us troubleshoot your issue.
                  You can do this by clicking on "Developer" from firefox menu, and select network. Copy and paste the request and response header.

                  Thanks!
                  Firefox just auto updated to version 30, the problem is gone now strangely.
                  TOUR of DUTY

                  Comment


                    #10
                    Originally posted by Basingse View Post
                    Love the honesty
                    I love it too

                    Comment


                      #11


                      It is by will alone I set my code in motion.
                      It is by coding that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning.
                      It is by will alone I set my code in motion.

                      Comment


                        #12
                        Originally posted by Shinje View Post
                        I love it too
                        Me too! It's rare nowadays for companies to be so straightforward

                        Comment


                          #13
                          Unfortunate to hear :/.

                          Comment


                            #14
                            Sorry to hear that Chris, sucks when you realise the snapshot data has been fried too...

                            Lesson learned...!

                            Comment


                              #15
                              Yeah it happens, I remember a guy at an internet backbone tripped over Jim Morris's router one time during maintenance after he went from Firetalk to E4 Chat providing voice chat services, think they had cables everywhere, it messed up services temporarily. I have the habit of encrypting data too much and create custom processes, only to forget a portion of that process, and I end up locked out of my own stuff, for some things, it's been permanently. I'm just glad your loss wasn't too bad.

                              Oh, I did have around 50GB of project data downloaded before the Unreal Launcher Update with Unreal Engine 4.2.1, I did lose a bulk of that data after the launcher update and very close to data cap now, so won't be fully recovered till later, just re-downloaded a few things for now, nothing's 100% perfect all the time and things like this do happen, the old saying could be "Too Error Is Human."

                              Sometimes, it makes you feel like you've received that old error message like it was that 'It was an ID ten T error.'

                              As time goes on your Unreal Engine Forums and Answer Hub will be massive with regular participation, the only thing I'm worried about is how the future Unreal Tournament network and data management hosting might be handled for multiplayer.

                              I'm sure you all learned something from this, and with that you can improve moving forward!
                              Last edited by KnightTechDev; 06-19-2014, 11:43 PM.
                              KT Dev Blog:
                              https://sites.google.com/site/knight...evelopers-blog

                              Comment

                              Working...
                              X