There was an outage of our database servers that support our forums today (Wednesday 6/18). While we were able to recover from the outage, unfortunately there was a data loss suffered.
- During the process of removing an unused database cluster, we, apparently, accidentally performed the operation on the live environment supporting the forums.
- We immediately realized the mistake, and began the recovery from our backup. [Edited for clarity]However, we were not able to recover all of the data.
- Forum outage affected all users access to forums.unrealengine.com from 2:58PM EDT to 5:43PM EDT
- Forum data was lost for any posts after approximately 8:00 PM EDT 6-17-2014 - 2:58 PM EDT 6-18-2014.
We completed an internal investigation of what occurred, and found the following:
Root Cause Analysis
There are several root causes for this incident.
Insufficient barriers or failsafe to keep someone from accidentally removing systems that are in use by our online production services.
Data loss caused by insufficient protection against human error. Database backups are hourly. However snapshotting of the backup destination volume happens nightly.
Lack of data protection policies and enforcement of existing best practices.
Lack of process associated with these types of actions.
Future Mitigation Steps
Add Termination Protection to all Production Database and specialty instances.
Increase frequency of snapshotting of Production instances to be every hour or replicate in real time.
Add to the process and require Production systems to be stopped for a sufficient period before being terminated.
Create a policy for the removal/destruction of Production systems requiring, at minimum, peer approval.
Create offsite backup location(s) a) Separated by rights from TechOps and DevOps.
b) Process for gaining access to these systems and the data with approval steps and sufficient oversight.
We apologize for the outage and loss of data, and the inconvenience this caused. [added post investigation] We take this event extremely seriously and will implement these measures to ensure that issues of this sort do not recur.
Human error may have been the “cause” or trigger of this outage and data loss, but it was not the root cause. We will strive to mitigate the impact of mistakes to limit the damage and downtime caused when human error occurs. If any of these measures were in place, the impact of the outage would have been limited and data loss could have been avoided. Thank you for your patience with us as we strive for excellence in all we do here at Epic.
- Chris Gerhardt