Unavailability of Unreal Engine Launcher Service: September 7, 2014

anonymous_user_85444fdc · September 12, 2014, 9:01pm

September 7, 2014 Unavailability of Unreal Engine Launcher Service

We would like to share more details with our customers about the online service availability of our Unreal Engine Launcher that happened on Sunday, September 7th. The service disruption affected all customers trying to run the launcher, specifically causing problems with signing in, installing, and updating the launcher or versions of the Unreal Engine. Those who had already installed the Unreal Engine would still have been able to use the product by selecting “go offline” at the sign in screen.

Incident Report
All times listed in US EDT (-4 hours UTC)

The service disruption began at 2:51AM when the hosts behind our load balancer began to stop the service. Our Epic staff determined this to be due to an Out of Memory (OOM) condition where the service utilizes excessive memory causing the server to kill the offending process in order to save itself from a complete crash of the operating system.

It was by 9:54AM a concerned customer posted the issue to our Unreal Engine Forums (https://forums.unrealengine.com). The incident was moved by our moderator to our Unreal Engine Answers site (https://answers.unrealengine.com) where our customers helped provide more details into the problem. The severity of the issue became clear after the nature of the problem was found at 10:58AM and escalated to the site reliability team at 11:31AM. Customers may have been presented with “No Version Information Received” and Service Unavailable 503 response codes during this outage.

Our site reliability team engaged the problem until confirming at 11:57AM where the problem was sourced from. Immediate action was taken and service was restored at 12:07PM.
From this point forward we continued day and night to keep vigil, proactively monitor the services, and discover both the root cause and discover potential risks and mark them for improvement.

Forward Thinking
This was a rare case that took us by surprise and revealed areas of our monitoring, alerting, and escalation process that needed improvements.

First, though alerts were triggered, the alert recipients were not configured correctly and did not notify the appropriate team to engage the problem. We have implemented fixes for this and will continue to make improvements as we expand our services.

Second, we did not have effective periodic ‘synthetic customer interactive’ tests in place to test deeper into the portal and conduct transactions; development of these types of tests is ongoing.

Third, we made improvements to how our environment scales to meet demand.

Lastly, many other improvements have been made to help our site reliability team monitor and conduct analysis in order to provide both our developers and customers quick and accurate data. We continue investigating better ways to automate alert response.

As of this statement, we have rolled out an updated and stable version that has been tested under heavy load conditions.

At Epic we strive for the best in everything we do. This incident is very unfortunate, and we would like to apologize for the negative impact it may have had. We are committed to providing you with the best possible experience, and as you can see we are making every effort to prevent this from happening again.

Jacky · September 12, 2014, 11:36pm

You deserve every letter of your Officer title, Chris. It felt like i was reading a battle report. oO
Thanks a bunch for the detailed explanation!

BinarySword · September 14, 2014, 4:14pm

I dunno about you but I was reading the report with Patrick Stewart’s portrayal of Captain Jean-Luc Picard - Star Trek and/or Edward James Olmos portrayal of Admiral William Adama - Battlestar Galactica voice/s in my head :rolleyes:

With that aside,
Thank you for striving for the level of customer service which can only be described as the very name the company you work for was named