Unavailability of Engine Services - Post-Mortem Report: October 26, 2014

_Epic_JakeVargas · October 31, 2014, 10:32pm

This past Sunday, access to a number of engine services went offline for the duration of around eight hours. This removed access for developers to download engine binaries and Marketplace content, as well as the ability to subscribe to the service.

OUTAGE ONSET
2014/10/26 4:45AM ET

OUTAGE RESOLVED
2014/10/26 12:26PM ET

AFFECTED SERVERS AND SERVICES
Unreal Engine Client Launcher
buildinfo-public-service-prod06 (Buildinfo services)

OUTAGE IMPACT
Live launcher unable to download or load properly.
UE4 state: ‘subscribe’
Fortnite and Unreal Tournament clients state: ‘Unavailable’
Marketplace items state: ‘Syncing’

DATA LOSS
None

CAUSE
Java version running on servers was u45. This version has a known memory leak that caused a OOM killer to stop the Buildinfo service.

RESOLUTION
Update java version on Buildinfo servers and restart the service.

FUTURE MITIGATION
1- Alerting on ELB (load balancer) issues to properly notify as P0 (Critical) when all instances are offline|unhealthy.
2- NOC to maintain vigil on instances being online but services remaining unavailable. Standard Operating Procedure (SOP) is to restart services and determine root cause.
3- Fixed issue with invalid characters in VictorOps distribution list.
4- For Emergencies: Our internal support staff will make a phone call when in doubt! Texting isn’t the preferred form of communicating critical issues. We will get acknowledgement if handing off issues to ensure a proper chain of possession and accountability from start to finish.

DETAILS
At 4:45 AM EDT buildinfo services went offline. Between approximately 6:38 AM EDT and 7:00 AM EDT emails began to circulate regarding this issue. At 7:07 AM EDT TechOps confirmed receipt of the notification. LIVE-1104 ticket was created to document the events of this issue. At 12:21 PM EDT the buildinfo service was started by DevOps to bring services back online. DevOps determined a problematic Java version is running on the systems. At 12:29 PM EDT DevOps confirmed the new Java version, 1.7.0_67, was now in use for the buildinfo service. At 12:30 PM EDT confirmation was received by our Senior Programmer of our Engine Team that services were restored.

RESPONSE DELAY ANALYSIS
Our monitoring system, Zabbix, sent out an alert that a buildinfo service was offline on one instance prior to the secondary. However, due to problems with alerts occurring during auto-scaling events and production pushes, it is difficult to immediately distinguish between valid and invalid alerts. Problematic alerts are “ELB (Partial)” and “Port Down” alerts. An ELB (all instances down) alert has been created.

A problem was found in the TechOps on call distribution list During the conversion from Exchange to GMail, the email address used to send an alert to the VictorOps integration was mangled. This caused a failure of the email to send and VictorOps was not notified. In addition, the Epic Games Gmail domain defaults to disallowing external domains being included in Gmail groups.

TechOps is currently down to an on-call rotation of two site reliability engineers. One was preparing for a flight, the other was unaware of the issue until a phone call was made. There was confusion of assumed responsibility once the engineer was in transit. The TechOps list did not notify DevOps personnel adequately of the responsibility change yet notified the Producer. It wasn’t until after the call to the secondary TechOps engineer that the realization of miscommunication caused significant delay. DevOps promptly took action to resolve the issue. TechOps should not had assumed it to be a Producers role to pass responsibility during a technical issue unless mutually agreed to under the circumstances.

RESPONSE MITIGATION
Alerting logic is the primary reason for the delay of the response times. The secondary contributing factor to the delay was that TechOps relies on source-based alerting from VictorOps (inclusive is the TechOps on call DL). Because preliminary notification was via email or text, there was a significant delay.

VictorOps uses multiple forms of alerting. 1- Android|iOS based app alerting, 2- SMS, 3- Phone Call.
It has been iterated the need for verbal communication during crisis situations.

TechOps on call DL should have sent an alert to the TechOps team. - This has been fixed
Zabbix should have sent a Critical alert when no servers were behind the ELB. - This has been fixed
Zabbix WebChecks should have alerted critical during a service outage. - A Zabbix upgrade has completed and we are adding webchecks

TommyBear · November 1, 2014, 4:47am

Interesting!

_Epic_JakeVargas · November 1, 2014, 8:16am

We want to be as transparent to everyone as possible. That’s the only way to move forward and continue to improve ourselves!

Open source and an open mind.