summary of 06-11 outage of launcher services

Our launcher / patcher backend services had an outage yesterday – our apologies for the interruption of service!

Below are the details of what happened.

From: DaveS
Sent: Wednesday, June 11, 2014 2:44 PM
Subject: RE: URGENT: BI down for Prod?

The first alert showed up in Zabbix at 11:54pm. That is the one that detected “something wrong” but with specifics that, while correct, weren’t the red flag that they probably should have been. It was followed within the next 10 minutes by more scary alerts.

The root cause was an oplog configuration causing the startup sequence to fail. We were going through some services to apply an accounts addition. This had been running cleanly on all the systems to this point. However, since the production systems have much larger oplog settings, their configs are different than the dev systems. After the account change was applied, the mongodb’s were ordered to stop and start. The issue occurred upon start because it disliked the oplog setting. I am not sure yet on what was done to resolve this. So, the other production systems will not receive this account change until I am certain the issue will not reoccur.

oplog: An operations log is a log of updates and changes that have happened to the database. Anything written to the DB is also a line item in the oplog. The oplog size needs to be larger on the production systems because it is also the way the MongoDB servers sync with one another.

As far as I can tell, this error was not related to the account change outside of the fact that its execution requested a stop and start of the MongoDB services. Any stop and start would have caused the same issue until Cove fixed it.

-dave

From: DaveS
Date: Wednesday, June 11, 2014 at 2:20 PM
Subject: RE: URGENT: BI down for Prod?

Here is the timeline of this incident (EDT):

• 12:01pm – Zachary recognizes a possible issue with patching and the build info’s admin page.
• 12:10pm – Netenrich sends out an alert email caused by a test run by the Zabbix Server. It is vague, but still a pointer to the main issue.
• 12:18pm – ChrisD expands the thread and asks for assistance looking into it.
• 12:23pm – Netenrich calls Dave with the Zabbix Server error.
• 12:26pm – Greg confirms that there were no recent changes to BuildInfo.
• 12:36pm – Jun expands the thread to the NOC.
• 12:37pm – Netenrich calls Dave with a more specific error. Dave hits the road to return to work.
• 12:44pm – Jun reaches out to Cove. Cove confirms and begins looking at the issue.
• 1:02pm – Cove sees a configuration issue. Dave, separately, finds the same issue.
• 1:14pm – Dave confirms that the databases are running. Zabbix alerts have gone away. Cove continues to verify…
• 1:39pm – Cove confirms that all is well.
• 1:40pm – Greg verifies this by patching the Vehicle Game on his client.
• 1:40pm – Incident closed.

-dave

I didn’t had any problems personally, but never the less thanks for posting this!

Thanks so much for this. This didn’t affect me either, but I find your continued transparency since the licensing changes nothing short of astounding. It’s as if we’re all million dollar licensees!

Cheers,
-D

The above is an account of the timeline and the technical issue, after investigation of the organizational and process issues we found the following:

  • This outage occurred after a planned change was applied to the system, but the outage time was prolonged because we failed to validate that there were no problems after applying the change to the environment.
  • Our monitoring system (NOC) identified the problem quickly but we failed to act quickly when informed of the issue, partially due to not understanding the severity of the issue and our staffing coverage at the point of escalation.

What we are doing to prevent this in the future:

  • Enforce strict change procedures that require validation testing
  • Provide better information on our application – infrastructure mapping so that we understand the severity of the issues reported by the NOC.
  • Ensure effective staff coverage so that we act more quickly once the issue has been identified
  • Investigate root cause of this technical issue that caused the database to fail when the update occurred, and take the appropriate action to address the technical issue.

, Epic CIO