Our launcher / patcher backend services had an outage yesterday – our apologies for the interruption of service!
Below are the details of what happened.
From: DaveS
Sent: Wednesday, June 11, 2014 2:44 PM
Subject: RE: URGENT: BI down for Prod?
The first alert showed up in Zabbix at 11:54pm. That is the one that detected “something wrong” but with specifics that, while correct, weren’t the red flag that they probably should have been. It was followed within the next 10 minutes by more scary alerts.
The root cause was an oplog configuration causing the startup sequence to fail. We were going through some services to apply an accounts addition. This had been running cleanly on all the systems to this point. However, since the production systems have much larger oplog settings, their configs are different than the dev systems. After the account change was applied, the mongodb’s were ordered to stop and start. The issue occurred upon start because it disliked the oplog setting. I am not sure yet on what was done to resolve this. So, the other production systems will not receive this account change until I am certain the issue will not reoccur.
oplog: An operations log is a log of updates and changes that have happened to the database. Anything written to the DB is also a line item in the oplog. The oplog size needs to be larger on the production systems because it is also the way the MongoDB servers sync with one another.
As far as I can tell, this error was not related to the account change outside of the fact that its execution requested a stop and start of the MongoDB services. Any stop and start would have caused the same issue until Cove fixed it.
-dave
From: DaveS
Date: Wednesday, June 11, 2014 at 2:20 PM
Subject: RE: URGENT: BI down for Prod?
Here is the timeline of this incident (EDT):
• 12:01pm – Zachary recognizes a possible issue with patching and the build info’s admin page.
• 12:10pm – Netenrich sends out an alert email caused by a test run by the Zabbix Server. It is vague, but still a pointer to the main issue.
• 12:18pm – ChrisD expands the thread and asks for assistance looking into it.
• 12:23pm – Netenrich calls Dave with the Zabbix Server error.
• 12:26pm – Greg confirms that there were no recent changes to BuildInfo.
• 12:36pm – Jun expands the thread to the NOC.
• 12:37pm – Netenrich calls Dave with a more specific error. Dave hits the road to return to work.
• 12:44pm – Jun reaches out to Cove. Cove confirms and begins looking at the issue.
• 1:02pm – Cove sees a configuration issue. Dave, separately, finds the same issue.
• 1:14pm – Dave confirms that the databases are running. Zabbix alerts have gone away. Cove continues to verify…
• 1:39pm – Cove confirms that all is well.
• 1:40pm – Greg verifies this by patching the Vehicle Game on his client.
• 1:40pm – Incident closed.
-dave