Google is sorry for outage, explains what happened

Google is contrite, and taking steps to ensure that the outage it suffered from about 11 am PST on 24 January (3 am on 25 January, Singapore time) will not happen again.

Users could not make use of logged-in services like Gmail, Google+, Calendar and Documents (officially) and Hangouts and Play (in other media) for periods ranging from 25 minutes to almost an hour, with the longer times affecting roughly 10% of users, Google said.

“Whether the effect was brief or lasted the better part of an hour, please accept our apologies—we strive to make all of Google’s services available and fast for you, all the time, and we missed the mark today,” said Ben Treynor, VP Engineering, Google, in a blog post.

According to Treynor, an internal system that tells other systems how to behave encountered a mistake in software at 10:55 am PST (2.55am Singapore time) that caused it to tell the other systems to ignore user requests for data. This practice generated errors that became obvious by 11:02 am PST, or 3.02am Singapore time, by which time Google’s monitoring software had informed Google’s Site Reliability Team.

Meanwhile, the same system had automatically cleared the original error and corrected itself by 11:14 am PST (3.14am Singapore time), and by 11:30 am, “almost all users’ service was restored.”

Here’s what Treynor says Google will be doing now:

  1. Correcting the mistake in the system to prevent recurrence, and checking all other similar systems to ensure they do not contain a similar mistake.
  2. Adding more checks, so that a system can’t cause similar service disruptions.
  3. Additional monitoring services, so failures can be detected and diagnosed more quickly.

In the wider scheme of things, Google had done everything it could. This outage was fixed in an hour tops, and their internal systems even correct themselves automatically.

Yet people were upset during the outage. As one of the 1,266 commenters on the blog post (at the time of writing) Moulton Media Services said, “There was panic in the streets at our place.  You would have thought the world had ended.”

The problem is that we all rely heavily on e-mail and associated apps to get things done every day. In December 2013, Yahoo’s email outage lasted for several days, probably adding to the uncertainty during the Google downtime.

While Google works on how it delivers its services, users need to remember it is a free service, whereas you might expect a specific quality of service and/or compensation for downtime if users had paid for it. As insurance though, users should make sure they have alternative apps and services to use if their primary service goes down.