For over a few hours on Monday, several Google services came crashing to a halt. Users all over the world were unable to send messages via Hangouts, engage in video chats, or check Google Voice. Some people trying to create spreadsheets with Sheets were met with 502 errors, and people taking advantage of the multi-player aspect of Google Play Games were also affected. All of this apparently resulted from an oops during a routine hardware maintenance event where the company miscalculated available capacity.
During these maintenance events, Google redirects traffic away from certain backend servers to a new set while they perform their work. Due to this slip-up, the new servers lacked enough capacity to handle the redirected traffic. Google Engineers started running the maintenance procedure at 8:25 AM and realized something was up roughly twenty minutes later.
The team then brought in additional capacity, halted the maintenance process, and started bringing users back online in waves to avoid overwhelming the system.
These things happen, but if you need the reassurance that Google's learned its lesson, here is a dry list of bullet points the company's provided to show what it's taking away from the experience.
- Review memory requirements and increase the memory capacity for the affected backend
servers to meet peak load needs.
- Implement better monitoring for memory utilization and usage tracking to ensure that servers
have sufficient capacity available.
- Lower the alert threshold for errors with the Hangouts service to improve Engineering
- Review internal procedures for bringing up emergency capacity to speed mitigation efforts.
- Continue work in progress to improve the resilience of Hangouts service during high load
You can read the entire incident report for yourself at the link below.