2017 | |
---|---|
Uptime percentage (excluding scheduled maintenance) | 99.999% |
Uptime percentage (including scheduled maintenance) | 99.999% |
Scheduled downtime | 0 hrs |
Unscheduled downtime | 4 min |
System Outages
2017-06-02 System Outage
Incident Description
On Friday 2nd June 2017 between 12:42 and 12:46 most Cartell Data Services (excluding the Website) were unreachable to traffic. The IT team were alerted to the issue two minutes into the outage and the system was up and running again within a few minutes. Apologies to all who were affected.
Analyis of Outage
Following extensive research we have identified the root cause of this issue and confirmed same by replication in QA . The downtime was caused by an unscheduled table-locking backup being performed on a new production database during a busy period. As the table was relatively large, table locking during the backup caused server threads to hang while waiting to insert new data. Once all threads were exhausted each app server refused new connections. HA systems automatically came into play, however, due to a mis-configuration at a load balancer, the locked database remained a single point of failure.
Remedial Action
- Avoid running backups on production systems during busy periods
- Avoid backup scripts that lock tables unnecessarily
- Configuration on Load Balancer has been corrected.
- All LB configuration rechecked for consistency.