Date: Fri, 15 Jan 2021 18:46:28 GMT
<p>On Jan 4th 2021, Slack experienced a global outage that prevented customers from using the service for nearly 5 hours.</p> <p>Slack has released the Root cause analysis incident report which I’m going to summarize in the first part of this video. After that Ill provide a lengthy deep dive of the incident so make sure to stick around for that.</p> <p>If you are new here, I make backend engineering videos and also cover software news, so make sure to Like comment and subscribe if you would like to see more plus it really helps the channel, lets jump into it.</p> <p>So This is an approximation of Slack’s architecture based on what was the described in the reports. Clients connects to load balancers, load balancers distribute requests to backend servers and backend servers finally make requests to database servers which is powered by mysql through vitess sharding. All of those are connected by routers in cross boundary network.</p> <p>Around 6AM jan 4 , the cross network boundary routers setting between LB and backend and backend to DB started to drop packets.</p> <p>This lead to the load balancers slowly marking backends as unhealthy and removing them from the fleet Which compounded the amount of requests</p> <p>The number of failed requests eventually triggered the provisioning service to start spinning an absurdly large number of backend servers</p> <p>However the provisioning service couldn’t keep up with the huge demand and shortly started to time out for the same networking reasons and eventually ran out of maximum open file handles.</p> <p>Eventually Slack’s cloud provider increased the networking capacity and backend servers went back to normal around 11 AM PST</p> <p>This was a summary of the slack outage, Now set back, grab your favorite beverage and lets go through the detailed incident report!</p> <p>0:00 Outage Summary</p> <p>2:00 Detailed Analysis Starts</p> <p>5:20 The Root Cause</p> <p>30:00 Corrective Actions</p>