DevLife #3: Responding to a page while on-call
Outages happen. Here's a simple sequence to follow to troubleshoot.
In a previous post, I wrote about why software engineers should join a team with on-call responsibilities. I then wrote about what you can do to prep for your first on-call shift. In this post, I’ll briefly go over what a typical on-call shift might look like.
The vast majority of software engineering teams combine on-call responsibilities with support. Hopefully, you’ll spend the vast majority of your on-call shift working support tickets and assisting other teams with their questions.
However, responding to alarms and outages takes precedence over all support activities. If you have to choose between keeping your service running versus helping Don from R&D figure out why his proof-of-concept project can’t access your service, then always choose to keep the service running. Don deserves your attention, but he can wait.
Make sure you keep your computer close to you. If you go out, make sure you understand how quickly you are expected to respond to a page. If it’s within minutes, you might need to find another engineer to cover for you while you go out.
Transient issues will alarm you, potentially often. A synthetic test might unluckily fail a few times in a row or latency might spike for a few minutes before resolving itself. These types of things happen on the Internet. In general, you should respond to the page when these issues happen and make sure that they get resolved. If it is 2:00 am and the issue resolves itself, then go back to bed and figure out the cause in the morning. Otherwise, it is good to check if you can determine what happened. Common causes for transient issues include:
Hardware failure if this happens, then your service should automatically spin up new virtual machines or containers on new hardware. However, your service might be under provisioned during this time which can lead to increased fault rates and latency.
Eventual consistency. Many workloads make changes that are eventually consistent. This can cause synthetic tests to sometimes fail. The reason for this is because a common pattern for a synthetic test is 1) do something 2) make sure that something was done. If the do something is eventually consistent, then making sure it is done might be done with polling. However, you don’t want to poll forever, so eventually you need to timeout. All of this can lead to you getting paged in the middle of the night.
Deployments/updates. Maybe your service or one of your dependencies does not gracefully handle a deployment. For instance, it is possible that your deployment system might take too many nodes out of service at a time. This can lead to increased fault rates and increased latency because of constrained resources.
Service Provider disruption. In this case, your cloud provider such as AWS or GCP or hosting provider might have a disruption. If you’re using AWS, maybe one of the availability zones where your service is partially deployed is disrupted. Your workload/service nodes should shift to a different zone. However, in the process you will likely get paged.
If a a severe service disruption occurs, and it isn’t apparent how to make the service recover within several minutes, then you should start an outage bridge or join the outage bridge that someone else created. As the on-call engineer, your primary focus is not to figure out why the service is disrupted though sometimes this will be either necessary or become apparent as the outage goes on. Your primary focus is to make the service recover from the disruption.
Your troubleshooting sequence should vary depending on what your service does and what the disruption is. However, this is the sequence I usually follow, and it generally works well:
Check if the application itself is working by signing in and testing the disrupted part of the service.
Go to monitoring dashboard. Does anything obvious jump out? Do differing metrics spike or dip around the same time?
Note that a massive traffic spike can indicate a denial of service attack. Most engineers are not equipped to deal with this alone. If you suspect a Denial of Service attack, then you need to engage your network engineers and your cloud provider.
Check the metrics that caused the alarm. Do they show signs of recovery?
Did your team recently deploy a change that correlates with the time the issue started surfacing? If so, you should immediately begin the rollback process if it is safe to do so. If you are unsure, page the engineer that made the change and have them perform the rollback. Remember, recovering is more important than figuring our precisely why the disruption is occurring.
Query the logs for ERROR logs. Is the frequency of ERROR logs higher than usual? Most logging systems display a histogram which should make that apparent.
Do the ERROR logs indicate that a database is disrupted? If the database is disrupted, contact your DBA (if you have one) to help troubleshoot. If you don’t have a DBA, then it is likely your team’s job to troubleshoot the database.
Is a downstream service disrupted? If so, make sure that team is represented on the outage bridge. If they recently deployed a change that correlates with the outage, they should roll it back. Make sure the incident manager on the bridge understand the impact the downstream service’s disruption is having on your service. One thing you can do to possibly assist that team is provide them with the RequestIDs of requests to them that are failing.
If the ERROR logs indicate that a third party service (like a payment system) is disrupted, then engage their support immediately. Create a support ticket with all the details necessary. Get them on the outage bridge if possible as well.
Check the primitive metrics such as CPU, Memory, and Disk usage. Additionally, check how many VMs/containers are active for your application. Was there a recent scale up or down?
At this point, if a path for recovering from the disruption is not apparent, then it is time to bring in other engineers from your team or even others teams to help troubleshoot.
If your service is completely disrupted and there’s not an apparent fix, then it might be time to enter what I call the danger zone. These are actions that you generally should NOT do as they might disrupt your service, but your service is already disrupted so it is slightly “safer” to do them. You shouldn’t do them alone. Make sure another engineer or manager has eyes on the actions. Here are some ideas:
Replace all the existing VMs/containers with new ones (for stateless services). This can usually be done pretty safely. Modern systems will drain existing connections and roll out the new nodes before the existing ones shutdown. This can fix your service because:
The underlying hardware might be disrupted in a somewhat invisible way. Restarting nodes has a strong chance that most of the new nodes are placed on different hardware.
Restart the machines (for stateful services). You probably can’t easily just replace the nodes your database is running on. Instead, you can restart the machine. If you believe it isn’t safe to restart the machine, consider restarting the services running on the machine.
Touch the cloud infrastructure. One of the worst outages I have ever been a part of was the result of a cloud managed load balancer being partially deployed on bad hardware. We never figured this out during the incident but it became apparent later on. We got lucky. I recommended we change a configuration on the load balancer (without knowing the LB was the problem), and that solved the issue. Why? Not at all because of the configuration changes! It was because the LB was transitioned to new hardware during the process.
Emergency hot fix. Consider pushing out a code change that adds more logging or metrics that you believe might make the issue more apparent. Or, if a developer has a hunch on a fix that might fix the issue, consider pushing that out.
There’s no denying that outages are stressful. I hope some of these tips help you deal with them.