In a previous post, I discussed why software engineers should seek to be on a team with on-call responsibilities. In this post, I’ll help you prepare for your first on-call shift. In a later post, I’ll discuss “what to do” during your first on-call shift.
If you’re reading this post, it is possible you are about to start your first on-call shift or perhaps your previous on-call shift was rough. Maybe you’ve done on-call before, but you’ve never felt comfortable when working those shifts. You’ve come to the right place. If you follow these tips, you will be prepared for your next on-call shift!
Here’s what you you need to do before your first shift:
Review previous incident/outage logs. Most companies have their on-call engineers and incident managers write incident logs or an incident report whenever a major service disruption happens. Read through several of these. How did the on-call engineers solve the problem? How long did it take the service to recover? What was determined to be the root cause? Did your service come into play? Who was most responsible for correcting the issue? Which services tend to be “problematic”?
Perhaps your company doesn’t officially collect these logs. If that is the case, see if you can find the chat thread on Slack where it was discussed.
Talk to engineers involved in solving outages. Often times, the incident/outage logs don’t do a great job of spelling out how the engineer who solved the issue figured it out. So talk directly to the engineer who solved it! The vast majority of engineers will be happy to share this information with you because they will want others to be able to solve such problems. Consider asking these types of questions:
How did you solve the latest outage?
What are the most common types of outages at our company?
What are the most important tools you use for diagnosing the service during an outage?
What are the company expectations for escalating or starting a bridge?
Do you have any general tips/advice?
As with anything, make sure you thank them for their time.
Shadow an on-call engineer. Whenever an incident/outage strikes, ask the current on-call engineer to immediately bring you into a call so that you can watch how they diagnose and resolve the issue. Hopefully, they’ll be able to explain what they are doing as they do it, but keep in mind that outages can be stressful so they might not be able to do that. That’s OK! Take what you can get. Keep the following in mind as you observe:
How were they notified an incident has happened? Was it the alarms or something else?
Which tools do they open first? Do you have access to those tools? You should.
What do their Web browser bookmarks for on-call look like? Would it be useful to copy their structure?
At what point do they open an outage bridge?
Join a few outage bridges. An outage bridge is a conference call where on-call engineers and incident managers meet to attempt to quickly recover from an outage (and other things, but those are beyond the scope of this post). Generally speaking, only people actively involved in correcting the outage should join an outage bridge. You don’t want too many cooks in the kitchen. However, joining for the purpose of learning how the process works is a valid exception. Join a bridge and just listen. Join another bridge and see if you can find the root cause before the other people on the bridge do.1 Don’t feel the need to speak unless you have something useful to say.
Know your Tools
Your tools assist you in troubleshooting an issue. Know what you have available to you before you begin your first shift.
Logs
Logs are human readable messages that your service emits whenever it does something. Logs are generally categorized by levels like INFO, WARN, and ERROR. INFO is something you expect to happen. WARN is something that you should maybe be concerned about, but also maybe not. ERROR means that something went wrong.
Usually, ERROR logs are the most helpful logs to look at during an outage. Before you begin your on-call shift, ensure that you know how to search for ERROR logs within a specific time window. Make sure you are comfortable working with date/time in UTC and a 24 hour time format.
You should hopefully have a log management system to help you search the logs. Most log management systems use SQL or a SQL-like syntax to query logs. You should study up on this. Put together a “querying cookbook” with example snippets of how to search your logs. Amazon CloudWatch Logs Insights has a good example here. You don’t want your first time querying the log management system to be during an outage!
RequestID
Logs should always include a RequestID. After you’ve searched for ERROR logs, look at the RequestID for one of them and query the log system to find all the logs associated with that RequestID. This will show you what the system actually did before it failed.
Make sure you know how to do this before your on-call shift begins. This will often lead you to the cause.
The Source Code
If you’re a software engineer, then you obviously have the of source code for your services available to you. Once you’ve looked at some logs, check in your code at the entry point and follow the call stack all the way down (or up depending how you visualize a stack in your mind). Are there any logs that aren’t getting emitted that should? Do you see perhaps where you’re not catching an evil null pointer exception?
If you’re an operations, DevOps, or Site Reliability Engineer then you might not have the source code readily available. I don’t agree with this approach, but many companies feel that for compliance and security reasons, this is a proper approach.
Metrics
Whereas Logs are human readable messages, metrics are human readable numbers that can be displayed. In operations, we have what are known as the four golden signals which are some of the most important metrics. I’ll cover three of them here:
Latency measures how long it takes requests to complete. If you see that your requests are taking a long time, then that is cause for concern!
Traffic measures how many requests your service receives. There are legitimate spikes in traffic like all your users waking up and using your service in the morning and nefarious spikes in traffic like a DDoS attack. However, even legitimate spikes in traffic can take down your service.
Faults rate measures how often requests fail for unexpected reasons. This is distinguished from requests failing for acceptable reasons like passing incorrect parameters. Any fault is concerning. Faults are also known as errors or 500s.
Other metrics to collect are CPU, Memory, Disk, and VM/Instance usage. This is especially important if your application is running in a data center or on a legacy hardware. However, in modern systems these are generally the last place to look when an issue occurs. This is because modern systems are designed for hardware failure and will automatically terminate VMs/Containers that are misbehaving.
Before beginning your on-call shift, make sure you know how to view Latency, Traffic, and Fault for all of your services. Ideally, this information should be on a Dashboard.
Alarms
In the on-call world, the worst way to be notified of an incident or outage is by one of your customers telling you about it. This is why operations use alarms.
The metrics we covered in the previous section are very useful in and of themselves. However, they become much more useful if you hook up alarms to them when they cross a threshold. For instance, if you know that your service should only take 50 milliseconds to complete a request on average, then you should set an alarm to go off if the average latency stays above 60 milliseconds for a few minutes.
Make sure your alarms do not generate many false positives. In other words, don’t set them up to page your team for a blip. An exception to this could be if you have 24 hour coverage across many time zones, but even then you want the on-call to focus on actual problems and not transient issues.
Synthetic Tests
These test your service the same your consumers do and do so continuously, typically every minute. For a Web application, synthetic tests simulate user clicks to browse the Web applications and perform actions like logging in or purchasing an item. For an API, synthetic tests simply call the API and verify the results.
Synthetic tests should hook up to alarms. If a synthetic test fails (or fails a certain number of times above a threshold), the alarm should page the on-call.
Note that many services/teams do not have synthetic tests that run continuously. If your team doesn’t have them, then I highly recommend implementing them.
Paging System
Alarms should trigger a paging system that pages the on-call. Additionally, the paging system allows you to page your co-workers if you need assistance. Common paging systems include PagerDuty and Opesgenie.
Traces
Traces are a more advanced topic for on-call and will be covered more in-depth in a future post.
Your Application!
One of the first things I do when I get a page is to login to my application and see if I can reproduce the issue. For instance, if I am alarmed for high latency, I login to check if the application feels slow. If a certain feature is alarmed because it supposedly does not work, then I verify that.
The reason to do this is that sometimes the issue isn’t with your application. Instead, the issue is with the monitoring system! A non-working application that you have verified yourself is also a strong signal to start an outage bridge.
Dashboard
A dashboard displays your most useful metrics, logs, and alarms all in once place. If an alarm is triggered, then that should be apparent on your dashboard (usually it will be red and marked with an X instead of a green checkmark). So, make sure you know how to access your dashboards before starting a bridge.
If your team does not have a dashboard, you can create what I would call a “makeshift dash” by having handy links to each of your important metrics and alarms organized using bookmarks. Here’s a good bookmark hierarchy:
Dashboard/
ServiceA
ApplicationURL
Latency
Traffic
Errors
Logs
ServiceB
…repeat
Chances are that other team members who do on-call might have this setup already in their bookmarks. Ask them if they could export their bookmarks!
Dashboards are also useful for correlating data. For instance, let’s say your dashboard contains both a time series chart for request latency and for traffic (number of requests). You get alarmed for request latency. You check the dashboard and see that both latency and traffic have spiked.
Chat Software
If your company discusses outage in a chat product like Slack, go back a few months and read every message up until the current date. There might be an #incidents channel or something similar to that you can check out. You’ll likely learn a few things from this.
Know your architecture and infrastructure
Is your application running in the Cloud or in a data center? If it is in the cloud, which public cloud are you using? If there is an issue with the underlying cloud services or data center, are you responsible for engaging the provider/vendor or is there a dedicated team for that? Does your team manager your application on Virtual Machines? Is your application containerized? Is it running on a Serverless compute engine like AWS Lambda?
Database
Are you using a Relational Database or a NoSQL database or both? Do you have database administrators that can assist during on-call? If there is a database issue, do you have metrics or logs that would indicate that?
Dependencies
Do your services interact with other services owned by other teams? Those are you dependencies! Make sure you know how to contact/page the other teams in case an outage is caused by them. Additionally, if your service is called by other teams, make sure they know how to contact you.
Do your services interact with 3rd parties that are in your application’s critical path? A good example of a common, critical 3rd party dependency is using a service like Stripe to handle payments. If so, make sure you know how to contact their Support Team in case they bring your service down.
Good luck!
On-call can be rough. However, going into your shift prepared makes it less painful. I hope this post was helpful for you!
An engineer on one of my teams joined an outage bridge within a few weeks of joining the company. He found the issue before anyone else did.