DevLife #1: Software Engineers should be on-call
The benefits of on-call outweigh the frustrations
My first four jobs as a software engineer had no on-call responsibilities. The whole concept of being an on-call engineer was foreign to me until I joined an IT/Cloud Operations organization that was responsible for the company’s primary SaaS service.
Most companies do not have on-call responsibilities for their software engineers. Development teams and operations teams are often intentionally kept separate. However, top performing software companies do tend to have on-call responsibilities for software engineers. This is often the case even if those companies additionally have dedicated SRE teams (like Google).
If you’ve never been on-call and have no plans to join a team with on-call responsibilities, you might be tempted to think this post isn’t for you. However, this post is precisely for you. I hope to motivate you to join an on-call team or appreciate the one you are on and what it does for you.
Let’s distinguish between support shift and on-call. While most software engineering teams do not have on-call, many of them do have support shifts. When assigned to support, the engineer is responsible for working tickets issued to their team from other teams. The difference between a support shift and on-call is that support shifts do not page engineers in the middle of the night when the service is down.
You should consider joining an on-call/True DevOps team for these reasons.
On-call software engineers write more operationally efficient code. The working relationship between an operations org and a development org often ends up in a vicious cycle. Operations teams are frustrated because they are constantly up all night dealing with poorly planned upgrades and outages, and development teams are frustrated from being pestered with requests from ops.
A software engineering team that is responsible for all of their on-call responsibilities is a true DevOps team. The engineers on these teams write more operationally efficient code simply because they don’t want to be woken up in the middle of the night to respond to a page.
Tips for writing operationally efficient code will be covered in a later post.
On-call will improve your troubleshooting skills. This will be the topic of a future post, but it is inherent that software engineers who wrote the code are best equipped to troubleshoot it when things go wrong. Beyond that, on-call will expose you to metrics, alarms, monitoring, log diving, etc.
On-call gives you exposure throughout your organization via the outage bridge. If you’re an entry or mid-level engineer, then it is possible that the outage bridge could be the primary two-way interaction you have with your upper management. Furthermore, it is likely that people from adjacent teams will be on the bridge as well. This probably sounds scary (and it can be), but it is a huge opportunity. If you can be the engineer who people depend on during these bridges, then that could very well lead to advancement or improved opportunities.
It could even make you less likely to be laid off or fired. Often times, specific layoff decisions are made by management in a group setting. If your name comes up, management probably isn’t going to cut the people best at keeping the service running even if you’re one of the weaker coders on the team. A different manager might even vouch for you solely because of that.1
Top software companies tend to have an on-call rotations for software engineers. The reason for this is that true DevOps/Engineers on-call is considered a best practice (for many of the reasons in this post actually). Top software organizations and startups tend to pay well, so it is worth becoming comfortable with on-call responsibilities if you wish to join one.
On-call makes software engineers more familiar with IT Concepts tangental to coding. I’m not going to lie, before joining on-call teams I had very little exposure to many IT concepts and tools that I now consider to be vital to my career. Things like practical computer networking, deployment automation, load balancers, cloud, Linux, database optimization, configuration management, firewalls, etc were things that I had only superficial knowledge about. Understanding the environment my software runs in greatly affects the way I write code.
On-call isn’t for everyone. I’m not sure I’ve ever met anyone who loves being on-call. Getting paged at 2:00 am is only the start of the pain. On-call makes our profession more similar to that of being an ER Nurse/Doctor trying to quickly fix a problem without making it worse in the process (except usually people’s lives are not on the line in our profession). Outage bridges are undeniably stressful, especially when your leadership joins and asks difficult questions. Fixing a live production service disruption that you caused is more humbling than receiving a bug ticket the following morning when your Ops counterpart runs into it. Maybe you have a strict sleep schedule that can’t be disrupted.
There are ways to alleviate some of these hardships that I’ll cover in a future post.
I have seen this happen firsthand. A manager wanted to cut an employee, but other leaders didn’t allow it because “he was very helpful on bridges”.