Recipe for Reliability

By anders pearson 30 Jan 2023

I thought I’d share my “simple” recipe for building reliable applications. Of course “simple” doesn’t mean “easy”, but I think this is the core philosophy behind how I approach developing reliable software.

Here is the process:

  1. A developer writes some code to make one improvement to the application.
  2. That developer deploys that code to production.
  3. That developer observes that new code in production to verify that it did what they expected and didn’t cause other problems.
  4. That developer takes what they’ve learned in step 3 and goes back to step 1.

That’s it.

Anything that deviates from that feedback loop or slows it down will ultimately make a system less reliable and more prone to incidents and outages.

One obvious caveat on this process is that the developer involved has to actually care about the reliability of the system. I’m kind of taking that as a given though. I guess It’s possible for a developer to not care about reliability, but every developer I’ve ever met has at least not actually wanted to cause an outage.

Outages and incidents usually come from more subtle deviations from that process.

One very common mistake that we, as a profession often get wrong is not having the same developer who wrote the code do the deployment and the observation. That often also includes a related mistake where more than one change is developed or deployed at once.

A commit or PR should have a single purpose. The more different things you try to do at once, the harder the code/PR will be to understand, the more likely it is that errors will be introduced, the harder it will be to thoroughly observe the behavior of all the different changes once it’s in production to verify that it’s actually correct, and the harder it will be to connect something learned in that process back to the actual cause and use that to inform the next improvement. If you’ve spent any time debugging code, this should be intuitive. Writing a thousand lines of code implementing a dozen different features before running anything pretty much guarantees a painful debugging session and missed bugs.

Implementing a single feature or even just a single branch or logical part of a feature and testing it before moving on to the next makes it much faster to locate the source of a problem and to have some confidence that each line of code does what you think it does. As experienced developers, I think most of us recognize that having a single clear purpose is important not just for the unit of work (PR) but for the unit of code as well (function, class, module, service, etc). The more different responsibilities a single piece of code has, the more likely it is to have bugs and the harder it is to understand and to work on.

Google’s CL author’s guide agrees on that point.

The CL makes a minimal change that addresses just one thing. This is usually just one part of a feature, rather than a whole feature at once. In general it’s better to err on the side of writing CLs that are too small vs. CLs that are too large.

The other part of that is probably a more common mistake, but is related. That’s when it’s not the same developer doing all of the steps. If Developer A writes some code, then Developer B (or worse, Ops Person B), deploys it, B is usually not in nearly as good a position as A to properly check that it’s working as expected. It’s not always the case, but usually A, having spent the time writing the code, has a better idea of what it’s supposed to do, what it should look like if it’s working correctly, and what edge cases are most likely to cause problems.

These two are commonly connected. Multiple developers each working on separate improvements get their code bundled together and all deployed in a batch. That pretty much always means that there are multiple changes deployed at the same time, which again makes it harder to reason about problems and interactions, creates more surface area that needs to be checked, and when a problem is found, makes it harder to trace to one change and makes it more complicated to revert just that change.

I’ve occasionally mentioned that I’m not really a fan of “staging” environments as they are often used. There are advantages to having a staging site, but the downside is that they often become a chokepoint where multiple changes get bundled together and then are later deployed to production together, invoking the two above problems. I’ve seen many prodution incidents that started when there were a bunch of different changes that had been deployed to staging, that had been verified there to different degrees. Those then all got merged together and deployed to production. The developer merging and deploying (probably separate developers) didn’t have a full understanding of all the different changes or how to verify them after the deploy. Unfortunately, this is a very common problem with staging environments. There are legitimate uses for a staging environment, but I think that they are often overused and their downside needs to be considered, especially if they are forming this kind of chokepoint.

You may have noticed that “testing” isn’t one of the steps in the process before deploying to production. There are a couple reasons for that.

First, I consider automated tests to be part of both the “write the code” and “deploy the code” steps. The process of writing code should almost always involve running tests locally. Really, a good test driven development workflow is just a mini version of the whole process above, except without a deploy step. You implement a single small piece of functionality at a time and verify its behavior the best you can and then repeat. Step 2 of the process, “That developer deploys that code to production.” doesn’t mean that the developer manually copies code out to the production servers; it means that they initiate an automated deployment pipeline, either by clicking a “deploy” button somewhere or merging a PR to trigger the deploy. The deployment pipeline should be running tests, possibly at multiple levels (unit tests, integration tests, post-deployment smoke tests) and fail as soon as any of them fail.

A more controversial reason I didn’t explicitly include a testing step is that while I love tests, I actually don’t think they’re as directly important to the reliability of a site. Good automated tests allow a developer who cares about the reliability of the site to make changes to the code and verifythose changes more rapidly. They allow the deploy step to run more safely (so less time spent debugging or fixing broken deploys). My experience is that if a hypothetical Team A writes no tests at all but otherwise cares a lot about site reliability and follows the process above, the result will be a more reliable site than Team B, who write a lot of tests but implement large changes, don’t deploy those changes individually, deploy infrequently, and don’t focus on observing the changes in production. Team A might start off worse, but they’ll learn a lot more, have a deeper understanding of how their code actually runs in production, and be able to build something that’s more reliable in the long run.

Steps 3 and 4 where the developer who implemented a change closely observe that code in production and learn from it are perhaps the key to the whole approach. This is why I tend to put so much emphasis on metrics, monitoring, logging, and observability tools. You usually can’t just see inside a running process in production, so you have to have tools in place to collect and process useful data. This is also why, while I can put a lot of those tools in place, at some point, the developers writing the code need to pick them up and use them. They are the ones who will be in the best position to know what data will help them verify that a feature is working like they expect or to understand what’s happening when it behaves differently in production than they expected. The developers will have assumptions about how often a given function is called, how long a database call should take to execute, which calls might fail, what sorts of values will be passed as parameters to a function, etc. Production traffic and end-user behavior often prove our assumptions wrong in startling ways. Uncovering those wrong assumptions quickly and correcting them is key to making a reliable site. One of the best things a developer can do is to cultivate a deep curiousity about what their code is really doing in production.

It’s also important to keep in mind that the success is greatly affected by the speed of the whole process. Each time you go through it, you should learn something. The more times you go through it, the more you learn and the more you can improve.

If the process, or steps in the process are slow or difficult, that limits how many times you can go through the cycle. A large codebase that takes a long time to navigate and slow tests make the development step slower. Slow deployment pipelines obviously make the deploy step take longer, but that’s also slowed by not having zero downtime deployments, forcing you to only be able to deploy during certain, infrequent windows (this also again makes it much more likely that you’ll end up deploying multiple changes at the same time). Not having good observability tooling makes it slower to verify the change in production. In addition to allowing fewer iterations, any slowness in the process also reduces reliability because the more time that passes between writing code and observing it in production, the more difficult it will be for the developer to really remember the full context of the change and properly verify it.

We often have additional steps in our workflow that serve purposes other than reliability and must be there. But we need to minimize their impact on the overall process. Eg, if you have compliance requirements (eg, SOC 2, PCI, ISO, etc) you probably have to have code reviews for security. Even without compliance requirements, code reviews are good (though I would argue that their importance has always been less about catching bugs or improving reliability and more about ensuring that other developers are aware of or understand the changes and maintaining common standards for our codebase). But it’s very important that turnaround time on code reviews is kept short to avoid slowing down the entire process. (of course, it’s equally important that we keep PRs small and single-purpose so the reviewers can do meaningful reviews quickly). It’s also important to lean on automation as much as possible to keep that part of the process fast and efficient.

Finally, it’s also worth mentioning that the importance of this process isn’t limited to just application code. When I’m working on infrastructure, my goal is to go through this whole cycle with Terraform config, ansible roles, etc.

This post has contained a lot of my opinions. I expect that not everyone will agree with me on many of these. I will note though that what I advocate for is pretty close to the recommendations that you will see in, eg, Charity Majors’ twitter feed, Dave Farley’s Modern Software Engineering or the DORA Report, which studies software development at a large scale. In particular, DORA are interested in what makes a “high functioning organization” and the four key metrics that they’ve found to be the most reliable predictors are 1) lead time for a change (how long it takes to get from an idea to that change being running in production and available to customers; shorter is better) 2) deploy frequency (how many times per day/month/etc. you deploy or release to end users; more often is better) 3) time to restore (if there’s an outage, how long does it typically take to fix; shorter is better) and 4) change fail percentage (what percentage of your changes/releases resulted in an incident; lower is better). The first two of those are obviously directly related to the approach I describe and I think an argument could be made that it helps with the latter two as well.

See also: my post about Ratchets.

Tags: reliability software