By anders pearson 23 Jul 2015
When you’re building systems that combine other systems, one of the big lessons you learn is that failures can cause chain reactions. If you’re not careful, one small piece of the overall system failing can cause catastrophic global failure.
Even worse, one of the main mechanisms for these chain reactions is a well-intentioned attempt by one system to cope with the failure of another.
Imagine Systems A and B, where A relies on B. System A needs to handle 100 requests per second, which come in at random, but normal intervals. Each of those implies a request to System B. So an average of one request every 10ms for each of them.
System A may decide that if a request to System B fails, it will just repeatedly retry the request until it succeeds. This might work fine if System B has a very brief outage. But then it goes down for longer and needs to be completely restarted, or is just becomes completely unresponsive until someone steps in and manually fixes it. Let’s say that it’s down for a full second. When it finally comes back up, it now has to deal with 100 requests all coming in at the same time, since System A has been piling them up. That usually doesn’t end well, making System B run extremely slowly or even crash again, starting the whole cycle over and over.
Meanwhile, since System A is having to wait unusually long for a successful response from B, it’s chewing up resources as well and is more likely to crash or start swapping. And of course, anything that relies on A is then affected and the failure propagates.
Going all the way the other way, with A never retrying and instead immediately noting the failure and passing it back is a little better, but still not ideal. It’s still hitting B every time while B is having trouble, which probably isn’t helping B’s situation. Somewhere back up the chain, a component that calls system A might be retrying, and effectively hammers B the same as in the previous scenario, just using A as the proxy.
A common pattern for dealing effectively with this kind of problem is the “circuit breaker”. I first read about it in Michael Nygaard’s book Release It! and Martin Fowler has a nice in-depth description of it.
As the name of the pattern should make clear, a circuit breaker is designed to detect a fault and immediately halt all traffic to the faulty component for a period of time, to give it a little breathing room to recover and prevent cascading failures. Typically, you set a threshold of errors and, when that threshold is crossed, the circuit breaker “trips” and no more requests are sent for a short period of time. Any clients get an immediate error response from the circuit breaker, but the ailing component is left alone. After that short period of time, it tentatively allows requests again, but will trip again if it sees any failures, this time blocking requests for a longer period. Ideally, these time periods of blocked requests will follow an exponential backoff, to minimize downtime while still easing the load as much as possible on the failed component.
Implementing a circuit breaker around whatever external service you’re using usually isn’t terribly hard and the programming language you are using probably already has a library or two available to help. Eg, in Go, there’s this nice one from Scott Barron at Github and this one which is inspired by Hystrix from Netflix, which includes circuit breaker functionality. Recently, Heroku released a nice Ruby circuit breaker implementation.
That’s great if you’re writing everything yourself, but sometimes you have to deal with components that you either don’t have the source code to or just can’t really modify easily. To help with those situations, I wrote cbp, which is a basic TCP proxy with circuit breaker functionality built in.
You tell it what port to listen on, where to proxy to, and what threshold to trip at. Eg:
$ cbp -l localhost:8000 -r api.example.com:80 -t .05
sets up a proxy to port 80 of api.example.com on the local port 8000. You can then configure a component to make API requests to
http://localhost:8000/. If more than 5% of those requests fail, the circuit breaker trips and it stops proxying for 1 second, 2 seconds, 4 seconds, etc. until the remote service recovers.
This is a raw TCP proxy, so even HTTPS requests or any other kind of (TCP) network traffic can be sent through it. The downside is that that means it can only detect TCP level errors and remains ignorant of whatever protocol is on top of it. That means that eg, if you proxy to an HTTP/HTTPS service and that service responds with a 500 or 502 error response,
cbp doesn’t know that that is considered an error, sees a perfectly valid TCP request go by, and assumes everything is fine; it would only trip if the remote service failed to establish the connection at all. (I may make an HTTP specific version later or add optional support for HTTP and/or other common protocols, but for now it’s plain TCP).