How many times have you used an ATM? A hundred times? A thousand times? Depending on where you are from, you probably have an ATM on just about every corner, and you probably use them on a fairly regular basis. How many times has an ATM given you an incorrect sum of money? How many times have you heard of friends who have received an incorrect sum of money from an ATM? I’m willing to bet that you haven’t. Pretty amazing, right?
With the millions of bills that are fed out of ATMs every year, it is amazing how reliable the mechanisms inside it are, right? I mean, if ATMs so rarely give out incorrect amounts of money, then they must have developed an amazingly reliable mechanism to divvy up the bills and then feed them out to the user, right? All those crisp new bills are so insanely hard to get apart, I just had to wonder what sort of engineering had gone into making absolutely sure that bills didn’t get stuck together!
Well, I was watching one of those "how it works" sort of shows the other day, and it turns out that the solution is devilishly simple. You see, most ATMs have a relatively simple mechanism of bands which drag bills one by one off of a stack and counts them out. While this system is fairly reliable, it still has some room for error. The reliability comes from a check that happens after the bills are grabbed off the stack. The machine passes the bill through a mechanism which measures the thickness of the bill. If the thickness of the bill doesn’t fall within a certain threshold it is rejected and put into a reject bin for inspection by a human.
So, what can our software learn from an ATM? Well, let’s break down the process:
- Perform task.
- Verify task.
- If verification failed, quit and retry.
Interesting. The mechanism wasn’t designed to be 100% reliable, it was designed to fail well. Instead of trying to implement a system which would detect that a bill was too thick and try further process it in order to make the task succeed, it simply ditches the bill and moves on. From a physical engineering perspective this is probably pretty obvious. More mechanisms means more cost, and more things which can go wrong. It wouldn’t make sense to try and build a more complex mechanism.
Somehow many developers didn’t get this memo. In order to build reliable systems, you need to:
- Perform a single, hopefully verifiable, task.
- Verify said task.
- If verification failed, then undo what can be undone, and notify someone.
What you don’t do:
- Perform multiple operations, then verify: If possible, keep things small and simple, and verify at each step. If you’re worried about failure, then keep things simple and isolated.
- Start deleting things to cleanup: If you start deleting things, and not everything can be removed, then you might lose state that will help you resolve issues later. What if you delete the file that tells you where files were originally put, and then the rest of the file copy fails? It would be better to try and quarantine the files.
- Try to automatically fix the issue: Unless you can say, with certainty, that you know exactly what the problem is, then writing code to handle an error situation will likely just cause another error. In fact, it may cause an error that is harder to detect or find. Errors that occur undetected are by far the worse kind of errors to have. If you do decide that fixing an issue automatically is the best course, then log out exactly what actions you performed to fix the issue.
- Fail quickly without logging or isolating state: Log out what happened and the state. Copy a file into a known error location. Put the message into a dead letter or poison queue. You get the idea, when something fails, try to gather things up and put them somewhere so that you can find them later!
- Perform operations in a shared location. If you know something can fail, perform the operation in a staging area. For example, if you need to bring in a file and perform some operations on it, then bring the file into a staging area, perform the operations and verify that they succeeded. Once you can verify they succeeded, then move the file to where it needs to be.
If you looked at this list and realized that you do all these things all the time, well, you need to stop lying to yourself. If all of these ideas seem obvious to you, then pat yourself on the back. Either way, you should make every effort to consider failure in your designs, but more importantly, try not to let failure cases complicate your design. The more complexity you introduce, the more likely it is to fail. Always weigh the consequences of failure against the effort required to deal with it.