005: Resiliency Through Failure

A crash-course in building more durable systems

Jun 16, 2023

Thank you for joining us for today’s 5 Minute Tech Challenge! We’re so glad you’re a part our community. Today, we learn from Abhinav Ramakrishnan, a staff software engineer at Meta. Let’s get to learning, Abhinav!

Every system around you will fall over at some point – it’s just a matter of time. And then, an initiative to “make the system more resilient” gets green-lit, and all of a sudden there are war-rooms and virtual teams focused on improving resiliency. You’re in there trying to figure out what to do and what went wrong, and you’re not sure where to start. I’m hoping this post can be a launching-pad for that discussion.

Resiliency - 100s: Catch your failures

*Don’t test your code because that slows down development.*

Resiliency 101: Test, Test, Test

No - I don’t mean you need to run each test three times, but there’s an idea! I just mean that to build reliable & resilient systems you need to be testing your system regularly. As you’re pushing changes out you want to build confidence that you’ve not broken anything, and you want this to happen at a regular cadence in an automated way.

My personal test-hierarchy is as follows:

Integration-tests are critical – they’re like unit-tests, but work on packages of code instead of a single class. This makes them easier to write than end-to-end (E2E) tests (which have external dependencies), and more valuable than unit-tests since they also test how things integrate.
Second come E2E tests that test the entire flow of a feature. This gives you confidence that what the user sees isn’t broken. If these were easy to write, they’d be numero uno, but they’re rarely easy and often flaky which makes them second in my ranking.
Last are unit-tests. They’re helpful when you’re trying to refactor something, but classes change too often for unit-tests to have durable value.

Food for thought: paying someone to exclusively test your product is not something I understand. It frees up time for devs, but it also means that they never fully understand the end-user experience. There isn’t anything like breaking your own product to teach you where you went wrong.

Resiliency 102: Monitoring and Alerting

It’s hard to solve a problem if you don’t know what’s wrong. Logs are useful, but don’t paint the aggregate picture. What you need is near-real-time information about a system’s state & health – hence monitoring and alerting. By itself, monitoring and alerting isn’t going to magically make your system more resilient. But it can help with early detection of potential issues and can enable teams to respond in a timely manner. There’re also secondary benefits – such as highlighting performance regressions, etc. But that’s a bonus.

My suggestion here is to have tiers of dashboards:

At the level of business function, that walks through business metrics you’re trying to uphold/drive (e.g. number of orders fulfilled, number of packages shipped, labor deficit in a warehouse, number of active daily customers, etc.)
High-level system health-metrics (# of errors, proof-of-work done, etc.) to tell you which system might be facing issues when something’s wrong
Detailed dashboards scoped to a system (latencies for individual calls, metrics on dependencies, etc) to tell you precisely what’s going wrong. This allows you to quickly triage an issue when something inevitably goes wrong – and hey, if it’s not affecting the bottom-line, you might as well go to sleep and fix it in the morning.

Resiliency - 200s: Do good work

*Do many things at once so when something goes wrong, you never notice.*

Resiliency 201: Procrastinate

If you own an e-commerce site and someone’s paid for an order, you don’t want to be spinning a loading screen until someone goes out and ships a box. Ain’t nobody got time for that. Instead, you just say: I charged your credit-card, and I’ll get around to sending you what you asked for; asynchronous scheduled work - i.e. a queue. Queues allow you to defer work while smoothing out load to your system and decoupling it from other systems. Use them where you can.

Queues also introduce complexity that you need to manage. You can’t rely on message ordering, you need to implement more complicated error-handling, and you have to monitor and manage your queues so they don’t grow like crazy. Despite these drawbacks, queuing solutions are very robust these days, with libraries that make working with queues very easy. Err on the side of more queues.

A fun aside about queues is that you can sometimes avoid doing work by making it async. Around 30% of the time, I order a package on Amazon, get buyers remorse, and immediately cancel it. By the time the system gets around to my order in the queue, it realizes it doesn’t need to do anything, and it’s magically saved some CPU cycles. This beats completing the order and then having to go through an entire hoopla to undo it. Where possible prefer asynchrony.

Resiliency 202: Do one thing well

As much as we all wish to practice the Tao of Haskell, the only way for you to have impact on the world is to enact a side-effect, sometimes more than one. For example, when The Rock posts about Kevin Hart being short, we store the post, and then let Kevin know so we can keep the beef going. As a dev you might choose to wrap this as an API-call: first save-to-DB, then notify-Kevin; all within the context of a single request. Isn’t that nice!

Don’t. You no longer have a single unit-of-work. If you hit an OOM half-way through, you could have saved to the database, and never actually notified followers; or maybe you never even saved the post. Who knows, and who has time to dig through the logs. Instead, isolate responsibilities. Save the post in the DB and use change-data-capture for notifications. That way if your request fails, you know the post never made it to the DB, and if it’s the subsequent eventing that failed, then you know that Kevin never found out that The Rock threw shade. Doing one thing in each step makes it easier to reason about the system and to build each step to be more robust.

Resiliency - 300s: Recover fast

*Don’t feed some people when you can let everyone starve.*

Resiliency 301: Hire more workers

If you ran a pizza shop and had a single pizza-chef, you’d be in for a tough time when they got sick. So often you have a “main” pizza-chef, and a second who can step-in if needed (unless you’re on Kitchen Nightmares, but let’s aspire to more than having Gordon shout at us).

Redundancy is one of the foundational ways to build resiliency. It applies at all levels of the stack. Instead of having a single server processing all requests, you have many servers so that a single server failing isn’t a problem. Do this as much as you can – hardware fails all the time, and you never want a single point-of-failure.

At the software level, you can build failovers. For example, if VISA is down and you’re processing orders, you could go to another system that tells you how likely a customer is to pay. If they’ve got good credit, you could just ship out the order and charge later when VISA’s back online.

Note: don’t do this unless you must. Failovers are tricky because they’re only executed when something goes wrong, which is hopefully rare. As your system evolves, failovers become gambles - you don’t know if they still work given all the changes that’ve been pushed through since last time. And if something goes wrong, God help whoever’s debugging. If you’ve got to do it, fail-over regularly to exercise these paths and build confidence both in the resiliency of your system, and that the operators know how to run the system in its failover state.

Resiliency 302: Starve some customers

When a system is overloaded, you only have two options: let all requests fail, or load-shed and let a few requests fail. While the former is more equitable, the latter leads to more $$$. And since we live in a capitalist society, we chase the $$$.

Not all traffic to your systems is equally important. For example, if you have too many orders and need to pick, you want to pick the ones that make you the most profit. This pattern of isolating faults to a few customers to let others thrive is quite common and falls under a few thematic labels: throttling, load-shedding, bulk-heading etc. But as Aragorn, son of Arathorn says of athelas, so I say to you of throttling: “I care not whether you say now asëa aranion or kingsfoil, so long as you have some”.

The one thing to note is that this only works if throttling is less expensive than fulfilling the request. Otherwise throttling is going to cause your system to brown-out. This is actually the problem with retry-storms. System-A throttles multiple requests from System-B, and each time, System-B retries. Soon, you have 10X more requests than you started off with. Then, as Gandalf the Grey said to the Balrog in the mines of Moria: “You shall not… be having a good time with that on-call.”

A fun thought experiment to run is to consider using a last-in first-out queue. When a queue grows too big you can’t process it in time. This leads to longer latencies which annoys people. With a FIFO queue that latency is distributed across everyone. With a LIFO queue, some customers face really long latencies, while others enjoy quick response times. Which one you want is up to you; but I’d rather really irritate a few people, than annoy everyone slightly.

Resiliency - 400s: Just give up

The final and most important lesson in resiliency is that sometimes the right answer is to do absolutely nothing. I call this justified laziness, because quite often investing in resilience may not provide a significant change in ROI.

For example, an internal system that sends out birthday/anniversary notifications does not need to be as resilient as an external facing website. As much as it raises morale, it doesn’t actually generate revenue. So the time (and hence $$$) you’re putting in is netting very low returns. That time can be better spent elsewhere. You should always be asking yourself if what you’re getting is worth what you’re paying.

There are other situations where there might be significant revenue impact, but the system is temporary while a new system is getting built. Here you’ve got to consider the opportunity cost. It might make more sense to build out the new system quicker than to make the old system more resilient. It does beg the question of why you’re rebuilding the system as opposed to refactoring it, but I’m going to assume you know what you’re doing.

You also have to understand that humans are by nature resilient, even if your system isn’t. An occasional order failure will cause the user to click the “order” button again. If your Instagram feed doesn’t load 1% of the time, people will just refresh. Not everything needs to be solved through technology - sometimes it’s okay to leave some things unsolved and just let humans figure it out. We seem to have survived the past 5-7 million years, I’m sure we can live through a few dropped requests.

Happy building!

Coming up in two weeks, I’ll be back to share my learnings on Understanding Product-Market-Fit: An engineer’s guide to partnering with product. The TLDR: product work can sometimes feel like swimming against the tides! (I promise that was a funny pun that will make sense two weeks from now).

5 Minute Tech Challenge