Serverless Handbook

Student Login

Architecture principles

What do you do when 3% of requests fail?

That was my reality one painful day in September. Deploy new feature, pat yourself on the back, go to lunch. A big feature – done. Nothing can touch you. 💪

Ding.

A Slack message? Who's working through lunch this time ...

Ding. Ding.

Weird, that's a lot of messages for lunchtime ...

BZZZ

Everything is on fire. High API error rate 🔥
~ an alarm SMS

Our system sent alarms to Slack. Serious alarms to SMS.

The high API error rate was the biggest alarm. A catch-all that triggered when you can't be sure the more specific alarms even work.

Back to my desk, my heart sank: 3% of all API requests were failing. Reasons unknown.

Every user action, every background process on the web, on iOS and on Android, every time you opened the site or accessed the app 👉 3% chance of failure. At our usual 10 requests per second, that's 18 errors every minute!

Wanna know the best part?

Nobody noticed. ✌️

I knew something was wrong because of that SMS. The system kept hobbling along. Slower, in pain, but getting the job done.

How can a system with 18 failures per minute keep working?

Everything fails

The design principle behind every backend architecture states:

  1. Everything can and will fail
  2. Your system should work anyway
  3. Make failures easy to fix

In 2011 Netflix forced engineers to think about this with the Chaos Monkey. They wanted to "move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable".

The Chaos Monkey makes that happen by introducing random failures to production environments. Servers down, databases unresponsive, message queues failing.

Reality is going to be your chaos monkey. Plan for it. 😉

How statistics play against you

At scale, you're playing against statistics. A one in a million error rate, at 10 requests per second, means an error every day.

Not a big deal, but if that error happens in your payment flow and you double charge a user ... they'll care.

What do statistics mean for you and your architecture?

AWS Lambda guarantees a 99.95% monthly uptime percentage. That's your base error rate.

Even if you write the most perfect code, you can't have a lower error rate than your platform.

What does 99.95% monthly uptime percentage mean in practice?

“Monthly Uptime Percentage” for a given AWS region is calculated as the average of the Availability for all 5-minute intervals in a monthly billing cycle. Monthly Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Lambda SLA Exclusion.

“Availability” is calculated for each 5-minute interval as the percentage of Requests processed by Lambda that do not fail with Errors and relate solely to the provisioned Lambda functions. If you did not make any Requests in a given 5-minute interval, that interval is assumed to be 100% available.

Let's run the numbers

There are 24*60/5 = 288 5-minute intervals in a day. An average of 99.95% means that some intervals have lower availability, some higher. None go higher than 100%.

The hidden message is that you're going to see bursts of failures. A bad 5-minute period in a sea of 100% green. Outliers pull the average down.

Let's assume an even error rate to simplify the math.

At 1 request per 5 minutes, you get 288 * 0.9995 = 287.86 successful requests. Round up to 288. All good.

But at 1 per minute, you're down to 1439 successful requests. An error per day.

It gets worse from there.

I once built a large system running at a 0.092% failure rate. Lower than the error rate of the platform because of the architecture.

Pretty good eh?

1069 errors per day.

PS: failure rate means the system never manages to process a request, error rate means individual, possibly recoverable, errors along the way

How to design a resilient architecture

Resilience is key. Designing your system to withstand shocks and errors. To work as a whole when individual pieces fail.

Your goal is to:

  • isolate errors when an error happens, confine it to the smallest area possible without damaging the rest of your system
  • make operations replayable triggering the same operation multiple times must produce the same result
  • retry until success errors are often transient and go away when you try again
  • make your system debuggable store enough information about your actions so that you can look back later
  • remove unprocessable requests sometimes requests can't be processed. Make sure these don't kill your system or get in the way of valid requests
  • alert the engineer when something is wrong errors can be part of normal operations. When there's too many, make your system cry out in pain and tell you where it hurts

You can achieve all that with a few steps.

Small, isolated, replayable operations

Think of your task as a pipeline.

A request comes in. You do X, then you do Y, then Z, then the result comes out. The request is your input, the result is your output.

Like Henry Ford's famous assembly line 👉 steel comes in the factory on one end, cars leave on the other.

For max resilience, make each operation follow this algorithm:

  • get request
  • check if request already processed
  • if processed, finish
  • if not, do your thing
  • trigger the next step
  • mark request as processed

Every request needs an identifier. Easiest if it's a unique id of a database row.

Having a way to answer "Did I process this yet?" gives you replayability. Call the same action with the same request twice and nothing happens.

Easiest way to achieve this, is a lookup table. I like to put a processed flag in each database row. Lets you pass a reference to the row between actions. Each does its thing and changes the flag.

Each action does one thing and one thing only.

Because each action performs one step in the process, you can look at the processed flags to see what's up.

Debugging becomes easier. You can test steps in the process on their own. Like a function call.

Isolation comes from decoupling

Isolation ensures that an error in one action doesn't impact others. That action fails and the pipeline might get backed up, but other actions keep working fine.

And because you have replayability, you can retry until your action succeeds. ✌️

Retry until success

Fire an action, get success or fail. If success, remove message from the queue and move on. If it fails, put message back on queue and try again.

Errors can be temporary. Engineers will fix the bug.

If it's a hardware problem, you can wait until your function runs on a different physical machine. Yay serverless.

You can keep retrying until success because actions are safely replayable.

To be safe, I like to have a cleanup worker that goes through the database every few hours and checks for rows that failed to process and stopped retrying.

Catches errors in the retry system itself. ✌️

PS: more on queues and how this works in the chapter on serverless elements

A debuggable system is a good system

Small actions that store a "Yes I Did It" state for each piece of data are easier to debug. You can look at your database to see which step failed.

Replayability lets you trigger the failed action for that data and see what happens.

Since actions are functions and your messages are plain data, you can test locally. Unit tests are great, running production code on production data in your terminal, that's wow.

And if that's not enough, add logging. A well-placed console.log can do wonders.

Conclusion

Build your system out of small isolated pieces that talk to each other via queues.

Next chapter we dive into queues and lambdas, and talk about how to tie them together into a system.

Did you enjoy this chapter?

Thanks for supporting Serverless Handbook! Consider sharing it with friends ❤️

Add it to your bookshelf if you haven't.

buy now amazon

Serverless Handbook on your bookshelf
Serverless Handbook on your bookshelf

Cheers,
~Swizec

Previous:
Good serverless DX
Next:
Lambdas, queues, etc
Created bySwizecwith ❤️