AWS: ReInvent. Performing chaos engineering in a serverless world

Thanks to my current employer New10 I was lucky to get to AWS:ReInvent conference this year in Las Vegas. I would like to share my notes on the talks I'm joining, while my memory is "fresh". My first talk was "Performing chaos engineering in a serverless world" by Gunnar Grosch.

So, what is `Chaos Engineering`?

According to Wikipedia it's

Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions

Sounds, great, but thanks to Gunnar I discovered that this statement can be adjusted and extended. Here's what I discovered:

Chaos engineering is not about breaking things, it's about learning on the failures
Resilience and Chaos engineering are two distinct matters and they should not be mixed, or confused.
You don't have to run Chaos engineering experiments on production. Many companies stopped looking into Chaos engineering because they didn't want to risk with production data. This should not be the case - if you're not confident, then run your experiments on tests environment. But it's better to keep your "Chaos" test environment as close as possible to production - this will give you the best result. A lot of companies are running their testing not on production and are very happy with results
You don't have to be a big streaming company ( like Netflix) to start using Chaos engineering. If you are a middle size startup running everything serverless - it's a perfect time to start looking experimenting.
"Chaos Engineer" position is getting more and more traction on the market and companies are hiring specialists that would help them with setting up and running Chaos experiments.
Chaos engineering is about finding weaknesses in the system and fixing them before they break on prod
Chaos engineering is not only about finding confidence in your system or application running, but also finding confidence in your organization. Finding an error in your system is important, but it's also very important for your organization to react to it most effectively. Running Chaos experiments and drills will allow you to find weaknesses in your organization processes and fix them before it happens on production.

What is the motivation for Chaos engineering?

“Everything fails all the time” - Werner Vogels ( CTO, Amazon)

I think this is very important to realize this fact and for every developer to take it into account when they build a new microservice or application. Business only survives when customers are happy. That's why I like the phrase Nines don’t matter if users aren’t happy. We can have a very sophisticated monitoring system, but if your application breaks because of resiliency problems, then customers will not tolerate it. Losing customers means losing money and that's not sustainable.

Another thing to think about is - are you confident that your monitoring and alert system are working properly and that at any moment you can easily spot a failure/downtime/error and react on it? In New10 we are constantly working on observability and monitoring and I still see that there's a big window of improvement open for us, so I think Chaos engineering experiments could help us a lot in this matter.

So, how do you run a Chaos experiment?

Step 1: Define steady step

Get back to your monitoring and pick a set of system and business metrics that define a normal "expected" state of the system. It's gonna help to track deviations and errors. Business metrics are usually more effective and important, so don't just depend on your CPU/memory usage metrics. Think from a user/business perspective.

Step 2: Form your hypothesis

Define your what ifs (what will happen if our DynamoDB table will be dropped, or what if our 3rd party service is not available)
Decide on where you gonna inject your errors ( introduce some error to your code, set some latency in your lambda or 3rd party service). What I found very useful from the talk, was that you don't have to think only service/application wise. You need to think also about the organization. Imaging what if a key person/service owner is not available - this can be a very good use case that can show how your team can handle a failure in the service when a person, who has the most knowledge about it, is absent? I think it will be quite revealing and useful to the whole organization.

Step 3: Plan and run your experiment

It's important during this phase to gather key people in a room and plan everything on a whiteboard. Discuss what is the plan, "play it" on the whiteboard and discuss with people what could be the potential impact on the whole system. Most probably you will discover things you didn't see in the first place. Besides, it gets everyone on the same page and allows the experiment to run smoother. During planning agree on "blast radius" - how big you want your experiment to be ( how many servers you want to include into the experiment, maybe define a percentage of users that will be affected, etc.). Usually, it's better to start from a small radius, so that it will be easier to measure and handle experiment. As more you get confidence with your experiments, the more you can extend the radius. According to Gunnar, scaling of the experiment allows seeing new problems that were not visible on a smaller scale.
It's very important to notify the whole organization that the Chaos experiment will be in place and that new alerts and incident reports might appear, but people should not panic.
Another important task during planning is to create Stop button - a mechanism that will allow you to quickly stop the experiment and bring the system back to normal.

Step 4: Measure and learn

It's important to have a system that will provide good observability and metrics information about the system under test. It's needed to prove or disapprove your experiment hypothesis. As a result of the experiment, the summary should be written and shared within the organization. It should contain:

Hypothesis details
Actions that were made during the experiment
Metrics of how the system behaved during the experiment
Was there any unexpected behavior?
Summary that will state whether hypothesis about system behavior was approved/disapproved.

Interestingly, the success of the experiment can be different:

If your system was resilient to all the failures/breakdowns that you have introduced during the experiment - success
You managed to run the experiment and confirm your hypothesis - success
You found some unexpected and new behavior of a system that you didn't think of before - it's also a success because you learned something new about your system and not by customers cost!

Step 5: Scale-up or abort and fix:

As I said before, scaling the experiment can discover new weaknesses or problems in the system. It's useful to raise it gracefully and monitor results. But, if you already see that there are definitive problems - take a break, fix them and then repeat your experiment with a version of your service. It's better to keep the scope of known issues to minimal to get a better understanding of system metrics and understanding what potentially causing failures.

Challenges with serverless

Chaos engineering is a technique that is battle-tested on real hardware for a quite long time ( since 2011 by Netflix). The problem with serverless is that you don't control hardware on which is running your functions. You can't just shut down some VM, or configure some latency on a machine. It's a black box controlled by AWS and you have very little control over it.
Another problem is that each function has it's own configuration and permissions, which increases complexity and "Chaos". Serverless allows us to build small simple functions/services that on the other hand brings more complexity into the architecture of the whole system and interconnection between them. Funny thing, but all those factors make serverless a perfect fit
for Chaos engineering :)

Common serverless weaknesses:

Error handling - the more services we have, the more chances are that we handle errors differently. Chaos engineering allows us to show weak spots in our system when errors are introduced on purpose somewhere in the flow
Timeout values - what will happen if 3rd party service, that your function relies on replies with a delay? What is the timeout of your lambda function? Every function has it's own configuration and it's easy to miss timeout setting for one of them.
Events - what will happen if your function will receive events in the "wrong" order? Will it handle it properly, or will just throw a bunch of errors and die?
Fallbacks - what will happen if some 3rd party service is down, or one of the resources, that your service is relying on, is not available anymore?
Failovers - this is quite a rare state, but we already know that "everything breaks" :)
Roles and Permissions - what will happen if your function doesn't have access to DynamoDB anymore?

If you want to start experimenting, then you should check this Chaos Lambda Layer. It offers simple python decorators to do delay, exception and statusCode injection and a Class to add delay to any 3rd party dependencies called from your function. This allows conducting small chaos engineering experiments for your serverless application in the AWS Cloud.

Thanks to this talk I started to think that we definitely should try Chaos Engineering in New10 and I will start preparing some pitch talk about this after I'm back in Amsterdam. Follow Gunnar (after his talk I have followed him right away) and experiment in your company too!

So, what is Chaos Engineering?