What is Chaos Engineering?
Chaos engineering is the practice of performing experiments on your systems (e.g. applications) to identify whether or not your system is resilient, and LitmusChaos is a tool that makes it possible to perform chaos engineering within a Kubernetes cluster. To be specific it is a toolset developed to help Kubernetes developers and Site Reliability Engineers (SREs) practice chaos engineering with Kubernetes. Developers would run the chaos experiments during development as part of unit testing or integration testing (or as part of a CI pipeline), whilst SREs would run the chaos experiments against the application and/or the surrounding infrastructure. Both sets of users using chaos engineering with Litmuschaos will lead to resilient infrastructure and applications being developed.
Litmuschaos
Litmuschaos makes this possible by extending the Kubernetes API with additional Custom Resource Definitions (CRDs) that can be used to create, manage and monitor the chaos experiments. The three main CRDs are:
- ChaosExperiment
- Contains low-level execution details for running an experiment
- It is a template that can be reused
- ChaosEngine
- Hold information about how the chaos experiments are executed
- Connects an application instance with one or more chaos experiments
- Also contains status of the chaos experiments (after they are run)
- ChaosResult
- Holds the result of a chaos experiment
Along with these CRDs is a Litmuschaos Kubernetes Operator that will search for active ChaosEngines
and will invoke the ChaosExperiments
declared in the engine.
You can create your own custom chaos experiments to use, but the LitmusChaos team have created a list of generic/common experiments that can be used. The experiments can be found in the ChaosHub. Some examples of common experiments are:
- Injecting packet loss into an application’s pod
- Consuming CPU resources in an application’s container
- Causing a pod to fail
- Introducing IO Disk stress on a Kubernetes node
This demo will be using one of the provided experiments to fail an application’s pod as part of a Helm chart test suite.
Helm Chart Tests
When you package your Kubernetes application into a Helm chart you may also choose to include a test suite for the Helm chart to validate your deployed Helm charts.
A good use case for this is to perform a health check against your Kubernetes application (deployed via a Helm chart). However, imagine you have an application that is fully functional when it’s initially deployed but loses some (or all) of its functionality when a disaster strikes. For example, can we be confident that our application will still be fully functional if a pod crashes and restarts?
Ideally, we’d want a test suite that can also prove that an application is resilient and by packaging LitmusChaos experiments into a Helm chart test suite, it should be possible to create a test suite that can prove an application is resilient within a Kubernetes cluster.
Before going over the implementation details, I want to provide an overview of the workflows we’ll be combining, followed by the new workflow.
Workflows
Chaos Experiment Workflow
Assuming the litmuschaos operator is deployed, the workflow of a chaos experiment (via terminal only) is as follows:
- Deploy reusable
ChaosExperiments
- Deploy an active
ChaosEngine
which declares whichChaosExperiments
to use - The experiments are run automatically and
ChaosResults
objects are created - Manually examine the
ChaosEngine
andChaosResults
objects to discover the results of the experiments
Note: Deployments can be done with the kubectl apply -f
command
Packaging the chaos experiments into a Helm chart test suite will improve this workflow. Some benefits include:
- Bundling all necessary resources for the chaos experiments together and have them deployable with a single command
- Making it possible to output the results of the chaos experiments after they are completed
- Having access to the Helm template engine to render the chaos experiments for different variations of the Kubernetes application packaged (i.e. different releases of a Helm chart)
So including the experiments in a Helm chart test suite will not only improve the test suite it’ll also improve the workflow of the chaos experiments. A win-win for both tools.
Now, we’ll go over the workflow for a Helm chart test suite.
Helm Chart Test Workflow
Tests in a helm chart live in the templates/
directory. Therefore, when creating a helm chart we need to create a template for a Pod
and Job
with the helm.sh/hook: test
annotation. This annotation declares that the manifest file should be run/deploy when it’s time for Helm’s test
hook to be invoked, which is when the helm test
command is run.
The workflow for running a Helm test suite is:
- Deploy the helm chart (note: resources with the
helm.sh/hook: test
annotation will not be deployed) - Run the
helm test
command to invoke thetest
hook - All resources with the
helm.sh/hook: test
annotation will be deployed, this includes thePod
(s) and/orJob
(s) that will run the containers with the tests - Once all the containers are finished and have terminated the test results are outputted. If all the containers terminated successfully (i.e.
exit 0
) then all tests passed.
With chaos experiments, the result of the experiments can be used to determine how the container should terminate.
New Workflow (Helm + LitmusChaos)
When we package LitmusChaos experiments into a Helm chart test suite, we can achieve the following workflow:
- Deploy reusable
ChaosExperiments
- Run
helm test RELEASE_NAME --logs --debug
- A
Pod
orJob
object is deployed, which will set theChaosEngine
to active and then go through 3 phases- Perform a health check for the app
- Query the results of the chaos experiments
- Re-perform a health check for the app after the chaos experiments
- If a phase is unsuccessful (e.g. the chaos experiment fails) then the container will terminate, but not successfully. If all the chaos experiments pass and the two app health checks were successful then the test suite was successful.
Technical/Implementation Details
Now the question is how do we create this new workflow?
Prerequisites
First, we’ll need the prerequisites:
- The
ChaosEngine
and the necessary RBAC objects run the experiments should be deployed as part of the helm chart - The
ChaosEngine
’sengineState
should be stopped bystop
by default - The chaos reporter (i.e. the
Pod
/Job
) must:- use an image with the command line tools
bash
,curl
andjq
- contain the helm test hook annotation:
helm.sh/hook: test
- have permission to
GET
andPATCH
theChaosEngines
results via the Kubernetes API (i.e. the necessary RBAC objects should be deployed as part of the helm chart)
- use an image with the command line tools
The Chaos Reporter: Querying Chaos Experiments
With the prerequisites set up, we now need to define the container within the Pod
/Job
that will execute the tests. The tests will be in one bash script and will involve repeatedly running the curl
command against the kubernetes.default.svc
DNS name, which is the recommended approach to querying the API within a pod. With these curl
commands, we’ll be able to patch and query the ChaosEngine
.
At the start of the script, we’ll need to fetch the token of the service account within the pod. We will need this token to run our curl
commands with the correct permissions assigned to the service account.
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" $URL
The URL
follows the convention https://kubernetes.default.svc:443/apis/$GROUP_NAME/$VERSION/$NAMESPACE/$RESOURCE_TYPE/$RESOURCE_NAME
In this case:
- The group name is
litmuschaos.io
- The version is
v1alpha1
- The resource type is
chaosengine
- The namespace and resource name can be anything
The output of the curl
command would be a description of the deployed ChaosEngine
- which includes the engine status, the chaos experiments and the results of the experiments.
Starting & Stopping Chaos Experiments
To start the experiment by patching the ChaosEngine
we’ll need to modify the curl
command to:
- Set the
Content-Type
header toapplication/json-patch+json
- Set the request type to
PATCH
- Provide data for the patch request
An example is below:
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" \
-H "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "replace", "path": "/spec/engineState", "value": "active"}]' \
$URL
To stop the ChaosEngine
you would set the value to stop
instead of active
.
Test Algorithm
Now that we’re aware of how to get a pod to query and patch the ChaosEngine
we can define the following algorithm for our tests:
- Run app health check
- Patch the
ChaosEngine
and activate/trigger the experiments - Continuously poll the
ChaosEngine
until the experiments arecompleted
- Get the verdict/result of all the chaos experiments
- Patch the
ChaosEngine
and disable/stop the experiments - Re-run app health check
You can find example code for this (and the helm chart) here. The gif below is a demonstration of the workflow in action.
The experiment would take at least a minute to complete, so the gif was cut short. In the gif, you saw the logs for the tests were in the helm-test-chaos-reporter
pod. If you used the --logs
flag with helm test
the logs will be shown as the output (alongside other details) of the helm test
command once the test either passes or fails (The image below shows the tail end of the output).
DevOps Engineer Final Thoughts on LitmusChaos
This is as far as we’re going to go with LitmusChaos today, but by combining LitmusChaos with Helm we’re able to easily configure chaos experiments for specific Helm charts/releases (i.e. Kubernetes applications) and run them.
With this approach and the right chaos experiments Helm chart maintainers will be able to create a test suite to confirm the resilience of the Kubernete application deployed as part of a Helm chart and include this test suite within a CI pipeline that’ll be able to provide ample feedback to developers.
We’re looking forward to integrating this new workflow in our internal projects and exploring the other CNCF sandbox projects.