High Availability Kubernetes with Podmaster

High Availability Requirement

You’ve finally containerised your cloud native web application, and its components are redundant and scalable. You’ve constructed your ReplicationControllers and Services. The only remaining single point of failure remaining is your cluster master itself! In this post, we’re going to look at setting up a high availability Kubernetes master.

Master Node Building Blocks

The first step to building a reliable master node is to start with a monitoring system that restarts failed processes on a single host. Seeing as we use CoreOS, we use SystemD, although the official HA documentation uses monit. Either way, set up your process monitor to start the fundamental building blocks:

etcd - holds the cluster state
container runtime - Docker in this case, but rkt support is looming on the horizon
container networking fabric - this can be cloud provider specific, but we use Flannel for now
Kubelet process

Why use kubelet as another process manager when we’re already using systemd? As Kubelet manifests are the same files that you could otherwise pass directly to a running cluster, you can add resource limits and update the binaries by updating their containers without rebuilding your unit files. It’s also convenient to just drop in any cluster-ready services you may have already constructed elsewhere.

In order to provide a high availability data store, we will need to set up some clustered etcd instances. You can do this manually, or by using a discovery service - both methods are well documented. It’s also recommended you keep your etcd storage directory on a distributed filesystem, for example Gluster, Ceph or a cloud provider’s block device.

Configuration and Kubelet Manifests

You now need to configure your master node appropriately. If you’re doing this as an experiment, you can simply copy the configuration from an existing cluster. The appropriate files are:

basic_auth.csv - basic auth user and password
ca.crt - Certificate Authority cert
known_tokens.csv - tokens that entities (e.g. the kubelet) can use to talk to the apiserver
kubecfg.crt - Client certificate, public key
kubecfg.key - Client certificate, private key
server.cert - Server certificate, public key
server.key - Server certificate, private key

The Kubelet process scans a specified directory for Kubernetes manifests, and executes them. To start with, we want to run a copy of the APIserver on each node (it’s stateless, so running it on each node permanently is fine.) Edit the manifest to your liking, then add it and the check that the Kubelet has started it using docker ps.

If everything has started correctly, you have created a stateless master node which you can scale horizontally by spinning up identical copies. You should run them behind a load balancer of your choosing, which is again cloud provider specific. Once this is done, when configuring your worker nodes, set their APIserver configuration to point to the load balancer and they should carry on working as usual! Note that your certificates may have to be regenerated to contain the IP address of your load balancer rather than the master nodes themselves!

Podmaster and Leader Election

While the Kubernetes APIServer is stateless, the Scheduler and Controller-manager are not. Now we are running multiple masters we will need to have an implementation of leader election. Enter podmaster!

Podmaster is a small utility written in go that uses etcd’s atomic “CompareAndSwap” functionality to implement master election. The first master to reach the etcd cluster wins the race and becomes the master node, marking itself as with an expring key that it periodically extends. If it finds the key has expired, it attempts to take over using an atomic request. If it is the current master, it copies the scheduler and controller-manager manifests into the kubelet directory, and if it isn’t it removes them. As all it does is copy files, it could be used for anything that requires leader election, not just Kubernetes!

Final Kubernetes Architecture

The overall cluster architecture looks something like this:

In time, high availability is supposed to be baked in to Kubernetes, but for now, the podmaster (of the universe) is considered best practice!