Replace Failed Kubernetes Etcd Member

Replace Failed Kubernetes Etcd Member

I had a pretty knotty problem in my homelab. I am running a Kubernetes cluster in the with 3 masters and an embeded Etcd cluster. That means that the Etcd cluster runs on the same nodes as the K8s API and scheduler pods. Like them, it is running as Pods controlled directly by Kubelet (magic! except it isn't). The data on one of those members (node3) got corrupted, so naturally it would no longer join the cluster.

What you need to do is remove that (etcd) node from the cluster and recreate it. This is pretty simple, but needs a bit of under-the-bonnet knowledge. So how is this Pod configurered?

I hinted at a bit of magic earlier. These pods are running in K8s, and visible in the kube-system namespace, but are not actually manged by the Kubernetes scheduler. They are managed by the Kubelet itself. Kubelet on each master watches /etc/kubernetes/manifests and will action any valid manifest files you place in that folder. When I installed the cluster with kubeadm it did the following:

$ ls /etc/kubernetes/manifests/
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

The part which interests me is in the spec.volumes key of etcd.yaml:

spec:
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data

This tells me 2 things:

  1. The actual cluster data is store in /var/lib/etcd on my physical node
  2. The certificates for cluster comms are in /etc/kubernetes/pki/etcd

So now I need etcdctl that I can use which can access both the kube masters and those certificates. I actually had it on another machine in the lab, so I copied the pki/etcd contents to that machine, but you could put etcdctl on the broken master, it is just a binary.

You will need the UUID for your failed node:

export ETCDCTL="etcdctl --endpoints=https://<node1>:2379,https://<node2>:2379,https://<node3>:2379 \
    --cert /etc/kubernetes/pki/etcd/server.crt \
    --key /etc/kubernetes/pki/etcd/server.key \
    --cacert /etc/kubernetes/pki/etcd/ca.crt
${ETCDCTL} member list 

Remove the failed node from the Etcd cluster:

${ETCDCTL} remove <uuid-of-failed-node>

The simple move the etcd.yaml to one side:

mv /etc/kubernetes/manifests/etcd.yaml .

The kubelet wil then stop the Etcd pod and you can clean up its corrupted data dir:

rm -rf /var/lib/etcd/member

Re-start the pod:

mv etcd.yaml /etc/kubernetes/manifests/

That will restart the pod, but you still need to add it to the cluster:

${ETCDCTL} member add --peer-urls=https://<node3>:2380 <node3>

It will probably take a couple of restarts before it is properly healthy, but Kubelet will take care of that.

Before long you can run ${ETCDCTL} endpoint health and all will return good.

Conclusion

Nothing was actually that complex, but I needed to know a couple of things about how K8s does things:

  1. Where kubeadm put the certificates
  2. That Kubelet watches /etc/kubernetes/manifests for static Pods (defined by staticPodPath).