How to gain access to a RKE2 cluster without Rancher when the CNI doesn't work

In my previous post where I outlined challenges that I’ve encountered with Rancher. As part of the feedback to that I ended up having to rebuild one of my clusters. I took that time to try out RKE2 and K3s for my home lab. In this home lab, I use a custom CNI based on the official Bridge and DHCP IPAM CNIs (Read more) to enable my smart home software (HomeAssistant) to communicate with other devices on the same Layer 2 domain.

However, it seems that if you try to spin up a RKE2 cluster on a host with a Bridge interface setup (See here) then it’ll get stuck during provisioning and you won’t be able to download a Kube Config from Rancher Server because Rancher thinks it’s offline. I reported this issue initially here.

In this blog post, I explain more about the problem and how to directly connect to the cluster to install a working CNI such that Rancher will correctly start.

Problem Continued

In this cluster, I setup a single Ubuntu Server node that has a bridge interface configured exactly as I’ve done before (See here). I’ve configured with cni:multus,calico

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master cni0 state UP group default qlen 1000
    link/ether 00:15:5d:02:cb:08 brd ff:ff:ff:ff:ff:ff
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:06:25:9b:31 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:15:5d:02:cb:08 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.241/24 brd 192.168.2.255 scope global dynamic cni0
    inet6 fe80::215:5dff:fe02:cb08/64 scope link
       valid_lft forever preferred_lft forever

$ ip route
default via 192.168.2.1 dev cni0 proto dhcp src 192.168.2.241 metric 1024
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.2.0/24 dev cni0 proto kernel scope link src 192.168.2.241
192.168.2.1 dev cni0 proto dhcp scope link src 192.168.2.241 metric 1024

There’s a valid route outwards however Calico can’t start because it reports:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ sudo crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                       ATTEMPT             POD ID
ac2bab78f970e       c59896fc7ca44       3 minutes ago       Exited              calico-node                8                   c9c4aa34f68a9

$ sudo crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock logs ac2bab78f970e
...
2022-04-10 18:12:09.146 [WARNING][10] startup/startup.go 710: Unable to auto-detect an IPv4 address: no valid IPv4 addresses found on the host interfaces
2022-04-10 18:12:09.146 [WARNING][10] startup/startup.go 477: Couldn't autodetect an IPv4 address. If auto-detecting, choose a different autodetection method. Otherwise provide an explicit address.
2022-04-10 18:12:09.146 [INFO][10] startup/startup.go 361: Clearing out-of-date IPv4 address from this node IP=""
2022-04-10 18:12:09.150 [WARNING][10] startup/utils.go 48: Terminating
Calico node failed to start

Rancher Server shows the following logs.

[INFO ] waiting for at least one bootstrap node
[INFO ] provisioning bootstrap node(s) custom-3bfcd9ce3995: waiting for agent to check in and apply initial plan
[INFO ] provisioning bootstrap node(s) custom-3bfcd9ce3995: waiting on probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning bootstrap node(s) custom-3bfcd9ce3995: waiting on probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] provisioning bootstrap node(s) custom-3bfcd9ce3995: waiting on probes: kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] non-ready bootstrap machine(s) custom-3bfcd9ce3995: waiting for cluster agent to be available and join url to be available on bootstrap node

The cluster will never progress because Rancher needs to launch the cattle-cluster-agent, but this needs a working CNI to launch correctly. However, we can’t fix the CNI because Rancher won’t give a Kube Config that allows us to connect to the cluster and deploy the working CNI.

RKE2 - Get a valid credential

Since I have full access to the host running the RKE2 cluster, I should be able to gain access to it somehow. Each Kubernetes pod deployed to a host gets a special volume mounted inside the container that it can use to communicate to the Kubernetes apiserver. By default, these pods don’t generally have any privileges, but if we can find one that has enough privileges to create the resources we need, we can get the cluster working.

In this cluster, I enabled the kubernetes API endpoint in Rancher. This deployed a container called kube-api-auth. Luckily this container grants all the privileges we need.

1
2
$ sudo ctr --address /run/k3s/containerd/containerd.sock --namespace k8s.io c ls | grep kube-api-auth
57d148cadbdceb998ab7be8e38f72dec1fa0fe8c6f313dcab19e09ba9245eb1f    docker.io/rancher/kube-api-auth:v0.1.6                                io.containerd.runc.v2

There may be two containers displayed. One of them is the pause container which serves as a special init process. If you want to know why here’s a good blog post.

Inspect the container and look for the volume mount kube-api-access:

1
2
$ sudo ctr --address /run/k3s/containerd/containerd.sock --namespace k8s.io c info 57d148cadbdceb998ab7be8e38f72dec1fa0fe8c6f313dcab19e09ba9245eb1f | grep kube-api-access
  "source": "/var/lib/kubelet/pods/972647f6-e514-40c6-a0a0-6891898a2dec/volumes/kubernetes.io~projected/kube-api-access-vj98z"
1
2
$ sudo ls /var/lib/kubelet/pods/972647f6-e514-40c6-a0a0-6891898a2dec/volumes/kubernetes.io~projected/kube-api-access-vj98z
ca.crt  namespace  token

The JWT token can be extracted from the ’token’ file. With this we’re going to effectively impersonate this container and the use the privileges that it has:

1
2
3
$ sudo cat /var/lib/kubelet/pods/972647f6-e514-40c6-a0a0-6891898a2dec/volumes/kubernetes.io~projected/kube-api-access-vj98z/token

eyJhbGciOi...My

Kubectl needs the CA certificate to validate the SSL certificate:

1
2
3
$ sudo cat "/var/lib/kubelet/pods/22d4bf53-2f87-4d58-9272-9c4d0bad47f2/volumes/kubernetes.io~projected/kube-api-access-tl4qz/ca.crt" | base64 -w 0 ; echo

LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJlVENDQVIrZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREF\[...\]RU5EIENFUlRJRklDQVRFLS0tLS0K

Edit ~/.kube/config and insert the content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: v1
kind: Config
clusters:
- name: "my-new-cluster"
  cluster:
    server: "https://**{myip}**:6443"
    certificate-authority-data: "**{base64d ca.crt}**"

users:
- name: "my-new-user"
  user:
    token: "**{contents of /token}**"

contexts:
- name: "new-cluster"
  context:
    user: "my-new-user"
    cluster: "my-new-cluster"

current-context: "new-cluster"

After that you should be able to use kubectl to launch whatever resources you need. Remember to change get a new kubectl from Rancher afterwards so you’re not using system level credentials.

Copyright - All Rights Reserved

Comments

Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.