Home Lab: Part 4 - A DHCP IPAM

In the previous post, we end up abusing subnets and routing to get Calico to exist on the correct subnet, but what if we could get rid of Calico’s duplicate IPAM system and just depend on our existing DHCP server to handle reservations? In this post, we’re going to prototype a cluster that uses DHCP + layer 2 Linux bridging to avoid the complications outlined in Part 3.

The official CNI documentation describes two plugins that could be relevant.

With dhcp plugin the containers can get an IP allocated by a DHCP server already running on your network.
https://www.cni.dev/plugins/current/ipam/dhcp/

This avoids overlapping IPAM problems with the previous solution and means that the DHCP server already running on my network would be responsible for handing out IP addresses directly to the containers.

That handles IP address assignment, now we need to be able to switch packets to the correct container interface. The documentation references both macvlan and ipvlan as possible switching options. Comparing the different options, ipvlan will expose only a single MAC address whereas macvlan will assign separate MAC addresses per container and expose them to the rest of the network. Ipvlan is generally recommended only when you need a single MAC address, like when you’re binding to a Wi-Fi adapter which only permits one MAC address per station.

I created a new cluster in Rancher with a new VM following my previous blog posts, however in Rancher 2.6.1+ it seems that I am unable to access the cluster if there’s no CNI plugin installed on the cluster, so I instead use kubectl to connect to the cluster. This is possibly a regression from 2.6.0 and I need to get around to reporting it.

I didn’t find a k8s installer that would deploy and configure the macvlan + DHCP CNI correctly, so we’re going to need to do this manually. In a future blog post, I will package this up into a polished file that can be deployed. First, download the latest release of the CNI plugins from their GitHub releases page. Extract it to the host’s /opt/cni/bin folder, so you have /opt/cni/bin/dhcp.

Then create /etc/cni/net.d/15-bridge.conflist and reboot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "cniVersion": "0.3.1",
  "name": "default-cni-network",
  "plugins": [
    {
      "type": "macvlan",
      "name": "macvlan",
      "master": "eth0",
      "ipam": {
        "type": "dhcp"
      }
    }
  ]
}

After the host came up, the DHCP requests were not making it out to the network, but they were visible on the VM’s network interface:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[rancher@rancher ~]$ sudo docker run --net=host --rm crccheck/tcpdump -i any -f 'udp port 67 or udp port 68'
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
06:09:33.665302 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 42:84:46:b7:d5:e5 (oui Unknown), length 272
06:09:33.737211 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from f2:53:1c:66:f2:51 (oui Unknown), length 272
06:09:33.993417 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 56:df:48:0a:4d:92 (oui Unknown), length 272
06:09:43.426731 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from fe:fa:30:18:23:6f (oui Unknown), length 272

### On the Router:
ubnt:~$ sudo tcpdump -i eth0 -f 'udp port 67 or udp port 68'
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured

Some digging revealing that this is because I’m using macvlan which enables each container to use it’s own MAC address. Hyper-V was configured to block this for security. To fix this, check the “Enable MAC address spoofing” option in VM Settings > Network Adapter > Advanced Features. My understanding is that ipvlan may not require this option since it rewrites to use the VM’s MAC address.

Enabling MAC address spoofing in Hyper-V enables us to use macvlan, but could reduce security.

After that, I restarted the DHCP container and poof we had reservations:

1
2
12d74a2fe[...]/default-cni-network: lease acquired, expiration is 2021-10-23 06:13:25.746364924 +0000 UTC
432fd1694[...]/default-cni-network: lease acquired, expiration is 2021-10-23 06:13:25.827760638 +0000 UTC

Containers were coming up with the right IP addresses, I was able to ping the containers from other computers, but I was not able to ping the containers from the host VM. This was odd. If anything I would have expected the reverse of this. Apparently this is expected behavior from a macvlan:

Irrespective of the mode used for the macvlan, there’s no connectivity from whatever uses the macvlan (eg a container) to the lower device. This is by design, and is due to the the way macvlan interfaces “hook into” their physical interface.
https://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/

This was also preventing kubelet from initializing the cluster:

1
2
3
4
5
I1022 19:07:40.843340    1274 prober.go:116] "Probe failed" 
probeType="Readiness" pod="kube-system/coredns-685d6d555d-pss58" 
podUID=cecc8eb0-56f2-4fa3-aad2-518bcd5aec55 containerName="coredns" 
probeResult=failure output="Get \"http://192.168.2.125:8181/ready\": 
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is because macvlan by default does not route traffic from the container to the host. To fix this, we need to add the host interface into the bridge so containers can send traffic to it.

1
2
3
sudo ip link add mac0 link eth0 type macvlan mode bridge
sudo ip addr add 192.168.2.125/24 dev mac0
sudo ip link set mac0 up

The next problem I encountered was particularly insidious partly due to the fact that I was already running a K8s cluster in a separate VM.

In Kubernetes networking is complicated. The CNI is responsible for creating the network interface that each container uses, however Kubernetes also has something called kube-proxy which is responsible for exposing certain services, such as kube-dns and the kubernetes main HTTPS endpoint. Each container automatically gets a K8s service token and several environmental variables pointing it to the correct IP address:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
rancher@rancher$ docker inspect {anyk8scontainer}
"Mounts": [
    {
        "Type": "bind",
        "Source": "/opt/rke/var/lib/kubelet/pods/f5f70c5b-7526-4873-aa21-57dedf551e3d/volumes/kubernetes.io~projected/kube-api-access-p8mqs",
        "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",

    }
],
"Config": {

    "Env": [
        "KUBERNETES_SERVICE_HOST=10.43.0.1",

Note how it provides the 10.43.0.1 address for Kubernetes! This IP address doesn’t match anything that we’ve previous configured in any of the CNI configuration. Kube-proxy uses iptables to fake these IP addresses:

1
2
3
4
rancher@rancher$ sudo iptables-save
-A KUBE-SEP-ECW7X2JHZ5GHPAME -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 192.168.2.125:6443
-A KUBE-SERVICES -d 10.43.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-ECW7X2JHZ5GHPAME

However, macvlan is special because packets from containers don’t get processed by the host’s iptables rules. Thus, this iptables magic doesn’t work and the packet gets forwarded out to the physical network. In my case, I was already running a separate k8s cluster and my router was forwarding it to the old API gateway which lead

To fix this, I use the route-override CNI plugin to add a route for 10.43.0.0/16 to send it to the host’s IP chain where the iptables rules will apply. I downloaded this CNI plugin and extracted it to /opt/cni/bin/route-override. We add the following plugin to the CNI configuration in /etc/cni/net.d/10-bridge.conflist and reboot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
  "cniVersion": "0.3.1",
  "name": "default-cni-network",
  "plugins": [
    {
      "type": "macvlan",
      "name": "macvlan",
      "master": "eth0",
      "ipam": {
        "type": "dhcp"
      }
    },
    {
      "type": "route-override",
      "addroutes": [
        { "dst": "10.43.0.0/16", "gw": "192.168.2.125" }
      ]
    }
  ]
}

Both of these IP addresses are hard coded and are dependent on the cluster configuration and the host IP, so when we expand to multiple hosts we’ll need to genericize this.

After this, all of my pods successfully came up with IP and all my pods were able to communicate successfully. However, ~12 hours later the routes on the mac0 interface get removed and all networking stops working.

1
2
3
[...]
I1027 06:50:01.302034 containerName="coredns" probeResult=failure output="Get \"http://192.168.2.93:8181/ready\": dial tcp 192.168.2.93:8181: i/o timeout (Client.Timeout exceeded while awaiting headers)"
I1027 06:50:01.682678 containerName="coredns" probeResult=failure output="Get \"http://192.168.2.93:8080/health\": dial tcp 192.168.2.93:8080: connect: no route to host"

This seems to coincide when the host’s DHCP client renews the IP address for eth0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
rancher@rancher$ sudo system-docker logs -t network

021-10-26T06:49:40.689173492Z Failed to connect to non-global ctrl_ifname: eth0  error: No such file or directory
2021-10-26T06:49:40.689195892Z Failed to connect to non-global ctrl_ifname: mac0  error: No such file or directory
2021-10-26T06:49:40.790714733Z sending signal TERM to pid 584
2021-10-26T06:49:40.790741234Z waiting for pid 584 to exit
2021-10-26T06:49:48.760578200Z netconf:info: Apply Network Config
2021-10-26T06:49:48.770687300Z netconf:info: Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true --timeout 10 -w --debug eth0
2021-10-26T06:49:49.988228300Z netconf:info: Checking to see if DNS was set by DHCP
2021-10-26T06:49:49.988241300Z netconf:info: dns testing eth0
2021-10-26T06:49:50.021991000Z netconf:info: dns was dhcp set for eth0
2021-10-26T06:49:50.022006300Z netconf:info: DNS set by DHCP
2021-10-26T06:49:50.022008500Z netconf:info: Apply Network Config SyncHostname
2021-10-26T06:49:50.022010300Z netconf:info: Restart syslog

Apparently the DHCP system is clearing out the mac0 interface configuration. To fix this, we can run the following command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
sudo ros config merge

write_files:
  - container: network
    path: /var/lib/macvlan-init.sh
    permissions: "0755"
    owner: root:root
    content: |
      #!/bin/bash
      set -ex
      echo 'macvlan is up. Configuring'
      (
        MASTER_IFACE="eth0"
        LOCAL_HOST_CIDR=$(ip addr show dev eth0 | grep -E '^\s*inet' | grep -m1 global | awk '{ print $2 }')
        ip link add mac0 link eth0 type macvlan mode bridge && ip addr add $LOCAL_HOST_CIDR dev mac0 && ip link set mac0 up || true
      )
      # the last line of the file needs to be a blank line or a comment
rancher:
  network:
    post_cmds:
    - /var/lib/macvlan-init.sh

[Ctrl-C]

Unfortunately I don’t know of a good way to do this from Kubernetes or if this is necessary from non-RancherOS based host VMs. Leave a comment below if you have a better suggestion.

In the next post, I’ve revisited this and found out that MACvlan causes some problems with K8s service routing, so read through that.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!