Home Lab: Part 6 - Replacing MACvlan with a Bridge

In previous posts, I leveraged the MACvlan CNI to provide the networking to forward packets between containers and the rest of my network, however I ran into several issues rooted from the fact that MACvlan traffic bypasses several parts of the host’s IP stack including conntrack and IPTables. This conflicted with how Kubernetes expects to handle routing and meant we had to bypass and modify IPTables chains to get it to work.

While I got it to work, there was simply too much wire bending involved and I wanted to investigate alternatives to see if anything was able to fit my requirements better. Let’s consider the bridge CNI.

To recap what we’re looking for in this CNI: we want to be able to run pods on the same subnet as my home LAN, this ultimately requires some kind of L2 layer bridge combined with a DHCP IPAM. Nothing pre-existing fully supports this situation. Thus I ended up modifying and extending existing CNIs.

Bridge CNI

The bridge CNI’s IP stack

The Bridge stack is slightly different that the MACvlan stack. On the host side, we now have a point-to-point adapter (prefixed with veth*). These are added to the bridge and traffic from them can be routed between the adapters associated with the bridge.

Starting with the reference bridge CNI along with the following configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "cniVersion": "0.3.1",
  "name": "dhcp-cni-network",
  "plugins": [
    {
      "type": "bridge",
      "name": "mybridge",
      "ipam": {
        "type": "dhcp"
      }
    }
  ]
}

Unfortunately, the reference bridge CNI gives us the following errors in the Kubelet log:

1
"Error adding pod to network" err="error calling DHCP.Allocate: no more tries" pod="metallb/metallb-controller-7cb7dd579d-8zlgr"

The DHCP daemon isn’t receiving any responses from the DHCP server. While the daemon is configured to use the host network, it assumes the Pod’s network namespace while sending the DHCP request packets. Taking a look at the pod’s network namespace, I see that the requests are being sent, but no responses are received:

1
2
3
[rancher@rancher ~]$ sudo docker run -ti --rm --net=container:e6d4baa7820f crccheck/tcpdump -i any
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from aa:9c:3e:ae:d1:68 (oui Unknown), length 336
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, equest from c2:be:1c:d3:d4:ba (oui Unknown), length 336

Looking at the host’s network adapter, we can see that they’re making it to the bridge, but not being sent outwards on eth0, thus the rest of the network won’t hear anything.

1
2
3
[rancher@rancher ~]$ sudo docker run -ti --rm --net=host --cap-add NET_ADMIN crccheck/tcpdump -i any -f 'udp port 67 or udp port 68'
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 82:a0:15:d9:51:02 (oui Unknown), length 336
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 82:a0:15:d9:51:02 (oui Unknown), length 336

Additionally, there are absolutely no routes in the Pod’s netns, thus nothing will ever work because it has no idea where to send packets. Luckily broadcast packets don’t need to be routed, so they somehow manage to get to host’s bridge adapter.

1
2
[rancher@rancher ~]$ sudo docker run -ti --rm --net=container:e6d4baa7820f igneoussystems/iproute2 ip route
[rancher@rancher ~]$

This should be an easy fix since the bridge plugin contains two different configuration options: isGateway and isDefaultGateway. We should be able to set one of these to true and it should work. Unfortunately, it decides to use the gateway as returned by the IPAM plugin (see here). In the DHCP IPAM case, this is the IP of the network’s router (192.168.2.1) not the local host (192.168.2.125) which we want everything to forward through to get IPTables.

The fix for this is the same as in the MACvlan CNI. As part of the CNI, I need to define routes that forward all traffic to the host’s IP. I modified the bridge CNI code here. Ultimately, it gets the host’s primary IPv4 address (IPv6 to come later) and creates a default route to send all traffic to the host. Note that we again

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
gwIp := uplinkAddrs[0].IP
err = netns.Do(func(_ ns.NetNS) error {
  containerLink, err := netlink.LinkByName(args.IfName)

  routes, _ := netlink.RouteList(containerLink, netlink.FAMILY_ALL)
  for _, route := range routes {
    err = netlink.RouteDel(&route)

  }

  // This route tells the OS that 192.168.2.125/32 can be found on eth0
  // Before we can set a default route, Linux needs to know where to find the gateway
  err = netlink.RouteAdd(&netlink.Route{
    LinkIndex: containerLink.Attrs().Index,
    Scope:     netlink.SCOPE_LINK,
    Dst:       netlink.NewIPNet(gwIp),
  })

  // This route tells the OS to forward 0.0.0.0/0 (all traffic, even on the local LAN)
/ // to 192.168.2.125. It knows that 192.168.2.125 is on the eth0 interface
  err = netlink.RouteAdd(&netlink.Route{
    LinkIndex: containerLink.Attrs().Index,
    Gw:        gwIp,
    Src:       ipamResult.IPs[0].Address.IP,
  })

Great, now the Pod has the correct routes:

1
2
3
[rancher@rancher ~]$ sudo docker run -ti --rm --net=container:e6d4baa7820f igneoussystems/iproute2 ip route
default via 192.168.2.25 dev eth0 src 192.168.2.167
192.168.2.125 dev eth0 proto kernel scope link

DHCP still doesn’t work though. Looking back at the IP stack diagram, there’s a missing link from the bridge to eth0:

This is confirmed using the brctl command:

1
2
3
[rancher@rancher ~]$ sudo docker run -ti --rm --net=host igneoussystems/iproute2 brctl show
bridge name     bridge id               STP enabled     interfaces
cni0            8000.00155d02cb02       no              veth0c90ef73

That means we need to add the eth0 interface to the bridge. This is done here. The code (error handling removed) below shows how it works. First, we need to copy the IP address from eth0 to the bridge. This is because the bridge interface will effectively replace eth0 as the primary interface handling all traffic for even this host. Then we call LinkSetMaster to add eth0 into the bridge.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Copy the IPv4 address from eth0 to the bridge
addrs, err := netlink.AddrList(br, netlink.FAMILY_V4)
gwIp := uplinkAddrs[0].IP
foundAddr := false
for _, addr := range addrs {
  if addr.IP.Equal(gwIp) {
    foundAddr = true
    break
  }
}
var failed bool
if !foundAddr {
  addr := &netlink.Addr{
    IPNet: netlink.NewIPNet(gwIp),
  }
  err = netlink.AddrAdd(br, addr)
}

// Add the uplink interface to the bridge if it isn't already there
// If MasterIndex == 0, then the interface isn't part of a bridge
// If MasterIndex != BridgeIndex, then the interface is part of a different bridge
if uplinkLink.Attrs().MasterIndex != br.Attrs().Index && uplinkLink.Attrs().MasterIndex != 0 {
   // Fail
}
err = netlink.LinkSetMaster(uplinkLink, br)

Unfortunately this caused my SSH connection to disconnect after a minute and still prevent traffic. To fix this, we need to move the routes to the bridge interface because it needs to effectively replace eth0 as the primary interface.

In the code below, we get the routes defined on eth0 so we can add them to the bridge. This failed at first with Linux giving a syscall error.

This was tricky to figure out, but in Linux you can’t define a route that points to a destination, that Linux doesn’t also know how to find. For example, defining the route default via 192.168.2.125/32 means that you also need to define where to find 192.168.2.125/32. In most networks, you get a free route defined, 192.168.2.0/24 dev eth0, that tells Linux to find that IP on the eth0 interface, but in our case we’re explicitly defining all routes so that doesn’t work. We need to define 192.168.2.125/32 dev eth0, then define default via 192.168.2.125.

As a simple trick, I sort the routes based on their mask length so more specific routes appear first in the slice.

After sorting it, I remove it from eth0 and add it to the bridge.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
routes, err := netlink.RouteList(uplinkLink, netlink.FAMILY_V4)
if len(routes) > 0 {
  // Sort routes so that most specific routes appear first. This is to avoid an issue where we can't create a
  // default route until the subnet route is available
  sort.Slice(routes, func(i, j int) bool {
    l, _ := routes[i].Dst.Mask.Size()
    if routes[j].Dst == nil {
      return true
    }
    if routes[j].Dst.Mask == nil {
      return true
    }
    r, _ := routes[j].Dst.Mask.Size()
    return l >= r
  })
  for _, route := range routes {
    err = netlink.RouteDel(&route)
    route.LinkIndex = br.Index
    err = netlink.RouteAdd(&route)
  }
}

Now I have a route table that looks like this and my pods are able to work on RancherOS:

1
2
3
4
5
6
rancher@rancher$ ip route
[...]
default via 192.168.2.1 dev cni0 src 192.168.2.125 metric 203
192.168.2.0/24 dev cni0 proto kernel scope link src 192.168.2.125 metric 203
192.168.2.92 dev veth3ec79a35 scope link
[...]

Of course, what would this blog series be without a new problem to solve. When I tried running this on an Ubuntu machine, I encountered more issues with networking as DHCP requests were not making it out to the network.

As it turns out, there’s a difference in the default IPTables rule set between Ubuntu Server and RancherOS.

In RancherOS, the FORWARD chain has a default value of ACCEPT:

1
2
3
4
[rancher@rancher ~]$ sudo iptables -L -v
[...]
Chain FORWARD (policy ACCEPT 15883 packets, 2853K bytes)
[...]

Whereas, Ubuntu Server has a default value of DROP:

1
2
3
4
user@ubuntu:~$ sudo iptables -L -v
[...]
Chain FORWARD (policy DROP 2735 packets, 495K bytes)
[...]

This means we’re going to have manage IPTables rules that permit each pod to communicate with the network. Stay tuned for the next post where we extend the CNI to include IPTables rules.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!