Home Lab: Part 5 - Problems with asymmetrical routing

In the previous post (DHCP IPAM), we successfully got our containers running with macvlan + DHCP. I additionally installed MetalLB and everything seemingly worked, however when I tried to retroactively add this to my existing Kubernetes home lab cluster already running Calico, I was not able to access the Metallb service. All connections were timing out.

A quick Wireshark packet capture of the situation exposed this problem:

The SYN packet from my computer made it to the container (LB IP 1921.168.6.2), but the responding SYN/ACK packet that came back had a source address of 192.168.2.76 (the pod’s network interface.) This wouldn’t work because my computer ignored it because it didn’t belong to an active flow.

On the far side, Metallb is responsible for destination NATing (DNAT) by rewriting the destination IP from 192.168.6.2 to the pod’s IP 192.168.2.76, then the response packets are supposed to be source NATed (SNAT) so that the client computer only sees the LB IP address.

Checking out iptables -t nat -L -v, we see the difference in two different chains. One has KUBE-MARK-MASQ, the other doesn’t.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Chain KUBE-FW-4IWBTEYTRLLAGWWJ (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */
    0     0 KUBE-SVC-4IWBTEYTRLLAGWWJ  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */

Chain KUBE-FW-BI5FYS4DEZGC5QLO (1 references)
 pkts bytes target     prot opt in     out     source               destination
   17   884 KUBE-XLB-BI5FYS4DEZGC5QLO  all  -- any    any     anywhere             anywhere             /* smarthome/pihole-tcp:http loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  -- any    any     anywhere             anywhere             /* smarthome/pihole-tcp:http loadbalancer IP */

At first, I thought Calico was doing something different between these two services, because when changed back to Calico, it worked correctly. But this was only partially correct. There were two main differences.

The first difference was that the service that worked had externalTrafficPolicy: Cluster, but the service that wasn’t working had externalTrafficPolicy: Local. In Cluster mode, the source IP address is rewritten to use the host’s IP address, whereas Local only DNATs the packet. (Here is useful blog post)

Usually the link doing NAT will remember how it mangled a packet, and when a reply packet passes through the other way, it will do the reverse mangling on that reply packet, so everything works.
IPTables Docs - NAT-HOWTO

IPTables automatically handles the packets flowing in the reverse direction, but are our packets being processed by the host’s IPTables?

My new containers were using the following routing table:

1
2
3
4
root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
default via 192.168.2.1 dev eth0
10.43.0.0/16 via 192.168.2.225 dev eth0
192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.187

In this route table, all packets (except for K8s cluster local IPs) would get forwarded to the destination switch port and bypass the host’s iptables rule set. In the previous post, this already caused problems with the K8s service routing.

The packets were following a path like this:

However, Calico used a different routing table where all traffic was routed through 169.254.1.1. Calico’s FAQ mentions this here.

1
2
3
root@pihole-6f776b89bc-9lbw6:/# ip route
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0  scope link

The packets were following a path where all packets were being forwarded through the host’s iptables rule set.

Thus, we need to change the route tables so that everything flows through the host.

This seems like it would be easy to do, but the following config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "cniVersion": "0.3.1",
  "name": "dhcp-cni-network",
  "plugins": [
    {
      "type": "macvlan",
      "name": "macvlan",
      "master": "eth0",
      "ipam": {
        "type": "dhcp"
      }
    },
    {
      "type": "route-override",
      "flushroutes": true,
      "addroutes": [
        { "dst": "0.0.0.0/0", "gw": "192.168.2.125" }
      ]
    }
  ]
}

Results in the following route table:

1
2
3
root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
default via 192.168.2.225 dev eth0
192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.187

The second route allows traffic destined to the local subnet to continue bypassing the host’s IP stack. If you’re not on the local subnet, then it does work correctly.

I was not able to get the routing tables corrected using the route-override CNI plugin. Setting flushroutes: true wouldn’t delete this route because of this check:

1
2
3
4
for _, route := range routes {
   if route.Scope != netlink.SCOPE_LINK {
      ^^^^^
      if route.Dst != nil {

If you know how this mystery route is created, let me know in the comments.

Instead, I ended up forking the CNI references plugins and writing a custom plugin that setup my routes explicitly. Resulting in the following route table:

1
2
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0  scope link

Partial success.

Surprise Issues with MACvlan

Now, I’m able to load it using the LB, but I can’t ping the pod IP, but the pod can ping outwards. After much investigation, I tracked this down to an IPTables rule that was dropping INVALID connections.

1
2
3
Chain KUBE-FORWARD (1 references)
 pkts bytes target     prot opt in     out     source               destination
  984  102K DROP       all  -- any    any     anywhere             anywhere             ctstate INVALID

Why are these packets considered INVALID by conntrack when outbound pings from the pod itself are able to succeed? My guess is that inbound packets are again bypassing the host IP stack going directly to the pod network stack as macvlan is supposed to do.

TBD: Figure out how to fix this. My temporary solution is to override the DROP command and force IPTables to pass any traffic coming from the containers.

1
sudo iptables -I FORWARD 1 -i mac0 -j ACCEPT

These problems are suggesting that maybe MACvlan is the wrong technology to use here. I previously ruled out using bridge because it was lower performance and required the kernel to ’learn’ STP and MACs, of which we shouldn’t need.

Writing a custom CNI Plugin

Let’s review how I wrote a new CNI plugin:

The CNI plugins repository contains a sample CNI that we can use. For brevity, I’m going to exclude error handling

1
2
3
4
link, err := netlink.LinkByName("eth0")

addrs, err := netlink.AddrList(link, netlink.FAMILY_V4)
mainIP := addrs[0]

The default route needs to point towards the host’s IP address, so we need to grab that. In the future, this should be configurable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
err = netns.Do(func(_ ns.NetNS) error {
  var containerIFName string
  for _, netif := range result.Interfaces {
    if netif.Sandbox != "" {
      containerIFName = netif.Name
      link, _ := netlink.LinkByName(netif.Name)
      routes, _ := netlink.RouteList(link, netlink.FAMILY_ALL)
      for _, route := range routes {
        err = netlink.RouteDel(&route)

      }
    }
  }
  dev, _ := netlink.LinkByName(containerIFName)

Now, we switch to the context of the container’s network namespace and iterate over the interfaces and purge all routes. Nothing survives.

1
2
3
4
5
6
route := &netlink.Route{
  LinkIndex: dev.Attrs().Index,
  Scope:     netlink.SCOPE_LINK,
  Dst:       netlink.NewIPNet(mainIP.IP),
}
err = netlink.RouteAdd(route)

Next, we tell Linux which interface the gateway IP lives.

1
2
3
4
5
err = netlink.RouteAdd(&netlink.Route{
  LinkIndex: dev.Attrs().Index,
  Gw:        mainIP.IP,
  Src:       prevResult.IPs[0].Address.IP,
})

Finally, we add the default route telling Linux to forward everything to the host’s IP stack.

Then our CNI config (/etc/cni/net.d/0-bridge.conflist) looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "cniVersion": "0.3.1",
  "name": "dhcp-cni-network",
  "plugins": [
    {
      "type": "macvlan",
      "name": "macvlan",
      "master": "eth0",
      "ipam": {
        "type": "dhcp"
      }
    },
    {
        "type": "route-fix",
        "master": "eth0"
    }
  ]
}

To simplify everything, I’ve also changed the mac0 interface IP address to explicitly 169.254.1.1 to follow in Calico’s model. This is all handled by the custom CNI I wrote in the section below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
rancher:
  network:
    post_cmds:
    - /var/lib/macvlan-init.sh

write_files:
- container: network
  content: |+
    #!/bin/bash
    set -ex
    echo 'macvlan is up. Configuring'
    (
      MASTER_IFACE="eth0"
      ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
      ip addr add 169.254.1.1 dev mac0 || true
    )
    # the last line of the file needs to be a blank line or a comment
  owner: root:root
  path: /var/lib/macvlan-init.sh
  permissions: "0755"

If you’re not using RancherOS, the important part is:

1
2
MASTER_IFACE="eth0"
ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true

Custom CNI

This is deployable with the following K8s YAML:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: dhcp
    name: kube-dhcp
    tier: node
  name: kube-dhcp-daemon
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: kube-dhcp
  template:
    metadata:
      labels:
        app: dhcp
        name: kube-dhcp
        tier: node
    spec:
      containers:
      - env:
        - name: PRIORITY
          value: "0"
        image: ghcr.io/ajacques/k8s-dhcp-cni-helper:dev
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - rm /host/cni_net/$PRIORITY-bridge.conflist
        name: kube-dhcp
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 50Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /run
          name: run
        - mountPath: /host/cni_bin/
          name: cnibin
        - mountPath: /host/cni_net
          name: cni
      hostNetwork: true
      hostPID: true
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /run
          type: ""
        name: run
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cnibin

Stay tuned for more work on the cluster.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!