Home Lab: Part 5 – Problems with asymmetrical routing

This entry is part 5 of 6 in the series Home Lab

In the previous post (DHCP IPAM), we successfully got our containers running with macvlan + DHCP. I additionally installed MetalLB and everything seemingly worked, however when I tried to retroactively add this to my existing Kubernetes home lab cluster already running Calico, I was not able to access the Metallb service. All connections were timing out.

A quick Wireshark packet capture of the situation exposed this problem:

The SYN packet from my computer made it to the container (LB IP 1921.168.6.2), but the responding SYN/ACK packet that came back had a source address of 192.168.2.76 (the pod’s network interface.) This wouldn’t work because my computer ignored it because it didn’t belong to an active flow.

On the far side, Metallb is responsible for destination NATing (DNAT) by rewriting the destination IP from 192.168.6.2 to the pod’s IP 192.168.2.76, then the response packets are supposed to be source NATed (SNAT) so that the client computer only sees the LB IP address.

Checking out iptables -t nat -L -v, we see the difference in two different chains. One has KUBE-MARK-MASQ, the other doesn’t.

Chain KUBE-FW-4IWBTEYTRLLAGWWJ (1 references) pkts bytes target prot opt in out source destination 0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */ 0 0 KUBE-SVC-4IWBTEYTRLLAGWWJ all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */ 0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */ Chain KUBE-FW-BI5FYS4DEZGC5QLO (1 references) pkts bytes target prot opt in out source destination 17 884 KUBE-XLB-BI5FYS4DEZGC5QLO all -- any any anywhere anywhere /* smarthome/pihole-tcp:http loadbalancer IP */ 0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* smarthome/pihole-tcp:http loadbalancer IP */
Code language: JavaScript (javascript)

At first, I thought Calico was doing something different between these two services, because when changed back to Calico, it worked correctly. But this was only partially correct. There were two main differences.

The first difference was that the service that worked had externalTrafficPolicy: Cluster, but the service that wasn’t working had externalTrafficPolicy: Local. In Cluster mode, the source IP address is rewritten to use the host’s IP address, whereas Local only DNATs the packet. (Here is useful blog post)

Usually the link doing NAT will remember how it mangled a packet, and when a reply packet passes through the other way, it will do the reverse mangling on that reply packet, so everything works.

IPTables Docs – NAT-HOWTO

IPTables automatically handles the packets flowing in the reverse direction, but are our packets being processed by the host’s IPTables?

My new containers were using the following routing table:

root@macvlan-test-6fcfb775c9-zmbpc:/# ip route default via 192.168.2.1 dev eth0 10.43.0.0/16 via 192.168.2.225 dev eth0 192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.187
Code language: PHP (php)

In this route table, all packets (except for K8s cluster local IPs) would get forwarded to the destination switch port and bypass the host’s iptables rule set. In the previous post, this already caused problems with the K8s service routing.

The packets were following a path like this:

However, Calico used a different routing table where all traffic was routed through 169.254.1.1. Calico’s FAQ mentions this here.

root@pihole-6f776b89bc-9lbw6:/# ip route default via 169.254.1.1 dev eth0 169.254.1.1 dev eth0 scope link
Code language: PHP (php)

The packets were following a path where all packets were being forwarded through the host’s iptables rule set.

Thus, we need to change the route tables so that everything flows through the host.

This seems like it would be easy to do, but the following config:

{ "cniVersion": "0.3.1", "name": "dhcp-cni-network", "plugins": [ { "type": "macvlan", "name": "macvlan", "master": "eth0", "ipam": { "type": "dhcp" } }, { "type": "route-override", "flushroutes": true, "addroutes": [ { "dst": "0.0.0.0/0", "gw": "192.168.2.125" } ] } ] }
Code language: JSON / JSON with Comments (json)

Results in the following route table:

root@macvlan-test-6fcfb775c9-zmbpc:/# ip route default via 192.168.2.225 dev eth0 192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.187
Code language: PHP (php)

The second route allows traffic destined to the local subnet to continue bypassing the host’s IP stack. If you’re not on the local subnet, then it does work correctly.

I was not able to get the routing tables corrected using the route-override CNI plugin. Setting flushroutes: true wouldn’t delete this route because of this check:

for _, route := range routes { if route.Scope != netlink.SCOPE_LINK { ^^^^^ if route.Dst != nil {

If you know how this mystery route is created, let me know in the comments.

Instead, I ended up forking the CNI references plugins and writing a custom plugin that setup my routes explicitly. Resulting in the following route table:

default via 169.254.1.1 dev eth0 169.254.1.1 dev eth0 scope link
Code language: CSS (css)

Partial success.

Surprise Issues with MACvlan

Now, I’m able to load it using the LB, but I can’t ping the pod IP, but the pod can ping outwards. After much investigation, I tracked this down to an IPTables rule that was dropping INVALID connections.

Chain KUBE-FORWARD (1 references) pkts bytes target prot opt in out source destination 984 102K DROP all -- any any anywhere anywhere ctstate INVALID

Why are these packets considered INVALID by conntrack when outbound pings from the pod itself are able to succeed? My guess is that inbound packets are again bypassing the host IP stack going directly to the pod network stack as macvlan is supposed to do.

TBD: Figure out how to fix this. My temporary solution is to override the DROP command and force IPTables to pass any traffic coming from the containers.

sudo iptables -I FORWARD 1 -i mac0 -j ACCEPT

These problems are suggesting that maybe MACvlan is the wrong technology to use here. I previously ruled out using bridge because it was lower performance and required the kernel to ‘learn’ STP and MACs, of which we shouldn’t need.

Writing a custom CNI Plugin

Let’s review how I wrote a new CNI plugin:

The CNI plugins repository contains a sample CNI that we can use. For brevity, I’m going to exclude error handling

link, err := netlink.LinkByName("eth0") addrs, err := netlink.AddrList(link, netlink.FAMILY_V4) mainIP := addrs[0]
Code language: JavaScript (javascript)

The default route needs to point towards the host’s IP address, so we need to grab that. In the future, this should be configurable.

err = netns.Do(func(_ ns.NetNS) error { var containerIFName string for _, netif := range result.Interfaces { if netif.Sandbox != "" { containerIFName = netif.Name link, _ := netlink.LinkByName(netif.Name) routes, _ := netlink.RouteList(link, netlink.FAMILY_ALL) for _, route := range routes { err = netlink.RouteDel(&route) } } } dev, _ := netlink.LinkByName(containerIFName)
Code language: PHP (php)

Now, we switch to the context of the container’s network namespace and iterate over the interfaces and purge all routes. Nothing survives.

route := &netlink.Route{ LinkIndex: dev.Attrs().Index, Scope: netlink.SCOPE_LINK, Dst: netlink.NewIPNet(mainIP.IP), } err = netlink.RouteAdd(route)

Next, we tell Linux which interface the gateway IP lives.

err = netlink.RouteAdd(&netlink.Route{ LinkIndex: dev.Attrs().Index, Gw: mainIP.IP, Src: prevResult.IPs[0].Address.IP, })

Finally, we add the default route telling Linux to forward everything to the host’s IP stack.

Then our CNI config (/etc/cni/net.d/0-bridge.conflist) looks like:

{ "cniVersion": "0.3.1", "name": "dhcp-cni-network", "plugins": [ { "type": "macvlan", "name": "macvlan", "master": "eth0", "ipam": { "type": "dhcp" } }, { "type": "route-fix", "master": "eth0" } ] }
Code language: JSON / JSON with Comments (json)

To simplify everything, I’ve also changed the mac0 interface IP address to explicitly 169.254.1.1 to follow in Calico’s model. This is all handled by the custom CNI I wrote in the section below.

rancher: network: post_cmds: - /var/lib/macvlan-init.sh write_files: - container: network content: |+ #!/bin/bash set -ex echo 'macvlan is up. Configuring' ( MASTER_IFACE="eth0" ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true ip addr add 169.254.1.1 dev mac0 || true ) # the last line of the file needs to be a blank line or a comment owner: root:root path: /var/lib/macvlan-init.sh permissions: "0755"
Code language: PHP (php)

If you’re not using RancherOS, the important part is:

MASTER_IFACE="eth0" ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
Code language: JavaScript (javascript)

Custom CNI

This is deployable with the following K8s YAML:

apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: dhcp name: kube-dhcp tier: node name: kube-dhcp-daemon namespace: kube-system spec: selector: matchLabels: name: kube-dhcp template: metadata: labels: app: dhcp name: kube-dhcp tier: node spec: containers: - env: - name: PRIORITY value: "0" image: ghcr.io/ajacques/k8s-dhcp-cni-helper:dev imagePullPolicy: Always lifecycle: preStop: exec: command: - /bin/sh - -c - rm /host/cni_net/$PRIORITY-bridge.conflist name: kube-dhcp resources: limits: cpu: 100m memory: 50Mi requests: cpu: 10m memory: 50Mi securityContext: privileged: true volumeMounts: - mountPath: /run name: run - mountPath: /host/cni_bin/ name: cnibin - mountPath: /host/cni_net name: cni hostNetwork: true hostPID: true tolerations: - key: CriticalAddonsOnly operator: Exists - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists volumes: - hostPath: path: /run type: "" name: run - hostPath: path: /etc/cni/net.d type: "" name: cni - hostPath: path: /opt/cni/bin type: "" name: cnibin

Stay tuned for more work on the cluster.

Series Navigation<< Home Lab: Part 4 – A DHCP IPAMHome Lab: Part 6 – Replacing MACvlan with a Bridge >>

Leave a Reply

Your email address will not be published. Required fields are marked *