Home Lab: Part 5 - Problems with asymmetrical routing

This article is part of the Home Lab series.

    In the previous post (DHCP IPAM), we successfully got our containers running with macvlan + DHCP. I additionally installed MetalLB and everything seemingly worked, however when I tried to retroactively add this to my existing Kubernetes home lab cluster already running Calico, I was not able to access the Metallb service. All connections were timing out.

    A quick Wireshark packet capture of the situation exposed this problem:

    The SYN packet from my computer made it to the container (LB IP 1921.168.6.2), but the responding SYN/ACK packet that came back had a source address of 192.168.2.76 (the pod’s network interface.) This wouldn’t work because my computer ignored it because it didn’t belong to an active flow.

    On the far side, Metallb is responsible for destination NATing (DNAT) by rewriting the destination IP from 192.168.6.2 to the pod’s IP 192.168.2.76, then the response packets are supposed to be source NATed (SNAT) so that the client computer only sees the LB IP address.

    Checking out iptables -t nat -L -v, we see the difference in two different chains. One has KUBE-MARK-MASQ, the other doesn’t.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    Chain KUBE-FW-4IWBTEYTRLLAGWWJ (1 references)
     pkts bytes target     prot opt in     out     source               destination
        0     0 KUBE-MARK-MASQ  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */
        0     0 KUBE-SVC-4IWBTEYTRLLAGWWJ  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */
        0     0 KUBE-MARK-DROP  all  -- any    any     anywhere             anywhere             /* prometheus/prometheus-pushgateway:http loadbalancer IP */
    
    Chain KUBE-FW-BI5FYS4DEZGC5QLO (1 references)
     pkts bytes target     prot opt in     out     source               destination
       17   884 KUBE-XLB-BI5FYS4DEZGC5QLO  all  -- any    any     anywhere             anywhere             /* smarthome/pihole-tcp:http loadbalancer IP */
        0     0 KUBE-MARK-DROP  all  -- any    any     anywhere             anywhere             /* smarthome/pihole-tcp:http loadbalancer IP */
    

    At first, I thought Calico was doing something different between these two services, because when changed back to Calico, it worked correctly. But this was only partially correct. There were two main differences.

    The first difference was that the service that worked had externalTrafficPolicy: Cluster, but the service that wasn’t working had externalTrafficPolicy: Local. In Cluster mode, the source IP address is rewritten to use the host’s IP address, whereas Local only DNATs the packet. (Here is useful blog post)

    Usually the link doing NAT will remember how it mangled a packet, and when a reply packet passes through the other way, it will do the reverse mangling on that reply packet, so everything works.

    IPTables Docs - NAT-HOWTO

    IPTables automatically handles the packets flowing in the reverse direction, but are our packets being processed by the host’s IPTables?

    My new containers were using the following routing table:

    1
    2
    3
    4
    
    root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
    default via 192.168.2.1 dev eth0
    10.43.0.0/16 via 192.168.2.225 dev eth0
    192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.187
    

    In this route table, all packets (except for K8s cluster local IPs) would get forwarded to the destination switch port and bypass the host’s iptables rule set. In the previous post, this already caused problems with the K8s service routing.

    The packets were following a path like this:

    However, Calico used a different routing table where all traffic was routed through 169.254.1.1. Calico’s FAQ mentions this here.

    1
    2
    3
    
    root@pihole-6f776b89bc-9lbw6:/# ip route
    default via 169.254.1.1 dev eth0
    169.254.1.1 dev eth0  scope link
    

    The packets were following a path where all packets were being forwarded through the host’s iptables rule set.

    Thus, we need to change the route tables so that everything flows through the host.

    This seems like it would be easy to do, but the following config:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    {
      "cniVersion": "0.3.1",
      "name": "dhcp-cni-network",
      "plugins": [
        {
          "type": "macvlan",
          "name": "macvlan",
          "master": "eth0",
          "ipam": {
            "type": "dhcp"
          }
        },
        {
          "type": "route-override",
          "flushroutes": true,
          "addroutes": [
            { "dst": "0.0.0.0/0", "gw": "192.168.2.125" }
          ]
        }
      ]
    }
    

    Results in the following route table:

    1
    2
    3
    
    root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
    default via 192.168.2.225 dev eth0
    192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.187
    

    The second route allows traffic destined to the local subnet to continue bypassing the host’s IP stack. If you’re not on the local subnet, then it does work correctly.

    I was not able to get the routing tables corrected using the route-override CNI plugin. Setting flushroutes: true wouldn’t delete this route because of this check:

    1
    2
    3
    4
    
    for _, route := range routes {
       if route.Scope != netlink.SCOPE_LINK {
          ^^^^^
          if route.Dst != nil {
    

    If you know how this mystery route is created, let me know in the comments.

    Instead, I ended up forking the CNI references plugins and writing a custom plugin that setup my routes explicitly. Resulting in the following route table:

    1
    2
    
    default via 169.254.1.1 dev eth0
    169.254.1.1 dev eth0  scope link
    

    Partial success.

    Surprise Issues with MACvlan

    Now, I’m able to load it using the LB, but I can’t ping the pod IP, but the pod can ping outwards. After much investigation, I tracked this down to an IPTables rule that was dropping INVALID connections.

    1
    2
    3
    
    Chain KUBE-FORWARD (1 references)
     pkts bytes target     prot opt in     out     source               destination
      984  102K DROP       all  -- any    any     anywhere             anywhere             ctstate INVALID
    

    Why are these packets considered INVALID by conntrack when outbound pings from the pod itself are able to succeed? My guess is that inbound packets are again bypassing the host IP stack going directly to the pod network stack as macvlan is supposed to do.

    TBD: Figure out how to fix this. My temporary solution is to override the DROP command and force IPTables to pass any traffic coming from the containers.

    1
    
    sudo iptables -I FORWARD 1 -i mac0 -j ACCEPT
    

    These problems are suggesting that maybe MACvlan is the wrong technology to use here. I previously ruled out using bridge because it was lower performance and required the kernel to ’learn’ STP and MACs, of which we shouldn’t need.

    Writing a custom CNI Plugin

    Let’s review how I wrote a new CNI plugin:

    The CNI plugins repository contains a sample CNI that we can use. For brevity, I’m going to exclude error handling

    1
    2
    3
    4
    
    link, err := netlink.LinkByName("eth0")
    
    addrs, err := netlink.AddrList(link, netlink.FAMILY_V4)
    mainIP := addrs[0]
    

    The default route needs to point towards the host’s IP address, so we need to grab that. In the future, this should be configurable.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    err = netns.Do(func(_ ns.NetNS) error {
      var containerIFName string
      for _, netif := range result.Interfaces {
        if netif.Sandbox != "" {
          containerIFName = netif.Name
          link, _ := netlink.LinkByName(netif.Name)
          routes, _ := netlink.RouteList(link, netlink.FAMILY_ALL)
          for _, route := range routes {
            err = netlink.RouteDel(&route)
    
          }
        }
      }
      dev, _ := netlink.LinkByName(containerIFName)
    

    Now, we switch to the context of the container’s network namespace and iterate over the interfaces and purge all routes. Nothing survives.

    1
    2
    3
    4
    5
    6
    
    route := &netlink.Route{
      LinkIndex: dev.Attrs().Index,
      Scope:     netlink.SCOPE_LINK,
      Dst:       netlink.NewIPNet(mainIP.IP),
    }
    err = netlink.RouteAdd(route)
    

    Next, we tell Linux which interface the gateway IP lives.

    1
    2
    3
    4
    5
    
    err = netlink.RouteAdd(&netlink.Route{
      LinkIndex: dev.Attrs().Index,
      Gw:        mainIP.IP,
      Src:       prevResult.IPs[0].Address.IP,
    })
    

    Finally, we add the default route telling Linux to forward everything to the host’s IP stack.

    Then our CNI config (/etc/cni/net.d/0-bridge.conflist) looks like:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    {
      "cniVersion": "0.3.1",
      "name": "dhcp-cni-network",
      "plugins": [
        {
          "type": "macvlan",
          "name": "macvlan",
          "master": "eth0",
          "ipam": {
            "type": "dhcp"
          }
        },
        {
            "type": "route-fix",
            "master": "eth0"
        }
      ]
    }
    

    To simplify everything, I’ve also changed the mac0 interface IP address to explicitly 169.254.1.1 to follow in Calico’s model. This is all handled by the custom CNI I wrote in the section below.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    rancher:
      network:
        post_cmds:
        - /var/lib/macvlan-init.sh
    
    write_files:
    - container: network
      content: |+
        #!/bin/bash
        set -ex
        echo 'macvlan is up. Configuring'
        (
          MASTER_IFACE="eth0"
          ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
          ip addr add 169.254.1.1 dev mac0 || true
        )
        # the last line of the file needs to be a blank line or a comment    
      owner: root:root
      path: /var/lib/macvlan-init.sh
      permissions: "0755"
    

    If you’re not using RancherOS, the important part is:

    1
    2
    
    MASTER_IFACE="eth0"
    ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
    

    Custom CNI

    This is deployable with the following K8s YAML:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      labels:
        app: dhcp
        name: kube-dhcp
        tier: node
      name: kube-dhcp-daemon
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: kube-dhcp
      template:
        metadata:
          labels:
            app: dhcp
            name: kube-dhcp
            tier: node
        spec:
          containers:
          - env:
            - name: PRIORITY
              value: "0"
            image: ghcr.io/ajacques/k8s-dhcp-cni-helper:dev
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - rm /host/cni_net/$PRIORITY-bridge.conflist
            name: kube-dhcp
            resources:
              limits:
                cpu: 100m
                memory: 50Mi
              requests:
                cpu: 10m
                memory: 50Mi
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /run
              name: run
            - mountPath: /host/cni_bin/
              name: cnibin
            - mountPath: /host/cni_net
              name: cni
          hostNetwork: true
          hostPID: true
          tolerations:
          - key: CriticalAddonsOnly
            operator: Exists
          - effect: NoSchedule
            operator: Exists
          - effect: NoExecute
            operator: Exists
          volumes:
          - hostPath:
              path: /run
              type: ""
            name: run
          - hostPath:
              path: /etc/cni/net.d
              type: ""
            name: cni
          - hostPath:
              path: /opt/cni/bin
              type: ""
            name: cnibin
    

    Stay tuned for more work on the cluster.

    Copyright - All Rights Reserved

    Comments

    Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.