- Home Lab: Part 1 – Cluster Setup
- Home Lab: Part 2 – Networking Setup
- Home Lab: Part 3 – Networking Revisited
- Home Lab: Part 4 – A DHCP IPAM
- Home Lab: Part 5 – Problems with asymmetrical routing
- Home Lab: Part 6 – Replacing MACvlan with a Bridge
- Home Lab – Using the bridge CNI with Systemd
- Kubernetes: A hybrid Calico and Layer 2 Bridge+DHCP network using Multus
In the previous post (DHCP IPAM), we successfully got our containers running with macvlan + DHCP. I additionally installed MetalLB and everything seemingly worked, however when I tried to retroactively add this to my existing Kubernetes home lab cluster already running Calico, I was not able to access the Metallb service. All connections were timing out.
A quick Wireshark packet capture of the situation exposed this problem:

The SYN packet from my computer made it to the container (LB IP 1921.168.6.2), but the responding SYN/ACK packet that came back had a source address of 192.168.2.76 (the pod’s network interface.) This wouldn’t work because my computer ignored it because it didn’t belong to an active flow.
On the far side, Metallb is responsible for destination NATing (DNAT) by rewriting the destination IP from 192.168.6.2 to the pod’s IP 192.168.2.76, then the response packets are supposed to be source NATed (SNAT) so that the client computer only sees the LB IP address.
Checking out iptables -t nat -L -v, we see the difference in two different chains. One has KUBE-MARK-MASQ, the other doesn’t.
Chain KUBE-FW-4IWBTEYTRLLAGWWJ (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */
0 0 KUBE-SVC-4IWBTEYTRLLAGWWJ all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* prometheus/prometheus-pushgateway:http loadbalancer IP */
Chain KUBE-FW-BI5FYS4DEZGC5QLO (1 references)
pkts bytes target prot opt in out source destination
17 884 KUBE-XLB-BI5FYS4DEZGC5QLO all -- any any anywhere anywhere /* smarthome/pihole-tcp:http loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* smarthome/pihole-tcp:http loadbalancer IP */
Code language: JavaScript (javascript)
At first, I thought Calico was doing something different between these two services, because when changed back to Calico, it worked correctly. But this was only partially correct. There were two main differences.
The first difference was that the service that worked had externalTrafficPolicy: Cluster, but the service that wasn’t working had externalTrafficPolicy: Local. In Cluster mode, the source IP address is rewritten to use the host’s IP address, whereas Local only DNATs the packet. (Here is useful blog post)
Usually the link doing NAT will remember how it mangled a packet, and when a reply packet passes through the other way, it will do the reverse mangling on that reply packet, so everything works.
IPTables Docs – NAT-HOWTO
IPTables automatically handles the packets flowing in the reverse direction, but are our packets being processed by the host’s IPTables?
My new containers were using the following routing table:
root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
default via 192.168.2.1 dev eth0
10.43.0.0/16 via 192.168.2.225 dev eth0
192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.187
Code language: PHP (php)
In this route table, all packets (except for K8s cluster local IPs) would get forwarded to the destination switch port and bypass the host’s iptables rule set. In the previous post, this already caused problems with the K8s service routing.
The packets were following a path like this:

However, Calico used a different routing table where all traffic was routed through 169.254.1.1. Calico’s FAQ mentions this here.
root@pihole-6f776b89bc-9lbw6:/# ip route
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
Code language: PHP (php)
The packets were following a path where all packets were being forwarded through the host’s iptables rule set.

Thus, we need to change the route tables so that everything flows through the host.
This seems like it would be easy to do, but the following config:
{
"cniVersion": "0.3.1",
"name": "dhcp-cni-network",
"plugins": [
{
"type": "macvlan",
"name": "macvlan",
"master": "eth0",
"ipam": {
"type": "dhcp"
}
},
{
"type": "route-override",
"flushroutes": true,
"addroutes": [
{ "dst": "0.0.0.0/0", "gw": "192.168.2.125" }
]
}
]
}
Code language: JSON / JSON with Comments (json)
Results in the following route table:
root@macvlan-test-6fcfb775c9-zmbpc:/# ip route
default via 192.168.2.225 dev eth0
192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.187
Code language: PHP (php)
The second route allows traffic destined to the local subnet to continue bypassing the host’s IP stack. If you’re not on the local subnet, then it does work correctly.
I was not able to get the routing tables corrected using the route-override CNI plugin. Setting flushroutes: true wouldn’t delete this route because of this check:
for _, route := range routes {
if route.Scope != netlink.SCOPE_LINK {
^^^^^
if route.Dst != nil {
If you know how this mystery route is created, let me know in the comments.
Instead, I ended up forking the CNI references plugins and writing a custom plugin that setup my routes explicitly. Resulting in the following route table:
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
Code language: CSS (css)
Partial success.
Surprise Issues with MACvlan
Now, I’m able to load it using the LB, but I can’t ping the pod IP, but the pod can ping outwards. After much investigation, I tracked this down to an IPTables rule that was dropping INVALID connections.
Chain KUBE-FORWARD (1 references)
pkts bytes target prot opt in out source destination
984 102K DROP all -- any any anywhere anywhere ctstate INVALID
Why are these packets considered INVALID by conntrack when outbound pings from the pod itself are able to succeed? My guess is that inbound packets are again bypassing the host IP stack going directly to the pod network stack as macvlan is supposed to do.
TBD: Figure out how to fix this. My temporary solution is to override the DROP command and force IPTables to pass any traffic coming from the containers.
sudo iptables -I FORWARD 1 -i mac0 -j ACCEPT
These problems are suggesting that maybe MACvlan is the wrong technology to use here. I previously ruled out using bridge because it was lower performance and required the kernel to ‘learn’ STP and MACs, of which we shouldn’t need.
Writing a custom CNI Plugin
Let’s review how I wrote a new CNI plugin:
The CNI plugins repository contains a sample CNI that we can use. For brevity, I’m going to exclude error handling
link, err := netlink.LinkByName("eth0")
addrs, err := netlink.AddrList(link, netlink.FAMILY_V4)
mainIP := addrs[0]
Code language: JavaScript (javascript)
The default route needs to point towards the host’s IP address, so we need to grab that. In the future, this should be configurable.
err = netns.Do(func(_ ns.NetNS) error {
var containerIFName string
for _, netif := range result.Interfaces {
if netif.Sandbox != "" {
containerIFName = netif.Name
link, _ := netlink.LinkByName(netif.Name)
routes, _ := netlink.RouteList(link, netlink.FAMILY_ALL)
for _, route := range routes {
err = netlink.RouteDel(&route)
}
}
}
dev, _ := netlink.LinkByName(containerIFName)
Code language: PHP (php)
Now, we switch to the context of the container’s network namespace and iterate over the interfaces and purge all routes. Nothing survives.
route := &netlink.Route{
LinkIndex: dev.Attrs().Index,
Scope: netlink.SCOPE_LINK,
Dst: netlink.NewIPNet(mainIP.IP),
}
err = netlink.RouteAdd(route)
Next, we tell Linux which interface the gateway IP lives.
err = netlink.RouteAdd(&netlink.Route{
LinkIndex: dev.Attrs().Index,
Gw: mainIP.IP,
Src: prevResult.IPs[0].Address.IP,
})
Finally, we add the default route telling Linux to forward everything to the host’s IP stack.
Then our CNI config (/etc/cni/net.d/0-bridge.conflist) looks like:
{
"cniVersion": "0.3.1",
"name": "dhcp-cni-network",
"plugins": [
{
"type": "macvlan",
"name": "macvlan",
"master": "eth0",
"ipam": {
"type": "dhcp"
}
},
{
"type": "route-fix",
"master": "eth0"
}
]
}
Code language: JSON / JSON with Comments (json)
To simplify everything, I’ve also changed the mac0 interface IP address to explicitly 169.254.1.1 to follow in Calico’s model. This is all handled by the custom CNI I wrote in the section below.
rancher:
network:
post_cmds:
- /var/lib/macvlan-init.sh
write_files:
- container: network
content: |+
#!/bin/bash
set -ex
echo 'macvlan is up. Configuring'
(
MASTER_IFACE="eth0"
ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
ip addr add 169.254.1.1 dev mac0 || true
)
# the last line of the file needs to be a blank line or a comment
owner: root:root
path: /var/lib/macvlan-init.sh
permissions: "0755"
Code language: PHP (php)
If you’re not using RancherOS, the important part is:
MASTER_IFACE="eth0"
ip link add mac0 link eth0 type macvlan mode bridge && ip link set mac0 up || true
Code language: JavaScript (javascript)
Custom CNI
This is deployable with the following K8s YAML:
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: dhcp
name: kube-dhcp
tier: node
name: kube-dhcp-daemon
namespace: kube-system
spec:
selector:
matchLabels:
name: kube-dhcp
template:
metadata:
labels:
app: dhcp
name: kube-dhcp
tier: node
spec:
containers:
- env:
- name: PRIORITY
value: "0"
image: ghcr.io/ajacques/k8s-dhcp-cni-helper:dev
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- rm /host/cni_net/$PRIORITY-bridge.conflist
name: kube-dhcp
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /run
name: run
- mountPath: /host/cni_bin/
name: cnibin
- mountPath: /host/cni_net
name: cni
hostNetwork: true
hostPID: true
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- hostPath:
path: /run
type: ""
name: run
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni
- hostPath:
path: /opt/cni/bin
type: ""
name: cnibin
Stay tuned for more work on the cluster.