My external cluster runs on 3 different dedicated servers (most from SoYouStart.com.) I have 3 machines since the Kubernetes control plane needs 3 or more to be able to have a quorum and be able to handle any one machine going down. If one machine goes down, then the other two maintain a majority and can agree on the state of the cluster.
I randomly encountered issues where the Kubernetes control plane of Rancher UI would crash and restart. While this cluster didn’t really matter, it still annoyed me and wanted to figure it out.
I narrowed it down to one single host and documented the steps I took to resolve this issue which seems to be have been caused by one machine using HDDs and all other hosts using SSDs.
There were a few services that I ran that I wanted to be able to access from both inside my home network and outside my home network. If I was inside my home network, I wanted to route directly to the service, but if I was outside I needed to be able to route traffic through a proxy that would then route into my home lab. Additionally, I wanted to support SSL on all my services for security using cert-manager
Since my IPv4 addresses differ inside my network vs outside, I need to use split-horizon DNS to respond with the correct DNS query. Split-horizon DNS refers to the DNS on one horizon (inside the network) showing different results than outside the network.
DNS is the protocol that converts domain names like “technowizardry.net” into the IP address of the server that will respond like “220.127.116.11”. In DNS, domain names actually are supposed to end with a period. For example, the URL of this website is not “www.technowizardry.net”, but it’s actually “www.technowizardry.net.” Notice the period at the end.
Where does this come from? If you look at a DNS packet in a packet capture, you’ll see that each query looks something like this:
The queried domain starts right where I’ve highlighted in the above picture. Domain names are separated by each period. In this example, I have 3 separate domain parts: [“www”, “technowizardry”, “net”]. The byte sequence looks like:
Previously in my Home Lab series, I described how my home lab Kubernetes clusters runs with a DHCP CNI–all pods get an IP address on the same layer 2 network as the rest of my home and an IP from DHCP. This enabled me to run certain software that needed this like Home Assistant which wanted to be able to do mDNS and send broadcast packets to discover device.
However, not all pods actually needed to be on the same layer 2 network and lead to a few situations where I ran out of IP addresses on the DHCP server and couldn’t connect any new devices until reservations expired:
I also had a circular dependency where the main VLAN told clients to use a DNS server that was running in Kubernetes. If I had to reboot the cluster, my Kubernetes cluster could get stuck starting because it tried to query a DNS server that wasn’t started yet (For simplicity, I use DHCP for everything instead of static config).
In this post, I explain how I built a new home lab cluster with K3s and used Multus to run both Calico and my custom Bridge+DHCP CNI so that only pods that need layer 2 access get access.
In my previous post where I outlined challenges that I’ve encountered with Rancher. As part of the feedback to that I ended up having to rebuild one of my clusters. I took that time to try out RKE2 and K3s for my home lab. In this home lab, I use a custom CNI based on the official Bridge and DHCP IPAM CNIs (Read more) to enable my smart home software (HomeAssistant) to communicate with other devices on the same Layer 2 domain.
However, it seems that if you try to spin up a RKE2 cluster on a host with a Bridge interface setup (See here) then it’ll get stuck during provisioning and you won’t be able to download a Kube Config from Rancher Server because Rancher thinks it’s offline. I reported this issue initially here.
In this blog post, I explain more about the problem and how to directly connect to the cluster to install a working CNI such that Rancher will correctly start.
You were supposed to bring balance to Kubernetes, Rancher, not destroy it
et tu? Rancher?
I’ve been maintaining my own dedicated servers for around 7 years now as a way to learn and improve skills and have a place to run my different web sites, mail servers, even this blog. Over the years the hardware has changed and I’ve moved from hosting Rails applications directly on the OS to Docker and finally Kubernetes. I’ve learned a lot of skills that eventually helped me in my professional career at my job that it’s definitely been worth it, but maintaining this server has had its massive pain points where I’ve just had to walk away and leave stuff broken for days until I finally fix the issues.
I selected Rancher several years ago (at least 3 or 4 years ago I’d estimate) when I finally moved to Kubernetes. I liked how it automatically provisioned my clusters, managed networking, and provided a nice UI. It was also reasonably recommended by the internet. Things worked reasonably well, but after adopting Rancher and Kubernetes, every 6-12 months I’d end up having something massively break and I’d have to rebuild the entire Kubernetes cluster painstakingly and many times I’d tell myself if it broke, then I’d just swear off Rancher entirely, but it never happened because I eventually got everything working.
After upgrading to Rancher v2.6.3 that just launched yesterday and finding that all my clusters were removed from Rancher, I hit my breaking point.
After I’ve had time to run my home lab for a while, I’ve started switching to a more up to date Linux distribution (instead of RancherOS.) I’m currently testing Ubuntu Server which leverages Systemd. Systemd-networkd is responsible for managing the network interface configuration and it differs in behavior compared to NetworkManager enough that we need to update the Home Lab Bridge CNI to handle it.
Previously the CNI was creating a bridge network adapter when the first container started up, but this causes problems with systemd because resolved (the DNS resolver component) was eventually failing to make DNS queries and networkd was duplicating IP addresses on both eth0 (the actual uplink adapter) and on cni0 because we were copying it over.
Long ago, I installed Longhorn onto my Kubernetes cluster using Helm 2. Then eventually Helm 3 was released and helm 2to3 was made available. However, I was not able to use helm 2to3 for whatever reason because Rancher didn’t deploy Tiller in the way that this CLI expected. Additionally, Rancher did not provide an upgrade mechanism to handle this. Eventually Rancher 2.6 was released which entirely dropped Helm 2 support and I was stuck with a cluster where Longhorn was deployed, but not managed by a working Helm installation.
This blog post outlines how you can recover Longhorn and upgrade it to Helm 3 without deleting all your volumes. This guide isn’t specific to Rancher 2.6.
Kubernetes automatically exposes certain services as a port on your host and may unintentionally exposing private services. Here’s how to fix that
I recently responded to the Log4j vulnerability. If you’re not aware, Log4j is a very popular Java logging library used in many Java applications. There was a vulnerability where malicious actors could remotely take control of your computer by submitting a specially crafted request parameter that gets directly logged to log4j.
This situation was not ideal since I was running several Java applications on my servers, thus I decided to use Nmap to port scan my dedicated server to see what ports were open. I ended up finding a number of ports I didn’t expect because several of Kubernetes Service instances were being mapped as node ports.
In this post, I outline the problem with Kubernete’s default strategy for services and how to avoid exposing ports that you don’t need.
In the previous post, we end up abusing subnets and routing to get Calico to exist on the correct subnet, but what if we could get rid of Calico’s duplicate IPAM system and just depend on our existing DHCP server to handle reservations? In this post, we’re going to prototype a cluster that uses DHCP + layer 2 Linux bridging to avoid the complications outlined in Part 3.
This avoids overlapping IPAM problems with the previous solution and means that the DHCP server already running on my network would be responsible for handing out IP addresses directly to the containers.