A COE on why technowizardry.net went down

COE = Correction of Error

My previous employer, Amazon, was a big proponent of doing blameless analysis of outages and figuring out what could be done to fix it. I recently had an outage on my servers and wanted to share what went wrong and the fix.

Summary

Starting Thursday until Friday, all TLS requests to a *.technowizardry.net domain would have failed due to a TLS certificate expiration error. Then on Friday, all DNS queries to a *.technowizardry.net zone failed which also caused mail delivery to fail too. This happened because cert-manager had created the acme-challenge TXT record, but the record was not visible to the Internet because the HE DNS was failing to perform an AXFR Zone Transfer from my authoritative DNS server. This was because PowerDNS was unable to bind to port :53 because systemd-resolved was already listening on that port.

While trying to fix the issue, I temporarily deleting the entire zone in Hurricane Electric which then triggered an issue how PowerDNS handles ALIAS records that blocked further AXFR transfers from succeeding. This required manual effort to unblock the transfer to let the zone start working again and allowed the TLS certificate to be renewed.

Additionally, Dovecot, which handles mail storage, failed to start after being rebooted because a Kubernetes sidecar container couldn’t start because of a “too many open files” error while setting up an inotify (to listen for file changes) because of a low ulimit.

Background

DNS

On each of my three servers, I run the PowerDNS authoritative server storing zone records in MySQL. I use cert-manager to create and renew TLS certificates from Let’s Encrypt in my Kubernetes cluster. Since I use wildcard certificates, Let’s Encrypt requires me to use the DNS verification method which means cert-manager has to have permissions to create and modify TXT records. It creates records in PowerDNS using the API.

I have three dedicated servers running my Kubernetes cluster. Three is chosen to ensure that if one machine is offline, I can continue to serve critical services, like this blog. However, they are located geographically all in the north-east area of the US and Canada. For performance, I wanted to distribute out the DNS servers so they’re faster. Thus, I adopted Hurricane Electric’s DNS service to serve as authoritative DNS servers. These servers automatically perform what’s called an DNS Zone Transfer or an AXFR from my servers to their servers to keep in sync.

When cert-manager updates a record using the API, PowerDNS automatically updates the SOA record to tell other secondary resolvers that there’s an update. As of now, my SOA record looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ dig +short SOA technowizardry.net
ns1.he.net. contact.technowizardry.net. 2025082101 3600 600 259200 60

ns1.he.net - Primary NS server
contact.technowizardry.net - A contact email
2025082101 - Serial number. Translates to date: "2025-08-21" version: "01"
3600 - The refresh time, how often secondary resolvers should try to refresh in seconds (1 hour)
600 - Time between retries
259200 - Expiration time. After 7 days, secondaries stop returning cached records
60 - Minimum TTL for the SOA

HE DNS checks for any changes to the zone serial number on my server, and if mine is newer, it pulls the newest version.

Mail

I use Dovecot to store email and implement IMAP/POP3. This is running in Kubernetes and is exposed as a Kubernetes service. It’s not configured in a Highly Available mode and only has one instance running because I never spent the time to figure out Dovecot HA. I then expose the port using ingress-nginx’s mechanism to expose TCP services.

Running as a side-car, I have a config-reloader container which runs quietly and listens for changes in the ConfigMap configuration files or the SSL certificate and if a change is detected, it sends a SIGHUP to Dovecot to reload the configuration. This allows me to avoid having to fully delete and restart the service to make config changes and automatically handle SSL certificate renewals. The deployment looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dovecot
  namespace: mail
spec:
  template:
    spec:
      containers:
        - args:
            - '-c'
            - /etc/dovecot/dovecot.conf
            - '-F'
          command:
            - /usr/sbin/dovecot
          image: ajacques/kube-mail:latest
          name: dovecot
          resources:
            requests:
              memory: 64Mi
          volumeMounts:
            - mountPath: /etc/dovecot/conf.d
              name: config
              readOnly: true
        - env:
            - name: CONFIG_DIR
              value: /etc/dovecot/conf.d,/etc/dovecot/ssl/tls-combined.pem
            - name: PROCESS_NAME
              value: dovecot
          image: ajacques/config-reloader-sidecar:latest
          imagePullPolicy: IfNotPresent
          name: config-reloader-sidecar-config
          resources:
            limits:
              memory: 8Mi
            requests:
              cpu: 5m
              memory: 8Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              add:
                - KILL
              drop:
                - ALL
            privileged: false
            readOnlyRootFilesystem: false
            runAsNonRoot: false
          volumeMounts:
            - mountPath: /etc/dovecot/conf.d
              name: config
              readOnly: true
            - mountPath: /etc/dovecot/ssl
              name: mailcert
              readOnly: true
      volumes:
        - configMap:
            defaultMode: 292
            name: dovecot
            optional: false
          name: config
        - name: mailcert
          secret:
            defaultMode: 292
            optional: false
            secretName: technowizardry-wildcard
      # ...

Investigation

DNS

I was on vacation with limited access to the Internet so it took a lot longer than it should have to figure it out and fix the problem.

First thing to check is server logs. cert-manager showed:

1
E0815 11:23:44.790373       1 sync.go:190] "propagation check failed" err="DNS record for \"technowizardry.net\" not yet propagated" logger="cert-manager.challenges" resource_name="technowizardry-prod-18-1227467484-534542642" resource_namespace="technowizardry" resource_kind="Challenge" resource_version="v1" dnsName="technowizardry.net" type="DNS-01"

The PowerDNS pods showed:

1
2
3
4
5
6
7
Aug 14 18:06:18 Guardian is launching an instance
Aug 14 18:06:18 Unable to bind UDP socket to '0.0.0.0:53': Address already in use
Aug 14 18:06:18 Fatal error: Unable to bind to UDP socket
Aug 14 18:06:19 Our pdns instance exited with code 1, respawning
Aug 14 18:06:20 Guardian is launching an instance
Aug 14 18:06:20 Unable to bind UDP socket to '0.0.0.0:53': Address already in use
Aug 14 18:06:20 Fatal error: Unable to bind to UDP socket

Comparing SOA records shows a difference. HE is not pulling updates.

1
2
3
4
5
dig +short SOA @srv5.technowizardry.net technowizardry.net | awk '{ print $3 }'
2025081401

dig +short SOA @srv5.technowizardry.net technowizardry.net | awk '{ print $3 }'
2025062601

Without SSH access and using an eSIM with a very limited data cap, I can’t investigate what’s listening, but it’s definitely the systemd-resolved. In the mean time, I use Hurricane Electric’s website to change from a secondary DNS zone type to a primary zone which means I can edit the records in the website and skip AXFR. A day later I had full Internet access and could get to work. I wanted to switch back to using the DNS Zone transfer mechanism and have Hurricane Electric pull from my PowerDNS, but they kept failing to pull my server. I couldn’t figure out why because the server was responding to some queries.

Manually trying to pull the AXFR zone didn’t succeed and gave no error messages back. I use TSIG to authenticate transfers. I double and triple-checked that it was correct.

1
$ dig -t AXFR -y hmac-sha512:[tsigkeyname]:[tsigkey]@144.217.181.222 technowizardry.net

The server logs gave no error messages as to why. In the pdns.conf, I changed the logging level from 3 to 5 to increase logging messages.

1
2
3
4
# pdns.conf
# loglevel Amount of logging. Higher is more. Do not set below 3  

loglevel=5

Then retried and got this:

1
Aug 15 11:02:02 AXFR-out zone 'technowizardry.net', client '217.138.75.171', error resolving for ALIAS ingress-nginx.technowizardry.net., aborting AXFR

Ah ha. I’m using an ALIAS record type on apex record, technowizardry.net because I can’t use a CNAME record (CNAME tells resolvers to go look at another record to find the value) CNAMEs can’t be used on the root domain because of “reasons” that are boring, legacy, and unfortunate. I use these records because I point everything to a common record ingress-nginx.technowizardry.net which includes all three servers. As I do maintenance, I take them out and put them back in that single record.

Unfortunately, it seems that even though technowizardry.net is an ALIAS to ingress-nginx.technowizardry.net which this server knows about, it ends up trying to query a resolver on the Internet for that value. However, in this case, no server (which is my HE DNS) is able to handle that query because they’re failing to AXFR transfer. Thus we have a circular loop.

I consider this to be a bug in PowerDNS given that it can absolutely handle this query. However, I suspect they may not have implemented this to avoid issues where somebody creates an ALIAS on a subdomain that is on another NS server.

I quickly disable all ALIAS records to get back in business.

1
2
3
update records
set disabled = 1
where domain_id = 1 and type = 'ALIAS'

After that the DNS zone transfer succeeds. I can re-enable the ALIAS records. Eventually, the TLS certificate renewal succeeds and we’re mostly golden.

Mail

Except for mail is not coming back. The pod is marked as unhealthy which means that ingress-nginx won’t forward traffic to it. However, Dovecot itself is fine. It’s the config reloader that’s restarting with a “too many open files” error. It only opens like five files. How could that be. Well, it’s the only container that uses Inotify which is a Linux feature that sends notifications when files are updated. This one was interesting. Some GitHub issues talk about this, but it’s unclear. My intuition suggests it’s a ulimit or sysctl issue causing it to be unable to startup. Coincidentally, the pod is running on my node that is running NixOS. The other two haven’t been changed over yet.

1
2
3
sysctl -a | grep inotify
fs.inotify.max_user_instances = 128
user.max_inotify_instances = 128

The limit was pretty low. That’s the culprit.

Why

The next section is asking Five whys to get to the bottom. I’ll break it down to different problems.

Why didn’t cert-manager renew the certificate?

  • cert-manager was unable to renew certificates. Why?
  • cert-manager was waiting for the acme-challenge TXT record to be available? Why wasn’t it available?
  • cert-manager had created the TXT record in the authoritative PowerDNS instance, but the NS servers listed on the technowizardry.net zone (run by Hurricane Electric) did not have the TXT record. Why didn’t they have the record?
  • Hurricane Electric failed to perform a DNS zone transfer. Why couldn’t they transfer?
  • The DNS query was not making it to PowerDNS which could handle the transfer. Why?
  • PowerDNS wasn’t able to bind to the port because systemd-resolved was already listening. Why wasn’t it already disabled?
  • Not sure. Maybe an OS upgrade reverted my change or maybe I had only temp disabled it.

Why didn’t mail services start to work?

  • The dovecot service wasn’t starting up. Why?
  • Because the config-reloader side-car was crash looping. Why?
  • Because it was unable to setup the inotify that was needed. Why?
  • The sysctl was set to low. Why?
  • It was never changed from the default. Why?
  • First time I hit this problem

Action Items

Disable systemd-resolved resolver

The first action is to disable the systemd-resolved local resolver which listens on the same port. My servers then forwards DNS off host instead of using a local cache.

On NixOS, this can done using:

1
2
3
4
5
6
{ config, lib, pkgs, ... }:
{
    services.resolved.extraConfig = ''
    DNSStubListener=no
    '';
}

And on other machines, it’s like:

1
2
3
4
# /etc/systemd/resolved.conf.d/disable-local-resolver.conf 

[Resolve]  
DNSStubListener=no

Increase inotify limits

Next, we need to increase the sysctl limits. The Nix team discussed increasing this limit by default, but the related PR was abandoned.

1
2
3
4
5
6
7
{ config, lib, pkgs, ... }:
{
	boot.kernel.sysctl = {
	  "fs.inotify.max_user_instances" = "8192";
	  "user.max_inotify_instances" = "8192";
	};
}

Add more monitoring

I’m using kube-prometheus-stack to monitor my cluster. I didn’t know my certificate was failing to renew for over a week. Had I known, I could have fixed it ahead of the expiration. Let’s add an alert so I get an email next time.

Here I create a PrometheusRule object which tells Prometheus’ AlarmManager to start watching for this metric expression to trigger.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager-rules
  namespace: cert-manager
spec:
  groups:
    - name: certificates
      rules:
        - alert: failing-to-renew
          annotations:
            summary: >-
              Certificate {{ $labels.exported_namespace }}/{{ $labels.name }} is
              failing to renew
          expr: >-
            sum(certmanager_certificate_renewal_timestamp_seconds - time()) BY
            (name,exported_namespace) < 0
          for: 0s
          labels:
            severity: critical

I use Grafana dashboards to visualize and send alerts. I configured a contact point that sends me an email and a notification policy to catch the above alarm.

Copyright - All Rights Reserved

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!

Donate to my blog