Improving bad on-call with the Snowball Effect

I’ve worked on several different teams over the past 8 years I’ve worked at Amazon. Each one of them had on-call in which the engineers were on-call to keep the system running 24/7 for a week. If something broke at 2am, they’d get paged to fix it.

Now, Amazon’s a big company. On-call varied quite a bit. Some teams had more ops load, others had barely any. I had my fair share of weeks with lots of tickets, but usually I sought out teams where it was more manageable. However, those engineers frequently struggled to get anywhere, playing a bit of on-call hot potato with the next on-call. Sadly, Amazon largely did not leverage SREs or dedicated support groups except for the most critical systems. I do wish they would have leveraged them.

Here’s my strategy that I employed when I joined a new team.

Continue reading “Improving bad on-call with the Snowball Effect”

Technical Diagrams – Stop using cloud logos

Quick, what is this diagram trying to show?

An architecture diagram using AWS service icons to describe services

I hope you know your AWS icons. There’s over 200 services and I have to guess frequently when playing the AWS Logo Quiz. While this diagram could easily add some descriptive labels to help, the icons assume developers can remember what the icon means. Some color-blind people may even struggle to see the difference in coloring that AWS uses for different types of services. These icons become visually cluttered and distract the viewer from what matters–your system.

Continue reading “Technical Diagrams – Stop using cloud logos”

Local Energy Monitoring using the Emporia Vue 2

This entry is part 5 of 5 in the series Home Energy Monitoring

I’ve previously explored the world of home energy monitoring systems and in the past arrived at using the Brultech GreenEye Monitor for a project in a friend’s house. It had the advantage of being local out-of-the-box and had a wide range of compact CTs that made fitting the electronics in the breaker box a lot easier, but it had one flaw that made it not suitable for my condo. It had to be mounted outside the breaker box with wires running into the box. I had no space in my condo, so I instead explored other options.

I came across the Emporia Vue2 and identified that it was running a standard ESP32 device and was easy to reflash with custom ESPHome firmware. ESPHome is an open-source framework for creating firmware to collect data from a variety of different sensors and publish it to MQTT/Home Assistant. This sounded perfect, so I ordered a Vue2 and here’s how I made it work.

Continue reading “Local Energy Monitoring using the Emporia Vue 2”

Zeppelin v0.10 not showing matplotlib graphs

I upgraded to Apache Zeppelin v0.10.x from v0.9.x and randomly my Python Matplotlib scripts stopped rendering images. Anything that called the plot method would just return the string response of the function. Like below:

import matplotlib.pyplot as plt
plt.plot([1, 2, 3])

[<matplotlib.lines.Line2D at 0x7ff547624210>]
Code language: Python (python)

Continue reading “Zeppelin v0.10 not showing matplotlib graphs”

Rewriting Home Assistant Long-term statistics from InfluxDB

In an earlier post, I made an error that incorrectly aggregated the energy data which resulted in hugely inflated aggregated energy usage. All the un-aggregated data was accurate, but the sums were wrong. Luckily I had all the raw data stored in InfluxDB and could rebuild it.

In this post, I walk through how to re-write the Home Assistant Long-term statistics database to fix this mistake.

Continue reading “Rewriting Home Assistant Long-term statistics from InfluxDB”

From a random Kubernetes control plane crash to a new RAID array

My external cluster runs on 3 different dedicated servers (most from I have 3 machines since the Kubernetes control plane needs 3 or more to be able to have a quorum and be able to handle any one machine going down. If one machine goes down, then the other two maintain a majority and can agree on the state of the cluster.

I randomly encountered issues where the Kubernetes control plane of Rancher UI would crash and restart. While this cluster didn’t really matter, it still annoyed me and wanted to figure it out.

I narrowed it down to one single host and documented the steps I took to resolve this issue which seems to be have been caused by one machine using HDDs and all other hosts using SSDs.

Continue reading “From a random Kubernetes control plane crash to a new RAID array”

Visualizing Home Energy Usage in InfluxDB and Home Assistant

This entry is part 4 of 5 in the series Home Energy Monitoring

In previous posts in this series, I walked through how to get data flowing into Home Assistant.

In this post, we’ll get it flowing into InfluxDB for long-term retention.

Continue reading “Visualizing Home Energy Usage in InfluxDB and Home Assistant”

Over-engineering a home air quality dashboard

Illustration by Audrey Lee

Air—it’s invisible, I can’t see it, but I feel effects of it in so many ways, temperature, humidity, gas composition, but I lacked sensors to measure it. In this post, I walk through some different Air Quality sensors that I found and how I wired them up into a dashboard.

Continue reading “Over-engineering a home air quality dashboard”

Accurate, Local Home Energy Monitoring: Part 3 – Software Config

This entry is part 3 of 5 in the series Home Energy Monitoring

In the previous post in this series, I selected an energy monitoring system that is purely local based (no cloud), integrates into the breaker box, and showed how to connect it to the network and configure the size of each circuit. In this post, I’ll show how to connect the BrulTech GreenEye Energy Monitor to HomeAssistant and create some useful monitoring dashboards.

Continue reading “Accurate, Local Home Energy Monitoring: Part 3 – Software Config”

Split Horizon DNS with external-dns and cert-manager for Kubernetes

There were a few services that I ran that I wanted to be able to access from both inside my home network and outside my home network. If I was inside my home network, I wanted to route directly to the service, but if I was outside I needed to be able to route traffic through a proxy that would then route into my home lab. Additionally, I wanted to support SSL on all my services for security using cert-manager

Since my IPv4 addresses differ inside my network vs outside, I need to use split-horizon DNS to respond with the correct DNS query. Split-horizon DNS refers to the DNS on one horizon (inside the network) showing different results than outside the network.

Continue reading “Split Horizon DNS with external-dns and cert-manager for Kubernetes”