I’ve worked on several different teams over the past 8 years I’ve worked at Amazon. Each one of them had on-call in which the engineers were on-call to keep the system running 24/7 for a week. If something broke at 2am, they’d get paged to fix it.
Now, Amazon’s a big company. On-call varied quite a bit. Some teams had more ops load, others had barely any. I had my fair share of weeks with lots of tickets, but usually I sought out teams where it was more manageable. However, those engineers frequently struggled to get anywhere, playing a bit of on-call hot potato with the next on-call. Sadly, Amazon largely did not leverage SREs or dedicated support groups except for the most critical systems. I do wish they would have leveraged them.
Here’s my strategy that I employed when I joined a new team.
I hope you know your AWS icons. There’s over 200 services and I have to guess frequently when playing the AWS Logo Quiz. While this diagram could easily add some descriptive labels to help, the icons assume developers can remember what the icon means. Some color-blind people may even struggle to see the difference in coloring that AWS uses for different types of services. These icons become visually cluttered and distract the viewer from what matters–your system.
JUnit is a popular testing library for Java applications and I extensively used it when working at Amazon for the numerous Java applications and services there. However, I came across a number of different anti-patterns and areas to improve the quality of the test code. This post introduces many of the different tricks and patterns that I’ve learned and shared with my coworkers, and now want to share
Another library to know and reference is Mockito, which I use extensively in JUnit test cases and will reference this too below.
These are all real things that I’ve seen developers do.
If you’ve got a service that provides clients with the ability to make changes to those entities, then you probably want an audit log that tracks who makes what changes.
I decided to write this post because I frequently saw teams at Amazon not thinking through these considerations. Some of the guidance does focus on AWS IAM, but a lot of it is practical for any type of audit log.
Google Guice is a dependency injection library for Java and I frequently used it on a number of Java services. Compared to Spring, I liked how simple and narrow focused on just dependency injection it was. However, I often times saw developers using it in incorrect or non-ideal patterns that increased boilerplate or were just wrong.
These are all recommendations that I’ve accumulated over several years at working at Amazon watching engineers and sometimes myself improperly leverage Google Guice.
How to make your system robust against your worst nightmare–your future self
In this post, I talk about some strategies that I’ve learned to simplify class structures in Java services that load and persist data into data stores like DynamoDB or RDS at the same time making the codebase safer.
As always, my opinions are my own.
At Amazon, I ended up joining two teams that were suffering under the technical debt. Each time, I was asked to spend some time understanding why the products were unstable and users were encountering frequent bugs. In one system, responsible for managing critical metadata about products in the catalog, was experiencing problems where users were reporting that they’d randomly lose data.
A service that was losing client data is a terrible service and caused users to lose trust in this system. Note that some details of this story have been modified for confidentiality reasons. Let’s dive in.
For new fields, Elasticsearch can automatically identify what type to use, but it can be wrong or do unexpected things. For example, I’ve seen Elasticsearch accidentally identify a decimal value as a long because the first value to go into the index did not have any decimal points. Then all other documents failed to be indexed because they did not match. This is especially important if you have fields that have a wide range of values (for example, user controlled) because you can’t predict if the first value is going to look like a number or a date, when it should always be considered to be a string.