Improving bad on-call with the Snowball Effect

I’ve worked on several different teams over the past 8 years I’ve worked at Amazon. Each one of them had on-call in which the engineers were on-call to keep the system running 24/7 for a week. If something broke at 2am, they’d get paged to fix it.

Now, Amazon’s a big company. On-call varied quite a bit. Some teams had more ops load, others had barely any. I had my fair share of weeks with lots of tickets, but usually I sought out teams where it was more manageable. However, those engineers frequently struggled to get anywhere, playing a bit of on-call hot potato with the next on-call. Sadly, Amazon largely did not leverage SREs or dedicated support groups except for the most critical systems. I do wish they would have leveraged them.

Here’s my strategy that I employed when I joined a new team.

Continue reading “Improving bad on-call with the Snowball Effect”

Technical Diagrams – Stop using cloud logos

Quick, what is this diagram trying to show?

An architecture diagram using AWS service icons to describe services

I hope you know your AWS icons. There’s over 200 services and I have to guess frequently when playing the AWS Logo Quiz. While this diagram could easily add some descriptive labels to help, the icons assume developers can remember what the icon means. Some color-blind people may even struggle to see the difference in coloring that AWS uses for different types of services. These icons become visually cluttered and distract the viewer from what matters–your system.

Continue reading “Technical Diagrams – Stop using cloud logos”

Best Practices for Java testing with JUnit

JUnit is a popular testing library for Java applications and I extensively used it when working at Amazon for the numerous Java applications and services there. However, I came across a number of different anti-patterns and areas to improve the quality of the test code. This post introduces many of the different tricks and patterns that I’ve learned and shared with my coworkers, and now want to share

Another library to know and reference is Mockito, which I use extensively in JUnit test cases and will reference this too below.

These are all real things that I’ve seen developers do.

Continue reading “Best Practices for Java testing with JUnit”

How to build a useful service data change audit log

If you’ve got a service that provides clients with the ability to make changes to those entities, then you probably want an audit log that tracks who makes what changes.

I decided to write this post because I frequently saw teams at Amazon not thinking through these considerations. Some of the guidance does focus on AWS IAM, but a lot of it is practical for any type of audit log.

Important aspects to an audit log:

  • Who made the change?
  • When did they make the change?
  • Where did they make the change?
  • What did they do?
Continue reading “How to build a useful service data change audit log”

Best Practices for working with Google Guice

Google Guice is a dependency injection library for Java and I frequently used it on a number of Java services. Compared to Spring, I liked how simple and narrow focused on just dependency injection it was. However, I often times saw developers using it in incorrect or non-ideal patterns that increased boilerplate or were just wrong.

These are all recommendations that I’ve accumulated over several years at working at Amazon watching engineers and sometimes myself improperly leverage Google Guice.

Continue reading “Best Practices for working with Google Guice”

Defensive Coding: Stop using your storage models everywhere

How to make your system robust against your worst nightmare–your future self

In this post, I talk about some strategies that I’ve learned to simplify class structures in Java services that load and persist data into data stores like DynamoDB or RDS at the same time making the codebase safer.

As always, my opinions are my own.

At Amazon, I ended up joining two teams that were suffering under the technical debt. Each time, I was asked to spend some time understanding why the products were unstable and users were encountering frequent bugs. In one system, responsible for managing critical metadata about products in the catalog, was experiencing problems where users were reporting that they’d randomly lose data.

A service that was losing client data is a terrible service and caused users to lose trust in this system. Note that some details of this story have been modified for confidentiality reasons. Let’s dive in.

Continue reading “Defensive Coding: Stop using your storage models everywhere”

Best Practices for Elasticsearch mappings

At first, Elasticsearch may appear to be schemaless since you can add new fields any time you want, but every field in a document must match the mapping.

Dynamic Templates reduce boilerplate

How many times have you opened up a mapping file to something like this where the same type definition is repeated over and over again?

{
  "properties": {
    "foo": {
      "type": "keyword"
    },
    "foo": {
      "type": "keyword"
    },
    "foo": {
      "type": "keyword"
    },
    "baz": {
      "type": "keyword"
    },
    "other": {
      "type": "text"
    },
    ...
  }
}

It’s super easy to refactor this into an alternative where by default all string values are mapped as keyword, except for the specific field listed as “text”.

{
  "properties": {
    "dynamic_templates": [
      {
        "example_name": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ],
    "other": {
      "type": "text"
    }
  }
}

Disable type detection

For new fields, Elasticsearch can automatically identify what type to use, but it can be wrong or do unexpected things. For example, I’ve seen Elasticsearch accidentally identify a decimal value as a long because the first value to go into the index did not have any decimal points. Then all other documents failed to be indexed because they did not match. This is especially important if you have fields that have a wide range of values (for example, user controlled) because you can’t predict if the first value is going to look like a number or a date, when it should always be considered to be a string.

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html

{
  "mappings": {
    "date_detection": false,
    "numeric_detection": false
  }
}