Understanding Etsy’s 411 alerting framework

Five questions for Ken Lee and Kai Zhong: Insights on building Etsy's alerting framework and best practices for monitoring and alerting.

By Courtney Nash, Kenneth Lee and Kai Zhong

September 20, 2016

Red traffic light at night (source: Hans via Pixabay)

I recently sat down with Kenneth Lee and Kai Zhong, security engineers at Etsy, to discuss their alerting framework 411, and best practices for monitoring and alerting. Here are some highlights from our talk.

Etsy has created its own open source, real-time alerting framework. Can you describe it briefly?

Kai: 411 is alert management in a box. It provides the framework for querying data sources and managing the alerts it generates. We primarily use it for Elasticsearch-based alerts at Etsy, but it supports other alert types as well.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Kenneth: We’ve also generalized much of the code to make it as painless as possible for developers to extend on the functionality we’ve provided so they can create alerts from other search sources.

What prompted your move to Elasticsearch?

Kenneth: This was primarily a decision driven by the operations team and one that the security team had very little say over. The creation of 411 happened during this transition process because ELK (Elasticsearch-Logstash-Kibana) at the time lacked functionality that the security team needed when we first started the transition from Splunk.

How does 411 differ from other alerting and anomaly detection tools?

Kai: 411 focuses on providing a framework for alerting. You can use 411 with Elastic Stack (ES), or you can go an entirely different direction. The important takeaway is that you can easily add additional data sources to 411 to alert on the data you care about.

How should people decide what to log when designing their own alerting?

Kenneth: Log everything! Provided your ELK cluster is able to handle the volume, prioritizing adding logging functionality to base classes, or certain sensitive classes such as login or password changing, is a great place to begin. For people starting out in alerting, a good first pass is to add logging to calls that you want to know about that should usually not happen (non-technical users sshing into production boxes, number of successful site logins dips to zero, etc). The alerting functionality of 411 can definitely be put to good use for more nuanced (but actionable) alerts like attackers who are attempting to scrape your website. For developers, having a standard logger class that you can seamlessly utilize in your application that logs a bunch of information by default makes it easy for them to incorporate into their code, and also provides the secondary benefit of allowing you to specify one grok pattern to index those logs.

Kai: It’s better to have too many logs than too little, especially when you’re trying to do incident response.

You’re speaking at the Security Conference in New York this November. What presentations are you looking forward to attending while there?

Kai: The “Future UX of security software” talk looks interesting (and relevant to us as developers of security software). I’m also looking forward to “AppSec programs for the rest of us” as it fits in well with the security culture at Etsy!

Kenneth: There are a bunch of great presentations lined up that I’m looking forward to. Among others, I’m planning on checking out “Classifiers under attack,” “Hacker quantified security,” and Jessica Frazelle’s talk, “Benefits of isolation provided by containers.”

Post topics: Security