book

The Art of Monitoring

by James Turnbull

June 2016

Intermediate to advanced

524 pages

9h 54m

English

Turnbull Press

Read now

Unlock full access

0.1 Who is this book for?0.2 Credits and Acknowledgments0.3 Technical Reviewers0.3.1 Caitie McCaffrey0.3.2 Paul Stack0.3.3 Jamie Wilkinson0.4 Editor0.5 Author0.6 Conventions in the book0.7 Code and Examples0.8 Colophon0.9 Errata0.10 Disclaimer0.11 Copyright0.12 Version
1.1 Welcome to the Art of Monitoring1.2 What is monitoring?1.2.1 The business as a customer1.2.2 Information Technology as a customer1.3 What does monitoring actually look like?1.3.1 Manual, user-initiated, or no monitoring1.3.2 Reactive1.3.3 Proactive1.4 Model distribution1.5 Becoming Proactive1.6 What’s in the book?1.7 Tool choices
2.1 Pull versus Push2.2 Blackbox and Whitebox2.3 Event, log, and metric-centered2.3.1 More about metrics2.3.2 So what’s a metric?2.3.3 Types of metrics2.3.4 Metric summaries2.3.5 Metric aggregation2.4 Contextual and useful notifications2.5 Visualization2.6 So why this architecture? What’s wrong with traditional monitoring?2.6.1 Static configuration2.6.2 Inflexible logic and thresholds2.6.3 Object-centric2.6.4 An interlude into pets and cattle2.6.5 So what do we do differently?2.6.6 Smarter threshold inputs2.7 Collecting data for our monitoring framework2.7.1 Overhead and the observer effect2.8 Summary
3.1 Introducing Riemann3.1.1 Riemann architecture and implementation3.1.2 Installing Riemann3.2 Configuring Riemann3.2.1 Learning some Clojure3.2.2 Riemann’s base configuration3.2.3 Events, streams, and the index3.2.4 Configuring events, streams, and the index3.2.5 Sending our first event to Riemann3.2.6 Creating our first Riemann monitoring check3.2.7 An interlude into Riemann filtering3.3 Connecting Riemann servers3.3.1 Configuring the upstream Riemann servers3.3.2 Configuring the downstream Riemann server3.3.3 Enabling the send of our Riemann events downstream3.4 Alerting on the upstream Riemann servers3.4.1 Throttling Riemann events3.4.2 Rolling up Riemann events3.4.3 Alternatives to email notifications3.5 Testing your Riemann configuration3.6 Validating Riemann configuration3.7 Performance, scaling, and making Riemann highly available3.8 Alternatives to Riemann3.9 Summary
4.1 Introducing Graphite4.1.1 Carbon4.1.2 Whisper4.1.3 Graphite Web, Graphite-API, and Grafana4.2 Graphite architecture4.3 Installing Graphite4.3.1 Installing Graphite on Ubuntu4.3.2 Installing Graphite on Red Hat4.3.3 Installing Graphite-API4.3.4 Installing Grafana4.3.5 Installing Graphite and Grafana via configuration management4.4 Configuring Graphite and Carbon4.4.1 Configuring Carbon’s metric retention4.4.2 Estimating Graphite storage4.4.3 Carbon and Graphite service management4.5 Configuring Graphite-API4.5.1 Service management for Graphite-API4.5.2 Testing the Graphite-API4.6 Configuring Grafana4.7 Configuring Riemann for Graphite4.8 A brief introduction to Grafana4.9 Graphite and Carbon Redundancy4.10 Time and time zones4.10.1 Managing time manually4.10.2 Managing Time via configuration management4.10.3 Checking the time status4.11 Alternatives to Graphite and Grafana4.11.1 Commercial tools4.11.2 Open-source tools4.12 Whisper alternatives4.12.1 InfluxDB4.12.2 Cyanite4.13 Summary
5.1 Introducing collectd5.2 What host components should we monitor?5.3 Installing collectd5.3.1 Installing collectd on Ubuntu5.3.2 Installing collectd on Red Hat5.3.3 Installing collectd via configuration management5.4 Configuring collectd5.4.1 Loading and configuring collectd plugins for monitoring5.4.2 Finishing up5.4.3 Enabling and running collectd5.5 The collectd events5.6 Sending our collectd events to Graphite5.7 Refactoring the collectd metric names5.8 Summary
6.1 Checking processes are running6.2 Other actions and enhancements6.3 Replicating some classic monitoring6.4 Better monitoring through smarter data6.4.1 Building a median-based check6.4.2 Using percentiles for host-based checks6.4.3 Creating check abstractions6.4.4 Organizing our checks6.5 Graphing collectd metrics with Grafana6.5.1 Creating the Hosts dashboard6.5.2 Creating our first host graph6.5.3 Creating a memory graph6.5.4 Single host graphs6.5.5 Additional graphs6.6 Network, device, and Microsoft Windows monitoring6.7 Alternatives to collectd6.7.1 Commercial tools6.7.2 Open source6.8 Summary
7.1 Challenges with container monitoring7.2 Monitoring Docker containers7.2.1 Docker collectd plugin7.2.2 Installing the Docker collectd plugin7.2.3 Configuring the Docker collectd plugin7.3 Processing Docker collectd statistics with Riemann7.3.1 Adding metadata to our Docker events7.4 Specifying different resolution for Docker metrics7.5 Cleaning up old Graphite Docker metrics7.6 Using Docker metrics for monitoring7.7 Other container monitoring tools7.8 Summary
8.1 Introducing Elasticsearch, Logstash, and Kibana8.2 Logstash architecture8.3 Installing Logstash8.3.1 On Debian & Ubuntu On Debian & Ubuntu8.3.2 On Red Hat8.3.3 Testing Java is installed8.3.4 Installing the Logstash package8.3.5 Testing Logstash is installed8.4 Configuring Logstash8.5 Installing Elasticsearch8.5.1 On Debian and Ubuntu8.5.2 On Red Hat8.5.3 Installing Elasticsearch via configuration management8.5.4 Testing Elasticsearch is installed8.5.5 Determining Elasticsearch is running8.6 Configuring our Elasticsearch cluster and nodes8.6.1 Adding a cluster management plugin8.7 Time and time zone8.8 Integrating Logstash and Elasticsearch8.8.1 What happens inside Logstash?8.8.2 What happens inside Elasticsearch?8.9 Installing Kibana8.10 Configuring Kibana8.11 Running Kibana8.11.1 Using Kibana8.12 Connecting our hosts to Logstash via Syslog8.12.1 Configuring Logstash8.12.2 A quick introduction to Syslog8.12.3 Configuring Syslog8.13 Logging from Docker8.13.1 Configuring the Docker Daemon for logging8.14 Sending data from Logstash to Riemann8.15 Sending data from Riemann to Logstash8.16 Scaling Elasticsearch and Logstash8.16.1 Scaling Logstash8.16.2 Scaling Elasticsearch8.17 Monitoring our components8.17.1 Monitoring RSyslog8.17.2 Monitoring Logstash8.17.3 Monitoring Elasticsearch8.18 Alternatives to Logstash8.18.1 Splunk8.18.2 Heka8.18.3 Graylog8.18.4 mtail8.19 Summary
9.1 An application monitoring primer9.1.1 Where should I instrument?9.1.2 Instrument schemas9.1.3 Time and the observer effect9.2 Metrics9.2.1 Application metrics9.2.2 Business metrics9.2.3 Monitoring patterns, or where to put your metrics9.2.4 The utility pattern9.2.5 The external pattern9.2.6 Building metrics into a sample application9.3 Logging9.3.1 Adding our own structured log entries9.3.2 Adding structured logging to our sample application9.3.3 Working with your existing logs9.4 Health checks, endpoints, and external monitoring9.4.1 Checking an internal endpoint9.5 Deployments9.5.1 Adding deployment notifications to our sample application9.5.2 Working with our deployment events9.6 Tracing9.7 Summary

10.1 Our current notifications10.2 Updating expired event configuration10.3 Upgrading our email notifications10.3.1 Formatting the email subject10.3.2 Formatting the email body10.4 Adding graphs to notifications10.4.1 Defining our data source10.4.2 Defining our query parameters10.4.3 Defining our graph panels and rows10.4.4 Rendering the dashboard10.4.5 Adding our dashboard to the Riemann notification10.4.6 Some sample scripted dashboards10.4.7 Other context10.5 Adding Slack as a destination10.6 Adding PagerDuty as a destination10.7 Maintenance and downtime10.8 Learning from your notifications10.9 Other alerting tools10.10 Summary
11.1 The Tornado application11.1.1 Application architecture11.2 Monitoring strategy11.3 Tagging our Tornado events11.4 Monitoring Tornado — Web tier11.4.1 Monitoring HAProxy11.4.2 Monitoring Nginx11.4.3 Addressing the Web tier monitoring concerns11.4.4 Setting up the Tornado checks in Riemann11.4.5 The webtier function11.5 Adding Tornado checks to Riemann11.6 Summary
12.1 Monitoring the Application tier JVM12.1.1 Configuring collectd for JMX12.2 Collecting our Application tier JVM logs12.3 Monitoring the Tornado API application12.4 Addressing the Tornado Application tier monitoring concerns12.5 Summary
13.1 Monitoring the Data tier MySQL server13.1.1 Using MySQL data for metrics13.1.2 Query timing13.2 Monitoring the Data tier’s Redis server13.3 Addressing the Tornado Data tier monitoring concerns13.4 The Tornado dashboard13.5 Expanding monitoring beyond Tornado13.6 Summary
14.1 A brief introduction to Clojure14.2 Installing Leiningen14.3 Clojure syntax and types14.3.1 Clojure functions14.3.2 Lists14.3.3 Vectors14.3.4 Sets14.3.5 Maps14.3.6 Strings14.3.7 Creating our own functions14.3.8 Creating variables14.3.9 Creating named functions14.4 Learning more Clojure

Content preview from The Art of Monitoring

1 Introduction

Let’s begin with an origin story for a company called Example.com. Once upon a time(-series), Example.com had a sysadmin. She managed infrastructure that lived in data centers. Every time a new host was added to that environment she installed a monitoring agent and set up some monitoring checks. Every now and again one of those hosts would break and a check would trigger. A notification would be sent, and she would wake up and run rm -fr /var/log/*.log to fix it.

For many years this approach worked just fine. Of course, there was some drama. Occasionally something would go wrong for which there wasn’t a check, or there just wasn’t time to act on a notification, or some applications and services on top of the hosts weren’t monitored. ...