O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Art of Monitoring

Book Description

A hands-on and introductory guide to the art of modern application and infrastructure monitoring and metrics. We start small and then build on what you learn to scale out to multi-site, multi-tier applications. The book is written for both developers and sysadmins. We focus on building monitored and measurable applications. We also use tools that are designed to handle the challenges of managing Cloud, containerized and distributed applications and infrastructure.

In the book we'll deliver:

* An introduction to monitoring, metrics and measurement.
* A scalable framework for monitoring hosts (including Docker and containers), services and applications built on top of the Riemann event stream processor.
* Graphing and metric storage using Graphite and Grafana.
* Logging with Logstash.
* A framework for high quality and useful notifications
* Techniques for developing and building monitorable applications
* A capstone that puts all the pieces together to monitor a multi-tier application.

Table of Contents

  1. The Art of Monitoring
    1. 0.1 Who is this book for?
    2. 0.2 Credits and Acknowledgments
    3. 0.3 Technical Reviewers
      1. 0.3.1 Caitie McCaffrey
      2. 0.3.2 Paul Stack
      3. 0.3.3 Jamie Wilkinson
    4. 0.4 Editor
    5. 0.5 Author
    6. 0.6 Conventions in the book
    7. 0.7 Code and Examples
    8. 0.8 Colophon
    9. 0.9 Errata
    10. 0.10 Disclaimer
    11. 0.11 Copyright
    12. 0.12 Version
  2. 1 Introduction
    1. 1.1 Welcome to the Art of Monitoring
    2. 1.2 What is monitoring?
      1. 1.2.1 The business as a customer
      2. 1.2.2 Information Technology as a customer
    3. 1.3 What does monitoring actually look like?
      1. 1.3.1 Manual, user-initiated, or no monitoring
      2. 1.3.2 Reactive
      3. 1.3.3 Proactive
    4. 1.4 Model distribution
    5. 1.5 Becoming Proactive
    6. 1.6 What's in the book?
    7. 1.7 Tool choices
  3. 2 A Monitoring and Measurement Framework
    1. 2.1 Blackbox versus Whitebox, or Pull versus Push
    2. 2.2 Event, log, and metric-centered
      1. 2.2.1 More about metrics
      2. 2.2.2 So what's a metric?
      3. 2.2.3 Types of metrics
      4. 2.2.4 Metric summaries
      5. 2.2.5 Metric aggregation
    3. 2.3 Contextual and useful notifications
    4. 2.4 Visualization
    5. 2.5 So why this architecture? What's wrong with traditional monitoring?
      1. 2.5.1 Static configuration
      2. 2.5.2 Inflexible logic and thresholds
      3. 2.5.3 Object-centric
      4. 2.5.4 An interlude into pets and cattle
      5. 2.5.5 So what do we do differently?
      6. 2.5.6 Smarter threshold inputs
    6. 2.6 Collecting data for our monitoring framework
      1. 2.6.1 Overhead and the observer effect
    7. 2.7 Summary
  4. 3 Managing events and metrics with Riemann
    1. 3.1 Introducing Riemann
      1. 3.1.1 Riemann architecture and implementation
      2. 3.1.2 Installing Riemann
    2. 3.2 Configuring Riemann
      1. 3.2.1 Learning some Clojure
      2. 3.2.2 Riemann's base configuration
      3. 3.2.3 Events, streams, and the index
      4. 3.2.4 Configuring events, streams, and the index
      5. 3.2.5 Sending our first event to Riemann
      6. 3.2.6 Creating our first Riemann monitoring check
      7. 3.2.7 An interlude into Riemann filtering
    3. 3.3 Connecting Riemann servers
      1. 3.3.1 Configuring the upstream Riemann servers
      2. 3.3.2 Configuring the downstream Riemann server
      3. 3.3.3 Enabling the send of our Riemann events downstream
    4. 3.4 Alerting on the upstream Riemann servers
      1. 3.4.1 Throttling Riemann events
      2. 3.4.2 Rolling up Riemann events
      3. 3.4.3 Alternatives to email notifications
    5. 3.5 Testing your Riemann configuration
    6. 3.6 Validating Riemann configuration
    7. 3.7 Performance, scaling, and making Riemann highly available
    8. 3.8 Alternatives to Riemann
    9. 3.9 Summary
  5. 4 Introducing Graphite and Grafana
    1. 4.1 Introducing Graphite
      1. 4.1.1 Carbon
      2. 4.1.2 Whisper
      3. 4.1.3 Graphite Web, Graphite-API, and Grafana
    2. 4.2 Graphite architecture
    3. 4.3 Installing Graphite
      1. 4.3.1 Installing Graphite on Ubuntu
      2. 4.3.2 Installing Graphite on Red Hat
      3. 4.3.3 Installing Graphite-API
      4. 4.3.4 Installing Grafana
      5. 4.3.5 Installing Graphite and Grafana via configuration management
    4. 4.4 Configuring Graphite and Carbon
      1. 4.4.1 Configuring Carbon's metric retention
      2. 4.4.2 Estimating Graphite storage
      3. 4.4.3 Carbon and Graphite service management
    5. 4.5 Configuring Graphite-API
      1. 4.5.1 Service management for Graphite-API
      2. 4.5.2 Testing the Graphite-API
    6. 4.6 Configuring Grafana
    7. 4.7 Configuring Riemann for Graphite
    8. 4.8 A brief introduction to Grafana
    9. 4.9 Graphite and Carbon Redundancy
    10. 4.10 Time and time zones
      1. 4.10.1 Managing time manually
      2. 4.10.2 Managing Time via configuration management
      3. 4.10.3 Checking the time status
    11. 4.11 Alternatives to Graphite and Grafana
      1. 4.11.1 Commercial tools
      2. 4.11.2 Open-source tools
    12. 4.12 Whisper alternatives
      1. 4.12.1 InfluxDB
      2. 4.12.2 Cyanite
    13. 4.13 Summary
  6. 5 Host monitoring
    1. 5.1 Introducing collectd
    2. 5.2 What host components should we monitor?
    3. 5.3 Installing collectd
      1. 5.3.1 Installing collectd on Ubuntu
      2. 5.3.2 Installing collectd on Red Hat
      3. 5.3.3 Installing collectd via configuration management
    4. 5.4 Configuring collectd
      1. 5.4.1 Loading and configuring collectd plugins for monitoring
      2. 5.4.2 Finishing up
      3. 5.4.3 Enabling and running collectd
    5. 5.5 The collectd events
    6. 5.6 Sending our collectd events to Graphite
    7. 5.7 Refactoring the collectd metric names
    8. 5.8 Summary
  7. 6 Using collectd events in Riemann
    1. 6.1 Checking processes are running
    2. 6.2 Other actions and enhancements
    3. 6.3 Replicating some classic monitoring
    4. 6.4 Better monitoring through smarter data
      1. 6.4.1 Building a median-based check
      2. 6.4.2 Using percentiles for host-based checks
      3. 6.4.3 Creating check abstractions
      4. 6.4.4 Organizing our checks
    5. 6.5 Graphing collectd metrics with Grafana
      1. 6.5.1 Creating the Hosts dashboard
      2. 6.5.2 Creating our first host graph
      3. 6.5.3 Creating a memory graph
      4. 6.5.4 Single host graphs
      5. 6.5.5 Additional graphs
    6. 6.6 Network, device, and Microsoft Windows monitoring
    7. 6.7 Alternatives to collectd
      1. 6.7.1 Commercial tools
      2. 6.7.2 Open source
    8. 6.8 Summary
  8. 7 Containers: another kind of host
    1. 7.1 Challenges with container monitoring
    2. 7.2 Monitoring Docker containers
      1. 7.2.1 Docker collectd plugin
      2. 7.2.2 Installing the Docker collectd plugin
      3. 7.2.3 Configuring the Docker collectd plugin
    3. 7.3 Processing Docker collectd statistics with Riemann
      1. 7.3.1 Adding metadata to our Docker events
    4. 7.4 Specifying different resolution for Docker metrics
    5. 7.5 Cleaning up old Graphite Docker metrics
    6. 7.6 Using Docker metrics for monitoring
    7. 7.7 Other container monitoring tools
    8. 7.8 Summary
  9. 8 Logs and logging
    1. 8.1 Introducing Elasticsearch, Logstash, and Kibana
    2. 8.2 Logstash architecture
    3. 8.3 Installing Logstash
      1. 8.3.1 On Debian & Ubuntu
      2. 8.3.2 On Red Hat
      3. 8.3.3 Testing Java is installed
      4. 8.3.4 Installing the Logstash package
      5. 8.3.5 Testing Logstash is installed
    4. 8.4 Configuring Logstash
    5. 8.5 Installing Elasticsearch
      1. 8.5.1 On Debian and Ubuntu
      2. 8.5.2 On Red Hat
      3. 8.5.3 Installing Elasticsearch via configuration management
      4. 8.5.4 Testing Elasticsearch is installed
      5. 8.5.5 Determining Elasticsearch is running
    6. 8.6 Configuring our Elasticsearch cluster and nodes
      1. 8.6.1 Adding a cluster management plugin
    7. 8.7 Time and time zone
    8. 8.8 Integrating Logstash and Elasticsearch
      1. 8.8.1 What happens inside Logstash?
      2. 8.8.2 What happens inside Elasticsearch?
    9. 8.9 Installing Kibana
    10. 8.10 Configuring Kibana
    11. 8.11 Running Kibana
      1. 8.11.1 Using Kibana
    12. 8.12 Connecting our hosts to Logstash via Syslog
      1. 8.12.1 Configuring Logstash
      2. 8.12.2 A quick introduction to Syslog
      3. 8.12.3 Configuring Syslog
    13. 8.13 Logging from Docker
      1. 8.13.1 Configuring the Docker Daemon for logging
    14. 8.14 Sending data from Logstash to Riemann
    15. 8.15 Sending data from Riemann to Logstash
    16. 8.16 Scaling Elasticsearch and Logstash
      1. 8.16.1 Scaling Logstash
      2. 8.16.2 Scaling Elasticsearch
    17. 8.17 Monitoring our components
      1. 8.17.1 Monitoring RSyslog
      2. 8.17.2 Monitoring Logstash
      3. 8.17.3 Monitoring Elasticsearch
    18. 8.18 Alternatives to Logstash
      1. 8.18.1 Splunk
      2. 8.18.2 Heka
      3. 8.18.3 Graylog
      4. 8.18.4 mtail
    19. 8.19 Summary
  10. 9 Building Monitored Applications
    1. 9.1 An application monitoring primer
      1. 9.1.1 Where should I instrument?
      2. 9.1.2 Instrument schemas
      3. 9.1.3 Time and the observer effect
    2. 9.2 Metrics
      1. 9.2.1 Application metrics
      2. 9.2.2 Business metrics
      3. 9.2.3 Monitoring patterns, or where to put your metrics
      4. 9.2.4 The utility pattern
      5. 9.2.5 The external pattern
      6. 9.2.6 Building metrics into a sample application
    3. 9.3 Logging
      1. 9.3.1 Adding our own structured log entries
      2. 9.3.2 Adding structured logging to our sample application
      3. 9.3.3 Working with your existing logs
    4. 9.4 Health checks, endpoints, and external monitoring
      1. 9.4.1 Checking an internal endpoint
    5. 9.5 Deployments
      1. 9.5.1 Adding deployment notifications to our sample application
      2. 9.5.2 Working with our deployment events
    6. 9.6 Tracing
    7. 9.7 Summary
  11. 10 Notifications
    1. 10.1 Our current notifications
    2. 10.2 Updating expired event configuration
    3. 10.3 Upgrading our email notifications
      1. 10.3.1 Formatting the email subject
      2. 10.3.2 Formatting the email body
    4. 10.4 Adding graphs to notifications
      1. 10.4.1 Defining our data source
      2. 10.4.2 Defining our query parameters
      3. 10.4.3 Defining our graph panels and rows
      4. 10.4.4 Rendering the dashboard
      5. 10.4.5 Adding our dashboard to the Riemann notification
      6. 10.4.6 Some sample scripted dashboards
      7. 10.4.7 Other context
    5. 10.5 Adding Slack as a destination
    6. 10.6 Adding PagerDuty as a destination
    7. 10.7 Maintenance and downtime
    8. 10.8 Learning from your notifications
    9. 10.9 Other alerting tools
    10. 10.10 Summary
  12. 11 Monitoring Tornado: a capstone
    1. 11.1 The Tornado application
      1. 11.1.1 Application architecture
    2. 11.2 Monitoring strategy
    3. 11.3 Tagging our Tornado events
    4. 11.4 Monitoring Tornado — Web tier
      1. 11.4.1 Monitoring HAProxy
      2. 11.4.2 Monitoring Nginx
      3. 11.4.3 Addressing the Web tier monitoring concerns
      4. 11.4.4 Setting up the Tornado checks in Riemann
      5. 11.4.5 The webtier function
    5. 11.5 Adding Tornado checks to Riemann
    6. 11.6 Summary
  13. 12 Monitoring Tornado: Application Tier
    1. 12.1 Monitoring the Application tier JVM
      1. 12.1.1 Configuring collectd for JMX
    2. 12.2 Collecting our Application tier JVM logs
    3. 12.3 Monitoring the Tornado API application
    4. 12.4 Addressing the Tornado Application tier monitoring concerns
    5. 12.5 Summary
  14. 13 Monitoring Tornado: Data tier
    1. 13.1 Monitoring the Data tier MySQL server
      1. 13.1.1 Using MySQL data for metrics
      2. 13.1.2 Query timing
    2. 13.2 Monitoring the Data tier's Redis server
    3. 13.3 Addressing the Tornado Data tier monitoring concerns
    4. 13.4 The Tornado dashboard
    5. 13.5 Expanding monitoring beyond Tornado
    6. 13.6 Summary
  15. 14 An Introduction to Clojure and Functional Programming
    1. 14.1 A brief introduction to Clojure
    2. 14.2 Installing Leiningen
    3. 14.3 Clojure syntax and types
      1. 14.3.1 Clojure functions
      2. 14.3.2 Lists
      3. 14.3.3 Vectors
      4. 14.3.4 Sets
      5. 14.3.5 Maps
      6. 14.3.6 Strings
      7. 14.3.7 Creating our own functions
      8. 14.3.8 Creating variables
      9. 14.3.9 Creating named functions
    4. 14.4 Learning more Clojure