O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Monitoring

Book Description

Do you have a nagging feeling that your monitoring needs improvement, but you just aren’t sure where to start or how to do it? Are you plagued by constant, meaningless alerts? Does your monitoring system routinely miss real problems? This is the book for you.

Mike Julian lays out a practical approach to designing and implementing effective monitoring—from your enterprise application down to the hardware in a datacenter, and everything between. Practical Monitoring provides you with straightforward strategies and tactics for designing and implementing a strong monitoring foundation for your company.

This book takes a unique vendor-neutral approach to monitoring. Rather than discuss how to implement specific tools, Mike teaches the principles and underlying mechanics behind monitoring so you can implement the lessons in any tool.

Practical Monitoring covers essential topics including:

  • Monitoring antipatterns
  • Principles of monitoring design
  • How to build an effective on-call rotation
  • Getting metrics and logs out of your application

Table of Contents

  1. Preface
    1. Who Should Read This Book
    2. Why I Wrote This Book
    3. A Word on Monitoring Today
    4. Navigating This Book
    5. Online Resources
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Safari
    9. How to Contact Us
    10. Acknowledgments
  2. I. Monitoring Principles
  3. 1. Monitoring Anti-Patterns
    1. Anti-Pattern #1: Tool Obsession
      1. Monitoring Is Multiple Complex Problems Under One Name
      2. Avoid Cargo-Culting Tools
      3. Sometimes, You Really Do Have to Build It
      4. The Single Pane of Glass Is a Myth
    2. Anti-Pattern #2: Monitoring-as-a-Job
    3. Anti-Pattern #3: Checkbox Monitoring
      1. What Does “Working” Actually Mean? Monitor That.
      2. OS Metrics Aren’t Very Useful—for Alerting
      3. Collect Your Metrics More Often
    4. Anti-Pattern #4: Using Monitoring as a Crutch
    5. Anti-Pattern #5: Manual Configuration
    6. Wrap-Up
  4. 2. Monitoring Design Patterns
    1. Pattern #1: Composable Monitoring
      1. The Components of a Monitoring Service
    2. Pattern #2: Monitor from the User Perspective
    3. Pattern #3: Buy, Not Build
      1. It’s Cheaper
      2. You’re (Probably) Not an Expert at Architecting These Tools
      3. SaaS Allows You to Focus on the Company’s Product
      4. No, Really, SaaS Is Actually Better
    4. Pattern #4: Continual Improvement
    5. Wrap-Up
  5. 3. Alerts, On-Call, and Incident Management
    1. What Makes a Good Alert?
      1. Stop Using Email for Alerts
      2. Write Runbooks
      3. Arbitrary Static Thresholds Aren’t the Only Way
      4. Delete and Tune Alerts
      5. Use Maintenance Periods
      6. Attempt Automated Self-Healing First
    2. On-Call
      1. Fixing False Alarms
      2. Cutting Down on Needless Firefighting
      3. Building a Better On-Call Rotation
    3. Incident Management
    4. Postmortems
    5. Wrap-Up
  6. 4. Statistics Primer
    1. Before Statistics in Systems Operations
    2. Math to the Rescue!
    3. Statistics Isn’t Magic
    4. Mean and Average
    5. Median
    6. Seasonality
    7. Quantiles
    8. Standard Deviation
    9. Wrap-Up
  7. II. Monitoring Tactics
  8. 5. Monitoring the Business
    1. Business KPIs
    2. Two Real-World Examples
      1. Yelp
      2. Reddit
    3. Tying Business KPIs to Technical Metrics
    4. My App Doesn’t Have Those Metrics!
    5. Finding Your Company’s Business KPIs
    6. Wrap-Up
  9. 6. Frontend Monitoring
    1. The Cost of a Slow App
    2. Two Approaches to Frontend Monitoring
    3. Document Object Model (DOM)
      1. Frontend Performance Metrics
      2. OK, That’s Great, but How Do I Use This?
    4. Logging
    5. Synthetic Monitoring
    6. Wrap-Up
  10. 7. Application Monitoring
    1. Instrumenting Your Apps with Metrics
      1. How It Works Under the Hood
    2. Monitoring Build and Release Pipelines
    3. Health Endpoint Pattern
    4. Application Logging
      1. Wait a Minute…Should I Have a Metric or a Log Entry?
      2. What Should I Be Logging?
      3. Write to Disk or Write to Network?
    5. Serverless / Function-as-a-Service
    6. Monitoring Microservice Architectures
    7. Wrap-Up
  11. 8. Server Monitoring
    1. Standard OS Metrics
      1. CPU
      2. Memory
      3. Network
      4. Disk
      5. Load
    2. SSL Certificates
    3. SNMP
    4. Web Servers
    5. Database Servers
    6. Load Balancers
    7. Message Queues
    8. Caching
    9. DNS
    10. NTP
    11. Miscellaneous Corporate Infrastructure
      1. DHCP
      2. SMTP
    12. Monitoring Scheduled Jobs
    13. Logging
      1. Collection
      2. Storage
      3. Analysis
    14. Wrap-Up
  12. 9. Network Monitoring
    1. The Pains of SNMP
      1. What Is SNMP?
      2. How Does It Work?
      3. A Word on Security
      4. How Do I Use SNMP?
      5. Interface Metrics
      6. Interface and Logging
      7. Recap
    2. Configuration Tracking
    3. Voice and Video
    4. Routing
    5. Spanning Tree Protocol (STP)
    6. Chassis
      1. CPU and Memory
      2. Hardware
    7. Flow Monitoring
    8. Capacity Planning
      1. Working Backward
      2. Forecasting
    9. Wrap-up
  13. 10. Security Monitoring
    1. Monitoring and Compliance
    2. User, Command, and Filesystem Auditing
      1. Setting Up auditd
      2. auditd and Remote Logs
    3. Host Intrusion Detection System (HIDS)
    4. rkhunter
    5. Network Intrusion Detection System (NIDS)
    6. Wrap-Up
  14. 11. Conducting a Monitoring Assessment
    1. Business KPIs
    2. Frontend Monitoring
    3. Application and Server Monitoring
    4. Security Monitoring
    5. Alerting
    6. Wrap-Up
  15. A. An Example Runbook: Demo App
    1. Demo App
    2. Metadata
    3. Escalation Procedure
    4. External Dependencies
    5. Internal Dependencies
    6. Tech Stack
    7. Metrics and Logs
    8. Alerts
  16. B. Availability Chart
  17. Index