book

Practical Monitoring

by Mike Julian

October 2017

Beginner to intermediate

170 pages

3h 58m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who Should Read This BookWhy I Wrote This BookA Word on Monitoring TodayNavigating This BookOnline ResourcesConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
Anti-Pattern #1: Tool ObsessionMonitoring Is Multiple Complex Problems Under One NameAvoid Cargo-Culting ToolsSometimes, You Really Do Have to Build ItThe Single Pane of Glass Is a MythAnti-Pattern #2: Monitoring-as-a-JobAnti-Pattern #3: Checkbox MonitoringWhat Does “Working” Actually Mean? Monitor That.OS Metrics Aren’t Very Useful—for AlertingCollect Your Metrics More OftenAnti-Pattern #4: Using Monitoring as a CrutchAnti-Pattern #5: Manual ConfigurationWrap-Up
Pattern #1: Composable MonitoringThe Components of a Monitoring ServicePattern #2: Monitor from the User PerspectivePattern #3: Buy, Not BuildIt’s CheaperYou’re (Probably) Not an Expert at Architecting These ToolsSaaS Allows You to Focus on the Company’s ProductNo, Really, SaaS Is Actually BetterPattern #4: Continual ImprovementWrap-Up
What Makes a Good Alert?Stop Using Email for AlertsWrite RunbooksArbitrary Static Thresholds Aren’t the Only WayDelete and Tune AlertsUse Maintenance PeriodsAttempt Automated Self-Healing FirstOn-CallFixing False AlarmsCutting Down on Needless FirefightingBuilding a Better On-Call RotationIncident ManagementPostmortemsWrap-Up
Before Statistics in Systems OperationsMath to the Rescue!Statistics Isn’t MagicMean and AverageMedianSeasonalityQuantilesStandard DeviationWrap-Up
Business KPIsTwo Real-World ExamplesYelpRedditTying Business KPIs to Technical MetricsMy App Doesn’t Have Those Metrics!Finding Your Company’s Business KPIsWrap-Up
The Cost of a Slow AppTwo Approaches to Frontend MonitoringDocument Object Model (DOM)Frontend Performance MetricsOK, That’s Great, but How Do I Use This?LoggingSynthetic MonitoringWrap-Up
Instrumenting Your Apps with MetricsHow It Works Under the HoodMonitoring Build and Release PipelinesHealth Endpoint PatternApplication LoggingWait a Minute…Should I Have a Metric or a Log Entry?What Should I Be Logging?Write to Disk or Write to Network?Serverless / Function-as-a-ServiceMonitoring Microservice ArchitecturesWrap-Up

Standard OS MetricsCPUMemoryNetworkDiskLoadSSL CertificatesSNMPWeb ServersDatabase ServersLoad BalancersMessage QueuesCachingDNSNTPMiscellaneous Corporate InfrastructureDHCPSMTPMonitoring Scheduled JobsLoggingCollectionStorageAnalysisWrap-Up
The Pains of SNMPWhat Is SNMP?How Does It Work?A Word on SecurityHow Do I Use SNMP?Interface MetricsInterface and LoggingRecapConfiguration TrackingVoice and VideoRoutingSpanning Tree Protocol (STP)ChassisCPU and MemoryHardwareFlow MonitoringCapacity PlanningWorking BackwardForecastingWrap-up
Monitoring and ComplianceUser, Command, and Filesystem AuditingSetting Up auditdauditd and Remote LogsHost Intrusion Detection System (HIDS)rkhunterNetwork Intrusion Detection System (NIDS)Wrap-Up
Business KPIsFrontend MonitoringApplication and Server MonitoringSecurity MonitoringAlertingWrap-Up
Demo AppMetadataEscalation ProcedureExternal DependenciesInternal DependenciesTech StackMetrics and LogsAlerts

Content preview from Practical Monitoring

Chapter 4. Statistics Primer

Statistics is an undervalued topic in the world of software engineering and systems administration. It’s also misunderstood: many people I’ve spoken to over the years are operating on the misapprehension that “rubbing a little stats on it” will result in magic coming out the other end. Unfortunately, that isn’t quite the case.

However, I am happy to say that a basic lesson in statistics is both straightforward and incredibly useful to your work in monitoring.

Before Statistics in Systems Operations

Before we get into the statistics lesson, it’s helpful to understand a bit of the background story.

I fear that the prevalence and influence of Nagios has stifled the improvement of monitoring for many teams. Setting up an alert with Nagios is so simple, yet so often ineffective.¹

If you want an alert on some metric with Nagios, you’re effectively comparing the current value against another value you’ve already set as a warning or critical threshold. For example, let’s say the returned value is 5 for the 15m load average. The check script is going to compare that value against the warning value or critical value, which might be 4 and 10, respectively. In this situation, Nagios would fire an alert for the check breaching the warning value, which is expected. Unfortunately, it isn’t very helpful.

As so often happens, systems can behave in unexpected (but totally fine) ways. For example, what if the value crossed the threshold for only one occurrence? What ...