book

Effective Monitoring and Alerting

Name: Effective Monitoring and Alerting
Author: Slawek Ligus
ISBN: 9781449333522

by Slawek Ligus

November 2012

Intermediate to advanced

166 pages

4h 38m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Effective Monitoring and Alerting
Preface
Who Should Read This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgements
1. Introduction
Monitoring, Alerting, and What They Can Do for YouEarly Problem DetectionAvailabilityPerformanceDecision MakingBaseliningPredictionsAutomationAdmission ControlAutonomic ComputingMonitoring and Alerting in a NutshellMetrics and TimeseriesAlarms, Alerts, and MonitorsMonitoring SystemThe Process of AlertingIssue TrackingTickets and queuesThe ChallengesImportant Terms
2. Monitoring
The Building BlocksData CollectionCoverageResourcesNetworkComputational resourcesSolution stackOperating systemMiddlewareApplicationUser experienceMetricsSummary statisticsFrequency distribution and percentilesRate of changeTime granularityMetric aggregationExample: Inputs, Metrics, and TimeseriesUnderstanding MetricsType of unitData Collection ModeData SourceNumber of Inputs per Data PointType of QuantityTimeseries PatternsDrawing Conclusions from Timeseries PlotsInterpretation of AnomaliesFlowStockAvailabilityThroughputApplications of quantitiesFrequently Encountered AnomaliesFlattening EffectWarm-Up EffectRegular AnomaliesSpikes During TroughsDetermining CausalityCapturing the Daily Cycle, Trends, and Seasonal Changes
3. Alerting
The ChallengePrerequisitesMonitoring and Alerting PlatformAudit TrailIssue TrackingUnderstanding Failure and Its ImpactEstablishing SignificanceIdentifying CausesAnatomy of an AlarmBoolean FunctionMetric MonitorUpper LimitLower LimitOutside RangeData Points Not RecordedTime EvaluationAnother Alarm as Input SourceSuppressionAggregationCase Study: A Data PipelineTypes of AlertsSetting Up AlarmsIdentifying ImpactEstablishing SeverityPicking the Right TimeseriesConfiguring MonitorsComing Up with a ThresholdStatic thresholdsData-driven thresholdsBreach and Clear DelaySetting Up AlarmsTesting Alerting ConfigurationsAlerting Suggestions
4. At Scale
Implications of ScaleComposition of Large-Scale SystemsCommonalities of Large-Scale Alerting ConfigurationsMonitoring CoverageReflecting Dimensions in MetricsManaging Large Alerting ConfigurationsAddressing the ProblemsOrganize alarms and monitors in a namespaceCalculate threshold values from metric dataPeriodically refresh and clean up the configurationSuggested SolutionRefresh intervalsRunning the engineNamingAlarm creation and threshold calculationCleanup proceduresWriting ModulesSuppressionExtra FeaturesResult
5. Monitoring in System Automation
Choosing Appropriate Maintenance Times AutomaticallyControlling the Rate of UpgradeRecovery-Oriented Admission ControlAutomated Deployment and Rollback
6. The Work Environment
Keeping an Audit TrailWorking with TicketsRoot Cause AnalysisThe Five WhysExtracting CategoriesDealing with AnomaliesLearning from OutagesUsing ChecklistsCreating DashboardsService-Level AgreementsPreventing the Ironies of AutomationCulture
7. Measuring Success
The Feedback LoopRoot Cause ClassificationA Short Story of a Long Classifier ListTimingTicket ReportingFrequency of IncidenceIncidence TimesTime to Respond and Time to ResolutionMeasuring DetectabilityFalse Positives and False NegativesPrecision and RecallThe F-MeasureTransition to Automated AlarmsMaintenance OverheadHow (Not) to Measure
8. The Principles
Get in the Habit of MeasuringDraw Conclusions ReliablyMonitor ExtensivelyAlarm SelectivelyWork Smart, Not HardLearn from the Experience of OthersHave a TacticRun a Bank of CasesEnjoy the Process

A. Setting Up OpenTSDB
The SoftwareArchitectureGetting OpenTSDBFirst StepsStarting TSDPushing DataInput TaggingTag WildcardsTemporal AggregationSummary StatisticsRate of ChangeGathering Data System-WideRunning tcollectorWriting a Custom CollectorTimeseries PlotsPlotting TipsGet Involved
About the Author
Copyright

Overview

The book describes data-driven approach to optimal monitoring and alerting in distributed computer systems. It interprets monitoring as a continuous process aimed at extraction of meaning from system's data. The resulting wisdom drives effective maintenance and fast recovery - the bread and butter of web operations.

The content of the book gives a scalable perspective on the following topics:

anatomy of monitoring and alerting
conclusive interpretation of time series
data-driven approach to setting up monitors
addressing system failures by their impact
applications of monitoring in automation
reporting on quality with quantitative means
and more!

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449333515Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills