book

Effective Monitoring and Alerting

Name: Effective Monitoring and Alerting
Author: Slawek Ligus
ISBN: 9781449333522

by Slawek Ligus

November 2012

Intermediate to advanced

166 pages

4h 38m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Effective Monitoring and Alerting
Preface
Who Should Read This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgements
1. Introduction
Monitoring, Alerting, and What They Can Do for YouEarly Problem DetectionAvailabilityPerformanceDecision MakingBaseliningPredictionsAutomationAdmission ControlAutonomic ComputingMonitoring and Alerting in a NutshellMetrics and TimeseriesAlarms, Alerts, and MonitorsMonitoring SystemThe Process of AlertingIssue TrackingTickets and queuesThe ChallengesImportant Terms
2. Monitoring
The Building BlocksData CollectionCoverageResourcesNetworkComputational resourcesSolution stackOperating systemMiddlewareApplicationUser experienceMetricsSummary statisticsFrequency distribution and percentilesRate of changeTime granularityMetric aggregationExample: Inputs, Metrics, and TimeseriesUnderstanding MetricsType of unitData Collection ModeData SourceNumber of Inputs per Data PointType of QuantityTimeseries PatternsDrawing Conclusions from Timeseries PlotsInterpretation of AnomaliesFlowStockAvailabilityThroughputApplications of quantitiesFrequently Encountered AnomaliesFlattening EffectWarm-Up EffectRegular AnomaliesSpikes During TroughsDetermining CausalityCapturing the Daily Cycle, Trends, and Seasonal Changes
3. Alerting
The ChallengePrerequisitesMonitoring and Alerting PlatformAudit TrailIssue TrackingUnderstanding Failure and Its ImpactEstablishing SignificanceIdentifying CausesAnatomy of an AlarmBoolean FunctionMetric MonitorUpper LimitLower LimitOutside RangeData Points Not RecordedTime EvaluationAnother Alarm as Input SourceSuppressionAggregationCase Study: A Data PipelineTypes of AlertsSetting Up AlarmsIdentifying ImpactEstablishing SeverityPicking the Right TimeseriesConfiguring MonitorsComing Up with a ThresholdStatic thresholdsData-driven thresholdsBreach and Clear DelaySetting Up AlarmsTesting Alerting ConfigurationsAlerting Suggestions
4. At Scale
Implications of ScaleComposition of Large-Scale SystemsCommonalities of Large-Scale Alerting ConfigurationsMonitoring CoverageReflecting Dimensions in MetricsManaging Large Alerting ConfigurationsAddressing the ProblemsOrganize alarms and monitors in a namespaceCalculate threshold values from metric dataPeriodically refresh and clean up the configurationSuggested SolutionRefresh intervalsRunning the engineNamingAlarm creation and threshold calculationCleanup proceduresWriting ModulesSuppressionExtra FeaturesResult
5. Monitoring in System Automation
Choosing Appropriate Maintenance Times AutomaticallyControlling the Rate of UpgradeRecovery-Oriented Admission ControlAutomated Deployment and Rollback
6. The Work Environment
Keeping an Audit TrailWorking with TicketsRoot Cause AnalysisThe Five WhysExtracting CategoriesDealing with AnomaliesLearning from OutagesUsing ChecklistsCreating DashboardsService-Level AgreementsPreventing the Ironies of AutomationCulture
7. Measuring Success
The Feedback LoopRoot Cause ClassificationA Short Story of a Long Classifier ListTimingTicket ReportingFrequency of IncidenceIncidence TimesTime to Respond and Time to ResolutionMeasuring DetectabilityFalse Positives and False NegativesPrecision and RecallThe F-MeasureTransition to Automated AlarmsMaintenance OverheadHow (Not) to Measure
8. The Principles
Get in the Habit of MeasuringDraw Conclusions ReliablyMonitor ExtensivelyAlarm SelectivelyWork Smart, Not HardLearn from the Experience of OthersHave a TacticRun a Bank of CasesEnjoy the Process

A. Setting Up OpenTSDB
The SoftwareArchitectureGetting OpenTSDBFirst StepsStarting TSDPushing DataInput TaggingTag WildcardsTemporal AggregationSummary StatisticsRate of ChangeGathering Data System-WideRunning tcollectorWriting a Custom CollectorTimeseries PlotsPlotting TipsGet Involved
About the Author
Copyright

Content preview from Effective Monitoring and Alerting

Chapter 1. Introduction

Present-day information systems have became so complex that troubleshooting them effectively necessitates real-time performance, data presented at fine granularity, a thorough understanding of data interpretation, and a pinch of skill. The time when you could trace failure to a few possible causes is long gone. Availability standards in the industry remain high and are pushed ever further. The systems must be equipped with powerful instrumentation, otherwise lack of information will lead to loss of time and—in some cases—loss of revenue.

Monitoring empowers operators to catch complications before they develop into problems, and helps you preserve high availability and deliver high quality of service. It also assists you in making informed decisions about the present and the future, serves as input to automation of infrastructures and, most importantly, is an indispensable learning tool.

Monitoring, Alerting, and What They Can Do for You

Monitoring has become an umbrella term whose meaning strongly depends on the context. Most broadly, it refers to the process of becoming aware of the state of a system. This is done in two ways, proactive and reactive. The former involves watching visual indicators, such as timeseries and dashboards, and is sometimes what administrators mean by monitoring. The latter involves automated ways to deliver notifications to operators in order to bring to their attention a grave change in system’s state; this is usually referred to ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

The Practice of System and Network Administration: Volume 1: DevOps and other Best Practices for Enterprise IT, 3rd Edition

Thomas A. Limoncelli, Strata R. Chalup, Christina J. Hogan

Publisher Resources

ISBN: 9781449333515Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Effective Monitoring and Alerting

by Slawek Ligus

Chapter 1. Introduction

Monitoring, Alerting, and What They Can Do for You

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.