The incident had started about twenty minutes before Daniel called me. The operations center had escalated to the on-site team. David, the operations manager, had made the choice to bring me in as well.
Too much was on the line for our client to worry about interrupting a vacation day. Besides, I had told them not to hesitate to call me if I was needed.
We knew a few things at this point, twenty minutes into the incident:
Session counts were very high, higher than the day before.
Network bandwidth usage was high but not hitting a limit.
Application server page latency (response time) was high.
Web, application, and database CPU usage were low—really low.
Search servers, our usual culprit, were responding well. System stats looked ...