Step 3: Be Prepared and Learn

I remember the following very well: I was on call one night and sleeping. All of a sudden the phone rang. It wasn't my cell phone, it was my landline; nobody would call me at 4:00 in the morning, unless something bad happened. I raced down the stairs to take the call. I was greeted by my boss: "Why haven't you answered my calls? The site has been down for half an hour now, and you didn't respond to the alert. Can you please log in ASAP?" I muttered something like, "Yes boss, right away." I hadn't received any notification; had I slept through it? I went back upstairs to get my token and found that my cell phone battery had died during the night. So much for being alerted.... I logged in and tried the website. I was greeted by a beautiful stack trace indicating a problem with the application; it looked like gibberish to me. Nothing appeared to have changed since the previous day, so why did it fail? After looking around, I didn't find any problem with the disk space or memory, so I tried to restart the application server process. I still had the error. I knew it didn't have to do with the connection to the database, because monitoring showed that queries were still possible. I restarted the database anyway, but again, no luck. In the meantime, another half hour had passed, and I still had no solution. So, I decided to reboot all the machines. Everything came back online, but the error wouldn't go away.

I was puzzled, and I began to suspect a problem with ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.