The Smoking Gun

At this point, the postmortem analysis agreed with the symptoms from the outage itself: CF appeared to have caused both the IVR and kiosk check-in to hang. The biggest remaining question was still, “What happened to CF?”

The picture got clearer as I investigated the thread dumps from CF. CF’s application server used separate pools of threads to handle EJB calls and HTTP requests. That’s why CF was always able to respond to the monitoring application, even during the middle of the outage. The HTTP threads were almost entirely idle, which makes sense for an EJB server. The EJB threads, on the other hand, were all completely in use processing calls to FlightSearch.lookupByCity. In fact, every single thread on every application ...

Get Release It!, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.