The Smoking Gun

At this point, the postmortem analysis agreed with the symptoms from the outage itself: CF appeared to have caused both the IVR and kiosk check-in to hang. The biggest remaining question was still, “What happened to CF?”

The picture got clearer as I investigated the thread dumps from CF. CF’s application server used separate pools of threads to handle EJB calls and HTTP requests. That’s why CF was always able to respond to the monitoring application, even during the middle of the outage. The HTTP threads were almost entirely idle, which makes sense for an EJB server. The EJB threads, on the other hand, were all completely in use processing calls to FlightSearch.lookupByCity. In fact, every single thread on every application ...

Get Release It!, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.