I once worked with a team designing a large ecommerce website infrastructure. When I say large, I mean eight Cisco 6509 switches serving more than 200 physical servers (most with multiple virtual machines), providing upward of a gigabit per second of content. Timelines were tight, and everyone was stressed. In all the compression of timelines that occurred during the life of the project, one of the key phases eliminated was failure testing.
After the site went live, a device failed. The site was designed to withstand any single point of failure, yet the site stopped functioning properly. It turned out the failover device had been misconfigured in a way that only presented a problem when the active device failed. Because the failure caused a loss of connectivity to the site, we had no way of getting to the failed equipment, except to drive to the collocation facility. This failure, which should not have been possible, resulted in a two-hour outage while someone drove to the facility with a console cable.
Had failover testing been done, the problem would have been found during testing, and the outage would have been avoided. The design was correct, but its implementation was not. Always insist on failure testing in high-availability environments. Failure testing should be done on a regular basis and included in normal maintenance at scheduled intervals. Believing that your network is redundant is not the same as proving it.