Chapter 1. Executive Action
“This can’t happen again,” Roger said, looking directly at Bill, his gaze piercing, almost brutal. They were sitting across from each other at a massive wooden table polished to perfection. Despite his restrained demeanor, Roger, the son of a British diplomat, was furious. Even sitting down, he towered over everyone in the room. Roger had a thin face and a full head of white hair. In his fitted, dark-blue suit, pink shirt, and gold cufflinks, he was the only formally dressed person in the room. Roger’s other direct reports—heads of all the I.T. groups at the firm where Roger was chief information officer—were seated around the table.
“Just last week,” he said, “I had to explain to the Executive Committee the reasons for the previous outage.”
“I take full responsibility for this one,” Bill said, trying not to look away.
As if he hadn’t heard, Roger continued. “I had to stand there like an amateur and assure them that we have the expertise to keep the firm’s infrastructure running smoothly. And now the network is down on the day we have the highest volume of trades on record?”
Roger finally acknowledged Bill. “Do you have an explanation?”
Bill sighed. He was short, slightly overweight, with what little remained of his blond hair closely cropped. He wore rimless glasses and his blue polo shirt untucked. His beige khakis were held up by a brown leather belt, sagging under the weight of two smartphones in black holsters.
He was the head of the firm’s networking group, which operated one of the largest telecommunications networks in the world—a fact not widely known outside the insular world of financial institutions. The network tied together all the firm’s offices, and connected it to all its partners and markets around the globe.
“We run a pretty large network,” Bill said. “And it’s been engineered to have no single points of failure. Everything is redundant, and we can route around most network failures without anyone noticing.”
“You sure made people take notice today,” Raj said, without looking up from his tablet—brand new, not yet released to the public. With a Ph.D. in mathematics from M.I.T., Raj was one of the most brilliant technologists at the firm. Despite being the manager of the firm’s entire Application Development Group, he still routinely wrote code himself, usually late at night, and often sparred with software developers on the finer details of their craft. He wore black dress pants and a crisply pressed white shirt, sleeves rolled up to reveal several personal tracking devices on his wrists. His thinning black hair was carefully combed to one side. His long fingers noodled his new gadget.
“You can’t help but notice when you lose millions of dollars,” he said. “No one can trade. Like the time we couldn’t trade last week. And yesterday. And, thanks to you, we’ll all certainly notice in March, when we get our bonuses.”
“Get to the point,” Roger said, looking at Bill.
“Yes,” said Bill, glaring at Raj, who continued to tap away on his tablet without looking up. “As I was saying, we have full redundancy in our network, including our core routers.”
“Explain,” Roger said.
“They’re the central routers that connect all parts of our network. We’re careful with them. Only half of them are active at any given time, and they have enough capacity to handle all the network traffic, even on super busy days like yesterday. We keep the other half of the routers on standby; if something goes wrong, they take over automatically, like they did yesterday.”
“So why the outage?”
“The problem is that four weeks ago, the vendor released an emergency security patch for the core routers. They found a vulnerability that could allow someone to take full control of the network.”
“That’s not good.”
“It was all over the news. The router vendor found this vulnerability being exploited at Castle-Mart for the past six months—the hackers were using it to steal credit card numbers. So in response to this, we got a security patch from the vendor and installed it in our lab, to test before we did anything to the production network. We ran some load tests and everything looked fine. Then that Saturday night we installed the software on our standby routers, and cut over all the network traffic there.”
“The upgrade seemed to go well. But just to be on the safe side, we waited two weeks before patching the rest of the routers.”
“Which brings us to last Saturday.”
“Right. So last Saturday night, we patched the rest of the routers. Again, everything looked good. Now all of our routers had the latest security patch installed.”
“So this patch is the root cause of the outage on Monday?” asked Ollie. He was the head of the Distributed Systems Group, which managed thousands of servers the firm needed to conduct its business. Thin, with closely cropped hair and a permanent five o’clock shadow, he wore faded jeans and a rumpled T-shirt with a logo of a tech company that no longer existed.
“You of all people should know it’s not so simple,” Bill said. “We had this patch running in production for two weeks without any issues.”
“But never with as much traffic as we had on Monday?” Ollie said.
“That’s right. As I said, we did some load tests in the lab, but we never generated as much load as we had on Monday.”
“Why not?” Roger said. “For months now, we’ve had an explicit mandate—and budget—from the E.C. to prepare for increasing trading volumes.”
“We have,” said Bill. “Our production routers can handle at least twice the volume we had on Monday.”
“I’m confused,” Roger said.
“Just let me finish, Roger,” Bill said. “Our lab has a smaller version of our production routers. The real things cost millions of dollars, and it makes no sense to have them in the lab.” He paused. “Well, maybe it does now. But, in any case, the load-testing equipment we have generates a whole bunch of traffic, but not as much as we had on Monday. So we have to extrapolate from lab results. Generally speaking, if things work well in the lab, they tend to work well in production.”
“Well, they didn’t work so well this time,” Raj said, smirking.
“No, they didn’t,” Bill said. “And I’m guessing the way you build your apps didn’t help.”
Raj stopped playing with his tablet and looked up at Bill. “Don’t blame me for your shitty routers.”
“I’m not. I’m just saying that we’ve got a system here with many moving parts—with networks and servers and, yes, the code your guys write. When the volume of trades started to cross into record territory around 2 p.m. on Monday, we started to see the routers slow down. Nothing bad, but noticed alerts, so we began investigating. We logged into one of the core routers, and issued a routine status command to find out what was going on. What happened next was really strange: instead of showing the status of the router, that command apparently crashed it.”
Ollie sat up.
“Turns out,” Bill said, “the security patch introduced a nasty bug into the status command, which gets activated only when there’s a certain amount of traffic. We didn’t know it at the time, but we hit that very bug, which caused our primary core routers to crash. That caused a momentary blip on the network, and here’s where your stuff comes in, Raj.”
“This oughta be interesting,” Raj said under his breath.
“So after there was a brief network pause while the standby router took over automatically, we started to see an exponential increase in network traffic. We couldn’t figure out why this was happening because the volume of trades didn’t increase. We think maybe your apps were not designed to handle network outages.”
“Oh, OK, so you expect me to rewrite our entire application portfolio just because you can’t keep the network up? Everywhere I’ve ever worked, the network functions fine no matter what—even on 9/11—except this place.”
Bill was shaken. In addition to his day job, he was a longtime volunteer EMT, and had lost friends at the World Trade Center on 9/11. Reminders of that day always hurt, but in this context, they hurt more. “The root cause...” He worked to regain his composure. “OK...Now, after the primary router crashed and the standby router took over, it started to get overloaded with all the traffic. So we again logged in, and started to troubleshoot.”
“Who’s ‘we’?” Ollie asked.
“It was Mike, actually. He logged in and started to troubleshoot, and in the process accidentally took down the standby router using that same buggy command. That took down the entire network.”
Roger closed his eyes and tapped his fingers on his forehead. “Hang on, Mike used the same command again? The one that took down the first router?”
“That’s just careless,” Raj said.
Bill looked at Raj angrily. “Actually, we literally use this command hundreds of times every day. We had no idea that using it would take down the network.”
“It sounds like we have our root cause,” Ollie said.
“What’s that?” Bill said.
“If Mike didn’t use that buggy command, the network wouldn’t have crashed, right? I’m sure he didn’t mean it, but his mistake is what caused the outage.”
“I see where you’re going with that, Ollie,” Bill said. “Yes, it was an operator error, but”—Bill hesitated, knowing that Mike’s job was on the line—“Mike is one of our most experienced engineers.”
“Well, any experienced engineer knows to be extra careful when troubleshooting in production,” said Ollie. “Especially on a day with record trading volumes.”
“Didn’t Mike also cause the previous network outage?” Raj asked. “We almost lost our ability to trade that day. I guess we got lucky.”
“Bill, didn’t you talk to Mike after that first outage?” said Linda, who wasn’t sitting at the table, but near the wall. She was the head of Client Services, the team that bore the brunt of calls from within the firm, and from its clients, during any outage. She wore beige slacks and a tailored white shirt, her blonde hair cut at the shoulders. She spoke with a slight, unidentifiable accent. “Didn’t you have a sit-down with Mike about being more careful when working in production?”
“If you’re going to throw anyone under the bus, it should be me,” Bill said.
“No one is going to throw you under the bus, Bill,” said Linda. “Mike is the one with a history of taking down production. He’s a bit of a cowboy.”
“Mike’s been with the firm for six years,” said Bill. “He’s done amazing work. He did take down the network this time, but he also got it working again. Without him, we would have had a much longer outage. It’s hard to find engineers as good as Mike, and we almost lost him earlier this year when we docked his bonus for the last outage.”
The room fell silent. Finally, Roger said, “One of the core values of our firm is accountability. We have to hold people accountable for their actions. I can’t see us tolerating someone who continues to make mistakes. Mike’s careless mistakes are costing the firm money, and also making the entire I.T. team look incompetent.” Roger, speaking to Bill, said, “I agree with Ollie. The root cause of this kerfuffle is operator error. You need to deal with it. We can’t afford to have this happen again. Let me be absolutely clear: all our jobs are on the line. Given how much we’re paying for talent, we should be able to find engineers who take more care. Make it happen, Bill.”
Roger stood up and looked around the room. “Are there any questions?” No one said anything, so Roger left the room, followed by Raj. Linda got up next, saying “Sorry, Bill.” Ollie, who was sitting next to Bill, sighed and shook his head.