Distributed systems performance solutions require real-time intelligence

The O’Reilly Podcast: Pepperdata CEO Sean Suchter discusses why today's best practices fall short.

By Nicole Tache
February 2, 2016
Selenite cluster, Naica, Chihuahua, Mexico. Selenite cluster, Naica, Chihuahua, Mexico. (source: Wikimedia Commons)

In this O’Reilly Podcast episode, I talk with Sean Suchter, co-founder and CEO of Pepperdata. Suchter has been working with Hadoop and distributed systems for more than 15 years. He was vice president of Yahoo’s Web Search Technology, general manager of Microsoft’s Search Technology Center Silicon Valley, and is now at the helm of Pepperdata, where he is focused on providing solutions to the business-critical issue of real-time cluster optimization. We discussed the inherent performance challenges in powerful distributed systems, the need for increased control and performance guarantees on existing Hadoop clusters, and the mission that’s driving the work he’s doing at Pepperdata.

The power of distributed systems like Hadoop is profound—they enable multiple users within an organization to process and store an immense amount of diverse data at incredible speed. But, as Suchter expressed in our interview, the fundamental limitations of these types of complex distributed systems are also increasingly evident. In Suchter’s experience, the problem is a lack of oversight and control of workloads. And the implications of this problem can be severe—jobs are late; SLAs are difficult to meet; the performance of critical applications, like HBase, is inconsistent at best; clusters are overbuilt and underutilized; and people spend a lot of time trying to identify and debug issues. It can turn into a management nightmare. Workarounds currently identified as “best practice” today appear ineffective at addressing the root causes of these performance issues. According to Suchter, this is not a problem that can be fully solved by human interaction; it’s a problem only software can truly address. And it’s a problem that more and more companies are facing.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Here are a few highlights from our conversation:

“One of my memories of an early use of Hadoop at Yahoo was a time when the production search engine was happily serving away in the middle of the day and somebody did a very innocuous job—an entirely reasonable thing for them to do—and the power of Hadoop and the distributed fabric let them run that job very, very fast—and it pretty much saturated the entire network and took the production search engine down. It was an early example of the danger of these very powerful systems.”

“Distributed computing makes performance a really hard problem: when you’re sitting here on hundreds or thousands of nodes and each one of those has dozens of things running and each one of those dozen things is independently using up CPU and RAM and disc and network, and all of those different usages are constantly changing. So, it looks like an extremely chaotic system when you look under the hood. … The problem is that the chaos is changing, second by second, so you need to react automatically to the changing dynamic in a second-by-second timeframe.”

“Many people are used to this very static way that you have to isolate clusters, and overprovision, and pre-plan, and tweak and tune, and drive your cluster by looking in the rear-view of ‘what happened yesterday.’ … We realize this problem is going to get so acute that it needs a programmatic solution.”

“Hadoop is the highway, and it decides which cars to let on and how fast, but then once the cars are on the highway, it’s a free for all. They all have an independent driver and they’re all speeding up and slowing down. What we’re doing is watching every car, watching exactly where it is, we’re watching how fast it’s going. We’re controlling that. We’re saying ‘this needs to speed up a little bit,’ ‘this needs to move a little to the left,’ ‘you need to change exactly this aspect of your behavior right now, in this second.’ And the result is that you can get much more stuff through the highway; you can get things to happen in a very predictable time. And, when there is a traffic accident or when there is a collision, you know exactly what happened.”

This post is a collaboration between O’Reilly and Pepperdata. See our statement of editorial independence.

Post topics: Data science