Complex fractal pattern.
Complex fractal pattern. (source: Pixabay)

Big data today is, well, big. The 2016 Worldwide Semiannual Big Data and Analytics Spending Guide from IDC predicts that worldwide revenues for big data will grow from $130 billion in 2016 to more than $203 billion in 2020, representing a compound annual growth rate of 11.7%.

Three things are driving this, according to IDC: the increasing availability of ever-larger volumes of data, a rich portfolio of emerging open source big data technologies, and the cultural shift of businesses toward data-driven decision-making.

Sounds good, right?

Wrong.

Reports of the failures of numerous big data initiatives have been published far and wide. In this article, we’ll take a look at why many big data deployments are failures, why today’s solutions to these failures are mere stopgap measures, and why autonomous data platforms are a viable long-term solution.

Big data in the “Trough of Disillusionment”

A 2016 PricewaterhouseCoopers (PwC) study found that “Information Value Index” scores, which ranked both large and small businesses on their ability to get value from big data, averaged just 50.1 out of 100 possible points.

These decidedly failing rankings put big data firmly in the so-called “Trough of Disillusionment,” which generally occurs after interest in a much-hyped technology peaks.

In fact, just 4% of the businesses analyzed scored more than 90 in PwC’s index. To have obtained such a high score, businesses had to have established robust data governance frameworks, built strong data-driven cultures, and enabled secure access to enterprise data to necessary individuals within the organization.

That so few businesses operate at this level speaks worlds about the lack of maturity of most enterprise big data initiatives. To mature in this realm, enterprises must come to terms with the complexity of big data.

The No. 1 cause for big data failures: Complexity

Big data technology is complex. Very complex. There are a number of reasons.

First, you’re trying to solve a difficult problem using massive—almost inconceivable—amounts of data. On top of that, the open source technologies used for big data implementations were designed for flexibility. They aren’t commercial products, so they don’t have the level of support and services that commercial products do. They're not designed as turnkey solutions for specific problems. Because of that flexibility, they can be difficult and complex to deploy.

You also have to deal with numerous moving parts when you use open source big data tools. Six years ago, you only had Hadoop to worry about. But today, you have Spark, Presto, Hive, Pig, and Kafka, to name just a few. These open source technologies also change frequently. Spark has had six significant releases since the beginning of 2015. Hive has also undergone several iterations of major changes in recent years.

As a lot of human overhead is required to keep your big data infrastructure up to date on the latest version and bug fixes of all these open source products, worker productivity suffers.

Finally, you also don’t have “one throat to choke” when any problems arise. When it comes to support, education, and troubleshooting, you have multiple sources to go to, and have to depend on the open source community for support. Sometimes you are on your own when using big data open source tools.

Today’s solutions to challenges for big data initiatives

The industry is attempting to address these challenges. But today’s solutions tend to be stopgap and ineffectual in the long run. Let’s explore some common solutions to the issue of big data complexity.

Repackaging open source technologies

First, some vendors repackage big data open source technologies to narrow down the complexity for businesses. For example, they might package their solutions around specific use cases. This reduces complexity because any given solution is supposed to solve a specific issue rather than be used for general big data purposes. But you may still need several of these narrower solutions to get your project up and running. For example, if you expect to perform data science as well as data transformations and data analysis, you might have to buy three smaller packaged solutions. And the aggregate complexity is probably not any less complex than before.

Quick-start programs

Then, so-called quick-start programs are basically designed to get you up and running faster. But these address only the initial deployment. Alternatively, some vendors offer deployment templates. These are like cookbooks for specific big data use cases, providing step-by-step approaches for following proven best practices.

But the problem is, again, that these may help you get started, but they don’t address the ongoing complexity of big data. And they don’t help with the difficulties of keeping your infrastructure up to date, as the technologies evolve over time.

Clusterless solutions

Then there are vendors that try to hide complexity by decoupling the underlying infrastructure from the workloads—eliminating the notion of clusters.

When you run your big data infrastructure on premise, you tend to utilize a small number of large clusters (for instance, one for dev/test and one for production). But scheduling multiple jobs in a cluster ("concurrency"), in order to maximize overall utilization and therefore ROI, is very tricky. On the other hand, if you’re in the cloud, you can—for example—take advantage of the cloud’s elasticity, spin up a 20-machine cluster, use it for four hours, then discard it. This gives you control over price and performance.

But this architecting requires expertise. You need to know the optimal cluster size for a particular workload. You also need to know the machine type, and install and test the appropriate software.

To solve this particular issue, a number of vendors now offer “clusterless” solutions. Amazon’s Athena, the cloud data warehouse Snowflake, and Cloudera’s Impala (currently an Apache incubator project) all eliminate the concept of clusters. You simply point the product at your data and begin running queries.

The good news is that these are easy to manage. However, these are all multitenant in structure, which means a shared infrastructure serves all the users. So, even though they have masked the idea of a cluster, there's still a set of machines running software that is being shared across a number of users—most of whom are not your colleagues. Because of this, clusterless solution vendors put up “guardrails,” or limitations on what you can do so you don’t impact your multitenant neighborhood. This reduces your flexibility since you have to limit your queries so they don’t consume too many resources. In essence, you are hiding complexity from users but reducing control for IT. This affects both cost and service-level agreements (SLAs).

The people solution

The only viable long-term solution to addressing complexity without sacrificing flexibility is to employ more people as your big data initiatives grow—and that’s a problem in its own right, for two reasons.

First, the industry is currently suffering from a skills shortage. There are not enough big data engineers and DBAs to go around. They're hard to find. They're hard to attract. And they're hard to keep if you have them.

A recent survey found that in 2015, 79% of enterprises said there was a shortage of big data professionals in the field. In 2016, it actually got worse. A full 83% of respondents said they had trouble finding the big data talent they needed.

Even if you're up and running, and you've got a team with some degree of expertise, to grow your deployment, add more users, more applications, service more departments, grow your workloads and use cases, you will need proportionately more people to tend to your infrastructure.

Big data by itself is a scale-out technology. You get commensurate benefit with any compute, storage, or other resources you add to your infrastructure. Because big data is a scale-out technology, companies can now think in terms of handling infinite amounts of data with a linear cost.

But because of the complexity of big data, you need more humans every time you add a new type of workload or a new use case. And people are not a scale-out resource. If you add one person to your big data team, you don't actually get one full person's worth of output. You get less because of the overhead of communications, collaboration with others, processes, management, administration, and HR.

Complexity is the silent killer of big data, and the reason why solving it with people resources is not a viable long-term solution.

The rise of intelligent, autonomous data platforms

One long-term solution may be to intelligently automate the data platform.

An autonomous data platform is a big data infrastructure that both manages and optimizes itself. It also learns how to do both of those things better by watching how you use your data platform. It watches what kind of queries you run, which tables you use, how many clusters you're building, and how efficiently you are using those clusters. In short, an autonomous data platform examines everything related to how your big data infrastructure is used to solve business challenges, and it analyzes that information to help you make improvements.

It's actually quite analogous to what's going on with self-driving cars. A car that can drive in any condition—traffic, snow, rain, wind—and compensate the way humans can makes building a truly self-driving car quite challenging. But companies are chipping away at the problem by picking off specific aspects of the driving experience they can automate. The simplest version of that is the cruise control you have in your car today. But now you've got companies like Tesla offering so-called traffic-aware cruise control that adjusts based on whether there's a car in front of you or not, and how fast that car is going.

So, a completely autonomous data platform is still a vision. But vendors are slowly chipping away at individual problems or tasks that can be automated.

Autonomous data platforms will emerge in three layers. First will come alerts, insights, and recommendations. Then will come policy-based automation. And, finally, we will get to a fully autonomous data platform. Again, at this point, the fully autonomous data platform is just a vision. But we will get there.

Alerts, insights, and recommendations

  • Alerts are a passive form of intelligence. They give you information about your infrastructure, but they don't do anything. An alert might be, “Did you realize this cluster's running, even though no data is being processed?” You could respond either, "Yes, I should probably kill that cluster," or, "No. I'm about to use it—that’s why I left it running." An alert simply tells you about a situation that seems out of the ordinary. You still have to decide whether you want to take action or not.
  • An insight, on the other hand, gives you a nugget of analysis. That means taking out-of-the ordinary information and analyzing it to come to a conclusion. An insight might be: “John runs the most expensive queries out of all your users.” Like with alerts, insights are passive: they offer data, but take no action independently.
  • A recommendation is the next level up from an insight. It takes an analysis and uses it to suggest an action. A recommendation could be: “This field is often used as a filter in queries or sort fields, and indexing it would improve performance. Again, this is just a suggestion—no action is taken automatically by the platform.

Policy-based automation

Policy-based automation is when you give your infrastructure a set of guidelines or configuration parameters that result in some sort of action. You set a policy, and then the system takes over and acts on that policy for you.

For example, you could set policies when "shopping" for compute nodes that will deliver the optimal price performance. In most cases, clusters are configured using a single homogeneous machine type (such as a configuration of CPU, memory, and disk space). This is because the underlying technology was originally designed for on-premise deployment, where IT would procure identical machines for the data center.

In the cloud, providers like AWS offer many different machine configurations, each of which has its own price point. AWS auctions unused capacity on a "spot market" where you can bid for compute nodes. It's not unusual when overall demand on AWS is low (like in the middle of the night) to bid 10% of the usual price and win those nodes. Sounds great, right? But there are two “gotchas.” First, if demand is high, you may not get the nodes you want on the spot market, as supply can fluctuate greatly. Second, AWS can seize spot nodes from you if someone is willing to pay the on-demand (regular) price. This makes using the spot market challenging because you cannot guarantee you will own the nodes for the duration you need.

In another example of policy-based automation, Qubole has a spot agent which, based on a policy you set, "shops" for the optimal combination of price and performance. It mixes machine types if necessary and automatically makes bids for you. Policies can be extraordinarily complex and, because they can be executed overnight and in split-second machine time, it's better to let machines do the work rather than humans. This is not unlike intelligent stock trading systems that Wall Street firms use.

Fully autonomous automation

Finally, we get to fully autonomous automation, where the platform collects information, analyzes it, makes a decision, and acts on it without asking your permission or for any policy guidance.

For example, the platform might observe there are a number of queries using a specific field and determine that an index on that field would improve performance. It might likewise determine that partitioning a table in a certain way would also increase performance. It could detect that a cluster is too small to meet an SLA and resize it dynamically. Because it's a "closed loop," it would be able to see if the change had the desired impact.

Today, data teams must proactively observe these possibilities and then manually make changes and manually see if the desired outcome is achieved. In reality, most data teams don't have the time/bandwidth to do this, except for the most important or worst performing cases. So, a self-optimizing platform would allow data teams to focus more on business outcomes as opposed to troubleshooting and performance tuning, as well as perform some tasks that data teams rarely can find time to do, such as making many small adjustments that, in aggregate, make a big difference for users.

Complexity is stifling big data, but the future is bright

The end game for any company is to take advantage of all the data they have at their disposal, make it accessible to anybody in the company who has a need for it to get insights, and use those insights to make their business run better.

With increasing automation, data teams can work on higher order, more valuable problems. They can focus on the data, not the data platform, to achieve big data success.

This post is a collaboration between O'Reilly and Qubole. See our statement of editorial independence.

Article image: Complex fractal pattern. (source: Pixabay).