Chapter 1. Introduction
We are in the age of data. Recorded data is doubling in size every two years, and by 2020 we will have captured as many digital bits as there are stars in the universe, reaching a staggering 44 zettabytes, or 44 trillion gigabytes. Included in these figures is the business data generated by enterprise applications as well as the human data generated by social media sites like Facebook, LinkedIn, Twitter, and YouTube.
Big Data: A Brief Primer
Gartner’s description of big data—which focuses on the “three Vs”: volume, velocity, and variety—has become commonplace. Big data has all of these characteristics. There’s a lot of it, it moves swiftly, and it comes from a diverse range of sources.
A more pragmatic definition is this: you know you have big data when you possess diverse datasets from multiple sources that are too large to cost-effectively manage and analyze within a reasonable timeframe when using your traditional IT infrastructures. This data can include structured data as found in relational databases as well as unstructured data such as documents, audio, and video.
IDG estimates that big data will drive the transformation of IT through 2025. Key decision-makers at enterprises understand this. Eighty percent of enterprises have initiated big data–driven projects as top strategic priorities. And these projects are happening across virtually all industries. Table 1-1 lists just a few examples.
Industry | Big data use cases |
---|---|
Automotive | Auto sensors reporting vehicle location problems |
Financial services | Risk, fraud detection, portfolio analysis, new product development |
Manufacturing | Quality assurance, warranty analyses |
Healthcare | Patient sensors, monitoring, electronic health records, quality of care |
Oil and gas | Drilling exploration sensor analyses |
Retail | Consumer sentiment analyses, optimized marketing, personalized targeting, market basket analysis, intelligent forecasting, inventory management |
Utilities | Smart meter analyses for network capacity, smart grid |
Law enforcement | Threat analysis, social media monitoring, photo analysis, traffic optimization |
Advertising | Customer targeting, location-based advertising, personalized retargeting, churn detection/prevention |
A Crowded Marketplace for Big Data Analytical Databases
Given all of the interest in big data, it’s no surprise that many technology vendors have jumped into the market, each with a solution that purportedly will help you reap value from your big data. Most of these products solve a piece of the big data puzzle. But—it’s very important to note—no one has the whole picture. It’s essential to have the right tool for the job. Gartner calls this “best-fit engineering.”
This is especially true when it comes to databases. Databases form the heart of big data. They’ve been around for a half century. But they have evolved almost beyond recognition during that time. Today’s databases for big data analytics are completely different animals than the mainframe databases from the 1960s and 1970s, although SQL has been a constant for the last 20 to 30 years.
There have been four primary waves in this database evolution.
- Mainframe databases
- The first databases were fairly simple and used by government, financial services, and telecommunications organizations to process what (at the time) they thought were large volumes of transactions. But, there was no attempt to optimize either putting the data into the databases or getting it out again. And they were expensive—not every business could afford one.
- Online transactional processing (OLTP) databases
- The birth of the relational database using the client/server model finally brought affordable computing to all businesses. These databases became even more widely accessible through the Internet in the form of dynamic web applications and customer relationship management (CRM), enterprise resource management (ERP), and ecommerce systems.
- Data warehouses
- The next wave enabled businesses to combine transactional data—for example, from human resources, sales, and finance—together with operational software to gain analytical insight into their customers, employees, and operations. Several database vendors seized leadership roles during this time. Some were new and some were extensions of traditional OLTP databases. In addition, an entire industry that brought forth business intelligence (BI) as well as extract, transform, and load (ETL) tools was born.
- Big data analytics platforms
- During the fourth wave, leading businesses began recognizing that data is their most important asset. But handling the volume, variety, and velocity of big data far outstripped the capabilities of traditional data warehouses. In particular, previous waves of databases had focused on optimizing how to get data into the databases. These new databases were centered on getting actionable insight out of them. The result: today’s analytical databases can analyze massive volumes of data, both structured and unstructured, at unprecedented speeds. Users can easily query the data, extract reports, and otherwise access the data to make better business decisions much faster than was possible previously. (Think hours instead of days and seconds/minutes instead of hours.)
One example of an analytical database—the one we’ll explore in this document—is Vertica from Hewlett Packard Enterprise (HPE). Vertica is a massively parallel processing (MPP) database, which means it spreads the data across a cluster of servers, making it possible for systems to share the query-processing workload. Created by legendary database guru and Turing award winner Michael Stonebraker, and then acquired by HP, the Vertica Analytics Platform was purpose-built from its very first line of code to optimize big-data analytics.
Three things in particular set Vertica apart, according to Colin Mahony, senior vice president and general manager for HPE Software Big Data:
Its creators saw how rapidly the volume of data was growing, and designed a system capable of scaling to handle it from the ground up.
They also understood all the different analytical workloads that businesses would want to run against their data.
They realized that getting superb performance from the database in a cost-effective way was a top priority for businesses.
Yes, You Need Another Database: Finding the Right Tool for the Job
According to Gartner, data volumes are growing 30 percent to 40 percent annually, whereas IT budgets are only increasing by 4 percent. Businesses have more data to deal with than they have money. They probably have a traditional data warehouse, but the sheer size of the data coming in is overwhelming it. They can go the data lake route, and set it up on Hadoop, which will save money while capturing all the data coming in, but it won’t help them much with the analytics that started off the entire cycle. This is why these businesses are turning to analytical databases.
Analytical databases typically sit next to the system of record—whether that’s Hadoop, Oracle, or Microsoft—to perform speedy analytics of big data.
In short: people assume a database is a database, but that’s not true. Here’s a metaphor created by Steve Sarsfield, a product-marketing manager at HPE, to articulate the situation (illustrated in Figure 1-1):
If you say “I need a hammer,” the correct tool you need is determined by what you’re going to do with it.
The same scenario is true for databases. Depending on what you want to do, you would choose a different database, whether an MPP analytical database like Vertica, an XML database, or a NoSQL database—you must choose the right tool for the job you need to do.
You should choose based upon three factors: structure, size, and analytics. Let’s look a little more closely at each:
- Structure
- Does your data fit into a nice, clean data model? Or will the schema lack clarity or be dynamic? In other words, do you need a database capable of handling both structured and unstructured data?
- Size
- Is your data “big data” or does it have the potential to grow into big data? If your answer is “yes,” you need an analytics database that can scale appropriately.
- Analytics
- What questions do you want to ask of the data? Short-running queries or deeper, longer-running or predictive queries?
Of course, you have other considerations, such as the total cost of ownership (TCO) based upon the cost per terabyte, your staff’s familiarity with the database technology, and the openness and community of the database in question.
Still, though, the three main considerations remain structure, size, and analytics. Vertica’s sweet spot, for example, is performing long, deep queries of structured data at rest that have fixed schemas. But even then there are ways to stretch the spectrum of what Vertica can do by using technologies such as Kafka and Flex Tables, as demonstrated in Figure 1-2.
In the end, the factors that drive your database decision are the same forces that drive IT decisions in general. You want to:
- Increase revenues
- You do this by investing in big-data analytics solutions that allow you to reach more customers, develop new product offerings, focus on customer satisfaction, and understand your customers’ buying patterns.
- Enhance efficiency
- You need to choose big data analytics solutions that reduce software-licensing costs, enable you to perform processes more efficiently, take advantage of new data sources effectively, and accelerate the speed at which that information is turned into knowledge.
- Improve compliance
- Finally, your analytics database must help you to comply with local, state, federal, and industry regulations and ensure that your reporting passes the robust tests that regulatory mandates place on it. Plus, your database must be secure to protect the privacy of the information it contains, so that it’s not stolen or exposed to the world.
Sorting Through the Hype
There’s so much hype about big data that it can be difficult to know what to believe. We maintain that one size doesn’t fit all when it comes to big-data analytical databases. The top-performing organizations are those that have figured out how to optimize each part of their data pipelines and workloads with the right technologies.
The job of vendors in this market: to keep up with standards so that businesses don’t need to rip and replace their data schemas, queries, or frontend tools as their needs evolve.
In this document, we show the real-world ways that leading businesses are using Vertica in combination with other best-in-class big-data solutions to solve real business challenges.
Get The Big Data Transformation now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.