Analytical databases are an increasingly critical part of businesses’ big data infrastructure. Specifically designed to offer performance and scalability advantages over conventional relational databases, analytical databases enable business users as well as data analysts and data scientists to easily extract meaning from large and complex data stores.
But to wring the most knowledge and meaning from the data your business is collecting every minute—if not every second—it’s important to keep some best practices in mind when you deploy your big data analytical database. Leading businesses that have deployed such analytical databases share five pitfalls you should avoid to keep you on track as your big data initiatives mature.
1. Don’t ignore your users when choosing your analytical database tools
Business users, analysts, and data scientists are very different people, says Chris Bohn, “CB,” a senior database engineer with Etsy, a marketplace where millions of people around the world connect, both online and offline, to make, sell, and buy unique goods. For the most part, data scientists are going to be comfortable working with Hadoop, MapReduce, Scalding, and Spark, whereas data analysts live in an SQL world. “If you put tools in place that your users don’t have experience with, they won’t use those tools. It’s that simple,” says Bohn.
Etsy made sure to consider the end users of the analytics database before choosing an analytical database—and those end users, it turned out, were mainly analysts. So, Etsy made sure to pick a database based on the same SQL as PostgreSQL, which offered familiarity for end users and increased their productivity.
2. Don’t think too big when starting your big data initiative
Big data has generated a lot of interest lately. CEOs are reading about it in the business press and expressing their desire to leverage enterprise data to do everything from customizing product offerings, to improving worker productivity, to ensuring better product quality. But too many companies begin their big data journeys with big budgets and even bigger expectations. They attempt to tackle too much. Then, 18 months down the road, they have very little to show.
It’s more realistic to think small. Focus on one particular business problem—preferably one with high visibility—that could be solved by leveraging data more effectively. Address that problem with basic data analytics tools—even Excel can work. Create a hypothesis and perform an exercise that analyzes the data to test that hypothesis. Even if you get a different result than you expected, you’ve learned something. Rinse and repeat. Do more and more projects using that methodology “and you’ll find you’ll never stop—the use cases will keep coming,” affirms HPE’s Colin Mahony, senior vice president and general manager for HPE Software Big Data.
Larry Lancaster, the former chief data scientist at a company offering hardware and software solutions for data storage and backup, agrees. “Just find a problem your business is having,” advises Lancaster. “Look for a hot button. Instead of hiring a new executive to solve that problem, hire a data scientist.”
3. Don’t underestimate data volume growth
Virtually all big data veterans warn about unanticipated data volumes. Cerner, a company working at the intersection of health care and information technology, was no exception. Based in Kansas City, Cerner’s health information technology (HIT) solutions connect people and systems at more than 20,000 facilities worldwide.
Even though Cerner estimated quite substantial data volume growth at the time of the proof of concept in 2012, the growth has accelerated beyond Cerner’s wildest expectations.
“At the time we predicted a great deal of growth, and it certainly wasn’t linear,” says Dan Woicke, director of enterprise system management at Cerner. “Even so, we never would have predicted how fast we would grow. We’re probably at double or triple the data we expected.”
The moral: choose a database that can scale to meet unanticipated data volumes.
4. Don’t throw away any of your data
One mistake that many businesses make is not saving all of their data. They think once data gets old, it is stale and irrelevant. Or they can’t think of a specific use for a data point, and so they discard it. This is a serious error. Further down the road, that data might turn out to be essential for a key business decision.
“You never know what might come in handy,” says Etsy’s Bohn.
Today’s storage and database technologies make it quite inexpensive to store data for the long term. Why not save it all? Look for analytical databases that can scale to accommodate as much data as you generate. “As long as you have a secure way to lock data down, you should keep it,” says Bohn. You may later find there’s gold in it.”
5. Don’t lock yourself into rigid, engineered-system data warehouses
According to Bohn, one of the lessons he’s learned in his big data journey is this: your data is your star, and this drives your database purchasing decisions.
“Do you use the cloud or bare iron in a colocation facility?” asks Bohn. “This will matter, because to get data into the cloud you have to send it over the Internet—which will be not as fast as if your big data analytical system is located right next to your production system.”
Bohn adds, “It’s also important that you don’t go down any proprietary technological dead ends.” Bohn says to be careful of some of the newer technologies that may not stand the test of time. “It’s better to be on the leading than the bleeding edge,” says Bohn. For example, message queuing has become an important part of infrastructure for distributing data. Many such systems have been brought to market in the past decade, with a lot of hype and promises. More than a few companies made investments in those technologies, only to find that they didn't perform as advertised.
“Companies that made those investments then found they had to extricate themselves—at considerable cost,” Bohn notes. Etsy is currently using Kafka as an event and data pipeline, and may soon use it for HPE Vertica data ingestion. “Kafka has been gaining a lot of traction, and we think it will also be around for a while. We like its model, and so far it has proven to be robust. Vertica has developed a good Kafka connector, and it may well become a major path for data to get into Vertica.”
This post is a collaboration between O’Reilly and HPE Vertica. See our statement of editorial independence.