Chapter 4. Is Hadoop a Panacea for All Things Big Data? YPSM Says No

You can’t talk about big data without hearing about Hadoop. But it’s not necessarily for everyone. Businesses need to ensure that it fits their needs—or can be supplemented with other technologies—before committing to it.

Just in case you’ve missed the hype—and there’s been a lot of it—Hadoop is a free, Java-based programming framework that supports the processing of large datasets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. For many people, Hadoop is synonymous with big data. But it’s not for every big-data project.

For example, Hadoop is an extremely cost-effective way to store and process large volumes of structured or unstructured data. It’s also designed to optimize batch jobs. But fast it is not. Some industry observers have compared it to sending a letter and waiting for a response by using the United States Postal Service—more affectionately known as “snail mail”—as opposed to texting someone in real time. When time isn’t a constraint, Hadoop can be a boon. But for more urgent tasks, it’s not a big-data panacea.

It’s definitely not a replacement for your legacy data warehouse, despite the tempting low cost. That’s because most relational databases are optimized to ingest and process data that comes in over time—say, transactions from an order-entry system. But Hadoop was specifically engineered to process huge amounts of data that it ingests in batch mode.

Then there’s Hadoop’s complexity. You need specialized data scientists and programmers to make Hadoop an integral part of your business. Not only are these skills difficult to find in today’s market, they’re expensive, too—so much so that the cost of running Hadoop could add up to a lot more than you would think at first glance.

However, Hadoop is excellent to use as an extract, transform, and load (ETL) platform. Using it as a staging area and data integration vehicle, then feeding selected data into an analytical database like Vertica makes perfect sense.

Businesses need to ignore the hype, look at their needs, and figure out for themselves if and where Hadoop fits into their big data initiatives. It’s an important and powerful technology that can make a difference between big data success and failure. But keep in mind that it’s still a work in progress, according to Bill Theisinger, vice president of engineering for platform data services at YPSM, formerly known as YellowPages.com.

YP focuses on helping small and medium-sized businesses (SMBs) understand their customers better so that they can optimize marketing and ad campaigns. To achieve this, YP has developed a massive enterprise data lake using Hadoop with near-real-time reporting capabilities that pulls oceans of data and information from across new and legacy sources. Using powerful reporting and precise metrics from its data warehouse, YP helps its nearly half a million paying SMB advertisers deliver the best ad campaigns and continue to optimize their marketing.1

YP’s solutions can reach nearly 95% of U.S. Internet users, based on the use of YP distribution channels and the YP Local Ad Network (according to comScore Media Metrix Audience Duplication Report, November 2015).

Hadoop is necessary to do this because of the sheer volume of data, according to Theisinger. “We need to be able to capture how consumers interact with our customers, and that includes wherever they interact and whatever they interact with—whether it’s a mobile device or desktop device,” he says.

YP Transforms Itself Through Big Data

YP saw the writing on the wall years ago. Its traditional print business was in decline, so it began moving local business information online and transforming itself into a digital marketing business. YP began investigating what the system requirements would be to provide value to advertisers. The company realized it needed to understand where consumers were looking online, what ads they were viewing when they searched, what they clicked on, and even which businesses they ended up calling or visiting—whether online or in person.

Not having the infrastructure in place to do all this, YP had to reinvent its IT environment. It needed to capture billions of clicks and impressions and searches every day. The environment also had to be scalable. “If we added a new partner, if we expanded the YP network, if we added hundreds, thousands, or tens of thousands of new advertisers and consumers, we needed the infrastructure to be able to help us do that,” said Theisinger.

When Theisinger joined YP, Hadoop was at the very height of its hype cycle. But although it had been proven to help businesses that had large amounts of unstructured data, that wasn’t necessarily helpful to YP. The firm needed that data to be structured at some point in the data pipeline so that it could be reported on—both to advertisers, partners, and internally.

YP did what a lot of companies do: it combined Hadoop with an analytical database—it had chosen HPE Vertica—so that it could move large volumes of unstructured data in Hadoop into structured environment and run queries and reports rapidly.

Today, YP runs approximately 10,000 jobs daily, both to process data and also for analytics. “That data represents about five to six petabytes of data that we’ve been able to capture about consumers, their behaviors, and activities,” says Theisinger. That data is first ingested into Hadoop. It is then passed along to Vertica, and structured in a way that analysts, product owners, and even other systems can retrieve it, pull and analyze the metrics, and report on them to advertisers.

YP also uses the Hadoop-Vertica combination to optimize internal operations. “We’ve been able to provide various teams internally—sales, marketing, and finance, for example—with insights into who’s clicking on various business listings, what types of users are viewing various businesses, who’s calling businesses, what their segmentation is, and what their demographics look like,” said Theisinger. “This gives us a lot of insight.” Most of that work is done with Vertica.

YP’s customers want to see data in as near to real time as possible. “Small businesses rely on contact from customers. When a potential customer calls a small business and that small business isn’t able to actually get to the call or respond to that customer—perhaps they’re busy with another customer—it’s important for them to know that that call happened and to reach back out to the consumer,” says Theisinger. “To be able to do that as quickly as possible is a hard-and-fast requirement.”

Which brings us back to the original question asked at the beginning of the chapter: Is Hadoop a panacea for big data? Theisinger says no.

“Hadoop is definitely central to our data processing environment. At one point, Hadoop was sufficient in terms of speed, but not today,” said Theisinger. “It’s becoming antiquated. And we haven’t seen tremendous advancements in the core technologies for analyzing data outside of the new tools that can extend its capabilities—for example, Spark—which are making alternative architectures like Spark leveraging Kafka real alternatives.”

Additionally, YP has a lot more users who were familiar with SQL as the standard retrieval language and didn’t have the backgrounds to write their own scripts or interact with technologies like Hive or Spark.

And it was absolutely necessary to pair Hadoop with the Vertica MPP analytics database, Theisinger says.

“Depending on the volume of the data, we can get results 10 times faster by pushing the data into Vertica,” Theisinger says. “We also saw significant improvements when looking at SQL on Hadoop—their product that runs on HDFS, it was an order of magnitude faster than Hive.”

Another reason for the Vertica solution: YP had to analyze an extremely high volume of transactions over a short period of time. The data was not batch-oriented, and to attempt to analyze it in Hive would have taken 10, 20, 30 minutes—or perhaps even hours—to accomplish.

“We can do it in a much shorter time in Vertica,” says Theisinger, who said that Vertica is “magnitudes faster.”

Hadoop solves many problems, but for analytics it is primarily an ETL tool suited to batch modes, agrees Justin Coffey, a senior staff development lead at Criteo, a performance marketing technology company based in Paris, which also uses Hadoop and Vertica.

“Hadoop is a complicated technology,” he says. “It requires expertise. If you have that expertise, it makes your life a lot easier for dealing with the velocity, variety, and volume of data.”

However, Hadoop is not a panacea for big data. “Hadoop is structured for schema on read. To get the intelligence out of Hadoop, you need an MPP database like Vertica,” points out Coffey.

Larry Lancaster, whose take on kicking off a big-data project we explored in Chapter 2, takes this attitude even further. “I can’t think of any problems where you would prefer to use Hadoop versus Vertica aside from raw file storage,” he says. “With Vertica, you get answers much faster, you take up much less space on your hardware, and it’s incredibly cost effective. And for performance, you’re talking four to five orders of magnitude improvement.”

1 YP follows industry-standard privacy practices in its use of targeted advertising by taking responsible measures to secure any information collected through its sites about YP consumers, while still providing them with products, services and communications relevant to their interests. YP’s privacy policy and practices are TRUSTe certified, and YP consumers are able to opt out of mobile location data collection at the device level and manage the use of their information by opting out of retargeted advertising.

Get The Big Data Transformation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.