Innovation, Security, and Compliance in a World of Big Data

Solutions for making big data safe and keeping it private.

By Mike Barlow

May 4, 2015

The Menin Gate at Ypres (source: BiblioArchives)

Can Data Security and Rapid Business Innovation Coexist?

Finding a Balance

During the final decade of the 20th century and the first decade of the 21st century, many companies learned the hard way that launching an enterprise resource planning (ERP) system was more than a matter of acquiring new technology. Successful ERP deployments, it turned out, also required hiring new people and developing new processes.

After a series of multimillion dollar misadventures at major corporations, it became apparent that ERP was not something you simply bought, took home, and plugged in. “People, process, and technology” became the official mantra of ERP implementations. CIOs became “change management leaders” and stepped gingerly into the unfamiliar zone of business process transformation. They also began hiring people with business backgrounds to serve alongside the hardcore techies in their IT organizations.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

As quickly as the lessons of ERP were learned, they were forgotten. In an eerie rewinding of history, companies are now learning painfully similar lessons about big data. The peculiar feeling of déjà vu is especially palpable at the junction where big data meets data security.

There is a significant difference, however, between what happened in the past and what’s happening now. When a company’s ERP transformation went south, the CIO was fired and another CIO was hired to finish the job. When the contents of a data warehouse are compromised, the impact is considerably more widespread, and the potential for something genuinely nasty occurring is much higher. If ERP was like dynamite, big data is like plutonium.

“Security is tricky. Any small weakness can become a major problem once the hackers find a way to leverage it,” said Edouard Servan-Schreiber, director for solution architecture at MongoDB, a popular NoSQL database management system. “You can come up with a mathematically elegant security infrastructure, but the main challenge is adherence to a very strict security process. That’s the issue. More and more, a single mistake is a fatal mistake.”

The velocity of change is part of the problem. It’s fair to say that relatively few people anticipated the short amount of time it would take for big data to go mainstream. As a result, the technology part of big data is far ahead of the people and process parts.

“We’ve all seen hype roll through our industry,” says Jon M. Deutsch, president of The Data Warehouse Institute (TDWI) for New York, Connecticut, and New Jersey. “Usually it takes years for the hype to become reality. Big data is an exception to that rule.”

Many TDWI members “have the technology ingredients of big data in place,” said Deutsch, despite the lack of standard methods and protocols for implementing big data projects.

In tightly regulated industries such as financial services and pharmaceuticals, the lack of clear standards has slowed the adoption of big data systems. Concerns about security and privacy, said Deutsch, “limit the scope of big data projects, inject uncertainty, and restrict deployment.”

A general perception that big data frameworks such as Hadoop are less secure than “old-fashioned” relational database technology also contributes to the sense of hesitancy. In a very real sense, Hadoop and NoSQL are playing catchup with traditional SQL database products.

“We’re bringing the security of the Apache Hadoop stack up to the levels of the traditional database,” said Charles Zedlewski, vice president of products at Cloudera, a pioneer in Hadoop data management systems. “We’re adding key enterprise security elements such as RBAC and encryption in a consistent way across the platform.” For example, the Cloudera Enterprise Data Hub “includes Apache Sentry, an open source project we cofounded, to provide unified role-based authorization for the platform. We’ve also developed Cloudera Navigator to provide audit and lineage capabilities.”

Unscrambling the Eggs

Clearly, many businesses see a competitive advantage in ramping up their big data capabilities. At the same time, they are hesitant about diving into the deep end of the big data pool without assurances they won’t see their names in headlines about breached security. It’s no secret that when Hadoop and other non-traditional data management frameworks were invented, data security was not high on the list of operational priorities. Perhaps, as Jon Deutsch suggested earlier, no one seriously expected big data to become such a big deal in such a short span of time.

Suddenly, we’re in the same predicament as Aladdin. The genie is out of the bottle. He’s powerful and dangerous. We want our three wishes, but we have to wish carefully or something very bad could happen…

“Big data analytics software is about crunching data and returning the answers to queries very quickly,” said Terence Craig, founder and CTO of PatternBuilders, a streaming analytics vendor. He is also coauthor of Privacy and Big Data (O’Reilly, 2011). “As long as we want those primary capabilities, it will be difficult to put restrictions on the technology.”

Is it possible to achieve a fair balance between the need for data security and the need for rapid business innovation? Can the desire for privacy coexist with the desire for an ever-widening array of choices for consumers? Is there a way to protect information while distributing insights gleaned from that information?

“Data security and innovation are not at loggerheads,” said Tony Baer, principal analyst at Ovum, a global technology research and advisory firm. “In fact, I would suggest they are in alignment.” Baer, a veteran observer of the tech industry, said the real challenges are knowing where the data came from and keeping track of who’s using it.

“Previously, you were dealing with data that was from your internal systems. You probably knew the lineage of that data—who collected it, how it was collected, under what conditions, with what restrictions, and what you can do with it,” he said. “The difference with big data is that in many cases you’re harvesting data from external sources over which you have no control. Your awareness of the provenance of that data is going to be highly variable and limited.”

Some of the big data you vacuum up might have been “collected under conditions that do not necessarily reflect your own internal policies,” said Baer. Then you will be faced with a difficult choice, something akin to the prisoner’s dilemma: using the data might violate your company’s governance policies or break the rules of a regulatory body that oversees your industry. On the other hand, not using the data might create a business advantage for your competitors. It’s a slippery slope, replete with ambiguity and uncertainty.

At minimum, you need processes for protecting the data and ensuring its integrity. Even the simplest database can be protected with a three-step process of authentication, authorization, and access control.“Oracle Fusion Middleware Administrator’s Guide for Oracle HTTP”

Authentication verifies that a user is who they say they are.
Authorization determines if a user is permitted to use a particular kind of data resource.
Access control determines when, where, and how users can access the data resource.

Ensuring the integrity of your data requires keeping track of who’s using it, where it’s being used, and what it’s being used for. Software for automating the various steps of data security is readily available. The key to maintaining data security, however, isn’t software—it’s a relentless focus on discipline and accountability.

“It boils down to having the right policies and processes in place to manage and control access to the data. For instance, organizations need to understand exactly what big data is contained within the enterprise and where, and assess any legal or regulatory need to safeguard the data. This could range from interactions with customers over social networks, to transaction data from online purchases,” said Joanna Belbey, a compliance expert at Actiance, a firm that helps companies use various communications channels (e.g., email, unified communications, instant messages, collaboration tools, social media) while meeting regulatory, legal, and corporate compliance requirements.

Depending on the situation, approaches to data security can vary. “The tradeoffs you make when you’re going after a market or you’re doing something new might be different from the tradeoffs you make for security when you’re a major bank, for example. You have to negotiate those tradeoffs through an exercise in good, solid risk management,” said Gary McGraw, CTO at software security firm Cigital and author of Software Security (Addison-Wesley, 2006).

“I don’t think that a startup has to follow the same risk-management regimen as a bank. A startup can approach the problem of security as a risk-management exercise, and most startups that I advise do exactly that,” said McGraw. “They make tradeoffs between speed, agility, and engineering, which is okay because they are startups.”

Avoiding the “NoSQL, No Security” Cop-out

The knock against non-traditional data management technologies such as Hadoop and NoSQL is their relative lack of built-in data security features. As a result, companies that opt for newer database technologies are forced to deal with data security at the application level, which places an unreasonable burden on the shoulders of developers who are paid to deliver innovation, not security. Traditional database vendors have used the immaturity of non-traditional data management frameworks and systems to spread FUD—fear, uncertainty, and doubt—about products based on Hadoop and NoSQL.

Not surprisingly, vendors of products and services based on the newer database technologies disagree strenuously with arguments that Hadoop and NoSQL pose unmanageable security risks for competitive business organizations.

“Business is going to change and the regulations on business are going to change. NoSQL databases have gained traction because they offer flexibility and fast development of applications without sacrificing reliability and security,” said Alicia C. Saia, director, solutions marketing at MarkLogic, an enterprise-level NoSQL database based on proprietary code.

Saia flat-out rejected the notion that security and rapid innovation are mutually exclusive conditions in a modern data management environment. “When you’re running a business, you want to innovate as quickly as possible. It can take 18 months to model a relational database, which is an unacceptably long timeframe in today’s fast-paced economy,” she said.

Providers of traditional database technology “want to frame this as a binary choice between innovation and security,” said Saia. “One of the great advantages of an enterprise NoSQL database is that it’s flexible, which means you can respond to the inevitable external shocks without spending millions of dollars breaking apart and reassembling a traditional database to accommodate new kinds of data.”

MarkLogic leverages the combination of security and innovation as an element of its marketing strategy, noting that it offers “higher security certifications than any NoSQL database—providing certified, fine-grained, government-grade security at the database level.”

“You don’t want to be forced to choose between security and innovation,” said Saia. “You want a foundational database that has a layer of stringent security built into it so you’re not in situations where every new application needs its own security. Ideally, you should be able to develop as many applications as you need without stressing over data security.”

Saia and her team came up with a seven-point “checklist” of reasonable expectations for database security in modern data management environments:

You should not have to choose between data security and innovation.
Your database should never be a weak point for data security, data integrity, or data governance.
Your database should support your application security needs, not the other way around.
A flexible, schema-agnostic database will make it faster and cheaper to respond to regulatory changes and inquiries.
Your enterprise data will expand and change over time, so pick a database that makes integration easier—and that lets you scale up and down as needed.
Your database should manage data seamlessly across storage tiers, in real time.
NoSQL does not have to mean “No ACID,”ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. “No Security,” “No HA/DR,”HA/DR stands for High Availability/Disaster Recovery. or “No Auditing.”

Anonymize This!

For some companies, security depends on anonymity—the companies aren’t anonymous, but they make sure the data they use has been scrubbed of PII (personally identifiable information).

“How do we bake security into our approach? Our fundamental conception is that it’s not about the data, it’s about the signals,” said Laks Srinivasan, co-chief operating officer at Opera Solutions, an analytics-as-a-service provider that works with major financial institutions, airlines, and communications companies. “We look for patterns in the data. We extract those patterns, which we call signals, and use them to drive the data science and BI. That mitigates the risk in a big way because people aren’t carrying raw customer data around in their laptops.”

Most users don’t need or even want to deal with raw data, he said. “We extract the juice from terabytes of data. We detach the PII from the behavior patterns and we make the signals available to data scientists. That’s what they’re really interested in.”

Focusing on signals instead of data “doesn’t solve all the issues, but it reduces the proliferation of data and lowers the likelihood of incidents in which personal data is accidentally released,” he said.

Decoupling data from PII provides a measure of safety for all parties involved: consumers who generate data, companies that collect data, and firms that analyze data to harvest usable insights. DataSong, for example, is a San Francisco-based startup that onboards data from its customers (multi- and omni-channel retailers) and measures the incremental effectiveness of their marketing activities. “Our customers give us mountains of data, such as ad impressions, click streams, emails, e-commerce transactions, and in-store orders. It’s a lot of data, and keeping it secure is very important,” said John Wallace, the company’s founder and CEO.

DataSong deals with the security issue by only analyzing data that has been stripped of PII. “We bake data security into the engagement rather than into the technology,” said Wallace.

Data science providers like Opera Solutions and DataSong operate on the principle that anonymized data can be more valuable than personally identifiable data. If that’s true, then why all the fuss over data security? Part of the discomfort arises from the “creepiness factor” we experience when a marketer crosses the invisible line between knowing enough and knowing too much about our interests.

Here’s a typical example: you search for a topic such as “back pain,” and the next time you launch your web browser, whatever page you open is strewn with ads for painkillers. Here’s another scenario: you’re looking for a present, let’s say jewelry, for a special someone. You walk away from your computer and that special someone sits down to check her email—and she sees page after page of ads for jewelry. The possibilities for embarrassment are virtually unlimited.

Both of those examples are fairly benign. In Who Owns the Future (Simon & Schuster, 2013), computer scientist and composer Jaron Lanier wrote that “a surveillance economy is neither sustainable nor democratic” and that we gradually become less free as we “share” our personal information with a virtual cartel of “private spying” services that feeds on the data we generate every time we log onto a computer or use a mobile device. “This triumph of consumer passivity over empowerment is heartbreaking,” he wrote.

“We as individuals who want to live in a fully digital world need to come to grips with the fact that we are no longer going to be able to have privacy in any sense of the way we had it before,” said Terence Craig. “Even if the corporations behave, even if all the government actors behave, there will still be external actors or extra-legal actors who will penetrate systems and use information to generate revenue or power in some way. That’s the nature of the beast.”

“We’re creating a society that requires everyone to have a digital persona,” said Craig. “In the Internet age, privacy has been thrown away for efficiency—and not even deliberately, in most cases. The accelerating adoption of the Internet of Things and streaming analytics solutions like PatternBuilders will make it possible to breach privacy in unexpected and unintentional ways. But both IOT and streaming analytics are so relatively new that it is hard to predict either the costs or the benefits of having real-time access to IOT devices beyond your cell phone: glucose monitors, brain wave monitors, etc. This is where things will get really interesting.”

As a society, Craig said, we should begin looking seriously at regulations that would limit or curtail data retention. “Almost all of the worst-case scenarios involve data retention,” he said. “If you need real-time data to catch a terrorist, then great, go ahead and save the data you need to do that.”

If you’re not actively involved in rooting out terrorists or averting threats to public safety, however, you should be required at regular intervals to expunge any data you collect. “I could care less if Google knows that I like Crest toothpaste and my wife likes Tom’s of Maine natural toothpaste. The big issue is the collation of data, keeping it for an extended period of time, and building up individual profiles of a large percentage of the population,” said Craig.

Specifically, Craig is concerned about the capability of governments to collect and analyze data. When governments fall, either through democratic or non-democratic processes, their records become the property of new governments. “Hopefully, the people who get the records will be responsible people,” he said. “But history has shown that good leadership doesn’t last forever. Sooner or later, a bad leader turns up. Do we really want to hand over an NSA-level data infrastructure to the next Pol Pot?”

Replacing Guidance With Rules

Comprehensive regulations around data management would help, according to Dale Mayerrose, a retired US Air Force major general and former CIO for the US Intelligence Community. “If the government can create comprehensive rules and standards for work safety such as OSHA (Occupational Safety and Health Act), it can certainly create rules and standards for data security,” said Meyerrose.

Too many of the guidelines around data security are just that: guidelines, not laws or regulations. “How seriously will anyone take a voluntary set of standards? The role of government is creating policies and laws. If you give companies a choice, they’re not going to choose spending more money than their competitors on something they aren’t legally required to do,” he said.

Like most of the sources interviewed for this paper, Meyerrose sees no inherent conflict between security and innovation. “In the past, you put your ideas on a piece of paper and locked it in a safe behind your desk. Today, it’s in a database. The only thing that’s changed is the medium,” he said. “So it’s not really a matter of cyber-security or network security or computer security. It’s just security, and security is something you can control.”

From Meyerrose’s perspective, cyber-security is “an ecosystem of multiple supply chains—a human resources supply chain, an operational processes supply chain, and a technology supply chain.” Each of those supply chains must be carefully scrutinized and vetted for trust.

“I find it amazing that we can get the technical part right and get the human part wrong. In the case of Edward Snowden, there was no technical malfunction. But the process wasn’t designed to handle a complicit insider,” said Meyerrose.

Jeffrey Carr is the author of Inside Cyber Warfare: Mapping the Cyber Underworld (O’Reilly, 2011) and is an adjunct professor at George Washington University. He is the founder of the cyber security consultancy Taia Global, Inc., as well as the Suits and Spooks security conference.

In a 2014 paper, “The Classification of Valuable Data in an Assumption of Breach Paradigm,” Carr wrote that since adversaries eventually figure out ways of breaching even the best security systems, responsible organizations “must identify which data is worth protecting and which is not.”

Rather than fretting over the possibility of something bad happening, organizations should prepare for the worst. “Executives need to realize that if they’re in an industry that involves high tech, finance, energy, or anything related to weapons or the military, they’re in a state of perpetual breach,” said Carr. “That’s the first thing you need to come to grips with. You will never be secure. Once you’ve reached that realization, you should identify your most valuable digital assets—your ‘crown jewels’–and do your best to protect them.”

Carr recommends that companies take stock of their digital assets and objectively rank their value to hackers. “Remember, it doesn’t matter what you think is valuable. What matters is what a potential adversary thinks is valuable,” said Carr. For example, if your company is developing cutting-edge software for a new kind of industrial robot, it would be reasonable to expect attacks from organizations—and even countries—that are working on similar software.

“Lots of executives are still looking for a silver bullet that will protect their networks, but that’s not realistic,” said Carr, who predicted that more companies would begin taking security challenges seriously “when the SEC (Security Exchange Commission) makes it a rule instead of a guidance.”

Like Meyerrose, he said that process is a critical part of the solution. “You can make it harder for an adversary to gain access to your crown jewels. Part of making it harder is training your employees to spot spear phishing attacks, meaning train them to look at their email and say, ‘There’s something about this email that doesn’t look right, I’m not going to click on the link, open the attachment. I’ll pick up the phone and call the person that sent it to me to confirm that it’s legitimate.’ Training is a positive thing that makes it harder for potential bad guys to harm you. It won’t keep a dedicated adversary off your network. They’ll just find a way in eventually, if they have enough time and money to do that.”

Training is a key piece of “cyber hygiene,” Carr said. “It’s like putting chlorine in a swimming pool. It will keep you from catching some low-grade infection, but it won’t protect you from sharks.”

Not to Pass the Buck, But…

Although it won’t eradicate the problem, clarifying the regulations around data security would definitely help. “There is no one central set of regulations covering data security and privacy within the US. It’s pretty much a patchwork quilt at this point,” Joanna Belbey wrote in an email. “And while privacy concerns are being addressed through regulation in some sectors—for example, the Federal Communications Commission (FCC) works with telecommunications companies, the Health Insurance Portability and Accountability Act (HIPAA) addresses healthcare data, Public Utility Commissions (PUC) in several states restrict the use of smart grid data, and the Federal Trade Commission (FTC) is developing guidelines for web activity—all this activity has been broad in system coverage and open to interpretation in most cases.”

That sounds like a call for legislative action at the national level. A unified national data security policy would undoubtedly remove some of the uncertainty and create a set of common standards.

At the same time, it seems likely that many of the security issues associated with Hadoop and NoSQL will be resolved within a reasonably short period of time by good old-fashioned market forces. Heartbleed, the OpenSSL bug, cast a spotlight on the kind of problems that can arise when the software industry relies on the volunteer open source community to perform major miracles on miniscule or nonexistent budgets. Vendors that want to compete in the big data space will figure out how to bring their products up to snuff, and they’ll pass the development costs along to their customers. Eventually, consumers will foot the bill, but the costs will be spread so thinly that few of us will notice.

“The answer is that you’ve got to pay for security,” said Gary McGraw, adding that it is unfair and unrealistic to expect the open source community to do the job for free. “The demand for talent is too high and everybody with experience in this field is already incredibly busy.”

Post topics: Data