The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.
Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of the definition of big data:
“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”
Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database approach.
This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.
Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the Amazon Web Services cloud. But for production environments, support, professional services and training are often required.
Apache Hadoop is unquestionably the center of the latest iteration of big data solutions. At its heart, Hadoop is a system for distributing computation among commodity servers. It is often used with the Hadoop Hive project, which layers data warehouse technology on top of Hadoop, enabling ad-hoc analytical queries.
Big data platforms divide along the lines of their approach to Hadoop. The big data offerings from familiar enterprise vendors incorporate a Hadoop distribution, while other platforms offer Hadoop connectors to their existing analytical database systems. This latter category tends to comprise massively parallel processing (MPP) databases that made their name in big data before Hadoop matured: Vertica and Aster Data. Hadoop’s strength in these cases is in processing unstructured data in tandem with the analytical capabilities of the existing database on structured or structured data.
Practical big data implementations don’t in general fall neatly into either structured or unstructured data categories. You will invariably find Hadoop working as part of a system with a relational or MPP database.
Much as with Linux before it, no Hadoop solution incorporates the raw Apache Hadoop code. Instead, it’s packaged into distributions. At a minimum, these distributions have been through a testing process, and often include additional components such as management and monitoring tools. The most well-used distributions now come from Cloudera, Hortonworks and MapR. Not every distribution will be commercial, however: the BigTop project aims to create a Hadoop distribution under the Apache umbrella.
The leading Hadoop enterprise software vendors have aligned their Hadoop products with the rest of their database and analytical offerings. These vendors don’t require you to source Hadoop from another party, and offer it as a core part of their big data solutions. Their offerings integrate Hadoop into a broader enterprise setting, augmented by analytical and workflow tools.
Acquired by EMC, and rapidly taken to the heart of the company’s strategy, Greenplum is a relative newcomer to the enterprise, compared to other companies in this section. They have turned that to their advantage in creating an analytic platform, positioned as taking analytics “beyond BI” with agile data science teams.
Greenplum’s Unified Analytics Platform (UAP) comprises three elements: the Greenplum MPP database, for structured data; a Hadoop distribution, Greenplum HD; and Chorus, a productivity and groupware layer for data science teams.
The HD Hadoop layer builds on MapR’s Hadoop compatible distribution, which replaces the file system with a faster implementation and provides other features for robustness. Interoperability between HD and Greenplum Database means that a single query can access both database and Hadoop data.
Chorus is a unique feature, and is indicative of Greenplum’s commitment to the idea of data science and the importance of the agile team element to effectively exploiting big data. It supports organizational roles from analysts, data scientists and DBAs through to executive business stakeholders.
As befits EMC’s role in the data center market, Greenplum’s UAP is available in a modular appliance configuration.
IBM’s InfoSphere BigInsights is their Hadoop distribution, and part of a suite of products offered under the “InfoSphere” information management brand. Everything big data at IBM is helpfully labeled Big, appropriately enough for a company affectionately known as “Big Blue.”
BigInsights augments Hadoop with a variety of features, including management and administration tools. It also offers textual analysis tools that aid with entity resolution — identifying people, addresses, phone numbers and so on.
IBM’s Jaql query language provides a point of integration between Hadoop and other IBM products, such as relational databases or Netezza data warehouses.
InfoSphere BigInsights is interoperable with IBM’s other database and warehouse products, including DB2, Netezza and its InfoSphere warehouse and analytics lines. To aid analytical exploration, BigInsights ships with BigSheets, a spreadsheet interface onto big data.
IBM addresses streaming big data separately through its InfoSphere Streams product. BigInsights is not currently offered in an appliance form, but can be used in the cloud via Rightscale, Amazon, Rackspace, and IBM Smart Enterprise Cloud.
Microsoft have adopted Hadoop as the center of their big data offering, and are pursuing an integrated approach aimed at making big data available through their analytical tool suite, including to the familiar tools of Excel and PowerPivot.
Microsoft’s Big Data Solution brings Hadoop to the Windows Server platform, and in elastic form to their cloud platform Windows Azure. Microsoft have packaged their own distribution of Hadoop, integrated with Windows Systems Center and Active Directory. They intend to contribute back changes to Apache Hadoop to ensure that an open source version of Hadoop will run on Windows.
On the server side, Microsoft offer integrations to their SQL Server database and their data warehouse product. Using their warehouse solutions aren’t mandated, however. The Hadoop Hive data warehouse is part of the Big Data Solution, including connectors from Hive to ODBC and Excel.
Microsoft’s focus on the developer is evident in their creation of a JavaScript API for Hadoop. Using JavaScript, developers can create Hadoop jobs for MapReduce, Pig or Hive, even from a browser-based environment. Visual Studio and .NET integration with Hadoop is also provided.
Deployment is possible either on the server or in the cloud, or as a hybrid combination. Jobs written against the Apache Hadoop distribution should migrate with miniminal changes to Microsoft’s environment.
Announcing their entry into the big data market at the end of 2011, Oracle is taking an appliance-based approach. Their Big Data Appliance integrates Hadoop, R for analytics, a new Oracle NoSQL database, and connectors to Oracle’s database and Exadata data warehousing product line.
Oracle’s approach caters to the high-end enterprise market, and particularly leans to the rapid-deployment, high-performance end of the spectrum. It is the only vendor to include the popular R analytical language integrated with Hadoop, and to ship a NoSQL database of their own design as opposed to Hadoop HBase.
Rather than developing their own Hadoop distribution, Oracle have partnered with Cloudera for Hadoop support, which brings them a mature and established Hadoop solution. Database connectors again promote the integration of structured Oracle data with the unstructured data stored in Hadoop HDFS.
Oracle’s NoSQL Database is a scalable key-value database, built on the Berkeley DB technology. In that, Oracle owes double gratitude to Cloudera CEO Mike Olson, as he was previously the CEO of Sleepycat, the creators of Berkeley DB. Oracle are positioning their NoSQL database as a means of acquiring big data prior to analysis.
The Oracle R Enterprise product offers direct integration into the Oracle database, as well as Hadoop, enabling R scripts to run on data without having to round-trip it out of the data stores.
MPP (massively parallel processing) databases are specialized for processing structured big data, as distinct from the unstructured data that is Hadoop’s specialty. Along with Greenplum, Aster Data and Vertica are early pioneers of big data products before the mainstream emergence of Hadoop.
These MPP solutions are databases specialized for analyical workloads and data integration, and provide connectors to Hadoop and data warehouses. A recent spate of acquisitions have seen these products become the analytical play by data warehouse and storage vendors: Teradata acquired Aster Data, EMC acquired Greenplum, and HP acquired Vertica.
Aster Data | ParAccel | Vertica |
---|---|---|
Database
| Database
| Database
|
Deployment options
| Deployment options
| Deployment options
|
Hadoop
| Hadoop
| Hadoop
|
Links | Links | Links |
Directly employing Hadoop is another route to creating a big data solution, especially where your infrastructure doesn’t fall neatly into the product line of major vendors. Practically every database now features Hadoop connectivity, and there are multiple Hadoop distributions to choose from.
Reflecting the developer-driven ethos of the big data world, Hadoop distributions are frequently offered in a community edition. Such editions lack enterprise management features, but contain all the functionality needed for evaluation and development.
The first iterations of Hadoop distributions, from Cloudera and IBM, focused on usability and adminstration. We are now seeing the addition of performance-oriented improvements to Hadoop, such as those from MapR and Platform Computing. While maintaining API compatibility, these vendors replace slow or fragile parts of the Apache distribution with better performing or more robust components.
The longest-established provider of Hadoop distributions, Cloudera provides an enterprise Hadoop solution, alongside services, training and support options. Along with Yahoo, Cloudera have made deep open source contributions to Hadoop, and through hosting industry conferences have done much to establish Hadoop in its current position.
Though a recent entrant to the market, Hortonworks have a long history with Hadoop. Spun off from Yahoo, where Hadoop originated, Hortonworks aims to stick close to and promote the core Apache Hadoop technology. Hortonworks also have a partnership with Microsoft to assist and accelerate their Hadoop integration.
Hortonworks Data Platform is currently in a limited preview phase, with a public preview expected in early 2012. The company also provides support and training.
Cloudera | EMC Greenplum | Hortonworks | IBM | |
---|---|---|---|---|
Product Name | ||||
Free Edition |
Integrated, tested distribution of Apache Hadoop | Community Edition 100% open source certified and supported version of the Apache Hadoop stack |
An integrated Hadoop distribution. | |
Enterprise Edition |
Adds management software layer over CDH | Enterprise Edition Integrates MapR’s M5 Hadoop-compatible distribution, replaces HDFS with MapR’s C++-based file system. Includes MapR management tools |
Hadoop distribution, plus BigSheets spreadsheet interface, scheduler, text analytics, indexer, JDBC connector, security support. | |
Hadoop Components | Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr | Hive, Pig, Zookeeper, HBase | Hive, Pig, Zookeeper, HBase, None, Ambari | Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene |
Security | Cloudera Manager Kerberos, role-based administration and audit trails | Security features LDAP authentication, role-based authorization, reverse proxy | ||
Admin Interface | Cloudera Manager Centralized management and alerting | Administrative interfaces MapR Heatmap cluster administrative tools | Apache Ambari Monitoring, administration and lifecycle management for Hadoop clusters | Administrative interfaces Administrative features including Hadoop HDFS and MapReduce administration, cluster and server management, view HDFS file content |
Job Management | Cloudera Manager Job analytics, monitoring and log search | High-availability job management JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents | Apache Ambari Monitoring, administration and lifecycle management for Hadoop clusters | Job management features Job creation, submission, cancellation, status, logging. |
Database connectors | Greenplum Database | DB2, Netezza, InfoSphere Warehouse | ||
Interop features | ||||
HDFS Access | Fuse-DFS Mount HDFS as a traditional filesystem | NFS Access HDFS as a conventional network file system | WebHDFS REST API to HDFS | |
Installation | Cloudera Manager Wizard-based deployment | Quick installation GUI-driven installation tool | ||
Additional APIs | Jaql Jaql is a functional, declarative query language designed to process large data sets. | |||
Volume Management |
MapR | Microsoft | Platform Computing | |
---|---|---|---|
Product Name | |||
Free Edition |
Free community edition incorporating MapR’s performance increases | Platform MapReduce Developer Edition Evaluation edition, excludes resource management features of regualt edition | |
Enterprise Edition |
Augments M3 Edition with high availability and data protection features |
Windows Hadoop distribution, integrated with Microsoft’s database and analytical products |
Enhanced runtime for Hadoop MapReduce, API-compatible with Apache Hadoop |
Hadoop Components | Hive, Pig, Flume, HBase, Sqoop, Mahout, None, Oozie | Hive, Pig | |
Security | Active Directory integration | ||
Admin Interface | Administrative interfaces MapR Heatmap cluster administrative tools | System Center integration | Administrative interfaces Platform MapReduce Workload Manager |
Job Management | High-availability job management JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents | ||
Database connectors | SQL Server, SQL Server Parallel Data Warehouse | ||
Interop features | Hive ODBC Driver, Excel Hive Add-in | ||
HDFS Access | NFS Access HDFS as a conventional network file system | ||
Installation | |||
Additional APIs | REST API | JavaScript API JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries | Includes R, C/C++, C#, Java, Python |
Volume Management | Mirroring, snapshots |
Pure cloud solutions: Both Amazon Web Services and Google offer cloud-based big data solutions. These will be reviewed separately.
HPCC: Though dominant, Hadoop is not the only big data solution. LexisNexis’ HPCC offers an alternative approach.
Hadapt: not yet featured in this survey. Taking a different approach from both Hadoop-centered and MPP solutions, Hadapt integrates unstructured and structured data into one product: wrapping rather than exposing Hadoop. It is currently in “early access” stage.
NoSQL: Solutions built on databases such as Cassandra, MongoDB and Couchbase are not in the scope of this survey, though these databases do offer Hadoop integration.
Errors and omissions: given the fast-evolving nature of the market and variable quality of public information, any feedback about errors and omissions from this survey is most welcome. Please send it to edd+bigdata@oreilly.com.
Get Planning for Big Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.