Chapter 5. Microsoft’s Plan for Big Data

By Edd Dumbill

Microsoft has placed Apache Hadoop at the core of its big data strategy. It’s a move that might seem surprising to the casual observer, being a somewhat enthusiastic adoption of a significant open source product.

The reason for this move is that Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.

Microsoft’s goals go beyond integrating Hadoop into Windows. It intends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows.

Microsoft’s Hadoop Distribution

The Microsoft distribution of Hadoop is currently in “Customer Technology Preview” phase. This means it is undergoing evaluation in the field by groups of customers. The expected release time is toward the middle of 2012, but will be influenced by the results of the technology preview program.

Microsoft’s Hadoop distribution is usable either on-premise with Windows Server, or in Microsoft’s cloud platform, Windows Azure. The core of the product is in the MapReduce, HDFS, Pig and Hive components of Hadoop. These are certain to ship in the 1.0 release.

As Microsoft’s aim is for 100% Hadoop compatibility, it is likely that additional components of the Hadoop ecosystem such as Zookeeper, HBase, ...

Get Planning for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.