O'Reilly logo

Pentaho for Big Data Analytics by Feris Thia, Manoj R Patil

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Hadoop architecture

Hadoop is a large scale, large-batch data processing system, which uses MapReduce for computation and HDFS for data storage. HDFS is the most reliable distributed filesystem with a configurable replication mechanism designed to be deployed on low-cost commodity hardware.

HDFS breaks files into chunks of a minimum of 64 MB blocks, where each block is replicated three times. The replication factor can be configured, and it has to be perfectly balanced depending upon the data. The following diagram depicts a typical two node Hadoop cluster set up on two bare metal machines, although you can use a virtual machine as well.

The Hadoop architecture

One of ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required