O'Reilly logo

Pentaho for Big Data Analytics by Feris Thia, Manoj R Patil

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Appendix A. Big Data Sets

If you really want to check out the real power of Big Data solutions based on the Hadoop Distributed File System (HDFS), you will have to choose the right set of data. If you analyze files of merely a few KBs on this platform, it will take much more time than the conventional database systems. As data keeps growing in GBs and TBs and there are enough nodes in the cluster, you will start seeing the real benefit of HDFS-based solutions.

Data preparation is an important step in a Big Data solution where you have to harmonize various data sources by integrating them seamlessly, using appropriate ETL methodology to ensure that this integrated data can be easily analyzed by a Big Data solution. If you are well aware of the data, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required