50 | Big Data Simplied
• Stop hadoop cluster as follows.
After the shutdown of Hadoop cluster, all the Hadoop Java daemons are stopped in this
above screen, the
‘jps’ command returned only the ‘jps’ itself as it is also a Java process.
3.4 STORING DATA WITH HDFS
3.4.1 The NameNode and DataNodes
As discussed in the preceding sections of this chapter, HDFS is spread out across multiple
machines, each of these machines in the cluster being a rather simple one with commodity hard-
ware. Also, the value of HDFS lies in, if the cluster as a whole is highly fault tolerant. In addition,
it can store huge volumes of data and execute data processing tasks at high speed and scale. As
such, individual machines might have their disks corrupted or might go down, but it does not
affect the availability of data in the cluster. An important point to note is that HDFS is suited for
batch processing. Typically, very large jobs that run for a long time should be executed. HDFS is
typically not a low-latency system which should be used for quick retrieval of data. For example,
queries should not be executed in HDFS in real time.
The data stored in HDFS tends to be very large, for example, in petabytes. Also, majority of
this data is semi-structured or unstructured. This means that the data does not have a well-de-
fined structure like a relational database with strict definitions of rows and columns, referential
integrity and indexes. Any data that is stored in HDFS is actually split across multiple storage
disks, where each disk is present on a different machine in the cluster. It is the responsibility of
M03 Big Data Simplified XXXX 01.indd 50 5/10/2019 9:57:29 AM