Big Data Simplified

50 | Big Data Simplied

• Stop hadoop cluster as follows.

$ ./stop-yarn.sh

$ ./stop-dfs.sh

After the shutdown of Hadoop cluster, all the Hadoop Java daemons are stopped in this

above screen, the

‘jps’ command returned only the ‘jps’ itself as it is also a Java process.

3.4 STORING DATA WITH HDFS

3.4.1 The NameNode and DataNodes

As discussed in the preceding sections of this chapter, HDFS is spread out across multiple

machines, each of these machines in the cluster being a rather simple one with commodity hard-

ware. Also, the value of HDFS lies in, if the cluster as a whole is highly fault tolerant. In addition,

it can store huge volumes of data and execute data processing tasks at high speed and scale. As

such, individual machines might have their disks corrupted or might go down, but it does not

affect the availability of data in the cluster. An important point to note is that HDFS is suited for

batch processing. Typically, very large jobs that run for a long time should be executed. HDFS is

typically not a low-latency system which should be used for quick retrieval of data. For example,

queries should not be executed in HDFS in real time.

The data stored in HDFS tends to be very large, for example, in petabytes. Also, majority of

this data is semi-structured or unstructured. This means that the data does not have a well-de-

fined structure like a relational database with strict definitions of rows and columns, referential

integrity and indexes. Any data that is stored in HDFS is actually split across multiple storage

disks, where each disk is present on a different machine in the cluster. It is the responsibility of

M03 Big Data Simplified XXXX 01.indd 50 5/10/2019 9:57:29 AM

Get Big Data Simplified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Big Data Simplified by Sayan Goswami, Amit Kumar Das, Sourabh Mukherjee