O'Reilly logo

Modern Linux Administration by Sam R. Alapati

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 14. Big Data, Data Science, Apache Hadoop, and Apache Mesos

Relational databases, both in the form of transactional databases, as well as larger data warehouses and repositories, have been the mainstay of organizations that deal with data, which is everybody. In the past decade or so two major innovations have transformed the landscape of data and the way you store and mine the increasingly large amounts of data flowing into organizations.

The two data related innovations are distributed data storage and processing, and NoSQL databases. This chapter provides a basic introduction to both of these key areas of modern data storage techniques, with Hadoop being the leading method of storing and mining large amounts of data in a distributed fashion, and NoSQL databases such as Cassandra being increasingly used to manage huge quantities of data emanating from the internet.

Ranking web pages and searching social networks and social-networking sites are two examples where you deal with pretty regular data flows, which makes it easy to mine the data through data parallelism. In order to efficiently process these types of large data sets, a new programming paradigm using a different type of software stack has evolved. Whereas traditional data processing has focused on larger and larger computers, leading to the birth of supercomputers, this newer approach employs clusters of interconnected inexpensive computers.

The new software stack, as represented by Apache Hadoop, has a distributed ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required