Relational databases, both in the form of transactional databases, as well as larger data warehouses and repositories, have been the mainstay of organizations that deal with data, which is everybody. In the past decade or so two major innovations have transformed the landscape of data and the way you store and mine the increasingly large amounts of data flowing into organizations.
The two data related innovations are distributed data storage and processing, and NoSQL databases. This chapter provides a basic introduction to both of these key areas of modern data storage techniques, with Hadoop being the leading method of storing and mining large amounts of data in a distributed fashion, and NoSQL databases such as Cassandra being increasingly used to manage huge quantities of data emanating from the internet.
Ranking web pages and searching social networks and social-networking sites are two examples where you deal with pretty regular data flows, which makes it easy to mine the data through data parallelism. In order to efficiently process these types of large data sets, a new programming paradigm using a different type of software stack has evolved. Whereas traditional data processing has focused on larger and larger computers, leading to the birth of supercomputers, this newer approach employs clusters of interconnected inexpensive computers.
The new software stack, as represented by Apache Hadoop, has a distributed ...