CHAPTER 3Sources of Data

As mentioned in Chapter 1, the big variety of data coming from diverse sources is one of the key properties of the big data phenomenon. It is, therefore, beneficial to understand how data is generated in various environments and scenarios, before looking at what should be done with this data and how to design the best possible architecture to accomplish this.

The evolution of IT architectures, described in Chapter 2, means that the data is no longer processed by a few big monolith systems, but rather by a group of services. In parallel to the processing layer, the underlying data storage has also changed and became more distributed. This in turn required a significant paradigm shift as the traditional approach to transactions (ACID) could no longer be supported. On top of this, cloud computing is becoming a major approach with the benefits of reducing costs and providing on-demand scalability but at the same time introducing concerns about privacy, data ownership, etc.

In the meantime the Internet continues its exponential growth. Every day both structured and unstructured data is published and available for processing. To achieve competitive advantage companies have to relate their corporate resources to external services, e.g. financial markets, weather forecasts, social media, etc. While several of the sites provide some sort of API to access the data in a more orderly fashion, countless sources require advanced web mining and Natural Language Processing ...

Get Modern Big Data Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.