Collecting additional data

Many data processing systems don't have a single data ingest source; often, one primary source is enriched by other secondary sources. We will now look at how to incorporate the retrieval of such reference data into our data warehouse.

At a high level, the problem isn't very different from our retrieval of the raw tweet data, as we wish to pull data from an external source, possibly do some processing on it, and store it somewhere where it can be used later. But this does highlight an aspect we need to consider; do we really want to retrieve this data every time we ingest new tweets? The answer is certainly no. The reference data changes very rarely, and we could easily fetch it much less frequently than new tweet data. ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.