Chapter 5

Big Data Sources

One of the biggest challenges for most organizations is finding data sources to use as part of their analytics processes. As the name implies, Big Data is large, but size is not the only concern. There are several other considerations when deciding how to locate and parse Big Data sets.

The first step is to identify usable data. While that may be obvious, it is anything but simple. Locating the appropriate data to push through an analytics platform can be complex and frustrating. The source must be considered to determine whether the data set is appropriate for use. That translates into detective work or investigative reporting.

Considerations should include the following:

  • Structure of the data (structured, unstructured, semistructured, table based, proprietary)
  • Source of the data (internal, external, private, public)
  • Value of the data (generic, unique, specialized)
  • Quality of the data (verified, static, streaming)
  • Storage of the data (remotely accessed, shared, dedicated platforms, portable)
  • Relationship of the data (superset, subset, correlated)

All of those elements and many others can affect the selection process and can have a dramatic effect on how the raw data are prepared (“scrubbed”) before the analytics process takes place.

In the IT realm, once a data source is located, the next step is to import the data into an appropriate platform. That process may be as simple as copying data onto a Hadoop cluster or as complicated as scrubbing, indexing, ...

Get Big Data Analytics: Turning Big Data into Big Money now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.