5Data Science, Statistics, and Streams

This chapter starts with a brief overview of some areas of data science to set the context for the description of the Streams capabilities when it comes to streaming analytics. The Streams toolkits covered in varied levels of detail include Geospatial, Timeseries, Mining, SPSS, R-project, and SparkMLlib. It ends with a short reminder of how Streams can be extended to include any additional algorithms we may require.

Data Science is No Cure-all

It should be said at the outset that data science is not the Brand New Thing that is sure to solve all your problems. You still have to put the work in.

The term “data scientist” was only coined in 2008 — that’s still very recent. The definition of the term is evolving but we can see that, at least, it is a mix of statistical knowledge, programming, and domain expertise. (Again, you still have to do all the work to prepare the data and then explore it so as to decide how you can extract information “scientifically” from it.)

You can take advantage of a lot of tools to help you navigate the data and achieve your goals. Streams includes a lot of pre-built algorithms and access to other tools to increase your analysis capabilities.

Some Data Science Terms

Before we go any further, it would be good to define a few terms that can be useful in our understanding of probabilities, statistics, machine learning, and so on.

Population and sample

Population refers to the entire dataset we are working with, where ...

Get Streaming Analytics with IBM Streams: Analyze More, Act Faster, and Get Continuous Insights now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.