Designing a workflow

A data scientist has many options in selecting and implementing a classification or clustering algorithm.

Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters:

  • Business requirements such as accuracy of results
  • Availability of training data and algorithms
  • Access to a domain or subject-matter expert

Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters:

  • Available resources such as machines, CPU, memory, or I/O bandwidth ...

Get Scala: Guide for Data Science Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.