AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Content > Topics > Data

Data analysis: Just one component of the data science workflow

Specialized tools run the risk of being replaced by others that have more coverage.

By Ben Lorica September 7, 2013 • 4 minute read

LinkedIn X Facebook Threads Bluesky Reddit

Judging from articles in the popular press the term data scientist has increasingly come to refer to someone who specializes in data analysis (statistics, machine-learning, etc.). This is unfortunate since the term originally described someone who could cut across disciplines. Far from being confined to data analysis, a typical data science workflow¹ means jumping back-and-forth between a series of interdependent tasks. Data scientists tend to use a variety of tools, often across different programming languages. Workflows that involve many different tools require a lot of context-switching which affects productivity and impedes reproducability:

Tools and Training

People who build tools appreciate the value of having their solutions span across the data science workflow. If a tool only addresses a limited section of the workflow, it runs the risk of being replaced by others that have more coverage. Platfora is as proud of its data store (the fractal cache) and data wrangling² tools, as of its interactive visualization capabilities. The Berkeley Data Analytics Stack (BDAS) and the Hadoop community are expanding to include analytic engines that increase their coverage – over the next few months BDAS components for machine-learning (MLbase) and graph analytics (GraphX) are slated for their initial release. In an earlier post, I highlighted a number of tools that simplify the application of advanced analytics and the interpretation of results. Analytic tools are getting to the point that in the near future I expect many (routine) data analysis tasks will be performed by business analysts and other non-experts.

The people who train future data scientists also seem aware of the need to teach more than just data analysis skills. A quick glance at the syllabi and curricula of a few³ data science courses and programs reveals that – at least in some training programs – students get to learn other components of the data science workflow. One course that caught my eye: CS 109 at Harvard seems like a nice introduction to the many facets of practical data science – plus it uses IPython notebooks, Pandas, and scikit-learn!

The Analytic Lifecycle and Data Engineers

As I noted in a recent post, model building is only one aspect of the analytic lifecycle. Organizations are starting to pay more attention to the equally important tasks of model deployment, monitoring, and maintenance. One telling example comes from a recent paper on sponsored search advertising at Google: a simple model was chosen (logistic regression) and most of the effort (and paper) was devoted to devising ways to efficiently train, deploy, and maintain it in production.

In order to deploy their models into production, data scientists learn to work closely with folks who are responsible for building scalable data infrastructures – data engineers. If you talk with enough startups in Silicon Valley, you quickly realize that data engineers are in even higher⁴ demand than data scientists. Fortunately some forward thinking consulting services are stepping forward to help companies address both their data science data engineering needs.

Related posts:

(1) For a humorous view, see Data Science skills as a subway map!

(2) Here’s a funny take on the rule-of-thumb that data wrangling accounts for 80% of time spent on data projects: “In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”

(3) Here is a short list: UW Intro to Data Science and Certificate in Data Science, CS 109 at Harvard, Berkeley’s Master of Information and Data Science program, Columbia’s Certification of Professional Achievement in Data Sciences, MS in Data Science at NYU, and the Certificate of Advanced Study In Data Science at Syracuse.

(4) I’m not sure why the popular press hasn’t picked up on this distinction. Maybe it’s a testament to the the buzz surrounding data science.

Post topics: Data

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Data analysis: Just one component of the data science workflow