Enterprise Data Workflows with Cascading

Date: This event took place live on September 17 2013

Presented by: Paco Nathan

Duration: Approximately 60 minutes.

Cost: Free

Questions? Please send email to

Description:

In this hands-on webcast presented by Paco Nathan author of Enterprise Data Workflows with Cascading, he will discuss what defines a "workflow", in contrast to notions of "dataflow" and the impact that has on the tools required.

Overall, we're talking about middleware for Big Data — how to integrate Hadoop along with other data frameworks to build applications at scale.

Paco will compare and contrast some workflow platforms such as:

Actian's ParAccel (based on open source Knime and Eclipse)
Continuum Analytics (Anaconda platform for Enterprise-grade Python which is gaining traction based on IPython Notebook
Pandas, Scikit-Lear

We will also discuss some popular tools that do not fit in this category (are not for workflows) but are commonly confused as such: Apache Pig and Apache Hive in particular. Understanding where those do or don't fit is helpful. Within the context of Cascading, there are also the Scala community (Scalding) and the Clojure community (Cascalog) — which account for most of the new production deployments. Paco will compare and contract both of these as well.

About Paco Nathan

Paco Nathan is a Data Scientist at Concurrent, Inc., and heads up the developer outreach program there. He has a dual background from Stanford in math/stats and distributed computing, with 25+ years experience in the tech industry. As an expert in Hadoop, R, predictive analytics, machine learning, natural language processing, Paco has built and led several expert Data Science teams, with data infrastructure based on large-scale cloud deployments. He has presented twice on the AWS Start-Up Tour, and gives talks often about Hadoop, Data Science, and Cloud Computing.

You may also be interested in:

Description:

About Paco Nathan

About O'Reilly

Community

Partner Sites

Shop O'Reilly