What is a resilient distributed dataset?

Alex Robbins guides you through an in-depth look at the Python API for Apache Spark. In this segment, he explores RDDs--the central abstraction in Spark and essential knowledge for anyone working in the system.

Dive into scikit-learn

With scikit-learn, you can deploy machine learning models in just a few lines of code. Andreas Mueller summarizes the classification, regression, and clustering algorithms in this powerful machine learning library.

Building data science teams: Preparing your organization

How should you prepare when assembling and integrating a data science team into your organization? In this video training segment, Paco Nathan offers tips to consider in the early stages, including designating the right executive sponsor and encouraging basic hands-on data science training for management.

Architecting Hadoop Applications

In this O'Reilly training video, the "Hadoop Application Architectures" authors present an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. In this segment, they provide an overview of the complete architecture. Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska

A/B Testing: a checklist

Lisa Qian lays out the process for a successful A/B test, from defining a goal and hypothesis, to knowing when to end the test. The most rigorous form of data-gathering when done right, A/B tests can't be run by guesswork or gut instinct.

Line graph

Pivot Tables in Python

Pivot tables are an incredibly handy tool for exploring tabular data. This excerpt from the Python Data Science Handbook (Early Release) shows how to use the elegant pivot table features in Pandas to slice and dice your data.

Visualizing Data with Seaborn

Data visualization with Seaborn

Seaborn provides an API on top of matplotlib, which uses sane plot and color defaults and simple functions for common statistical plot types.

Data Science at the Command Line

Whether you're entirely new to the command line or already dreaming in shell scripts, by the end of this webcast you will have a solid understanding of how to leverage the power of the command line.

Python bar chart

Analyzing Data with Python

In this webcast led by Sarah Guido, you'll get a bird's eye overview of some of the best tools for data analysis and how you can apply them to your workflow.

Thinking with Data

This webcast examines a framework for incorporating ideas from other fields (like design, argument studies, and consulting) into Data Science.