Learning | Data Tools

Ideas and resources related to data tools.

Runnable code code

Hadoop: What you need to know

Learn about the basics of how Hadoop works, why it's such an important technology, and how you should be using it without getting mired in the details.

Video play
William Caxton showing specimens of his printing to King Edward IV and his Queen.

Easy, reproducible reports with R

Garrett Grolemund demonstrates how to use R Markdown to combine code and text into a single .Rmd file to generate polished reports automatically in a variety of formats.

Video play
Frank Gehry's Dancing House windows.

Best practices for streaming applications

Mark Grover and Ted Malaska offer an overview of projects for streaming applications, including Kafka, Flume, and Spark Streaming, and discuss the architectural schemas available, such as Lambda and Kappa.

Video play
The color frontispiece from Albert Henry Munsell's 1905 pamphlet "A Color Notation."

Running Spark on Alluxio with S3

Calvin Jia presents an in-depth overview of Alluxio and its role in the big data ecosystem. In this segment, he reviews examples that show how Alluxio complements Spark and S3, to enable fast data access.

Runnable code code
Flowing stream.

Making Sense of Stream Processing

Stream processing is finally coming of age. This report shows you how stream processing can make data storage and processing systems more flexible and less complex.

Video play

What is a resilient distributed dataset?

Alex Robbins guides you through an in-depth look at the Python API for Apache Spark. In this segment, he explores RDDs--the central abstraction in Spark and essential knowledge for anyone working in the system.

Video play
Expedition 47 Commander Tim Kopra of NASA captured this brightly lit night image of the city of Chicago on April 5, 2016, from the International Space Station.

Dive into scikit-learn

With scikit-learn, you can deploy machine learning models in just a few lines of code. Andreas Mueller summarizes the classification, regression, and clustering algorithms in this powerful machine learning library.

Video play
The bridge over the Crim Dell at the College of William and Mary

Architecting Hadoop Applications

In this O'Reilly training video, the "Hadoop Application Architectures" authors present an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. In this segment, they provide an overview of the complete architecture. Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska

Video play
Gather Ye Rosebuds While Ye May, by J.W. Waterhouse

A/B Testing: a checklist

Lisa Qian lays out the process for a successful A/B test, from defining a goal and hypothesis, to knowing when to end the test. The most rigorous form of data-gathering when done right, A/B tests can't be run by guesswork or gut instinct.

Runnable code code
Visualizing Data with Seaborn

Data visualization with Seaborn

Seaborn provides an API on top of matplotlib, which uses sane plot and color defaults and simple functions for common statistical plot types.

Video play
Python bar chart

Analyzing Data with Python

In this webcast led by Sarah Guido, you'll get a bird's eye overview of some of the best tools for data analysis and how you can apply them to your workflow.