How to use the wordcount example as a starting point (and you thought you’d escape the wordcount example).
Introducing the solar correlation map, and how to easily create your own.
Learn how to ship, parse, store, and analyze logs.
Word embedding in natural language processing.
Early methods to integrate machine learning using Naive Bayes and custom sinks.
In two online sessions October 18-19, 2016, Jesse Anderson will show you how to recognize opportunities, avoid problems, and get the most value from your data.
October 4-5, 2016, join Thomas Nield for a hands-on course for beginners on core database and SQL fundamentals.
Learn about the basics of how Hadoop works, why it's such an important technology, and how you should be using it without getting mired in the details.
Garrett Grolemund demonstrates how to use R Markdown to combine code and text into a single .Rmd file to generate polished reports automatically in a variety of formats.
Mark Grover and Ted Malaska offer an overview of projects for streaming applications, including Kafka, Flume, and Spark Streaming, and discuss the architectural schemas available, such as Lambda and Kappa.
Calvin Jia presents an in-depth overview of Alluxio and its role in the big data ecosystem. In this segment, he reviews examples that show how Alluxio complements Spark and S3, to enable fast data access.
Transparent matching of Spark portability with GPU performance.
Alexander Ulanov offers an overview of tools and frameworks that have been proposed for performing deep learning on Spark.
Evan Sparks describes the principles behind KeystoneML and introduces its programming model by way of example pipelines in NLP and image classification.
Stream processing is finally coming of age. This report shows you how stream processing can make data storage and processing systems more flexible and less complex.
Crunching CERN’s colossal data with scalable analytics
Join Thomas Nield for a hands-on introduction to core database and SQL fundamentals.
Bill Loconzolo reveals the lessons learned from building the Intuit Analytics Cloud.
Michael Armbrust and Tathagata Das explain updates to Spark version 2.0, demonstrating how stream processing is now more accessible with Spark SQL and DataFrame APIs.
Natalino Busa presents the Coral system, a solution for streaming anomaly detection.
Alex Robbins guides you through an in-depth look at the Python API for Apache Spark. In this segment, he explores RDDs--the central abstraction in Spark and essential knowledge for anyone working in the system.
Jonathan Whitmore demonstrates how to install pivot tables and showcases the features of this extension by examining a dataset of restaurant scores.
Sean Owen and Yann Delacourt cover Spark's architecture, deployment strategies, and use cases, as well as Spark's impact on data science, analytics, and machine learning.
With scikit-learn, you can deploy machine learning models in just a few lines of code. Andreas Mueller summarizes the classification, regression, and clustering algorithms in this powerful machine learning library.
Pete Warden walks through popular open source tools from the academic world and shows you step-by-step how to process images with them.
Dive into creating your own databases and learn how to design them efficiently.
Guaranteeing data availability in distributed systems.
Jesse Anderson walks viewers through the path data can take from publishers through a Kafka cluster and on to consumers.
In this O'Reilly training video, the "Hadoop Application Architectures" authors present an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. In this segment, they provide an overview of the complete architecture. Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska
Lisa Qian lays out the process for a successful A/B test, from defining a goal and hypothesis, to knowing when to end the test. The most rigorous form of data-gathering when done right, A/B tests can't be run by guesswork or gut instinct.
O'Reilly Online Training: Geo-Located Data—Extracting Patterns from Mobile Data Using Scikit-Learn and Cassandra
November 1-2, 2016, join Natalino Busa for an introduction to extracting patterns from geo-located data and building geo-located microservices.
Patrick Wendell from Databricks discusses Spark's new 1.3 release, which brings extensions to all of Spark's major components along with a new cross-cutting Dataframes API.
Seaborn provides an API on top of matplotlib, which uses sane plot and color defaults and simple functions for common statistical plot types.
This tutorial introduces Support Vector Machines (SVMs), a powerful supervised learning algorithm used to draw a boundary between clusters of data.
How to create and display a wide variety of 3D objects and patterns in matplotlib.
Confused by D3? Interested in coding data visualizations on the web, but don't know where to start? This online tutorial will have you transforming data into visual images in no time at all.
The scatterplot is a common type of visualization that represents two sets of corresponding values on two different axes.
In this webcast led by Sarah Guido, you'll get a bird's eye overview of some of the best tools for data analysis and how you can apply them to your workflow.
Can data and connectivity improve criminal justice and mitigate conflicts?