Video description
Get a practical introduction to Hadoop, the framework that made big data and large-scale analytics possible by combining distributed computing techniques with distributed storage. In this video tutorial, hosts Benjamin Bengfort and Jenny Kim discuss the core concepts behind distributed computing and big data, and then show you how to work with a Hadoop cluster and program analytical jobs. You’ll also learn how to use higher-level tools such as Hive and Spark.
Hadoop is a cluster computing technology that has many moving parts, including distributed systems administration, data engineering and warehousing methodologies, software engineering for distributed computing, and large-scale analytics. With this video, you’ll learn how to operationalize analytics over large datasets and rapidly deploy analytical jobs with a variety of toolsets.
Once you’ve completed this video, you’ll understand how different parts of Hadoop combine to form an entire data pipeline managed by teams of data engineers, data programmers, data researchers, and data business people.
- Understand the Hadoop architecture and set up a pseudo-distributed development environment
- Learn how to develop distributed computations with MapReduce and the Hadoop Distributed File System (HDFS)
- Work with Hadoop via the command-line interface
- Use the Hadoop Streaming utility to execute MapReduce jobs in Python
- Explore data warehousing, higher-order data flows, and other projects in the Hadoop ecosystem
- Learn how to use Hive to query and analyze relational data using Hadoop
- Use summarization, filtering, and aggregation to move Big Data towards last mile computation
- Understand how analytical workflows including iterative machine learning, feature analysis, and data modeling work in a Big Data context
Benjamin Bengfort is a data scientist and programmer in Washington DC who prefers technology to politics but sees the value of data in every domain. Alongside his work teaching, writing, and developing large-scale analytics with a focus on statistical machine learning, he is finishing his PhD at the University of Maryland where he studies machine learning and artificial intelligence.
Jenny Kim, a software engineer in the San Francisco Bay Area, develops, teaches, and writes about big data analytics applications and specializes in large-scale, distributed computing infrastructures and machine-learning algorithms to support recommendations systems.
Publisher resources
Table of contents
- Overview of the Video Course
- A Distributed Computing Environment
-
Computing with Hadoop
- How a MapReduce Job Works
- Mappers and Reducers in Detail
- Working with Hadoop via the Command Line: Starting HDFS and Yarn
- Working with Hadoop via the Command Line: Loading Data into HDFS
- Working with Hadoop via the Command Line: Running a MapReduce Job
- How To Use Our Github Goodies
- Working in Python with Hadoop Streaming
- Common MapReduce Tasks
- Spark on Hadoop 2
- Creating a Spark Application with Python
- The Hadoop Ecosystem
- Working with Data on Hive
- Towards Last Mile Computing
Product information
- Title: Hadoop Fundamentals for Data Scientists
- Author(s):
- Release date: January 2015
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491913161
You might also like
audiobook
Fall in Love with the Problem, Not the Solution
Unicorns-companies that reach a valuation of more than $1 billion-are rare. Uri Levine has built two. …
video
5 Essential Communications Skills to Catapult Your Career
Developing great communication skills is a key to professional success. Communication skills have always been instrumental …
video
Agile Testing Essentials
4+ Hours of Video Instruction is based on fundamental concepts from Lisa Crispin’s and Janet Gregory’s …
video
Data Science Fundamentals Part 1: Learning Basic Concepts, Data Wrangling, and Databases with Python
20 Hours of Video Instruction Data Science Fundamentals LiveLessons teaches you the foundational concepts, theory, and …