O'Reilly logo
live online training icon Live Online training

Intermediate Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines

Navigate Linux systems and perform essential tasks on analytics clusters

Topic: System Administration
Douglas Eadline

As part of the Effective Data Pipelines series, this course is a continuation of Beginning Linux Command Line for Data Engineers and Analysts (Live Online Training)

Most analytics systems use Linux with tools including Apache Hadoop, Spark, and Kafka. As with any Linux based platform, all essential operations can be performed using the command line. Indeed, in many situations there are operations can only be performed from the command line interface.

What you'll learn-and how you can apply it

  • Understand the advanced capabilities of the Linux command line
  • Learn how to access a Linux server using the command line in Windows and Mac computers
  • Learn about ways to move data to/from Linux and to HDFS for use by Hadoop/Spark
  • Learn how to use basic Linux "analytics tools" like grep, sed, gawk
  • Understand how to run Hadoop and Spark applications from the command line
  • Learn how to create simple scripts to automate many processes

This training course is for you because...

  • You want to continue to learning more skills beyond the first course, Beginning Linux Command Line for Data Engineers and Analysts
  • You will learn how to use more advanced capabilities of the Linux command line including scripting and additional tools
  • A special emphasis is placed on using Hadoop/Spark clusters and the needs of the Data Engineer.
  • All examples are provided in convenient notes files that provide a reference guide after the course is complete
  • A Linux Hadoop virtual machine is provided so you can try all the commands and examples during and after the course (include a single server instance of Hadoop/Spark/Kafka and other tools)

Prerequisites

Setup Instructions:

  • To run the class examples, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available).
  • Further information on the class, access to the class notes, and the Linux Hadoop Minimal VM can be found at https://www.clustermonkey.net/scalable-analytics.
  • If you wish to follow along, install and test the sandbox at least one day before the class.

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction and Course Goals (15 mins)

  • How to get the most out of this course
  • Required prerequisite skills (basic commands, file system operations, IO redirections, vi editor)
  • Working with the command line in Windows, Mac, and Linux
  • Safe communication using Secure Shell (SSH)

Segment 2: The Linux Hadoop Minimal Virtual Machine Text Terminal (10 mins)

  • Using Oracle Virtual Box
  • Starting the Virtual Machine
  • Connecting the VM using SSH
  • Questions (10 min)

Segment 3: Linux Analytics tools (30 mins)

  • Searching text using grep
  • Stream editing using sed
  • Using gawk with csv files
  • Segment 4: Moving Data into Hadoop HDFS (20 mins)
  • What is Hadoop HDFS and why is it different
  • Your local file-system is not Hadoop HDFS
  • Using HDFS wrapper commands
  • Questions (10 min)
  • Break (10 mins)

Segment 5: Running Command Line Analytics Tools (20 mins)

  • Running/Observing a Hive job
  • Running/Observing a PySpark job
  • Running/Observing a Kafka job
  • Segment 6: Bash Scripting Basics (30 mins)
  • Creating a bash script using the following:
  • Bash variables
  • If-then tests
  • Control structures
  • Input and output
  • Questions (10 min)

Segment 7: Creating Bash Scripts (20 mins)

  • Downloading and moving data
  • Combing tools

Segment 8: Course Wrap-up and Additional Resources (5 mins)

  • Remaining questions