O'Reilly logo
live online training icon Live Online training

Beginning Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines

Learn to navigate Linux systems and perform basic tasks on analytics clusters

Douglas Eadline

In many situations there are operations that can only be performed by using the Linux command. Although "pointing and clicking" using a GUI is often preferred, these interfaces can be restrictive and limit functionality. A good working knowledge of the Linux command line will allow many key operations to be streamlined and easily executed. Many of the commands and features available through the Linux command line will help improve the throughput of today's data analyst.

What you'll learn-and how you can apply it

  • Understand why the command line is still important
  • Learn how to access a Linux server using the command line in Windows and Mac computers
  • Understand the basic Linux filesystem layout and navigate its contents
  • Learn the essential commands and tools used in a modern scalable analytics environment
  • Understand the basic vi text editor commands so you can view and edit files
  • Learn about ways to move data to/from Linux

This training course is for you because...

  • Only the essential and useful aspects of the Linux command line are presented
  • You will learn how to connect to and perform useful tasks on almost any Linux server
  • All examples are provided in convenient notes files that provide a reference guide after the course is complete
  • A Linux Hadoop virtual machine is provided so you can try all the commands and examples during and after the course (include a single server instance of Hadoop/Spark/Kafka and other tools)

Prerequisites

  • A basic understanding of computer/server operation (processors, memory, disks, networking)

Course Set-up

To run the class examples, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available). Further information on the class, access to the class notes, and the Linux Hadoop Minimal VM can be found at https://www.clustermonkey.net/scalable-analytics

If you wish to follow along, install and test the sandbox at least one day before the class.

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction and Course Goals (15 mins)

  • How to get the most out of this course
  • It is 2019, why do we need the Linux/Unix command line?
  • Advantages and disadvantages of the command line
  • Working with the command line in Windows, Mac, and Linux
  • Safe communication using Secure Shell (SSH)

Segment 2: The Linux Hadoop Minimal Virtual Machine Text Terminal (20 mins)

  • Using Oracle Virtual Box
  • Starting the Virtual Machine
  • Connecting the VM using SSH
  • The Linux filesystem layout
  • Questions

Segment 3: Basic Linux Commands (50 mins)

  • What is a *nix shell?
  • Basic Linux commands
  • Basic shell commands
  • Input/Output and pipes
  • File permissions
  • Process management
  • Commands to access system information
  • Questions

Break (10 mins)

Segment 4: Editing/Viewing Text Files: vi (Visual Editor) (25 mins)

  • Basic modes and navigation
  • Insert/delete copy/paste
  • Search/Replace

Segment 5: Moving Data to/from Your Local File System (20 mins)

  • Compressing and archiving using tar and zip
  • Secure copy (scp)
  • Web get (wget)
  • Data integrity
  • Questions

Segment 6: Course Wrap-up and Additional Resources (10 mins)

  • Final Q&A