Getting Started with Impala

Book description

Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.

Written by John Russell, documentation lead for the Cloudera Impala project, this book gets you working with the most recent Impala releases quickly. Ideal for database developers and business analysts, the latest revision covers analytics functions, complex types, incremental statistics, subqueries, and submission to the Apache incubator.

Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.

  • Learn how Impala integrates with a wide range of Hadoop components
  • Attain high performance and scalability for huge data sets on production clusters
  • Explore common developer tasks, such as porting code to Impala and optimizing performance
  • Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
  • Learn how to transition from rigid schemas to a flexible model that evolves as needs change
  • Take a deep dive into joins and the roles of statistics

Table of contents

  1. Introduction
    1. Who Is This Book For?
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Content Updates
      1. March 30, 2016
    7. Acknowledgments
  2. 1. Why Impala?
    1. Impala’s Place in the Big Data Ecosystem
    2. Flexibility for Your Big Data Workflow
    3. High-Performance Analytics
    4. Exploratory Business Intelligence
  3. 2. Getting Up and Running with Impala
    1. Installation
    2. Connecting to Impala
    3. Your First Impala Queries
  4. 3. Impala for the Database Developer
    1. The SQL Language
      1. Standard SQL for Queries
      2. Limited DML
      3. No Transactions
      4. Numbers
      5. Recent Additions
    2. Big Data Considerations
      1. Billions and Billions of Rows
      2. HDFS Block Size
      3. Parquet Files: The Biggest Blocks of All
    3. How Impala Is Like a Data Warehouse
    4. Physical and Logical Data Layouts
      1. The HDFS Storage Model
    5. Distributed Queries
    6. Normalized and Denormalized Data
    7. File Formats
      1. Text File Format
      2. Parquet File Format
      3. Getting File Format Information
      4. Switching File Formats
    8. Aggregation
  5. 4. Common Developer Tasks for Impala
    1. Getting Data into an Impala Table
      1. INSERT Statement
      2. LOAD DATA Statement
      3. External Tables
      4. Figuring Out Where Impala Data Resides
      5. Manually Loading Data Files into HDFS
      6. Hive
      7. Sqoop
      8. Kite
    2. Porting SQL Code to Impala
    3. Using Impala from a JDBC or ODBC Application
      1. JDBC
      2. ODBC
    4. Using Impala with a Scripting Language
      1. Running Impala SQL Statements from Scripts
      2. Variable Substitution
      3. Saving Query Results
      4. The impyla Package for Python Scripting
    5. Optimizing Impala Performance
      1. Optimizing Query Performance
      2. Optimizing Memory Usage
      3. Working with Partitioned Tables
      4. Finding the Ideal Granularity
      5. Inserting into Partitioned Tables
      6. Adding and Loading New Partitions
      7. Keeping Statistics Up to Date for Partitioned Tables
    6. Writing User-Defined Functions
    7. Collaborating with Your Administrators
      1. Designing for Security
      2. Anticipate Memory Usage
      3. Understanding Resource Management
      4. Helping to Plan for Performance (Stats, HDFS Caching)
      5. Understanding Cluster Topology
      6. Always Close Your Queries
  6. 5. Tutorials and Deep Dives
    1. Tutorial: From Unix Data File to Impala Table
    2. Tutorial: Queries Without a Table
    3. Tutorial: The Journey of a Billion Rows
      1. Generating a Billion Rows of CSV Data
      2. Normalizing the Original Data
      3. Converting to Parquet Format
      4. Making a Partitioned Table
      5. Next Steps
    4. Deep Dive: Joins and the Role of Statistics
      1. Creating a Million-Row Table to Join With
      2. Loading Data and Computing Stats
      3. Reviewing the EXPLAIN Plan
      4. Trying a Real Query
      5. The Story So Far
      6. Final Join Query with 1B x 1M Rows
    5. Anti-Pattern: A Million Little Pieces
    6. Tutorial: Across the Fourth Dimension
      1. TIMESTAMP Data Type
      2. Format Strings for Dates and Times
      3. Working with Individual Date and Time Fields
      4. Date and Time Arithmetic
      5. Let’s Solve the Y2K Problem
      6. More Fun with Dates
    7. Tutorial: Verbose and Quiet impala-shell Output
    8. Tutorial: When Schemas Evolve
      1. Numbers Versus Strings
      2. Dealing with Out-of-Range Integers
    9. Tutorial: Levels of Abstraction
      1. String Formatting
      2. Temperature Conversion
    10. Tutorial: Subqueries
      1. Subqueries in the FROM Clause
      2. Subqueries in the FROM Clause for Join Queries
      3. Subqueries in the WHERE Clause
      4. Uncorrelated and Correlated Subqueries
      5. Common Table Expressions in the WITH Clause
    11. Tutorial: Analytic Functions
      1. Analyzing the Numbers 1 Through 10
      2. Running Totals and Moving Averages
      3. Breaking Ties
    12. Tutorial: Complex Types
      1. ARRAY: A List of Items with Identical Types
      2. MAP: A Hash Table or Dictionary with Key-Value Pairs
      3. STRUCT: A Row-Like Object for Flexible Typing and Naming
      4. Nesting Complex Types to Represent Arbitrary Data Structures
      5. Querying Tables with Nested Complex Types
      6. Constructing Data for Complex Types

Product information

  • Title: Getting Started with Impala
  • Author(s): John Russell
  • Release date: September 2014
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491905722