book

Getting Started with Impala

Name: Getting Started with Impala
Author: John Russell
ISBN: 9781491905722

by John Russell

September 2014

Intermediate to advanced

152 pages

4h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Who Is This Book For?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMarch 30, 2016Acknowledgments
1. Why Impala?
Impala’s Place in the Big Data EcosystemFlexibility for Your Big Data WorkflowHigh-Performance AnalyticsExploratory Business Intelligence
2. Getting Up and Running with Impala
InstallationConnecting to ImpalaYour First Impala Queries
3. Impala for the Database Developer
The SQL LanguageStandard SQL for QueriesLimited DMLNo TransactionsNumbersRecent AdditionsBig Data ConsiderationsBillions and Billions of RowsHDFS Block SizeParquet Files: The Biggest Blocks of AllHow Impala Is Like a Data WarehousePhysical and Logical Data LayoutsThe HDFS Storage ModelDistributed QueriesNormalized and Denormalized DataFile FormatsText File FormatParquet File FormatGetting File Format InformationSwitching File FormatsAggregation
4. Common Developer Tasks for Impala
Getting Data into an Impala TableINSERT StatementLOAD DATA StatementExternal TablesFiguring Out Where Impala Data ResidesManually Loading Data Files into HDFSHiveSqoopKitePorting SQL Code to ImpalaUsing Impala from a JDBC or ODBC ApplicationJDBCODBCUsing Impala with a Scripting LanguageRunning Impala SQL Statements from ScriptsVariable SubstitutionSaving Query ResultsThe impyla Package for Python ScriptingOptimizing Impala PerformanceOptimizing Query PerformanceOptimizing Memory UsageWorking with Partitioned TablesFinding the Ideal GranularityInserting into Partitioned TablesAdding and Loading New PartitionsKeeping Statistics Up to Date for Partitioned TablesWriting User-Defined FunctionsCollaborating with Your AdministratorsDesigning for SecurityAnticipate Memory UsageUnderstanding Resource ManagementHelping to Plan for Performance (Stats, HDFS Caching)Understanding Cluster TopologyAlways Close Your Queries
5. Tutorials and Deep Dives
Tutorial: From Unix Data File to Impala TableTutorial: Queries Without a TableTutorial: The Journey of a Billion RowsGenerating a Billion Rows of CSV DataNormalizing the Original DataConverting to Parquet FormatMaking a Partitioned TableNext StepsDeep Dive: Joins and the Role of StatisticsCreating a Million-Row Table to Join WithLoading Data and Computing StatsReviewing the EXPLAIN PlanTrying a Real QueryThe Story So FarFinal Join Query with 1B x 1M RowsAnti-Pattern: A Million Little PiecesTutorial: Across the Fourth DimensionTIMESTAMP Data TypeFormat Strings for Dates and TimesWorking with Individual Date and Time FieldsDate and Time ArithmeticLet’s Solve the Y2K ProblemMore Fun with DatesTutorial: Verbose and Quiet impala-shell OutputTutorial: When Schemas EvolveNumbers Versus StringsDealing with Out-of-Range IntegersTutorial: Levels of AbstractionString FormattingTemperature ConversionTutorial: SubqueriesSubqueries in the FROM ClauseSubqueries in the FROM Clause for Join QueriesSubqueries in the WHERE ClauseUncorrelated and Correlated SubqueriesCommon Table Expressions in the WITH ClauseTutorial: Analytic FunctionsAnalyzing the Numbers 1 Through 10Running Totals and Moving AveragesBreaking TiesTutorial: Complex TypesARRAY: A List of Items with Identical TypesMAP: A Hash Table or Dictionary with Key-Value PairsSTRUCT: A Row-Like Object for Flexible Typing and NamingNesting Complex Types to Represent Arbitrary Data StructuresQuerying Tables with Nested Complex TypesConstructing Data for Complex Types

Overview

Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.

Written by John Russell, documentation lead for the Cloudera Impala project, this book gets you working with the most recent Impala releases quickly. Ideal for database developers and business analysts, the latest revision covers analytics functions, complex types, incremental statistics, subqueries, and submission to the Apache incubator.

Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.

Learn how Impala integrates with a wide range of Hadoop components
Attain high performance and scalability for huge data sets on production clusters
Explore common developer tasks, such as porting code to Impala and optimizing performance
Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
Learn how to transition from rigid schemas to a flexible model that evolves as needs change
Take a deep dive into joins and the roles of statistics

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491905760Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills