book

Getting Started with Impala

Name: Getting Started with Impala
Author: John Russell
ISBN: 9781491905722

by John Russell

September 2014

Intermediate to advanced

152 pages

4h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Who Is This Book For?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMarch 30, 2016Acknowledgments
1. Why Impala?
Impala’s Place in the Big Data EcosystemFlexibility for Your Big Data WorkflowHigh-Performance AnalyticsExploratory Business Intelligence
2. Getting Up and Running with Impala
InstallationConnecting to ImpalaYour First Impala Queries
3. Impala for the Database Developer
The SQL LanguageStandard SQL for QueriesLimited DMLNo TransactionsNumbersRecent AdditionsBig Data ConsiderationsBillions and Billions of RowsHDFS Block SizeParquet Files: The Biggest Blocks of AllHow Impala Is Like a Data WarehousePhysical and Logical Data LayoutsThe HDFS Storage ModelDistributed QueriesNormalized and Denormalized DataFile FormatsText File FormatParquet File FormatGetting File Format InformationSwitching File FormatsAggregation
4. Common Developer Tasks for Impala
Getting Data into an Impala TableINSERT StatementLOAD DATA StatementExternal TablesFiguring Out Where Impala Data ResidesManually Loading Data Files into HDFSHiveSqoopKitePorting SQL Code to ImpalaUsing Impala from a JDBC or ODBC ApplicationJDBCODBCUsing Impala with a Scripting LanguageRunning Impala SQL Statements from ScriptsVariable SubstitutionSaving Query ResultsThe impyla Package for Python ScriptingOptimizing Impala PerformanceOptimizing Query PerformanceOptimizing Memory UsageWorking with Partitioned TablesFinding the Ideal GranularityInserting into Partitioned TablesAdding and Loading New PartitionsKeeping Statistics Up to Date for Partitioned TablesWriting User-Defined FunctionsCollaborating with Your AdministratorsDesigning for SecurityAnticipate Memory UsageUnderstanding Resource ManagementHelping to Plan for Performance (Stats, HDFS Caching)Understanding Cluster TopologyAlways Close Your Queries
5. Tutorials and Deep Dives
Tutorial: From Unix Data File to Impala TableTutorial: Queries Without a TableTutorial: The Journey of a Billion RowsGenerating a Billion Rows of CSV DataNormalizing the Original DataConverting to Parquet FormatMaking a Partitioned TableNext StepsDeep Dive: Joins and the Role of StatisticsCreating a Million-Row Table to Join WithLoading Data and Computing StatsReviewing the EXPLAIN PlanTrying a Real QueryThe Story So FarFinal Join Query with 1B x 1M RowsAnti-Pattern: A Million Little PiecesTutorial: Across the Fourth DimensionTIMESTAMP Data TypeFormat Strings for Dates and TimesWorking with Individual Date and Time FieldsDate and Time ArithmeticLet’s Solve the Y2K ProblemMore Fun with DatesTutorial: Verbose and Quiet impala-shell OutputTutorial: When Schemas EvolveNumbers Versus StringsDealing with Out-of-Range IntegersTutorial: Levels of AbstractionString FormattingTemperature ConversionTutorial: SubqueriesSubqueries in the FROM ClauseSubqueries in the FROM Clause for Join QueriesSubqueries in the WHERE ClauseUncorrelated and Correlated SubqueriesCommon Table Expressions in the WITH ClauseTutorial: Analytic FunctionsAnalyzing the Numbers 1 Through 10Running Totals and Moving AveragesBreaking TiesTutorial: Complex TypesARRAY: A List of Items with Identical TypesMAP: A Hash Table or Dictionary with Key-Value PairsSTRUCT: A Row-Like Object for Flexible Typing and NamingNesting Complex Types to Represent Arbitrary Data StructuresQuerying Tables with Nested Complex TypesConstructing Data for Complex Types

Content preview from Getting Started with Impala

Chapter 5. Tutorials and Deep Dives

The following sections cover aspects of Impala that deserve a closer look. Brief examples illustrate interesting features for new users. More complex topics are covered by tutorials or deep dives into the inner workings.

Tutorial: From Unix Data File to Impala Table

Here is what your first Unix command-line session might look like when you’re using Impala. This example from a Bash shell session creates a couple of text files (which could be named anything), copies those files into the HDFS filesystem, and points an Impala table at the data so that it can be queried through SQL. The exact HDFS paths might differ based on your HDFS configuration and Linux users.

$ cat >csv.txt
1,red,apple,4
2,orange,orange,4
3,yellow,banana,3
4,green,apple,4
^D
$ cat >more_csv.txt
5,blue,bubblegum,0.5
6,indigo,blackberry,0.2
7,violet,edible flower,0.01
8,white,scoop of vanilla ice cream,3
9,black,licorice stick,0.2
^D
$ hadoop fs -mkdir /user/hive/staging
$ hadoop fs -put csv.txt /user/hive/staging
$ hadoop fs -put more_csv.txt /user/hive/staging

Note

Sometimes the user you are logged in as does not have permission to manipulate HDFS files. In that case, issue the commands with the permissions of the hdfs user, using the form:

sudo -u hdfs hadoop fs arguments

Now that the data files are in the HDFS filesystem, let’s go into the Impala shell and start working with them. (Some of the prompts and output are abbreviated here for easier reading by first-time users.) ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491905760Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design