book

Getting Started with Impala

Name: Getting Started with Impala
Author: John Russell
ISBN: 9781491905722

by John Russell

September 2014

Intermediate to advanced

152 pages

4h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Who Is This Book For?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMarch 30, 2016Acknowledgments
1. Why Impala?
Impala’s Place in the Big Data EcosystemFlexibility for Your Big Data WorkflowHigh-Performance AnalyticsExploratory Business Intelligence
2. Getting Up and Running with Impala
InstallationConnecting to ImpalaYour First Impala Queries
3. Impala for the Database Developer
The SQL LanguageStandard SQL for QueriesLimited DMLNo TransactionsNumbersRecent AdditionsBig Data ConsiderationsBillions and Billions of RowsHDFS Block SizeParquet Files: The Biggest Blocks of AllHow Impala Is Like a Data WarehousePhysical and Logical Data LayoutsThe HDFS Storage ModelDistributed QueriesNormalized and Denormalized DataFile FormatsText File FormatParquet File FormatGetting File Format InformationSwitching File FormatsAggregation
4. Common Developer Tasks for Impala
Getting Data into an Impala TableINSERT StatementLOAD DATA StatementExternal TablesFiguring Out Where Impala Data ResidesManually Loading Data Files into HDFSHiveSqoopKitePorting SQL Code to ImpalaUsing Impala from a JDBC or ODBC ApplicationJDBCODBCUsing Impala with a Scripting LanguageRunning Impala SQL Statements from ScriptsVariable SubstitutionSaving Query ResultsThe impyla Package for Python ScriptingOptimizing Impala PerformanceOptimizing Query PerformanceOptimizing Memory UsageWorking with Partitioned TablesFinding the Ideal GranularityInserting into Partitioned TablesAdding and Loading New PartitionsKeeping Statistics Up to Date for Partitioned TablesWriting User-Defined FunctionsCollaborating with Your AdministratorsDesigning for SecurityAnticipate Memory UsageUnderstanding Resource ManagementHelping to Plan for Performance (Stats, HDFS Caching)Understanding Cluster TopologyAlways Close Your Queries
5. Tutorials and Deep Dives
Tutorial: From Unix Data File to Impala TableTutorial: Queries Without a TableTutorial: The Journey of a Billion RowsGenerating a Billion Rows of CSV DataNormalizing the Original DataConverting to Parquet FormatMaking a Partitioned TableNext StepsDeep Dive: Joins and the Role of StatisticsCreating a Million-Row Table to Join WithLoading Data and Computing StatsReviewing the EXPLAIN PlanTrying a Real QueryThe Story So FarFinal Join Query with 1B x 1M RowsAnti-Pattern: A Million Little PiecesTutorial: Across the Fourth DimensionTIMESTAMP Data TypeFormat Strings for Dates and TimesWorking with Individual Date and Time FieldsDate and Time ArithmeticLet’s Solve the Y2K ProblemMore Fun with DatesTutorial: Verbose and Quiet impala-shell OutputTutorial: When Schemas EvolveNumbers Versus StringsDealing with Out-of-Range IntegersTutorial: Levels of AbstractionString FormattingTemperature ConversionTutorial: SubqueriesSubqueries in the FROM ClauseSubqueries in the FROM Clause for Join QueriesSubqueries in the WHERE ClauseUncorrelated and Correlated SubqueriesCommon Table Expressions in the WITH ClauseTutorial: Analytic FunctionsAnalyzing the Numbers 1 Through 10Running Totals and Moving AveragesBreaking TiesTutorial: Complex TypesARRAY: A List of Items with Identical TypesMAP: A Hash Table or Dictionary with Key-Value PairsSTRUCT: A Row-Like Object for Flexible Typing and NamingNesting Complex Types to Represent Arbitrary Data StructuresQuerying Tables with Nested Complex TypesConstructing Data for Complex Types

Content preview from Getting Started with Impala

Chapter 1. Why Impala?

The Apache Hadoop ecosystem is very data-centric, making it a natural fit for database developers with SQL experience. Much application development work for Hadoop consists of writing programs to copy, convert or reorganize, and analyze data files. A lot of effort goes into finding ways to do these things reliably, on a large scale, and in parallel across clusters of networked machines. Impala focuses on making these activities fast and easy, without requiring you to have a PhD in distributed computing, learn a lot of new APIs, or write a complete program when your intent can be conveyed with a single SQL statement.

Impala’s Place in the Big Data Ecosystem

The Cloudera Impala project arrives in the Big Data world at just the right moment. Data volume is growing fast, outstripping what can be realistically stored or processed on a single server. The Hadoop software stack is opening that field up to a larger audience of users and developers.

Impala brings a high degree of flexibility to the familiar database ETL process. You can query data that you already have in various standard Hadoop file formats (see “File Formats”). You can access the same data with a combination of Impala and other Hadoop components such as Apache Hive, Apache Pig, and Cloudera Search without duplicating or converting the data. When query speed is critical, the Parquet columnar file format makes it simple to reorganize data for maximum performance of data warehouse-style queries.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491905760Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Getting Started with Impala

by John Russell

Chapter 1. Why Impala?

Impala’s Place in the Big Data Ecosystem

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.