book

Programming Hive

by Edward Capriolo, Dean Wampler, Jason Rutherglen

September 2012

Intermediate to advanced

350 pages

9h 46m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Programming Hive
Preface
Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsWhat Brought Us to Hive?Edward CaprioloDean WamplerJason RutherglenAcknowledgments
1. Introduction
An Overview of Hadoop and MapReduceMapReduceHive in the Hadoop EcosystemPigHBaseCascading, Crunch, and OthersJava Versus Hive: The Word Count AlgorithmWhat’s Next
2. Getting Started
Installing a Preconfigured Virtual MachineDetailed InstallationInstalling JavaLinux-specific Java stepsMac OS X−specific Java stepsInstalling HadoopLocal Mode, Pseudodistributed Mode, and Distributed ModeTesting HadoopInstalling HiveWhat Is Inside Hive?Starting HiveConfiguring Your Hadoop EnvironmentLocal Mode ConfigurationDistributed and Pseudodistributed Mode ConfigurationMetastore Using JDBCThe Hive CommandCommand OptionsThe Command-Line InterfaceCLI OptionsVariables and PropertiesHive “One Shot” CommandsExecuting Hive Queries from FilesThe .hiverc FileMore on Using the Hive CLIAutocompleteCommand HistoryShell ExecutionHadoop dfs Commands from Inside HiveComments in Hive ScriptsQuery Column Headers
3. Data Types and File Formats
Primitive Data TypesCollection Data TypesText File Encoding of Data ValuesSchema on Read
4. HiveQL: Data Definition
Databases in HiveAlter DatabaseCreating TablesManaged TablesExternal TablesPartitioned, Managed TablesExternal Partitioned TablesCustomizing Table Storage FormatsDropping TablesAlter TableRenaming a TableAdding, Modifying, and Dropping a Table PartitionChanging ColumnsAdding ColumnsDeleting or Replacing ColumnsAlter Table PropertiesAlter Storage PropertiesMiscellaneous Alter Table Statements
5. HiveQL: Data Manipulation
Loading Data into Managed TablesInserting Data into Tables from QueriesDynamic Partition InsertsCreating Tables and Loading Them in One QueryExporting Data
6. HiveQL: Queries
SELECT … FROM ClausesSpecify Columns with Regular ExpressionsComputing with Column ValuesArithmetic OperatorsUsing FunctionsMathematical functionsAggregate functionsTable generating functionsOther built-in functionsLIMIT ClauseColumn AliasesNested SELECT StatementsCASE … WHEN … THEN StatementsWhen Hive Can Avoid MapReduceWHERE ClausesPredicate OperatorsGotchas with Floating-Point ComparisonsLIKE and RLIKEGROUP BY ClausesHAVING ClausesJOIN StatementsInner JOINJoin OptimizationsLEFT OUTER JOINOUTER JOIN GotchaRIGHT OUTER JOINFULL OUTER JOINLEFT SEMI-JOINCartesian Product JOINsMap-side JoinsORDER BY and SORT BYDISTRIBUTE BY with SORT BYCLUSTER BYCastingCasting BINARY ValuesQueries that Sample DataBlock SamplingInput Pruning for Bucket TablesUNION ALL
7. HiveQL: Views
Views to Reduce Query ComplexityViews that Restrict Data Based on ConditionsViews and Map Type for Dynamic TablesView Odds and Ends
8. HiveQL: Indexes
Creating an IndexBitmap IndexesRebuilding the IndexShowing an IndexDropping an IndexImplementing a Custom Index Handler

9. Schema Design
Table-by-DayOver PartitioningUnique Keys and NormalizationMaking Multiple Passes over the Same DataThe Case for Partitioning Every TableBucketing Table Data StorageAdding Columns to a TableUsing Columnar TablesRepeated DataMany Columns(Almost) Always Use Compression!
10. Tuning
Using EXPLAINEXPLAIN EXTENDEDLimit TuningOptimized JoinsLocal ModeParallel ExecutionStrict ModeTuning the Number of Mappers and ReducersJVM ReuseIndexesDynamic Partition TuningSpeculative ExecutionSingle MapReduce MultiGROUP BYVirtual Columns
11. Other File Formats and Compression
Determining Installed CodecsChoosing a Compression CodecEnabling Intermediate CompressionFinal Output CompressionSequence FilesCompression in ActionArchive PartitionCompression: Wrapping Up
12. Developing
Changing Log4J PropertiesConnecting a Java Debugger to HiveBuilding Hive from SourceRunning Hive Test CasesExecution HooksSetting Up Hive and EclipseHive in a Maven ProjectUnit Testing in Hive with hive_testThe New Plugin Developer Kit
13. Functions
Discovering and Describing FunctionsCalling FunctionsStandard FunctionsAggregate FunctionsTable Generating FunctionsA UDF for Finding a Zodiac Sign from a DayUDF Versus GenericUDFPermanent FunctionsUser-Defined Aggregate FunctionsCreating a COLLECT UDAF to Emulate GROUP_CONCATUser-Defined Table Generating FunctionsUDTFs that Produce Multiple RowsUDTFs that Produce a Single Row with Multiple ColumnsUDTFs that Simulate Complex TypesAccessing the Distributed Cache from a UDFAnnotations for Use with FunctionsDeterministicStatefulDistinctLikeMacros
14. Streaming
Identity TransformationChanging TypesProjecting TransformationManipulative TransformationsUsing the Distributed CacheProducing Multiple Rows from a Single RowCalculating Aggregates with StreamingCLUSTER BY, DISTRIBUTE BY, SORT BYGenericMR Tools for Streaming to JavaCalculating Cogroups
15. Customizing Hive File and Record Formats
File Versus Record FormatsDemystifying CREATE TABLE StatementsFile FormatsSequenceFileRCFileExample of a Custom Input Format: DualInputFormatRecord Formats: SerDesCSV and TSV SerDesObjectInspectorThink Big Hive Reflection ObjectInspectorXML UDFXPath-Related FunctionsJSON SerDeAvro Hive SerDeDefining Avro Schema Using Table PropertiesDefining a Schema from a URIEvolving SchemaBinary Output
16. Hive Thrift Service
Starting the Thrift ServerSetting Up Groovy to Connect to HiveServiceConnecting to HiveServerGetting Cluster StatusResult Set SchemaFetching ResultsRetrieving Query PlanMetastore MethodsExample Table CheckerFinding tables not marked as externalAdministrating HiveServerProductionizing HiveServiceCleanupHive ThriftMetastoreThriftMetastore ConfigurationClient Configuration
17. Storage Handlers and NoSQL
Storage Handler BackgroundHiveStorageHandlerHBaseCassandraStatic Column MappingTransposed Column Mapping for Dynamic ColumnsCassandra SerDe PropertiesDynamoDB
18. Security
Integration with Hadoop SecurityAuthentication with HiveAuthorization in HiveUsers, Groups, and RolesPrivileges to Grant and RevokePartition-Level PrivilegesAutomatic Grants
19. Locking
Locking Support in Hive with ZookeeperExplicit, Exclusive Locks
20. Hive Integration with Oozie
Oozie ActionsHive Thrift Service ActionA Two-Query WorkflowOozie Web ConsoleVariables in WorkflowsCapturing OutputCapturing Output to Variables
21. Hive and Amazon Web Services (AWS)
Why Elastic MapReduce?InstancesBefore You StartManaging Your EMR Hive ClusterThrift Server on EMR HiveInstance Groups on EMRConfiguring Your EMR ClusterDeploying hive-site.xmlDeploying a .hiverc ScriptDeploying .hiverc using a config stepDeploying a .hiverc using a bootstrap actionSetting Up a Memory-Intensive ConfigurationPersistence and the Metastore on EMRHDFS and S3 on EMR ClusterPutting Resources, Configs, and Bootstrap Scripts on S3Logs on S3Spot InstancesSecurity GroupsEMR Versus EC2 and Apache HiveWrapping Up
22. HCatalog
IntroductionMapReduceReading DataWriting DataCommand LineSecurity ModelArchitecture
23. Case Studies
m6d.com (Media6Degrees)Data Science at M6D Using Hive and RM6D UDF PseudorankM6D Managing Hive Data Across Multiple MapReduce ClustersCross deployment queries with HiveReplicating Hive data between deploymentsOutbrainIn-Site Referrer IdentificationCleaning up the URLsDetermining referrer typeMultiple URLsCounting UniquesWhy this is a problemLoad a temp tableQuerying the temp tableSessionizationSetting it upFinding origin pageviewsBucketing PVs to originsAggregating on originsAggregating on origin typeMeasure engagementNASA’s Jet Propulsion LaboratoryThe Regional Climate Model Evaluation SystemOur Experience: Why Hive?Some Challenges and How We Overcame ThemConclusionPhotobucketBig Data at PhotobucketWhat Hardware Do We Use for Hive?What’s in Hive?Who Does It Support?SimpleReachExperiences and Needs from the Customer TrenchesA Karmasphere PerspectiveIntroductionUse Case Examples from the Customer TrenchesCustomer trenches #1: Optimal data formatting for HiveCustomer trenches #2: Partitions and performanceCustomer trenches #3: Text analytics with Regex, Lateral View Explode, Ngram, and other UDFsApache Hive in production: Incremental needs and capabilitiesAbout Karmasphere
Glossary
A. References
Index
About the Authors
Colophon
Copyright

Content preview from Programming Hive

Chapter 13. Functions

User-Defined Functions (UDFs) are a powerful feature that allow users to extend HiveQL. As we’ll see, you implement them in Java and once you add them to your session (interactive or driven by a script), they work just like built-in functions, even the online help. Hive has several types of user-defined functions, each of which performs a particular “class” of transformations on input data.

In an ETL workload, a process might have several processing steps. The Hive language has multiple ways to pipeline the output from one step to the next and produce multiple outputs during a single query. Users also have the ability to create their own functions for custom processing. Without this feature a process might have to include a custom MapReduce step or move the data into another system to apply the changes. Interconnecting systems add complexity and increase the chance of misconfigurations or other errors. Moving data between systems is time consuming when dealing with gigabyte- or terabyte-sized data sets. In contrast, UDFs run in the same processes as the tasks for your Hive queries, so they work efficiently and eliminate the complexity of integration with other systems. This chapter covers best practices associated with creating and using UDFs.

Discovering and Describing Functions

Before writing custom UDFs, let’s familiarize ourselves with the ones that are already part of Hive. Note that it’s common in the Hive community to use “UDF” to refer to any function, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449326944Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Programming Hive

by Edward Capriolo, Dean Wampler, Jason Rutherglen

Chapter 13. Functions

Discovering and Describing Functions

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.