O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Hive Essentials

Book Description

This book takes you on a fantastic journey to discover the attributes of big data using Apache Hive.

About This Book
  • Grasp the skills needed to write efficient Hive queries to analyze the Big Data
  • Discover how Hive can coexist and work with other tools within the Hadoop ecosystem
  • Uses practical, example-oriented scenarios to cover all the newly released features of Apache Hive 2.3.3
Who This Book Is For

If you are a data analyst, developer, or simply someone who wants to quickly get started with Hive to explore and analyze Big Data in Hadoop, this is the book for you. Since Hive is an SQL-like language, some previous experience with SQL will be useful to get the most out of this book.

What You Will Learn
  • Create and set up the Hive environment
  • Discover how to use Hive's definition language to describe data
  • Discover interesting data by joining and filtering datasets in Hive
  • Transform data by using Hive sorting, ordering, and functions
  • Aggregate and sample data in different ways
  • Boost Hive query performance and enhance data security in Hive
  • Customize Hive to your needs by using user-defined functions and integrate it with other tools
In Detail

In this book, we prepare you for your journey into big data by frstly introducing you to backgrounds in the big data domain, alongwith the process of setting up and getting familiar with your Hive working environment.

Next, the book guides you through discovering and transforming the values of big data with the help of examples. It also hones your skills in using the Hive language in an effcient manner. Toward the end, the book focuses on advanced topics, such as performance, security, and extensions in Hive, which will guide you on exciting adventures on this worthwhile big data journey.

By the end of the book, you will be familiar with Hive and able to work effeciently to find solutions to big data problems

Style and approach

This book takes on a practical approach which will get you familiarized with Apache Hive and how to use it to efficiently to find solutions to your big data problems. This book covers crucial topics like performance, and data security in order to help you make the most of the Hive working environment.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Apache Hive Essentials Second Edition
  3. Dedication
  4. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  5. Contributors
    1. About the author
    2. About the reviewers
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  7. Overview of Big Data and Hive
    1. A short history
    2. Introducing big data
    3. The relational and NoSQL databases versus Hadoop
    4. Batch, real-time, and stream processing
    5. Overview of the Hadoop ecosystem
    6. Hive overview
    7. Summary
  8. Setting Up the Hive Environment
    1. Installing Hive from Apache
    2. Installing Hive from vendors
    3. Using Hive in the cloud 
    4. Using the Hive command
    5. Using the Hive IDE
    6. Summary
  9. Data Definition and Description
    1. Understanding data types
    2. Data type conversions
    3. Data Definition Language
    4. Database
    5. Tables
      1. Table creation
      2. Table description
      3. Table cleaning
      4. Table alteration
    6. Partitions
    7. Buckets
    8. Views
    9. Summary
  10. Data Correlation and Scope
    1. Project data with SELECT
    2. Filtering data with conditions
    3. Linking data with JOIN
      1. INNER JOIN
      2. OUTER JOIN
      3. Special joins
    4. Combining data with UNION
    5. Summary
  11. Data Manipulation
    1. Data exchanging with LOAD
    2. Data exchange with INSERT
    3. Data exchange with [EX|IM]PORT
    4. Data sorting
    5. Functions
      1. Function tips for collections
      2. Function tips for date and string
      3. Virtual column functions
    6. Transactions and locks
      1. Transactions
        1. UPDATE statement
        2. DELETE statement
        3. MERGE statement
      2. Locks
    7. Summary
  12. Data Aggregation and Sampling
    1. Basic aggregation 
    2. Enhanced aggregation
      1. Grouping sets
      2. Rollup and Cube
    3. Aggregation condition
    4. Window functions
      1. Window aggregate functions
      2. Window sort functions
      3. Window analytics functions
      4. Window expression
    5. Sampling
      1. Random sampling
      2. Bucket table sampling
      3. Block sampling
    6. Summary
  13. Performance Considerations
    1. Performance utilities
      1. EXPLAIN statement
      2. ANALYZE statement
      3. Logs
    2. Design optimization
      1. Partition table design
      2. Bucket table design
      3. Index design
      4. Use skewed/temporary tables
    3. Data optimization
      1. File format
      2. Compression
      3. Storage optimization
    4. Job optimization
      1. Local mode
      2. JVM reuse
      3. Parallel execution
      4. Join optimization
        1. Common join
        2. Map join
        3. Bucket map join
        4. Sort merge bucket (SMB) join
        5. Sort merge bucket map (SMBM) join
        6. Skew join
      5. Job engine
      6. Optimizer
        1. Vectorization optimization
        2. Cost-based optimization
    5. Summary
  14. Extensibility Considerations
    1. User-defined functions
      1. UDF code template
      2. UDAF code template
      3. UDTF code template
      4. Development and deployment
    2. HPL/SQL
    3. Streaming
    4. SerDe
    5. Summary
  15. Security Considerations
    1. Authentication
      1. Metastore authentication
      2. Hiveserver2 authentication
    2. Authorization
      1. Legacy mode
      2. Storage-based mode
      3. SQL standard-based mode
    3. Mask and encryption
      1. The data-hashing function
      2. The data-masking function
      3. The data-encryption function
      4. Other methods
    4. Summary
  16. Working with Other Tools
    1. The JDBC/ODBC connector
    2. NoSQL
    3. The Hue/Ambari Hive view
    4. HCatalog
    5. Oozie
    6. Spark
    7. Hivemall
    8. Summary
  17. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think