O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Hive Essentials

Book Description

Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

In Detail

In this book, we prepare you for your journey into big data by firstly introducing you to backgrounds in the big data domain along with the process of setting up and getting familiar with your Hive working environment. Next, the book guides you through discovering and transforming the values of big data with the help of examples. It also hones your skill in using the Hive language in an efficient manner. Towards the end, the book focuses on advanced topics such as performance, security, and extensions in Hive, which will guide you on exciting adventures on this worthwhile big data journey.

By the end of the book, you will be familiar with Hive and able to work efficiently to find solutions to big data problems.

What You Will Learn

  • Create and set up the Hive environment
  • Discover how to use Hive's definition language to describe data
  • Discover interesting data by joining and filtering datasets in Hive
  • Transform data by using Hive sorting, ordering, and functions
  • Aggregate and sample data in different ways
  • Boost Hive query performance and enhance data security in Hive
  • Customize Hive to your needs by using user-defined functions and integrate it with other tools

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Apache Hive Essentials
    1. Table of Contents
    2. Apache Hive Essentials
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Overview of Big Data and Hive
      1. A short history
      2. Introducing big data
      3. Relational and NoSQL database versus Hadoop
      4. Batch, real-time, and stream processing
      5. Overview of the Hadoop ecosystem
      6. Hive overview
      7. Summary
    9. 2. Setting Up the Hive Environment
      1. Installing Hive from Apache
      2. Installing Hive from vendor packages
      3. Starting Hive in the cloud
      4. Using the Hive command line and Beeline
      5. The Hive-integrated development environment
      6. Summary
    10. 3. Data Definition and Description
      1. Understanding Hive data types
      2. Data type conversions
      3. Hive Data Definition Language
      4. Hive database
      5. Hive internal and external tables
      6. Hive partitions
      7. Hive buckets
      8. Hive views
      9. Summary
    11. 4. Data Selection and Scope
      1. The SELECT statement
      2. The INNER JOIN statement
      3. The OUTER JOIN and CROSS JOIN statements
      4. Special JOIN – MAPJOIN
      5. Set operation – UNION ALL
      6. Summary
    12. 5. Data Manipulation
      1. Data exchange – LOAD
      2. Data exchange – INSERT
      3. Data exchange – EXPORT and IMPORT
      4. ORDER and SORT
      5. Operators and functions
      6. Transactions
      7. Summary
    13. 6. Data Aggregation and Sampling
      1. Basic aggregation – GROUP BY
      2. Advanced aggregation – GROUPING SETS
      3. Advanced aggregation – ROLLUP and CUBE
      4. Aggregation condition – HAVING
      5. Analytic functions
      6. Sampling
      7. Summary
    14. 7. Performance Considerations
      1. Performance utilities
        1. The EXPLAIN statement
        2. The ANALYZE statement
      2. Design optimization
        1. Partition tables
        2. Bucket tables
        3. Index
      3. Data file optimization
        1. File format
        2. Compression
        3. Storage optimization
      4. Job and query optimization
        1. Local mode
        2. JVM reuse
        3. Parallel execution
        4. Join optimization
          1. Common join
          2. Map join
          3. Bucket map join
          4. Sort merge bucket (SMB) join
          5. Sort merge bucket map (SMBM) join
          6. Skew join
      5. Summary
    15. 8. Extensibility Considerations
      1. User-defined functions
        1. The UDF code template
        2. The UDAF code template
        3. The UDTF code template
        4. Development and deployment
      2. Streaming
      3. SerDe
      4. Summary
    16. 9. Security Considerations
      1. Authentication
        1. Metastore server authentication
        2. HiveServer2 authentication
      2. Authorization
        1. Legacy mode
        2. Storage-based mode
        3. SQL standard-based mode
      3. Encryption
      4. Summary
    17. 10. Working with Other Tools
      1. JDBC / ODBC connector
      2. HBase
      3. Hue
      4. HCatalog
      5. ZooKeeper
      6. Oozie
      7. Hive roadmap
      8. Summary
    18. Index