O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

HBase: The Definitive Guide

Book Description

If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away.

  • Discover how tight integration with Hadoop makes scalability with HBase easier
  • Distribute large datasets across an inexpensive cluster of commodity servers
  • Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs
  • Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more
  • Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs
  • Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

Table of Contents

  1. Dedication
  2. Foreword
  3. Preface
    1. General Information
      1. HBase Version
      2. Building the Examples
      3. Hush: The HBase URL Shortener
      4. Running Hush
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  4. 1. Introduction
    1. The Dawn of Big Data
    2. The Problem with Relational Database Systems
    3. Nonrelational Database Systems, Not-Only SQL or NoSQL?
      1. Dimensions
      2. Scalability
      3. Database (De-)Normalization
    4. Building Blocks
      1. Backdrop
      2. Tables, Rows, Columns, and Cells
      3. Auto-Sharding
      4. Storage API
      5. Implementation
      6. Summary
    5. HBase: The Hadoop Database
      1. History
      2. Nomenclature
      3. Summary
  5. 2. Installation
    1. Quick-Start Guide
    2. Requirements
      1. Hardware
        1. Servers
        2. Networking
      2. Software
        1. Operating system
        2. Filesystem
        3. Java
        4. Hadoop
        5. SSH
        6. Domain Name Service
        7. Synchronized time
        8. File handles and process limits
        9. Datanode handlers
        10. Swappiness
        11. Windows
    3. Filesystems for HBase
      1. Local
      2. HDFS
      3. S3
      4. Other Filesystems
    4. Installation Choices
      1. Apache Binary Release
      2. Building from Source
    5. Run Modes
      1. Standalone Mode
      2. Distributed Mode
        1. Pseudodistributed mode
        2. Fully distributed mode
          1. Specifying region servers
          2. ZooKeeper setup
          3. Using the existing ZooKeeper ensemble
    6. Configuration
      1. hbase-site.xml and hbase-default.xml
      2. hbase-env.sh
      3. regionserver
      4. log4j.properties
      5. Example Configuration
        1. hbase-site.xml
        2. regionservers
        3. hbase-env.sh
      6. Client Configuration
    7. Deployment
      1. Script-Based
      2. Apache Whirr
      3. Puppet and Chef
    8. Operating a Cluster
      1. Running and Confirming Your Installation
      2. Web-based UI Introduction
      3. Shell Introduction
      4. Stopping the Cluster
  6. 3. Client API: The Basics
    1. General Notes
    2. CRUD Operations
      1. Put Method
        1. Single Puts
        2. The KeyValue class
        3. Client-side write buffer
        4. List of Puts
        5. Atomic compare-and-set
      2. Get Method
        1. Single Gets
        2. The Result class
        3. List of Gets
        4. Related retrieval methods
      3. Delete Method
        1. Single Deletes
        2. List of Deletes
        3. Atomic compare-and-delete
    3. Batch Operations
    4. Row Locks
    5. Scans
      1. Introduction
      2. The ResultScanner Class
      3. Caching Versus Batching
    6. Miscellaneous Features
      1. The HTable Utility Methods
      2. The Bytes Class
  7. 4. Client API: Advanced Features
    1. Filters
      1. Introduction to Filters
        1. The filter hierarchy
        2. Comparison operators
        3. Comparators
      2. Comparison Filters
        1. RowFilter
        2. FamilyFilter
        3. QualifierFilter
        4. ValueFilter
        5. DependentColumnFilter
      3. Dedicated Filters
        1. SingleColumnValueFilter
        2. SingleColumnValueExcludeFilter
        3. PrefixFilter
        4. PageFilter
        5. KeyOnlyFilter
        6. FirstKeyOnlyFilter
        7. InclusiveStopFilter
        8. TimestampsFilter
        9. ColumnCountGetFilter
        10. ColumnPaginationFilter
        11. ColumnPrefixFilter
        12. RandomRowFilter
      4. Decorating Filters
        1. SkipFilter
        2. WhileMatchFilter
      5. FilterList
      6. Custom Filters
      7. Filters Summary
    2. Counters
      1. Introduction to Counters
      2. Single Counters
      3. Multiple Counters
    3. Coprocessors
      1. Introduction to Coprocessors
      2. The Coprocessor Class
      3. Coprocessor Loading
        1. Loading from the configuration
        2. Loading from the table descriptor
      4. The RegionObserver Class
        1. Handling region life-cycle events
          1. State: pending open
          2. State: open
          3. State: pending close
        2. Handling client API events
        3. The RegionCoprocessorEnvironment class
        4. The ObserverContext class
        5. The BaseRegionObserver class
      5. The MasterObserver Class
        1. The MasterCoprocessorEnvironment class
        2. The BaseMasterObserver class
      6. Endpoints
        1. The CoprocessorProtocol interface
        2. The BaseEndpointCoprocessor class
    4. HTablePool
    5. Connection Handling
  8. 5. Client API: Administrative Features
    1. Schema Definition
      1. Tables
      2. Table Properties
      3. Column Families
    2. HBaseAdmin
      1. Basic Operations
      2. Table Operations
      3. Schema Operations
      4. Cluster Operations
      5. Cluster Status Information
  9. 6. Available Clients
    1. Introduction to REST, Thrift, and Avro
    2. Interactive Clients
      1. Native Java
      2. REST
        1. Operation
        2. Supported formats
          1. Plain (text/plain)
          2. XML (text/xml)
          3. JSON (application/json)
          4. Protocol Buffer (application/x-protobuf)
          5. Raw binary (application/octet-stream)
        3. REST Java client
      3. Thrift
        1. Installation
        2. Operation
        3. Example: PHP
      4. Avro
        1. Installation
        2. Operation
      5. Other Clients
    3. Batch Clients
      1. MapReduce
        1. Native Java
        2. Clojure
      2. Hive
      3. Pig
      4. Cascading
    4. Shell
      1. Basics
      2. Commands
        1. General
        2. Data definition
        3. Data manipulation
        4. Tools
        5. Replication
      3. Scripting
    5. Web-based UI
      1. Master UI
        1. Main page
        2. User Table page
        3. ZooKeeper page
      2. Region Server UI
        1. Main page
      3. Shared Pages
  10. 7. MapReduce Integration
    1. Framework
      1. MapReduce Introduction
      2. Classes
        1. InputFormat
        2. Mapper
        3. Reducer
        4. OutputFormat
      3. Supporting Classes
      4. MapReduce Locality
      5. Table Splits
    2. MapReduce over HBase
      1. Preparation
        1. Static Provisioning
        2. Dynamic Provisioning
      2. Data Sink
      3. Data Source
      4. Data Source and Sink
      5. Custom Processing
  11. 8. Architecture
    1. Seek Versus Transfer
      1. B+ Trees
      2. Log-Structured Merge-Trees
    2. Storage
      1. Overview
      2. Write Path
      3. Files
        1. Root-level files
        2. Table-level files
        3. Region-level files
        4. Region splits
        5. Compactions
      4. HFile Format
      5. KeyValue Format
    3. Write-Ahead Log
      1. Overview
      2. HLog Class
      3. HLogKey Class
      4. WALEdit Class
      5. LogSyncer Class
      6. LogRoller Class
      7. Replay
        1. Single log
        2. Log splitting
        3. Edits recovery
      8. Durability
    4. Read Path
    5. Region Lookups
    6. The Region Life Cycle
    7. ZooKeeper
    8. Replication
      1. Life of a Log Edit
        1. Normal processing
        2. Non-Responding slave clusters
      2. Internals
        1. Choosing region servers to replicate to
        2. Keeping track of logs
        3. Reading, filtering, and sending edits
        4. Cleaning logs
        5. Region server failover
  12. 9. Advanced Usage
    1. Key Design
      1. Concepts
      2. Tall-Narrow Versus Flat-Wide Tables
      3. Partial Key Scans
      4. Pagination
      5. Time Series Data
      6. Time-Ordered Relations
    2. Advanced Schemas
    3. Secondary Indexes
    4. Search Integration
    5. Transactions
    6. Bloom Filters
    7. Versioning
      1. Implicit Versioning
      2. Custom Versioning
  13. 10. Cluster Monitoring
    1. Introduction
    2. The Metrics Framework
      1. Contexts, Records, and Metrics
      2. Master Metrics
      3. Region Server Metrics
      4. RPC Metrics
      5. JVM Metrics
      6. Info Metrics
    3. Ganglia
      1. Installation
        1. Ganglia-related steps
          1. Ganglia monitoring daemon
          2. Ganglia meta daemon
          3. Ganglia web frontend
        2. HBase-related steps
      2. Usage
    4. JMX
      1. JConsole
      2. JMX Remote API
    5. Nagios
  14. 11. Performance Tuning
    1. Garbage Collection Tuning
    2. Memstore-Local Allocation Buffer
    3. Compression
      1. Available Codecs
        1. Snappy
        2. LZO
        3. GZIP
      2. Verifying Installation
        1. Compression test tool
        2. Startup check
      3. Enabling Compression
    4. Optimizing Splits and Compactions
      1. Managed Splitting
      2. Region Hotspotting
      3. Presplitting Regions
    5. Load Balancing
    6. Merging Regions
    7. Client API: Best Practices
    8. Configuration
    9. Load Tests
      1. Performance Evaluation
      2. YCSB
  15. 12. Cluster Administration
    1. Operational Tasks
      1. Node Decommissioning
      2. Rolling Restarts
      3. Adding Servers
        1. Pseudodistributed mode
          1. Adding a local backup master
          2. Adding a local region server
        2. Fully distributed cluster
          1. Adding a backup master
          2. Adding a region server
    2. Data Tasks
      1. Import and Export Tools
      2. CopyTable Tool
      3. Bulk Import
        1. Bulk load procedure
        2. Using the importtsv tool
        3. Using the completebulkload Tool
        4. Advanced usage
      4. Replication
    3. Additional Tasks
      1. Coexisting Clusters
      2. Required Ports
    4. Changing Logging Levels
    5. Troubleshooting
      1. HBase Fsck
      2. Analyzing the Logs
      3. Common Issues
        1. Basic setup checklist
          1. File handles
          2. DataNode connections
          3. Compression
          4. Garbage collection/memory tuning
        2. Stability issues
          1. ZooKeeper problems
          2. “Could not obtain block” errors
  16. A. HBase Configuration Properties
  17. B. Road Map
    1. HBase 0.92.0
    2. HBase 0.94.0
  18. C. Upgrade from Previous Releases
    1. Upgrading to HBase 0.90.x
      1. From 0.20.x or 0.89.x
      2. Within 0.90.x
    2. Upgrading to HBase 0.92.0
  19. D. Distributions
    1. Cloudera’s Distribution Including Apache Hadoop
  20. E. Hush SQL Schema
  21. F. HBase Versus Bigtable
  22. Index
  23. About the Author
  24. Colophon
  25. Copyright