O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

HBase: The Definitive Guide, 2nd Edition

Book Description

If you’re looking for a scalable storage solution to accommodate a virtually endless amount of data, this updated edition shows you how Apache HBase can meet your needs. Modeled after Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant.

Fully revised for HBase 1.0, this second edition brings you up to speed on the new HBase client API, as well as security features and new case studies that demonstrate HBase use in the real world. Whether you just started to evaluate this non-relational database, or plan to put it into practice right away, this book has your back.

  • Launch into basic, advanced, and administrative features of HBase’s new client-facing API
  • Use new classes to integrate HBase with Hadoop’s MapReduce framework
  • Explore HBase’s architecture, including the storage format, write-ahead log, and background processes
  • Dive into advanced usage, such extended client and server options
  • Learn cluster sizing, tuning, and monitoring best practices
  • Design schemas, copy tables, import bulk data, decommission nodes, and other tasks
  • Go deeper into HBase security, including Kerberos and encryption at rest

Table of Contents

  1. 1. Introduction
    1. The Dawn of Big Data
    2. The Problem with Relational Database Systems
    3. Nonrelational Database Systems, Not-Only SQL or NoSQL?
      1. Dimensions
      2. Scalability
      3. Database (De-)Normalization
    4. Building Blocks
      1. Backdrop
      2. Namespaces, Tables, Rows, Columns, and Cells
      3. Auto-Sharding
      4. Storage API
      5. Implementation
      6. Summary
    5. HBase: The Hadoop Database
      1. History
      2. Nomenclature
      3. Summary
  2. 2. Installation
    1. Quick-Start Guide
    2. Requirements
      1. Hardware
      2. Software
    3. Filesystems for HBase
      1. Local
      2. HDFS
      3. S3
      4. Other Filesystems
    4. Installation Choices
      1. Apache Binary Release
      2. Building from Source
    5. Run Modes
      1. Standalone Mode
      2. Distributed Mode
    6. Configuration
      1. hbase-site.xml and hbase-default.xml
      2. hbase-env.sh and hbase-env.cmd
      3. regionserver
      4. log4j.properties
      5. Example Configuration
      6. Client Configuration
    7. Deployment
      1. Script-Based
      2. Apache Whirr
      3. Puppet and Chef
    8. Operating a Cluster
      1. Running and Confirming Your Installation
      2. Web-based UI Introduction
      3. Shell Introduction
      4. Stopping the Cluster
  3. 3. Client API: The Basics
    1. General Notes
    2. Data Types and Hierarchy
      1. Generic Attributes
      2. Operations: Fingerprint and ID
      3. Query versus Mutation
      4. Durability, Consistency, and Isolation
      5. The Cell
      6. API Building Blocks
    3. CRUD Operations
      1. Put Method
      2. Get Method
      3. Delete Method
      4. Append Method
      5. Mutate Method
    4. Batch Operations
    5. Scans
      1. Introduction
      2. The ResultScanner Class
      3. Scanner Caching
      4. Scanner Batching
      5. Slicing Rows
      6. Load Column Families on Demand
      7. Scanner Metrics
    6. Miscellaneous Features
      1. The Table Utility Methods
      2. The Bytes Class
  4. 4. Client API: Advanced Features
    1. Filters
      1. Introduction to Filters
      2. Comparison Filters
      3. Dedicated Filters
      4. Decorating Filters
      5. FilterList
      6. Custom Filters
      7. Filter Parser Utility
      8. Filters Summary
    2. Counters
      1. Introduction to Counters
      2. Single Counters
      3. Multiple Counters
    3. Coprocessors
      1. Introduction to Coprocessors
      2. The Coprocessor Class Trinity
      3. Coprocessor Loading
      4. Endpoints
      5. Observers
      6. The ObserverContext Class
      7. The RegionObserver Class
      8. The MasterObserver Class
      9. The RegionServerObserver Class
      10. The WALObserver Class
      11. The BulkLoadObserver Class
      12. The EndPointObserver Class
  5. 5. Client API: Administrative Features
    1. Schema Definition
      1. Namespaces
      2. Tables
      3. Table Properties
      4. Column Families
    2. Cluster Administration
      1. Basic Operations
      2. Namespace Operations
      3. Table Operations
      4. Schema Operations
      5. Cluster Operations
      6. Cluster Status Information
    3. ReplicationAdmin
  6. 6. Available Clients
    1. Introduction
      1. Gateways
      2. Frameworks
    2. Gateway Clients
      1. Native Java
      2. REST
      3. Thrift
      4. Thrift2
      5. SQL over NoSQL
    3. Framework Clients
      1. MapReduce
      2. Hive
      3. Pig
      4. Cascading
      5. Other Clients
    4. Shell
      1. Basics
      2. Commands
      3. Scripting
    5. Web-based UI
      1. Master UI Status Page
      2. Master UI Related Pages
      3. Region Server UI Status Page
      4. Shared Pages
  7. 7. Hadoop Integration
    1. Framework
      1. MapReduce Introduction
      2. Processing Classes
      3. Supporting Classes
      4. MapReduce Locality
      5. Table Splits
    2. MapReduce over Tables
      1. Preparation
      2. Table as a Data Sink
      3. Table as a Data Source
      4. Table as both Data Source and Sink
      5. Custom Processing
    3. MapReduce over Snapshots
    4. Bulk Loading Data
  8. 8. Advanced Usage
    1. Key Design
      1. Concepts
      2. Tall-Narrow Versus Flat-Wide Tables
      3. Partial Key Scans
      4. Pagination
      5. Time Series Data
      6. Time-Ordered Relations
      7. Aging-out Regions
      8. Application-driven Replicas
    2. Advanced Schemas
    3. Secondary Indexes
    4. Search Integration
    5. Transactions
      1. Region-local Transactions
    6. Versioning
      1. Implicit Versioning
      2. Custom Versioning
  9. 9. Cluster Monitoring
    1. Introduction
    2. The Metrics Framework
      1. Metrics Building Blocks
      2. Configuration
      3. Metrics UI
      4. Master Metrics
      5. Region Server Metrics
      6. RPC Metrics
      7. UserGroupInformation Metrics
      8. JVM Metrics
    3. Ganglia
      1. Installation
      2. Usage
    4. JMX
      1. JConsole
      2. JMX Remote API
    5. Nagios
    6. OpenTSDB
  10. 10. Performance Tuning
    1. Heap Tuning
      1. Java Heap Sizing
      2. Tuning Heap Shares
    2. Garbage Collection Tuning
      1. Introduction
      2. Concurrent Mark Sweep (CMS)
      3. Garbage First (G1)
      4. Garbage Collection Information
    3. Memstore-Local Allocation Buffer
    4. HDFS Read Tuning
      1. Short-Circuit Reads
      2. Hedged Reads
    5. Block Cache Tuning
      1. Introduction
      2. Cache Types
      3. Single vs. Multi-level Caching
      4. Basic Cache Configuration
      5. Advanced Cache Configuration
      6. Cache Selection
    6. Compression
      1. Available Codecs
      2. Verifying Installation
      3. Enabling Compression
    7. Key Encoding
      1. Available Codecs
      2. Enabling Key Encoding
    8. Bloom Filters
    9. Region Split Handling
      1. Number of Regions
      2. Managed Splitting
      3. Region Hotspotting
      4. Presplitting Regions
    10. Merging Regions
      1. Online: Merge with API and Shell
      2. Offline: Merge Tool
    11. Region Ergonomics
    12. Compaction Tuning
      1. Compaction Settings
      2. Compaction Throttling
    13. Region Flush Tuning
    14. RPC Tuning
      1. RPC Scheduling
      2. Slow Query Logging
    15. Load Balancing
    16. Client API: Best Practices
    17. Configuration
    18. Load Tests
      1. Performance Evaluation
      2. Load Test Tool
      3. YCSB
  11. 11. Cluster Administration
    1. Operational Tasks
      1. Cluster Sizing
      2. Resource Management
      3. Bulk Moving Regions
      4. Node Decommissioning
      5. Draining Servers
      6. Rolling Restarts
      7. Adding Servers
      8. Reloading Configuration
      9. Canary & Health Checks
      10. Region Server Memory Pinning
      11. Cleaning an Installation
    2. Data Tasks
      1. Renaming a Table
      2. Import and Export Tools
      3. CopyTable Tool
      4. Export Snapshots
      5. Bulk Import
      6. Replication
    3. Additional Tasks
      1. Coexisting Clusters
      2. Required Ports
      3. Changing Logging Levels
      4. Region Replicas
    4. Troubleshooting
      1. HBase Fsck
      2. Analyzing the Logs
      3. Common Issues
      4. Tracing Requests
  12. A. Upgrade from Previous Releases
    1. Upgrading to HBase 0.90.x
      1. From 0.20.x or 0.89.x
      2. Within 0.90.x
    2. Upgrading to HBase 0.92.0
    3. Upgrading to HBase 0.98.x
    4. Migrate API to HBase 1.0.x
      1. Migrate Coprocessors to post HBase 0.96
      2. Migrate Custom Filters to post HBase 0.96