O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Apache Cassandra - Second Edition

Book Description

Build a scalable, fault-tolerant and highly available data layer for your applications using Apache Cassandra

About This Book

  • Install Cassandra and set up multi-node clusters
  • Design rich schemas that capture the relationships between different data types
  • Master the advanced features available in Cassandra 3.x through a step-by-step tutorial and build a scalable, high performance database layer

Who This Book Is For

If you are a NoSQL developer and new to Apache Cassandra who wants to learn its common as well as not-so-common features, this book is for you. Alternatively, a developer wanting to enter the world of NoSQL will find this book useful.

It does not assume any prior experience in coding or any framework.

What You Will Learn

  • Install Cassandra
  • Create keyspaces and tables with multiple clustering columns to organize related data
  • Use secondary indexes and materialized views to avoid denormalization of data
  • Effortlessly handle concurrent updates with collection columns
  • Ensure data integrity with lightweight transactions and logged batches
  • Understand eventual consistency and use the right consistency level for your situation
  • Understand data distribution with Cassandra
  • Develop simple application using Java driver and implement application-level optimizations

In Detail

Cassandra is a distributed database that stands out thanks to its robust feature set and intuitive interface, while providing high availability and scalability of a distributed data store. This book will introduce you to the rich feature set offered by Cassandra, and empower you to create and manage a highly scalable, performant and fault-tolerant database layer.

The book starts by explaining the new features implemented in Cassandra 3.x and get you set up with Cassandra. Then you’ll walk through data modeling in Cassandra and the rich feature set available to design a flexible schema. Next you’ll learn to create tables with composite partition keys, collections and user-defined types and get to know different methods to avoid denormalization of data. You will then proceed to create user-defined functions and aggregates in Cassandra. Then, you will set up a multi node cluster and see how the dynamics of Cassandra change with it. Finally, you will implement some application-level optimizations using a Java client.

By the end of this book, you'll be fully equipped to build powerful, scalable Cassandra database layers for your applications.

Style and approach

This book takes a step-by- step approach to give you basic to intermediate knowledge of Apache Cassandra. Every concept is explained in depth, and is supplemented with practical examples when required.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Getting Up and Running with Cassandra
    1. What is big data?
    2. Challenges of modern applications
    3. Why not relational databases?
    4. How to handle big data
    5. What is Cassandra and why Cassandra?
      1. Horizontal scalability
      2. High availability
      3. Write optimization
      4. Structured records
      5. Secondary indexes
      6. Materialized views
      7. Efficient result ordering
      8. Immediate consistency
      9. Discretely writable collections
      10. Relational joins
    6. MapReduce and Spark
      1. Rich and flexible data model
      2. Lightweight transactions
      3. Multidata center replication
      4. Comparing Cassandra to the alternatives
    7. Installing Cassandra
    8. Installing the JDK
    9. Installing on Debian-based systems (Ubuntu)
    10. Installing on RHEL-based systems
    11. Installing on Windows
    12. Installing on Mac OS X
    13. Installing the binary tarball
    14. Bootstrapping the project
    15. CQL—the Cassandra Query Language
    16. Interacting with Cassandra
    17. Getting started with CQL
      1. Creating a keyspace
      2. Selecting a keyspace
      3. Creating a table
      4. Inserting and reading data
    18. New features in Cassandra 2.2, 3.0, and 3.X
    19. Summary
  3. The First Table
    1. How to configure keyspaces
    2. Creating the users table
      1. Structuring of tables
      2. Table and column options
      3. The type system
        1. Strings
        2. Integers
        3. Floating point and decimal numbers
        4. Timestamp
        5. UUIDs
        6. Booleans
        7. Blobs
        8. Collections
        9. Other data types
        10. The purpose of types
    3. Inserting data
      1. Writing data does not yield feedback
      2. Partial inserts
    4. Selecting data
      1. Missing rows
      2. Selecting more than one row
      3. Retrieving all the rows
        1. Paginating through results
      4. Inserts are always upserts
    5. Developing a mental model for Cassandra
    6. Summary
  4. Organizing Related Data
    1. A table for status updates
      1. Creating a table with a compound primary key
      2. The structure of the status updates table
        1. UUIDs and timestamps
    2. Working with status updates
      1. Extracting timestamps
      2. Looking up a specific status update
      3. Automatically generating UUIDs
    3. Anatomy of a compound primary key
      1. Anatomy of a single-column primary key
    4. Beyond two columns
      1. Multiple clustering columns
      2. Composite partition keys
        1. Composite partition key table
        2. Structure of composite partition key tables
        3. Composite partition key with multiple clustering columns
    5. Compound keys represent parent-child relationships
    6. Coupling parents and children using static columns
      1. Defining static columns
        1. Working with static columns
      2. Interacting only with the static columns
        1. Static-only inserts
        2. Static columns act like predefined joins
        3. When to use static columns
    7. Refining our mental model
    8. Summary
  5. Beyond Key-Value Lookup
    1. Looking up rows by partition
    2. The limits of the WHERE keyword
      1. Restricting by clustering column
      2. Restricting by part of a partition key
    3. Retrieving status updates for a specific time range
      1. Creating time UUID ranges
      2. Selecting a slice of a partition
    4. Paginating over rows in a partition
      1. Counting rows
    5. Reversing the order of rows
      1. Reversing clustering order at query time
      2. Reversing clustering order in the schema
      3. Limitations of ORDER BY
        1. ORDER BY summary
    6. Paginating over multiple partitions
    7. JSON support
      1. INSERT JSON
      2. SELECT JSON
    8. Building an autocomplete function
    9. Summary
  6. Establishing Relationships
    1. Modeling follow relationships
      1. Outbound follows
      2. Inbound follows
    2. Storing follow relationships
    3. Cassandra data modelling
      1. Conceptual data model (entity relationship model)
      2. Logical data model (query-driven design)
      3. Physical data model
    4. Denormalization
    5. Looking up follow relationships
    6. Unfollowing users
    7. Using secondary indexes to avoid denormalization
      1. The form of the single table
      2. Adding a secondary index
      3. Other uses of secondary indexes
      4. Limitations of secondary indexes
        1. Secondary indexes can only have one column
        2. Secondary indexes can only be tested for equality
        3. Secondary index lookup is not as efficient as primary key lookup
    8. Materialized views
      1. Adding a view
    9. Summary
  7. Denormalizing Data for Maximum Performance
    1. A normalized approach
      1. Generating the timeline
      2. Ordering and pagination
      3. Multiple partitions and read efficiency
    2. Partial denormalization
      1. Displaying the home timeline
      2. Read performance and write complexity
    3. Fully denormalizing the home timeline
      1. Creating a status update
      2. Displaying the home timeline
    4. Write complexity and data integrity
    5. Batching in Cassandra
      1. Logged batches
      2. Unlogged batches
      3. When to use unlogged batches
      4. Misuse of BATCH statements
    6. Summary
  8. Expanding Your Data Model
    1. Viewing a keyspace schema
    2. Viewing a table schema in cqlsh
    3. Adding columns to tables
    4. Deleting columns
    5. Updating the existing rows
      1. Updating multiple columns
      2. Updating multiple rows
    6. Removing a value from a column
      1. Missing columns in Cassandra
      2. Deleting specific columns
      3. Syntactic sugar for deletion
      4. Deleting table data (TRUNCATE)
      5. Deleting table/keyspace with schema (DROP)
    7. Inserts, updates, and upserts
      1. Inserts can overwrite existing data
      2. Checking before inserting isn't enough
      3. Another advantage of UUIDs
      4. Conditional inserts and lightweight transactions
      5. Updates can create new rows
      6. Optimistic locking with conditional updates
        1. Optimistic locking in action
        2. Optimistic locking and accidental updates
    8. Lightweight transactions and their cost
      1. When lightweight transactions aren't necessary
    9. Summary
  9. Collections, Tuples, and User-Defined Types
    1. The problem with concurrent updates
      1. Serializing the collection
      2. Introducing concurrency
    2. Collection columns and concurrent updates
      1. Defining collection columns
      2. Reading and writing sets
        1. Advanced set manipulation
        2. Removing values from a set
        3. Sets and uniqueness
        4. Collections and upserts
    3. Using lists for ordered, non-unique values
      1. Defining a list column
      2. Writing a list
      3. Discrete list manipulation
        1. Writing data at a specific index
        2. Removing elements from the list
    4. Using maps to store key-value pairs
      1. Writing a map
      2. Updating discrete values in a map
        1. Removing values from maps
    5. Collections in inserts
    6. Collections and secondary indexes
      1. Secondary indexes on map columns
    7. The limitations of collections
      1. Reading discrete values from collections
        1. Collection size limit
      2. Reading a collection column from multiple rows
      3. Unable to reuse collection names
      4. Performance of collection operations
    8. Working with tuples
      1. Creating a tuple column
      2. Writing to tuples
      3. Indexing tuples
    9. User-defined types
      1. Creating a user-defined type
      2. Assigning a user-defined type to a column
      3. Adding data to a user-defined column
      4. Indexing and querying user-defined types
      5. Partial selection of user-defined types
    10. Choosing between tuples and user-defined types
    11. Nested collections
    12. Nested tuples/UDTs
    13. Comparing data structures
    14. Summary
  10. Aggregating Time-Series Data
    1. Recording discrete analytics observations
      1. Using discrete analytics observations
      2. Slicing and dicing our data
    2. Recording aggregate analytics observations
      1. Answering the right question
      2. Precomputation versus read-time aggregation
      3. The many possibilities for aggregation
        1. The role of discrete observations
    3. Recording analytics observations
      1. Updating a counter column
      2. Counters and upserts
      3. Setting and resetting counter columns
      4. Counter columns and deletion
      5. Counter columns need their own table
    4. Cassandra configuration
      1. Configuration location
      2. Modifying configuration
      3. Restarting Cassandra
    5. User-defined functions
    6. User-defined aggregate functions
      1. Standard aggregate functions
    7. Summary
  11. How Cassandra Distributes Data
    1. Data distribution in Cassandra
      1. Cassandra's partitioning strategy - partition key tokens
        1. Distributing partition tokens
        2. Partitioners
        3. Partition keys group data on the same node
        4. Virtual nodes
        5. Virtual nodes facilitate redistribution
    2. Data replication in Cassandra
      1. Masterless replication
        1. Replication without a master
    3. Gossip protocol
    4. Multidata center cluster
      1. Snitch
      2. Replication strategy
      3. Durable writes
    5. Consistency
      1. Immediate and eventual consistency
      2. Consistency in Cassandra
        1. The anatomy of a successful request
      3. Tuning consistency
        1. Eventual consistency with ONE
        2. Immediate consistency with ALL
        3. Fault-tolerant immediate consistency with QUORUM
        4. Local consistency levels
      4. Comparing consistency levels
        1. Choosing the right consistency level
      5. The CAP theorem
    6. Handling conflicting data
      1. Last-write-wins conflict resolution
      2. Introspecting write timestamps
      3. Overriding write timestamps
    7. Distributed deletion
      1. Stumbling on tombstones
      2. Expiring columns with TTL
    8. Table configuration options
    9. Summary
  12. Cassandra Multi-Node Cluster
    1. 3 - node cluster
      1. Prerequisites
      2. Tuning configuration options setting up a 3-node cluster
      3. Tuning configuration
        1. Cassandra.yaml
        2. Cassandra-env.sh
      4. Starting the 3-node cluster
    2. Consistency in action
      1. Write consistency
        1. Consistency QUORUM
        2. Consistency ANY
    3. Cassandra internals
      1. The write path
        1. Compaction
      2. The read path
    4. Cassandra repair mechanisms
      1. Hinted handoff
      2. Read repair
      3. Anti-entropy repair
    5. Summary
  13. Application Development Using the Java Driver
    1. A simple query
      1. Cluster API
      2. Getting metadata
      3. Querying
    2. Prepared statements
    3. QueryBuilder API
      1. Building an INSERT statement
      2. Building an UPDATE statement
      3. Building a SELECT statement
    4. Asynchronous querying
      1. Execute asynchronously
      2. Processing future results
    5. Driver policies
      1. Load-balancing policy
        1. RoundRobinPolicy
        2. DCAwareRoundRobinPolicy
        3. TokenAwarePolicy
      2. Retry Policy
    6. Summary
  14. Peeking under the Hood
    1. Using cassandra-cli
    2. The structure of a simple primary key table
      1. Exploring cells
      2. A model of column families: RowKey and cells
    3. Compound primary keys in column families
      1. A complete mapping
      2. The wide row data structure
      3. The empty cell
    4. Collection columns in column families
      1. Set columns in column families
      2. Map columns in column families
      3. List columns in column families
        1. Appending and prepending values to lists
      4. Other list operations
    5. Summary
  15. Authentication and Authorization
    1. Enabling authentication and authorization
      1. Authentication, authorization, and fault-tolerance
      2. Authentication with cqlsh
      3. Authentication in your application
    2. Setting up a user
      1. Changing a user's password
      2. Viewing user accounts
    3. Controlling access
      1. Viewing permissions
      2. Revoking access
    4. Authorization in action
      1. Authorization as a hedge against mistakes
    5. Security beyond authentication and authorization
      1. Security protects against vulnerabilities
    6. Summary
    7. Wrapping up