book

Cassandra: The Definitive Guide

by Eben Hewitt

November 2010

Intermediate to advanced

328 pages

9h 38m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Why Apache Cassandra?Is This Book for You?What’s in This Book?Finding Out MoreConventions Used in This BookUsing Code ExamplesSafari® EnabledHow to Contact UsAcknowledgments
What’s Wrong with Relational Databases?A Quick Review of Relational DatabasesRDBMS: The Awesome and the Not-So-MuchTransactions, ACID-ity, and two-phase commitSchemaSharding and shared-nothing architectureSummaryWeb ScaleThe Cassandra Elevator PitchCassandra in 50 Words or LessDistributed and DecentralizedElastic ScalabilityHigh Availability and Fault ToleranceTuneable ConsistencyBrewer’s CAP TheoremRow-OrientedSchema-FreeHigh PerformanceWhere Did Cassandra Come From?Use Cases for CassandraLarge DeploymentsLots of Writes, Statistics, and AnalysisGeographical DistributionEvolving ApplicationsWho Is Using Cassandra?Summary
Installing the BinaryExtracting the DownloadWhat’s In There?Building from SourceAdditional Build TargetsBuilding with MavenRunning CassandraOn WindowsOn LinuxStarting the ServerRunning the Command-Line Client InterfaceBasic CLI CommandsHelpConnecting to a ServerDescribing the EnvironmentCreating a Keyspace and Column FamilyWriting and Reading DataSummary
The Relational Data ModelA Simple IntroductionClustersKeyspacesColumn FamiliesColumn Family OptionsColumnsWide Rows, Skinny RowsColumn SortingSuper ColumnsComposite KeysDesign Differences Between RDBMS and CassandraNo Query LanguageNo Referential IntegritySecondary IndexesSorting Is a Design DecisionDenormalizationDesign PatternsMaterialized ViewValueless ColumnAggregate KeySome Things to Keep in MindSummary
Data DesignHotel App RDBMS DesignHotel App Cassandra DesignHotel Application CodeCreating the DatabaseLoading the schemaData StructuresGetting a ConnectionPrepopulating the DatabaseThe Search ApplicationTwissandraSummary
System KeyspacePeer-to-PeerGossip and Failure DetectionAnti-Entropy and Read RepairMemtables, SSTables, and Commit LogsHinted HandoffCompactionBloom FiltersTombstonesStaged Event-Driven Architecture (SEDA)Managers and ServicesCassandra DaemonStorage ServiceMessaging ServiceHinted Handoff ManagerSummary
KeyspacesCreating a Column FamilyTransitioning from 0.6 to 0.7ReplicasReplica Placement StrategiesSimple StrategyOld Network Topology StrategyNetwork Topology StrategyReplication FactorIncreasing the Replication FactorPartitionersRandom PartitionerOrder-Preserving PartitionerCollating Order-Preserving PartitionerByte-Ordered PartitionerSnitchesSimple SnitchPropertyFileSnitchCreating a ClusterChanging the Cluster NameAdding Nodes to a ClusterMultiple Seed NodesDynamic Ring ParticipationSecurityUsing SimpleAuthenticatorProgrammatic AuthenticationUsing MD5 EncryptionProviding Your Own AuthenticationMiscellaneous SettingsAdditional ToolsViewing KeysImporting Previous ConfigurationsSummary

Query Differences Between RDBMS and CassandraNo Update QueryRecord-Level Atomicity on WritesNo Server-Side Transaction SupportNo Duplicate KeysBasic Write PropertiesConsistency LevelsBasic Read PropertiesThe APIRanges and SlicesSetup and Inserting DataUsing a Simple GetSeeding Some ValuesSlice PredicateGetting Particular Column Names with Get SliceGetting a Set of Columns with Slice RangeCountsReversedGetting All Columns in a RowGet Range SlicesMultiget SliceDeletingBatch MutatesBatch DeletesRange GhostsProgrammatically Defining Keyspaces and Column FamiliesSummary
Basic Client APIThriftThrift Support for JavaExceptionsThrift SummaryAvroAvro Ant TargetsAvro SpecificationAvro SummaryA Bit of GitConnecting Client NodesClient ListRound-Robin DNSLoad BalancerCassandra Web ConsoleHector (Java)FeaturesThe Hector APIHectorSharp (C#)ChirperChiton (Python)Pelops (Java)Kundera (Java ORM)Fauna (Ruby)Summary
LoggingTailingGeneral TipsFollowing alongWarning signsOverview of JMX and MBeansMBeansIntegrating JMXInteracting with Cassandra via JMXCassandra’s MBeansorg.apache.cassandra.concurrentorg.apache.cassandra.dborg.apache.cassandra.gmsorg.apache.cassandra.serviceStorageServiceStreamingServiceCustom Cassandra MBeansRuntime Analysis ToolsHeap Analysis with JMX and JHATDetecting Thread ProblemsHealth CheckSummary
Getting Ring InformationInfoRingRange TokensGetting StatisticsUsing cfstatsUsing tpstatsBasic MaintenanceRepairFlushCleanupSnapshotsTaking a SnapshotClearing a SnapshotLoad-Balancing the Clusterloadbalance and streamsDecommissioning a NodeUpdating NodesRemoving TokensCompaction ThresholdChanging Column Families in a Working ClusterSummary
Data StorageReply TimeoutCommit LogsMemtablesConcurrencyCachingBuffer SizesUsing the Python Stress TestGenerating the Python Thrift InterfacesGetting ThriftRunning the Python Stress TestStartup and JVM SettingsTuning the JVMSummary
What Is Hadoop?Working with MapReduceCassandra Hadoop Source PackageRunning the Word Count ExampleOutputting Data to CassandraHadoop StreamingTools Above MapReducePigHiveCluster ConfigurationUse CasesRaptr.com: Keith ThornhillImagini: Dave GardnerSummary
Nonrelational DatabasesObject DatabasesXML DatabasesSoftwareAG TaminoeXistOracle Berkeley XML DBMarkLogic ServerApache XindiceSummaryDocument-Oriented DatabasesIBM LotusApache CouchDBMongoDBRiakGraph DatabasesFlockDBNeo4JKey-Value Stores and Distributed HashtablesAmazon DynamoProject VoldemortRedisColumnar DatabasesGoogle BigtableHBaseHypertablePolyglot PersistenceSummary

Content preview from Cassandra: The Definitive Guide

Chapter 12. Integrating Hadoop

Jeremy Hanna

As companies and organizations adopt technologies like Cassandra, they look for tools that can be used to perform analytics and queries against their data. The built-in ways to query can do much, along with custom layers atop that. However, there are distributed tools in the community that can be fitted to work with Cassandra as well.

Hadoop seems to be the elephant in the room when it comes to open source big data frameworks. There we find tools such as an open source MapReduce implementation and higher-level analytics engines built on top of that, such as Pig and Hive. Thanks to members of both the Cassandra and Hadoop communities, Cassandra has gained some significant integration points with Hadoop and its analytics tools.

In this chapter, we explore how Cassandra and Hadoop fit together. First, we give a brief history of the Apache Hadoop project and go into how one can write MapReduce programs against data in Cassandra. From there, we cover integration with higher-level tools built on top of Hadoop: Pig and Hive. Once we have an understanding of these tools, we cover how a Cassandra cluster can be configured to run these analytics in a distributed way. Finally, we share a couple of use cases where Cassandra is being used alongside Hadoop to solve real-world problems.

What Is Hadoop?

If you’re already familiar with Hadoop, you can safely skip this section. If you haven’t had the pleasure, Hadoop (http://hadoop.apache.org) is a set of open ...