Chapter 5. Common Developer Tasks for Kudu

At its very core, Apache Kudu is a highly resilient, distributed, fault-tolerant storage engine that manages structured data really well. Moving data into Kudu and getting it out is meant to be done easily and efficiently through simple-to-understand APIs.

For the developer, you have several choices in how you could interact with the data you store in Kudu. Client-side APIs are provided for the following programming languages:

  • C++

  • Java

  • Python

Compute frameworks such as MapReduce and Spark are also available when interacting with Kudu. MapReduce, using the Java client, has a native Kudu input format, whereas Spark’s API provides a specialized Kudu Context together with deep integration with Spark SQL.

Providing SQL access to Kudu is a natural fit given that Kudu stores data in a structured, strongly typed fashion. Thus, as of today, not only can you use Spark SQL to access and manipulate your data, but also Apache Impala. Impala is an open source, native analytic database for Hadoop and is shipped by multiple Hadoop distributions. It, too, provides a clean abstraction of tables that can exist in Kudu, Hadoop Distributed File System (HDFS), HBase, or cloud-based object stores like Amazon Web Services Simple Storage Service (Amazon S3).

In this chapter, we dive into the various client-side APIs, including Spark, and then round out the chapter discussing how Impala’s integration with Kudu can be used for many types of use cases. ...

Get Getting Started with Kudu now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.