AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Content > Topics

How can I bulk-load data from HDFS to Kudu using Apache Spark?

Learn how to pair two top-tier open source technologies to create scalable data engineering pipelines.

By Ryan Bosshart April 10, 2017 • 1 minute read

LinkedIn X Facebook Threads Bluesky Reddit

Screen from "How can I bulk-load data from HDFS to Kudu using Apache Spark?" (source: O'Reilly)

Apache Spark dominates the big data landscape with its ability to process data on a large scale and handle machine learning workloads. In this video Ryan Bosshart explains how to pair Spark with the Hadoop storage layer for easy, scalable data storage. All you need to follow along is IntelliJ IDEA and access to Kudu Quickstart VM. Data architects and developers will be able to:

Use the Kudu-Spark module to move data between HDFS and Kudu.
Create a new Kudu table from a Spark SQL DataFrame.
Create data processing pipelines that transform data from a raw to processed, queryable format.

Continue learning Kudu with our Using Kudu with Apache Spark and Apache Flume course.

Ryan Bosshart is a Principal Systems Engineer at Cloudera, where he leads a specialized team focused on Hadoop ecosystem storage technologies such as HDFS, Hbase, and Kudu. An architect and builder of large-scale distributed systems since 2006, Ryan is co-chair of the Twin Cities Spark and Hadoop User Group. He speaks about Hadoop technologies at conferences throughout North America and holds a degree in computer science from Augsburg College.

Post topics: Data

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

How can I bulk-load data from HDFS to Kudu using Apache Spark?