© Ed Elliott 2021
E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_8

8. Batch Mode Processing

Ed Elliott1  
(1)
Sussex, UK
 

In this chapter, we will be looking at how to write a typical batch processing data pipeline using .NET for Apache Spark. We will show how a typical data processing job reads the source data and parses the data including dealing with any oddities the source files may have and then write the files out to a common format that other consumers of the data can use.

Imperfect Source Data

It is rare that when we are working with data sources, the files are in a perfect condition for processing; we often have to do some work to tidy the data, and in the example we will use in this chapter, this is as ...

Get Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.