© Ed Elliott 2021
E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_8

8. Batch Mode Processing

Ed Elliott1  
(1)
Sussex, UK
 

In this chapter, we will be looking at how to write a typical batch processing data pipeline using .NET for Apache Spark. We will show how a typical data processing job reads the source data and parses the data including dealing with any oddities the source files may have and then write the files out to a common format that other consumers of the data can use.

Imperfect Source Data

It is rare that when we are working with data sources, the files are in a perfect condition for processing; we often have to do some work to tidy the data, and in the example we will use in this chapter, this is as ...

Get Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.