June 2017
Beginner to intermediate
576 pages
15h 22m
English
The strategy we will use in this chapter is to first retrieve a small existing publicly available dataset (Pima Indians diabetes). Then we will perform some basic exploratory analysis, compute some key statistical properties, and then use those properties to simulate a much larger dataset that we will use to input into Spark. The key characteristics that we will use to generate this 'big data' will be: