To build the characteristics of the Spark dataframe, we will first take a small dataset, determine the basic statistical properties of this dataset, and then build a Spark dataframe based upon these properties.
The Pima Indians diabetes dataset contains the following attributes:
- Number of times pregnant
- Plasma glucose concentration with two hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Developed diabetes (yes or no)
The data is a publicly available dataset. In fact, there are several versions of this dataset available. We will ...