O'Reilly logo

Mastering Apache Spark 2.x - Second Edition by Romeo Kienzler

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Naive Bayes in practice

The first step is to choose some data that will be used for classification. We have chosen some data from the UK Government data website at http://data.gov.uk/dataset/road-accidents-safety-data.

The dataset is called Road Safety - Digital Breath Test Data 2013, which downloads a zipped text file called DigitalBreathTestData2013.txt. This file contains around half a million rows. The data looks as follows:

Reason,Month,Year,WeekType,TimeBand,BreathAlcohol,AgeBand,GenderSuspicion of Alcohol,Jan,2013,Weekday,12am-4am,75,30-39,MaleMoving Traffic Violation,Jan,2013,Weekday,12am-4am,0,20-24,MaleRoad Traffic Collision,Jan,2013,Weekend,12pm-4pm,0,20-24,Female

In order to classify the data, we have modified both the column ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required