Exploration and preparation of the OCR dataset

According to the dataset description, glyphs are scanned using an OCR reader on to the computer then they are automatically converted into pixels. Consequently, all the 16 statistical attributes (in figure 2) are recorded to the computer too. The the concentration of black pixels across various areas of the box provide a way to differentiate 26 letters using OCR or a machine learning algorithm to be trained.

Recall that support vector machines (SVM), Logistic Regression, Naive Bayesian-based classifier, or any other classifier algorithms (along with their associated learners) require all the features to be numeric. LIBSVM allows you to use a sparse training dataset in an unconventional format. ...

Get Scala and Spark for Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.