Chapter 16. Object Character Recognition

So far, we’ve dealt with writing stored as text data. However, a large portion of written data is stored as images. To use this data we need to convert it to text. This is different than our other NLP problems. In this problem, our knowledge of linguistics won’t be as useful. This isn’t the same as reading; it’s merely character recognition. It is a much less intentional activity than speaking or listening to speech. Fortunately, writing systems tend to be easily distinguishable characters, especially in print. This means that image recognition techniques should work well on images of print text.

Object character recognition (OCR) is the task of taking an image of written language (with characters) and converting it into text data. Modern solutions are neural-network based, and are essentially classifying sections of an image as containing a character. These classifications are then mapped into a character or string of characters in the text data.

Let’s talk about some of the possible inputs.

Kinds of OCR Tasks

There are several kinds of OCR tasks. The tasks differ in what kind of image is the input, what kind of writing is in the image, and what is the target of the model.

Images of Printed Text and PDFs to Text

Unfortunately, there are many systems that export their documents as images. Some will export as PDFs, but since there is such a wide variety of ways in which a document can be coded into a PDF, PDFs may not be better than images. ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.