Unit 16Processing Texts in Natural Languages

As a rule of thumb, somewhere around 80% of all potentially usable data is unstructured—which includes audio, video, images (all of them are beyond the scope of this book), and texts written in natural languages.[12] A text in a natural language has no tags, no delimiters, and no data types, but it still may be a rich source of information. We may want to know if (and how often) certain words are used in the text (word and sentence tokenization), what kind of text it is (text classification), whether it conveys a positive or negative message (sentiment analysis), who or what is mentioned in the text (entity extraction), and so on. We can read and process a text or two with our own eyes, but massive ...

Get Data Science Essentials in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.