CHAPTER 3Text Data Source Capture

TEXT MINING DATA SOURCE ASSEMBLY

Text document assembly is almost always embedded in a larger data assembly task. As we identify and tap into various text and data sources, it is typical in a multilingual world to verify and validate that we are correctly reading the text data according to how it has been encoded. When we are dealing with English-language, Latin scripts, and embedded content, we are dealing with compact representations that can be captured in standard ASCII. Increasingly, in an era of global formats it is important to ensure all types of textual data – including formal language logograms and more informal emoticons. Usually, this means turning on the input encoding to a more robust format than standard American ASCII and to use a format such as UCS-8, for example.

The inclusion of written text symbols in standard alphabetic and often pictographic form presents an additional layer of complexity beyond the collection of metric or numerical data.

Use Case: Accessing Text from SAS Conference Proceedings

There were two successive SAS conference presentations in 2012 and 2013.i The 2013 presentation incorporates and extends the earlier results and so serves as an instructive use case on using text analytics approaches to capture and analyze text data. The 2013 use case deals with capturing and summarizing conference presentations throughout the entire history of SAS at that time. Here we take a look at how the text data was captured ...

Get Text as Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.