CHAPTER 4Document Content and Characterization
AUTHORSHIP ANALYTICS: EARLY TEXT INDICATORS AND MEASURES
Text analytics have long been used to resolve questions on author attribution and is one area that predates modern higher-dimensional, computational methods. One of the earliest examples is the work of Mosteller and Wallacei to identify the authorship of 12 disputed essays in THE FEDERALIST PAPERS. Between the years 1787 and 1788, Alexander Hamilton, John Jay, and James Madison wrote 85 expositions, or essays, designed to help get the US Constitution ratified, published anonymously under the pseudonym “Publius.” The authorship of certain of The Federalist Papers was ambiguous, as both Hamilton and Madison produced lists that claimed some of the same papers.
Function Words as Indicators
Initially, the authors used sentence lengths (number of words per sentence) to distinguish authorship. Later researchers, for example, Fung,ii used function words (FW) (Table 4.1) to form a hyperplane between two sets of documents. This latter technique is also used by one of the psychological researchersiii taken up later in this chapter.
Fung's approach is based on computing the relative frequencies (number of occurrences per 1000 words of the text) of the 70 function words (Table 4.1). The training data set consists of authorship as the target field concatenated with 70 predictor fields that consist of the relative frequencies of the 70 function words. The Fung analysis resulted in a hyperplane ...
Get Text as Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.