book

Mining the Social Web, 2nd Edition

Name: Mining the Social Web, 2nd Edition
Author: Matthew A. Russell
ISBN: 9781449367619

by Matthew A. Russell

October 2013

Beginner to intermediate

444 pages

12h 45m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Dedication
Preface
README.1stManaging Your ExpectationsPython-Centric TechnologyImprovements Specific to the Second EditionConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments for the Second EditionAcknowledgments from the First Edition
I. A Guided Tour of the Social Web
Prelude
1. Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
OverviewWhy Is Twitter All the Rage?Exploring Twitter’s APIFundamental Twitter TerminologyCreating a Twitter API ConnectionExploring Trending TopicsSearching for TweetsAnalyzing the 140 CharactersExtracting Tweet EntitiesAnalyzing Tweets and Tweet Entities with Frequency AnalysisComputing the Lexical Diversity of TweetsExamining Patterns in RetweetsVisualizing Frequency Data with HistogramsClosing RemarksRecommended ExercisesOnline Resources
2. Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
OverviewExploring Facebook’s Social Graph APIUnderstanding the Social Graph APIUnderstanding the Open Graph ProtocolAnalyzing Social Graph ConnectionsAnalyzing Facebook PagesAnalyzing this book’s Facebook pageAnalyzing Coke vs Pepsi Facebook pagesExamining FriendshipsAnalyzing things your friends “like”Analyzing mutual friendships with directed graphsVisualizing directed graphs of mutual friendshipsClosing RemarksRecommended ExercisesOnline Resources
3. Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
OverviewExploring the LinkedIn APIMaking LinkedIn API RequestsDownloading LinkedIn Connections as a CSV FileCrash Course on Clustering DataClustering Enhances User ExperiencesNormalizing Data to Enable AnalysisNormalizing and counting companiesNormalizing and counting job titlesNormalizing and counting locationsVisualizing locations with cartogramsMeasuring SimilarityClustering AlgorithmsGreedy clusteringRuntime analysisHierarchical clusteringk-means clusteringVisualizing geographic clusters with Google EarthClosing RemarksRecommended ExercisesOnline Resources
4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More
OverviewExploring the Google+ APIMaking Google+ API RequestsA Whiz-Bang Introduction to TF-IDFTerm FrequencyInverse Document FrequencyTF-IDFQuerying Human Language Data with TF-IDFIntroducing the Natural Language ToolkitApplying TF-IDF to Human LanguageFinding Similar DocumentsThe theory behind vector space models and cosine similarityClustering posts with cosine similarityVisualizing document similarity with a matrix diagramAnalyzing Bigrams in Human LanguageContingency tables and scoring functionsReflections on Analyzing Human Language DataClosing RemarksRecommended ExercisesOnline Resources
5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
OverviewScraping, Parsing, and Crawling the WebBreadth-First Search in Web CrawlingDiscovering Semantics by Decoding SyntaxNatural Language Processing Illustrated Step-by-StepSentence Detection in Human Language DataDocument SummarizationAnalysis of Luhn’s summarization algorithmEntity-Centric Analysis: A Paradigm ShiftGisting Human Language DataQuality of Analytics for Processing Human Language DataClosing RemarksRecommended ExercisesOnline Resources
6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
OverviewObtaining and Processing a Mail CorpusA Primer on Unix MailboxesGetting the Enron DataConverting a Mail Corpus to a Unix MailboxConverting Unix Mailboxes to JSONImporting a JSONified Mail Corpus into MongoDBThe MongoDB shellProgrammatically Accessing MongoDB with PythonAnalyzing the Enron CorpusQuerying by Date/Time RangeAnalyzing Patterns in Sender/Recipient CommunicationsWriting Advanced QueriesSearching Emails by KeywordsDiscovering and Visualizing Time-Series TrendsAnalyzing Your Own Mail DataAccessing Your Gmail with OAuthFetching and Parsing Email Messages with IMAPVisualizing Patterns in GMail with the “Graph Your Inbox” Chrome ExtensionClosing RemarksRecommended ExercisesOnline Resources

7. Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More
OverviewExploring GitHub’s APICreating a GitHub API ConnectionMaking GitHub API RequestsModeling Data with Property GraphsAnalyzing GitHub Interest GraphsSeeding an Interest GraphComputing Graph Centrality MeasuresExtending the Interest Graph with “Follows” Edges for UsersApplication of centrality measuresAdding more repositories to the interest graphComputational ConsiderationsUsing Nodes as Pivots for More Efficient QueriesVisualizing Interest GraphsClosing RemarksRecommended ExercisesOnline Resources
8. Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
OverviewMicroformats: Easy-to-Implement MetadataGeocoordinates: A Common Thread for Just About AnythingUsing Recipe Data to Improve Online MatchmakingRetrieving recipe reviewsAccessing LinkedIn’s 200 Million Online RésumésFrom Semantic Markup to Semantic Web: A Brief InterludeThe Semantic Web: An Evolutionary RevolutionMan Cannot Live on Facts AloneOpen-world versus closed-world assumptionsInferencing About an Open WorldClosing RemarksRecommended ExercisesOnline Resources
II. Twitter Cookbook
9. Twitter Cookbook
Accessing Twitter’s API for Development PurposesProblemSolutionDiscussionDoing the OAuth Dance to Access Twitter’s API for Production PurposesProblemSolutionDiscussionDiscovering the Trending TopicsProblemSolutionDiscussionSearching for TweetsProblemSolutionDiscussionConstructing Convenient Function CallsProblemSolutionDiscussionSaving and Restoring JSON Data with Text FilesProblemSolutionDiscussionSaving and Accessing JSON Data with MongoDBProblemSolutionDiscussionSampling the Twitter Firehose with the Streaming APIProblemSolutionDiscussionCollecting Time-Series DataProblemSolutionDiscussionExtracting Tweet EntitiesProblemSolutionDiscussionFinding the Most Popular Tweets in a Collection of TweetsProblemSolutionDiscussionFinding the Most Popular Tweet Entities in a Collection of TweetsProblemSolutionDiscussionTabulating Frequency AnalysisProblemSolutionDiscussionFinding Users Who Have Retweeted a StatusProblemSolutionDiscussionExtracting a Retweet’s AttributionProblemSolutionDiscussionMaking Robust Twitter RequestsProblemSolutionDiscussionResolving User Profile InformationProblemSolutionDiscussionExtracting Tweet Entities from Arbitrary TextProblemSolutionDiscussionGetting All Friends or Followers for a UserProblemSolutionDiscussionAnalyzing a User’s Friends and FollowersProblemSolutionDiscussionHarvesting a User’s TweetsProblemSolutionDiscussionCrawling a Friendship GraphProblemSolutionDiscussionAnalyzing Tweet ContentProblemSolutionDiscussionSummarizing Link TargetsProblemSolutionDiscussionAnalyzing a User’s Favorite TweetsProblemSolutionDiscussionClosing RemarksRecommended ExercisesOnline Resources
III. Appendixes
A. Information About This Book’s Virtual Machine Experience
B. OAuth Primer
OverviewOAuth 1.0AOAuth 2.0
C. Python and IPython Notebook Tips & Tricks
Index
About the Author
Colophon
Copyright

Content preview from Mining the Social Web, 2nd Edition

Chapter 6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

Mail archives are arguably the ultimate kind of social web data and the basis of the earliest online social networks. Mail data is ubiquitous, and each message is inherently social, involving conversations and interactions among two or more people. Furthermore, each message consists of human language data that’s inherently expressive, and is laced with structured metadata fields that anchor the human language data in particular timespans and unambiguous identities. Mining mailboxes certainly provides an opportunity to synthesize all of the concepts you’ve learned in previous chapters and opens up incredible opportunities for discovering valuable insights.

Whether you are the CIO of a corporation and want to analyze corporate communications for trends and patterns, you have keen interest in mining online mailing lists for insights, or you’d simply like to explore your own mailbox for patterns as part of quantifying yourself, the following discussion provides a primer to help you get started. This chapter introduces some fundamental tools and techniques for exploring mailboxes to answer questions such as:

Who sends mail to whom (and how much/often)?
Is there a particular time of the day (or day of the week) when the most mail chatter happens?
Which people send the most messages to one another?
What are the subjects of the liveliest discussion threads?

Although social media sites are racking up petabytes ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449368180Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Mining the Social Web, 2nd Edition

by Matthew A. Russell

Chapter 6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.