Activities such as building your own natural language processor from scratch, venturing far beyond the typical usage of visualization libraries, and constructing just about anything state-of-the-art are not within the scope of this book. You’ll be really disappointed if you purchase this book because you want to do one of those things. However, just because it’s not realistic or our goal to capture the holy grail of text analytics or record matching in a mere few hundred pages doesn’t mean that this book won’t enable you to attain reasonable solutions to hard problems, apply those solutions to the social web as a domain, and have a lot of fun in the process. It also doesn’t mean that taking a very active interest in these fascinating research areas wouldn’t potentially be a great idea for you to consider. A short book like this one can’t do much beyond whetting your appetite and giving you enough insight to go out and start making a difference somewhere with your newly found passion for data hacking.
Maybe it’s obvious in this day and age, but another important item of note is that this book generally assumes that you’re connected to the Internet. This wouldn’t be a great book to take on vacation with you to a remote location, because it contains many references that have been hyperlinked, and all of the code examples are hyperlinked directly to GitHub, a very social Git repository that will always reflect the most up-to-date example code available. The hope is that social coding will enhance collaboration between like-minded folks such as ourselves who want to work together to extend the examples and hack away at interesting problems. Hopefully, you’ll fork, extend, and improve the source—and maybe even make some new friends along the way. Readily accessible sources of online information such as API docs are also liberally hyperlinked, and it is assumed that you’d rather look them up online than rely on inevitably stale copies in this printed book.
The official GitHub repository that maintains the latest and greatest bug-fixed source code for this book is http://github.com/ptwobrussell/Mining-the-Social-Web. The official Twitter account for this book is @SocialWebMining .
This book is also not recommended if you need a reference that gets you up to speed on distributed computing platforms such as sharded MySQL clusters or NoSQL technologies such as Hadoop or Cassandra. We do use some less-than-conventional storage technologies such as CouchDB and Redis, but always within the context of running on a single machine, and because they work well for the problem at hand. However, it really isn’t that much of a stretch to port the examples into distributed technologies if you possess sufficient motivation and need the horizontal scalability. A strong recommendation is that you master the fundamentals and prove out your thesis in a slightly less complex environment first before migrating to an inherently more complex distributed system—and then be ready to make major adjustments to your algorithms to make them performant once data access is no longer local. A good option to investigate if you want to go this route is Dumbo. Stay tuned to this book’s Twitter account ( @SocialWebMining ) for extended examples that involve Dumbo.
This book provides no advice whatsoever about the legal ramifications of what you may decide to do with the data that’s made available to you from social networking sites, although it does sincerely attempt to comply with the letter and spirit of the terms governing the particular sites that are mentioned. It may seem unfortunate that many of the most popular social networking sites have licensing terms that prohibit the use of their data outside of their platforms, but at the moment, it’s par for the course. Most social networking sites are like walled gardens, but from their standpoint (and the standpoint of their investors) a lot of the value these companies offer currently relies on controlling the platforms and protecting the privacy of their users; it’s a tough balance to maintain and probably won’t be all sorted out anytime soon.
A final and much lesser caveat is that this book slightly favors a *nix environment, in that there are a select few visualizations that may give Windows users trouble. Whenever this is known to be a problem, however, advice is given on reasonable alternatives or workarounds, such as firing up a VirtualBox to run the example in a Linux environment. Fortunately, this doesn’t come up often, and the few times it does you can safely ignore those sections and move on without any substantive loss of reading enjoyment.
 *nix is a term used to refer to a Linux/Unix environment, which is basically synonymous with non-Windows at this point in time.