I wrote my first Python program in 1996, and my most recent a few weeks ago, so I can appreciate Python's advance to cover a very broad range of computing tasks. I don't program much anymore, but in my work over the years—and yours too, if you do much coding—data manipulation has always played an important role.
You can't build and apply analytical models, manage transactions, craft a web experience, or carry out any other significant task without investing time and attention to data acquisition, cleansing, and structuring. Python is ideal for those tasks as well as model building and data analysis. Python is also great for natural language processing (NLP) in particular (a special interest of mine). Chances are, Python is a good fit for just about any data work that interests you.
This interview with Pythonista Katharine Jarmul focuses on data work. A couple of events provide context. Katharine is presenting a talk titled "How Machine Learning Changed Sentiment Analysis, or I Hate You, Computer 😉" at this year's Sentiment Analysis Symposium, July 12, 2016 in New York, following which she's offering a class, Learn Big Data Wrangling with Python, July 13-14, also in New York.
We'll get into Katharine's background in the course of the interview. I'll add now only that she's co-author of O'Reilly's Data Wrangling with Python book, published earlier this year, and "Data Wrangling and Analysis with Python" video, just published this month.
Seth Grimes: Make some converts. Why should folks use Python for data work, and in particular for natural language processing and sentiment analysis?
Katharine Jarmul: It's actually pretty hard to argue against using Python for these tasks. With Google using Python (primarily) for TensorFlow, Parsey McParseface [SyntaxNet], and word2vec as well as hundreds of startups and open source tools making advancements for machine learning, sentiment analysis and NLP in Python, I'd love to hear a good argument against it as the language du jour. I love Python because it's easy to read, it has great math and science libraries, it's proven to be quite scalable, and the community is unbeatable.
SG: Your consulting work centers on market analysis, which involves data of varied types—text and numeric, and perhaps geospatial and time-based—from disparate sources. Do you have any special guidance regarding ways to clean and mash it all up in ways that make sense and produce justified insights?
KJ: I actually just gave a talk about this at PyData Berlin, which was an amazing conference. Data wrangling and data cleaning are the un-sexy bits of our daily work, and I wish more people were talking about them, since I think there's a lot of work we can do to make them less painful. For me, I generally use Python and Pandas to perform some of these tasks, but there are so many tools and techniques available. In preparation for my talk, I also read a lot of the latest research and academic papers on automating the data cleaning process via machine learning. To help move the technical side along, I'll be putting together a literature review on the topic, and hopefully we can start building some great open source tools to help us make line-by-line data cleaning a thing of the past.
SG: You're not like most of the technologists I encounter: you made a career change from journalism and public policy to data and analytics. What motivated the switch?
KJ: In my opinion, the distance between data for journalism and data for startups is actually quite small. When I was working at the Washington Post and USA TODAY, I was in charge of quite a few projects involving data wrangling and data munging, so those skills were shared between the two. At a startup, however, I usually had more autonomy to make technical decisions and to grow and learn more technical things, so for me it was a natural progression of my interest in the field.
SG: I assume your journalism and policy skills and experience have informed your approach to your current work. If that's the case, in what ways? Or are the disciplines really different for you?
KJ: I think my background in journalism helps when it comes to communication and reporting. Many times my clients aren't statisticians or data scientists. They want to know what the numbers mean. I had a few great professors in journalism school who helped me with communicating my mathematical knowledge into an understandable and comprehensible article. I now can use those skills to work with my clients and make sure they understand the competitive landscape for their technology or startup.
SG: Your Sentiment Analysis Symposium presentation is titled "How Machine Learning Changed Sentiment Analysis, or I Hate You, Computer.😉" What species of machine learning do you see as applicable to sentiment problems, and which toolkits?
KJ: I myself am not a machine learning expert, nor do I use it often in my work. I am, of course, interested in the topic. As a Python developer, it's very easy to write 10 lines of code that "just work" using the amazing tools available, such as TensorFlow, scikit-learn, and Theano. There are even more I haven't had the chance to play with, so it's a great time to be in machine learning. Regarding sentiment analysis, I recommend taking a look at Spacy.io, run and primarily written by Matthew Honnibal. They already have some interesting training sets with informal text, and have some great resources on how to get started.
SG: Are you also applying established techniques, stuff like the lexical analysis and parsing you get with NLTK?
KJ: Most of the toolkits I've used have these as a part of the library, yes.
SG: Certain emoji, including your winking emoji, most often negate rather than emphasize. You're communicating that "I Hate You, Computer" is ironic. Have you worked in emoji analytics yourself? In techniques aimed at understanding irony and sarcasm and the like? If not, is that stuff on your to-do list, or is it not important in the market analyses and other work you take on?
KJ: Again, I am more of an NLP user rather than library creator. For my upcoming talk, I'll be interviewing Matthew Honnibal about what they have done with sense2vec, and it's pretty amazing. Emoji are just unicode and, for that reason, they can be parsed just like anything else. In a PyData user group talk in Berlin, Spacy.io was demonstrated to know that a smiley face emoji is similar to other positive emoji faces. At the end of the day, text is parseable and emojis are just special code points, so I don't see why we aren't in an age where this is a (nearly) solved problem.
SG: I've learned to be wary when a coder uses the word "just." And I'll add that one of the most interesting talks at last year's sentiment symposium was Emojineering at Instagram, given by Instagram software engineer Thomas Dimson and covering semantic analysis of emoji use. What (else) is on your to-do list, to learn and to apply, when it comes to Python, data wrangling, machine learning, and NLP and sentiment analysis?
KJ: I'll be focusing on the intersection of data cleaning and machine learning. It's of interest to me, and I'm already chatting with some folks about what happens next in terms of open source libraries to use and possibly build. If you are also interested in these problems, feel free to reach out! I'm @kjam on Twitter and freenode, or reachable by email at katharine (at) kjamistan.com.
SG: Very cool. Thanks Katharine.