In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.
Here are some highlights from our conversation:
Using machine learning to strengthen content ecosystems
Pinterest content is a giant, complicated corpus, that has a very rich meta data associated with it. If you build a recommendation system where there's a lot of bias in it, over time you can start showing just a particular corner of that corpus to the world—because you think your user might find a piece of that corner of content particularly engaging. This is an issue when you're basing your algorithms only on your existing users.
When Pinterest first started out, we had a very strong user base around particular user demographics. That part of the content corpus becomes very well curated, which makes those content pieces rank really high in our machine learning products. Then we had to start consciously thinking about how to combat that problem because otherwise, over time, you're just going to build a product that only appeals to that segment of users.
From the user perspective, you want to make sure you're creating a corpus that covers enough in terms of topics and interests, in terms of different languages people speak, in terms of different cultural backgrounds. Then, I think on the content side, we have the same problem where fresher, newer content may have trouble competing with older content that's been around for a long time and has really good historical performance.
Maintaining this healthy ecosystem involves creating mechanisms to jump start new content so we can show it enough times to quickly learn whether or not it's high quality. And whether or not it might be relevant for certain segments of users. We then want to be able to use that information very efficiently to drive our downstream products.
Building data products: Three anti-patterns
The first one is, do not build a model for users today. You have to think about your users tomorrow as well. Second, it's really easy to build a system where the rich get richer. There are a lot of techniques out there to prevent that from happening; it's often not by design. It's very subtle, and it takes a long time to observe this rich-get-richer effect and for it to build up. You have to be very vigilant about it. ... The third anti-pattern is that you might find yourself optimizing not quite the right thing. You can get exactly what you wish for with a machine learning system. It's very good at optimizing a goal that you specify. But that goal may not necessarily correlate with the ultimate goal. Keeping your ultimate goal in mind and evaluating your products with the ultimate goal, instead your intermediate goal, is really important. For example, I think short-term metrics are easier to optimize toward. But they may or may not correlate with a long-term goal like retention.
Peeking into the black box: Lessons from the front lines of machine-learning product launches - A 2017 Strata Data Conference keynote by Grace Huang
Recommending 1+ billion items to 100+ million users in real time—Harnessing the structure of the user-to-object graph to extract ranking signals at scale: A 2017 Strata Data Conference presentation by Pinterest’s chief scientist, Jure Leskovec
When is data science a house of cards? Replicating data science conclusions: A 2017 Strata Data Conference presentation by Frances Haugen and June Andrews of Pinterest
Data preparation in the age of deep learning: Featuring Crowdflower co-founder Lukas Biewald