Fang Yu on using data analytics to catch constantly evolving fraudsters

The O'Reilly Radar Podcast: Big data for security, challenges in fraud detection, and the growing complexity of fraudster behavior.

By Jenn Webb

December 1, 2016

Nonconformist silhouette. (source: Geralt on Pixabay)

This week, I sit down with Fang Yu, cofounder and CTO of DataVisor, where she focuses on big data for security. We talk about the current state of the fraud landscape, how fraudsters are evolving, and how data analytics and behavior analysis can help defend against—and prevent—attacks.

Here are some highlights from our chat:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Challenges in using supervised machine learning for fraud detection

In the past few years, machine learning has taken a big role in fraud detection. There are a number of supervised machine learning techniques and breakthroughs, especially for voice, image recognition, etc. There’s also an application for machine learning to detect fraud, but it’s a little challenging because supervised machine learning needs labels. It needs to know what good users and bad users look like, and to know what good behavior is, what bad behavior is; the problem in many fraud cases is that attackers constantly evolve. Their patterns change very quickly, so in order to detect an attack, you need to know they will do next.

That is ultimately hard, and in some cases—for example, financial transactions—it is too late. For supervised machine learning, you will have a charge back label from the bank because someone sees their credit card got abused and they called the bank. That’s how you get the label. But that happens well after the actual transaction takes place, sometimes even months later, and the damage is already done. And moving forward, by the time you have a model to train to prevent it from happening again, the attacker has already changed his or her behavior. Supervised machine learning is great, but when applied to security, you need a quicker and more customized solution.

An unsupervised machine learning approach to identify sleeper cells

At DataVisor, we actually do things differently from the traditional rule-based or supervised machine learning-based approaches. We do unsupervised detection, which does not need labels. So, at a high-level, today’s modern attackers do not use a single account to conduct fraud. If they have a single account, the fraud they can conduct is very limited. What they usually do is construct an army of fraud accounts, and then either do a mass registration or conduct account takeovers, then each of them will commit a little fraud. They can do spamming, they can do phishing, they can do all types of different bad activities. But together, because they have many accounts, they conduct attacks at a massive scale.

For DataVisor, the approach we take is called an unsupervised approach. We do not look at individual users anymore. We look at all the users in a holistic view and uncover their correlations and linkages. We use graph analysis and clustering techniques, etc., to identify these fraudsters’ rings. We can identify them even before they have done anything, or while they are sleeping, so we call them “sleeper cells.”

The big payoff of fraudulent faking

Nowadays, we actually see fraud becoming pretty complex and even more lucrative. For example, if you look at e-commerce platforms, they sometimes offer reviews. They let users rate, like, and write reviews about products. And all of these can be leveraged by the fraudsters—they can write fake reviews and incorporate bad links in the writeups in order to promote their own products. So, they do a lot of fake likes to promote.

Now, we also see a new trend going from the old days of having fake impressions, fake clicks now to actual fraudulent installs. For example, in the old days, when a gaming company had a new game coming out, they would purchase users to play these games—they would pay people like $50 dollars to play an Xbox game. Now, many of the games are free, but they need to drive installs to improve their rank in app stores. These gaming providers rely on app marketing, purchasing the users from different media sources, which can be pretty expensive—a few dollars per install. So, the fraudsters start to emulate the users and download these games. They are pretending they are media sources and cashing in by just downloading and playing the games. That payoff is 400 times more than that of a fake click or impression.

The future of fraudsters and fraud detection

Fraudsters are evolving to look more like real users, and it’s becoming more difficult to detect them. We see them incubate for a long time. We see them using cloud to circumvent IP blacklists. We see them skirting two-factor authentication. We see them opening apps, making purchases, and doing everything a real, normal user does. They are committing fraud at a huge scale across all industries, from banking and money laundering to social, and the payoff for them is equally as massive. If they are evolving, we need to evolve, too. That’s why new methods, such as unsupervised machine learning, are so critical to staying ahead of the game.

Post topics: O'Reilly Radar Podcast