In this episode of the Data Show, I spoke with Geoffrey Bradway, VP of engineering at Numerai, a new hedge fund that relies on contributions of external data scientists. The company hosts regular competitions where data scientists submit machine learning models for classification tasks. The most promising submissions are then added to an ensemble of models that the company uses to trade in real-world financial markets.
To minimize model redundancy, Numerai filters out entries that produce signals that are already well-covered by existing models in their ensemble. The company also plans to use (Ethereum) blockchain technology to develop an incentive system to reward models that do well on live data (not ones that overfit and do well on historical data).
Here are some highlights from our conversation:
Coordinating data science and AI in finance
At Numerai, we believe there are other people in the world who are better data scientists than we are, but we have the financial backgrounds. So, we can take very good financial data and actually encrypt it in such a way that the structure is preserved enough to do machine learning on it, but you can't tell what it is. Because we can do that, we can actually release our data set in data science tournaments, and then users can download our data, which just looks like a giant CSV with a bunch of features and targets. Then they can train their own models, try to predict what will happen in the future, and upload that to our website.
We're trying to set it up in such a way where our users don't have to know a lot about finance, and they can just be very, very good data scientists. They can leave much of the financial data munging up to us.
... A big problem we were running into is that we were sort of paying off users based on how well they did on a backtest. The problem with that is, you can overfit to your backtests. You do a model; it scores you; it tells you how well you do. Then you slightly tweak your model, it scores you better, and you can keep doing that until you get really, really good on a backtest. But that just destroys your ability to generalize into the future.
... What we wanted to do is create a mechanism that actually makes it irrational for users to want to overfit. That took the form of a cryptocurrency that we call the Numeraire. The idea is, you get this token that has some value, and you can use that to essentially stake your predictions. So, you can say: ‘Hey, I think these predictions are really good.’ If your predictions do turn out to be good, in the sense that they perform better than random on live data, we'll give you your tokens back, and we'll give you some additional payout.
Applications of differential privacy and adaptive data analysis
This business model has many potential applications, basically, in any setting where the data is very, very sensitive. This could be finance data, this could be health care data, but this could also just be internal corporate data. For example, Amazon wouldn't want to necessarily open source their logistics data, because that's very, very valuable for them. That's something that gives them an edge; so, anything where it would be fantastic to have models from many sources, but sharing the data is hard.
There is a related field in computer science called differential privacy. It is a subfield that talks about how you release a data set so that nobody can mine it for sensitive information. And then there's a related field that has to do with adaptive data analysis, where you run a model, you get feedback, you run another model based off of that feedback, and you keep doing that. How can you make those statistically sound?
What Kaggle has learned from almost a million data scientists: Strata Data conference keynote by Kaggle co-founder, Anthony Goldbloom
Data preparation in the age of deep learning: featuring Crowdflower co-founder Lukas Biewald