Let us finally come back to our running example--building a search engine. In Chapter 7, Extreme Gradient Boosting, we created a ranking model, which we can use for reordering search engine results so that the most relevant content gets higher positions.
In the previous chapter, Chapter 9, Scaling Data Science, we extracted a lot of text data from Common Crawl. What we can do now is to put it all together--use Apache Lucene to index the data from Common Crawl, and then search its content and get the best results with the XGBoost ranking model.
We already know how to use Hadoop MapReduce to extract text information from Common Crawl. However, if you remember, our ranking model needs more than just text--apart from just ...