The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, `square(xy)`

. It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.

Our distributed Hadoop cluster would do the following:

- Count all occurrences of each term in the index
- Count all occurrences of each co-occurring term in the index
- Construct a hash table or a map of co-occurring terms
- Calculate the information gain for each term and store ...

Start Free Trial

No credit card required