Customizing Apache Lucene
Apache Lucene is an old and very powerful search library. It was written back in 1999, and since then a lot of users not only have adopted it but also created many different extensions for this library.
Still, sometimes the built-in NLP capabilities of Lucene are not enough, and a specialized NLP library is needed.
For example, if we would like to include POS tags along with tokens, or find Named Entities, then we need something such as Stanford CoreNLP. It is not very difficult to include such external specialized NLP libraries in the Lucene workflow, and here we will see how to do it.
Let's use the StanfordNLP library and the tokenizer we have implemented in the previous section. We can call it StanfordNlpTokenizer ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access