Skip to Main Content
Natural Language Processing with Java - Second Edition
book

Natural Language Processing with Java - Second Edition

by Richard M. Reese, AshishSingh Bhatia
July 2018
Beginner to intermediate content levelBeginner to intermediate
318 pages
7h 49m
English
Packt Publishing
Content preview from Natural Language Processing with Java - Second Edition

Using POI to extract text from Word documents

The Apache POI project (http://poi.apache.org/index.html) is an API used to extract information from Microsoft Office products. It is an extensive library that allows information extraction from Word documents and other office products, such as Excel and Outlook. When downloading the API for POI, you will also need to use XMLBeans (http://xmlbeans.apache.org/), which supports POI. The binaries for XMLBeans can be downloaded from http://www.java2s.com/Code/Jar/x/Downloadxmlbeans524jar.htm. Our interest is in demonstrating how to use POI to extract text from word documents.

To demonstrate this, we will use a file called TestDocument.docx, with some text, tables, and other stuff, as shown in the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Natural Language Processing with Java Cookbook

Natural Language Processing with Java Cookbook

Richard M. Reese, Richard M Reese
Natural Language Processing in Action

Natural Language Processing in Action

Cole Howard, Hobson Lane, Hannes Hapke
Natural Language Processing with Python

Natural Language Processing with Python

Steven Bird, Ewan Klein, Edward Loper

Publisher Resources

ISBN: 9781788993494Supplemental Content