Book description
Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
About the Technology
Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.
About the Book
Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.
What's Inside
- Crack MS Word, PDF, HTML, and ZIP
- Integrate with search engines, CMS, and other data sources
- Learn through experimentation
- Many examples
About the Reader
This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.
About the Authors
Chris Mattmann is an information architect experienced in the construction of large data-intensive systems. Jukka Zitting is a core Tika developer, a member of the JCR expert group, and chairman of the Apache Jackrabbit project.
Quotes
By Tika's two main creators and maintainers.
- Jérôme Charron, WebPulse
Easily the most definitive guide to this great new text analysis toolkit.
- John Guthrie, SAP
An easy-to-read guide--plenty of technical content.
- Rick Wagner, Red Hat
There's not a single page of 'inaction' in the entire book!
- Sean Kelly, Technologist, NASA
Complete, practical, accurate
- Julien Nioche, DigitalPebble Ltd
Table of contents
- Copyright
- Dedication
- Brief Table of Contents
- Table of Contents
- Foreword
- Preface
- Acknowledgments
- About this Book
- About the Authors
- About the Cover Illustration
- Part 1. Getting started
- Chapter 1. The case for the digital Babel fish
- Chapter 2. Getting started with Tika
- Chapter 3. The information landscape
- Part 2. Tika in detail
- Chapter 4. Document type detection
- Chapter 5. Content extraction
- Chapter 6. Understanding metadata
- Chapter 7. Language detection
- Chapter 8. What’s in a file?
- Part 3. Integration and advanced use
- Chapter 9. The big picture
- Chapter 10. Tika and the Lucene search stack
- Chapter 11. Extending Tika
- Part 4. Case studies
- Chapter 12. Powering NASA science data systems
- Chapter 13. Content management with Apache Jackrabbit
- Chapter 14. Curating cancer research data with Tika
- Chapter 15. The classic search engine example
- Appendix A. Tika quick reference
- Appendix B. Supported metadata keys
- Index
- List of Figures
- List of Tables
- List of Listings
Product information
- Title: Tika in Action
- Author(s):
- Release date: November 2011
- Publisher(s): Manning Publications
- ISBN: 9781935182856
You might also like
book
Solr in Action
Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly …
book
Modernizing Enterprise Java
While containers, microservices, and distributed systems dominate discussions in the tech world, the majority of applications …
book
Elasticsearch in Action
Elasticsearch in Action teaches you how to build scalable search applications using Elasticsearch. You'll ramp up …
book
JUnit in Action, Third Edition
JUnit is the gold standard for unit testing Java applications. Filled with powerful new features designed …