January 2020
Intermediate to advanced
640 pages
16h 56m
English
The content extractor attempts to identify and extract all text from a document downloaded from a remote server. For instance, if the link is pointed to a plaintext document, then the extractor would emit the document content as is. On the other hand, if the link pointed to an HTML document, the extractor would strip off any HTML elements and emit the text-only portion of the document.
The emitted content is sent off to the content indexer component so it can be tokenized and update the Links 'R' Us full-text search index.