Getting text data from a document
We'll need to add some more features to our class definition so that we can extract meaningful, aggregated blocks of text. We'll need to add some layout rules and a text aggregator that uses the rules and the raw page to create aggregated blocks of text.
We'll override the init_device()
method to create a more sophisticated device. Here's the next subclass, built on the foundation of the Miner_Page
and Miner
classes:
from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams class Miner_Layout(Miner_Page): def __init__(self, *args, **kw): super().__init__(*args, **kw) def init_device(self, resource_manager, **params): """Return an PDFPageAggregator as a device.""" self.layout_params = ...
Get Python for Secret Agents - Volume II now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.