We'll need to add some more features to our class definition so that we can extract meaningful, aggregated blocks of text. We'll need to add some layout rules and a text aggregator that uses the rules and the raw page to create aggregated blocks of text.
We'll override the
init_device() method to create a more sophisticated device. Here's the next subclass, built on the foundation of the
from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams class Miner_Layout(Miner_Page): def __init__(self, *args, **kw): super().__init__(*args, **kw) def init_device(self, resource_manager, **params): """Return an PDFPageAggregator as a device.""" self.layout_params = ...