Learning Apache Apex
by Ananth Gundabattula, Thomas Weise, Munagala V. Ramanath, David Yan, Kenneth Knowles
File splitter and block reader
The input operator, although it is quite flexible for customization and simple to use, has one disadvantage that is important for use cases where input files can get very large and the size of files fluctuates. Since the smallest unit of input is an entire file, reading of a very large file cannot be parallelized.
The combination of file splitter and block reader can solve this. The splitter creates metadata of blocks of a file and those become the work items for downstream block readers, which can read/parse the (non-overlapping) blocks without dependency on other partitions. Essentially, the previous input operator is divided into two steps.
The first operator discovers the files and emits the block instructions, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access