November 2019
Intermediate to advanced
304 pages
8h 40m
English
Data can be spread across multiple files, subdirectories, or multiple clusters. We need a mechanism to extract and handle data in different ways due to various constraints, such as size. In distributed environments, a large amount of data can be stored as chunks in multiple clusters. DataVec uses InputSplit for this purpose.
In step 1, we looked at FileSplit, an InputSplit implementation that splits the root directory into files. FileSplit will recursively look for files inside the specified directory location. You can also pass an array of strings as a parameter to denote the allowed extensions:
In the sample output, ...