This chapter provides an efficient solution to the “small files” problem. What is a small file in a MapReduce/Hadoop environment? In the Hadoop world, a small file is a file whose size is much smaller than the HDFS block size. The default HDFS block size is 64 MB (or 67,108,864 bytes), so, for example, a 2 MB, 5 MB, or 7 MB file is considered a small file. However, the block size is configurable: it is defined by a parameter called
dfs.block.size. If you have an application that deals with huge files (such as DNA sequencing), then you might even set this to a higher size, like 256 MB.
In general, Hadoop handles big files very well, but when the files are small, it just passes each small file to a
map() function, which is not very efficient because it will create a large number of mappers. Typically, if you are using and storing small files, you probably have lots of them. For example, the file size to represent a bioset for a gene expression data type can be 2 to 3 MB. So, to process 1,000 biosets, you need 1,000 mappers (i.e., each file will be sent to a mapper, which is very inefficient). Having too many small files can therefore be problematic in Hadoop. To solve this problem, we should merge many of these small files into one and then process them. In the case of biosets, we might merge every 20 to 25 files into one file (where the size will be closer to 64 MB). By merging these files, we might need only 40 to 50 mappers (instead of 1,000). ...