November 2018
Intermediate to advanced
360 pages
9h 36m
English
In the previous recipes, we used HDF5 as a format for the storage of genomic data. In this recipe, we will consider another format: Parquet, from the Apache Project. There are not, as far as I know, many use cases of Bioinformatics in Parquet (https://parquet.apache.org/), but there are several reasons why this format should be considered. For one, it can be used natively with Apache Spark (see the next recipe), and it can also be far more intelligent than HDF5 in terms of storage of data. Think, for example, faster indexing of data.
In this recipe, we will convert a subset of the HDF5 file that we used in the previous two recipes.