book

Bioinformatics with Python Cookbook - Second Edition

by Tiago Antao

November 2018

Intermediate to advanced

360 pages

9h 36m

English

Packt Publishing

Read now

Unlock full access

Content preview from Bioinformatics with Python Cookbook - Second Edition

Using high-performance data formats – Parquet

In the previous recipes, we used HDF5 as a format for the storage of genomic data. In this recipe, we will consider another format: Parquet, from the Apache Project. There are not, as far as I know, many use cases of Bioinformatics in Parquet (https://parquet.apache.org/), but there are several reasons why this format should be considered. For one, it can be used natively with Apache Spark (see the next recipe), and it can also be far more intelligent than HDF5 in terms of storage of data. Think, for example, faster indexing of data.

In this recipe, we will convert a subset of the HDF5 file that we used in the previous two recipes.