How to do it...

Take a look at the following steps:

  1. We will start by doing all the necessary imports. Dask will be responsible for doing the conversion:
from math import ceilimport numpy as npimport h5pyimport dask.array as daimport dask.dataframe as dd
  1. We then read all the HDF5 datasets that we want to convert. For the sake of our example, we will use positions. If the position is an SNP, the qual and mq0 annotations will be used:
h5_3L = h5py.File('ag1000g.phase1.ar3.pass.3L.h5', 'r')positions = h5_3L['/3L/variants/POS']is_snp = h5_3L['/3L/variants/is_snp']qual = h5_3L['/3L/variants/QUAL']mq0 = h5_3L['/3L/variants/MQ0']
  1. We will now create a Dask DataFrame:
all_ddf = dd.from_array(positions, columns=['POS'])is_snp_dseries = dd.from_array(is_snp) ...

Get Bioinformatics with Python Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.