Abstract

hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5.

This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides.

It also illustrates how the provided compression filters can be enabled to read compressed datasets from other (non-Python) application.

Finally, it discusses how hdf5plugin manages to distribute the HDF5 plugins for reuse with different libHDF5.

License: CC-BY 4.0

hdf5plugin

hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.

h5py is a thin, pythonic wrapper around HDF5.

Presenter: Thomas VINCENT

European HDF Users Group Summer 2021, July 7-8, 2021

[2]:
from h5glance import H5Glance  # Browsing HDF5 files
H5Glance("data.h5")
[2]:
    • compressed_data_bitshuffle_lz4 [📋]: 1969 × 2961 entries, dtype: uint8
    • copyright [📋]: scalar entries, dtype: UTF-8 string
    • data [📋]: 1969 × 2961 entries, dtype: uint8
[3]:
import h5py  # Pythonic HDF5 wrapper: https://docs.h5py.org/

h5file = h5py.File("data.h5", mode="r")  # Open HDF5 file in read mode
data = h5file["/data"][()]               # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar()         # Display data
[3]:
<matplotlib.colorbar.Colorbar at 0x119479358>
_images/hdf5plugin_EuropeanHUG2021_7_1.png
[4]:
data = h5file["/compressed_data_bitshuffle_lz4"][()]  # Access compressed dataset
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-4bb532391a0f> in <module>
----> 1 data = h5file["/compressed_data_bitshuffle_lz4"][()]  # Access compressed dataset

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
    760         mspace = h5s.create_simple(selection.mshape)
    761         fspace = selection.id
--> 762         self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
    763
    764         # Patch up the output for NumPy

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)

hdf5plugin usage

Reading compressed datasets

To enable reading compressed datasets not supported by libHDF5 and h5py: Install hdf5plugin & import it.

[ ]:
%%bash
pip3 install hdf5plugin

Or: conda install -c conda-forge hdf5plugin

[5]:
import hdf5plugin
[6]:
data = h5file["/compressed_data_bitshuffle_lz4"][()]  # Access datset
plt.imshow(data); plt.colorbar()                      # Display data
[6]:
<matplotlib.colorbar.Colorbar at 0x11bc666d8>
_images/hdf5plugin_EuropeanHUG2021_14_1.png
[7]:
h5file.close()  # Close the HDF5 file

Writing compressed datasets

When writing datasets with h5py, compression can be specified with: h5py.Group.create_dataset

[8]:
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
[9]:
# Create a compressed dataset
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_bitshuffle_lz4",
    data=data,
    compression=32008,  # bitshuffle/lz4 HDF5 filter identifier
    compression_opts=(0, 2)  # options: default number of elements/block, enable LZ4
)
h5file.close()

hdf5plugin provides some helpers to ease dealing with compression filter and options:

[10]:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_bitshuffle_lz4",
    data=data,
    **hdf5plugin.Bitshuffle()  # Or: **hdf5plugin.BitShuffle(lz4=True)
)
h5file.close()
[ ]:
hdf5plugin.Bitshuffle?
[12]:
H5Glance("new_file_bitshuffle_lz4.h5")
[12]:
    • compressed_data_bitshuffle_lz4 [📋]: 1969 × 2961 entries, dtype: uint8
[13]:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data_bitshuffle_lz4"][()]); plt.colorbar()
h5file.close()
_images/hdf5plugin_EuropeanHUG2021_23_0.png
[14]:
!ls -l new_file*.h5
-rw-r--r--  1 tvincent  staff  4278852 Jul  8 14:25 new_file_bitshuffle_lz4.h5
-rw-r--r--  1 tvincent  staff  5832257 Jul  8 14:24 new_file_uncompressed.h5

HDF5 compression filters

Available through h5py

Compression filters provided by h5py:

  • Provided by libhdf5: “gzip” and eventually “szip” (optional)

  • Bundled with h5py: “lzf”

Pre-compression filter: Byte-Shuffle

[ ]:
h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()

Provided by hdf5plugin

Additional compression filters provided by hdf5plugin: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.

6 out of the 25 HDF5 registered filter plugins as of June 2021.

[ ]:
h5file = h5py.File("new_file_blosc.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_blosc",
    data=data,
    **hdf5plugin.Blosc(cname='zlib', clevel=5, shuffle=hdf5plugin.Blosc.SHUFFLE)
)
h5file.close()

General purpose lossless compression

Specific compression

Benchmark

Benchmark

Equivalent filters

Blosc includes pre-compression filters and algorithms provided by other HDF5 compression filters:

  • LZ4() => Blosc("lz4", 9)

  • Zstd() => Blosc("zstd", 2)

  • HDF5 shuffle => Blosc with shuffle=hdf5plugin.Blosc.SHUFFLE

  • Bitshuffle() => Blosc("lz4", 5, hdf5plugin.Blosc.BITSHUFFLE)

    Except for OpenMP support with Bitshuffle!

Summary

Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.

Also to keep in mind availability/compatibility: "gzip" as included in libHDF5 is the most compatible one (and also "lzf" as included in h5py).

Using hdf5plugin filters with other applications

Note: With notebook, using ! enables running shell commands

[15]:
!h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" {
DATASET "/compressed_data_bitshuffle_lz4" {
   DATATYPE  H5T_STD_U8LE
   DATASPACE  SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 5, 10 );
      BLOCK ( 1, 1 );
      DATA {
      }
   }
}
}

A solution: Set HDF5_PLUGIN_PATH environment variable to: hdf5plugin.PLUGINS_PATH

[ ]:
# Directory where HDF5 compression filters are stored
hdf5plugin.PLUGINS_PATH
[ ]:
# Retrieve hdf5plugin.PLUGINS_PATH from the command line
!python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"
[19]:
!ls `python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
libh5blosc.dylib     libh5fcidecomp.dylib libh5zfp.dylib
libh5bshuf.dylib     libh5lz4.dylib       libh5zstd.dylib
[20]:
# Set HDF5_PLUGIN_PATH environment variable to hdf5plugin.PLUGINS_PATH
!HDF5_PLUGIN_PATH=`python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"` h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" {
DATASET "/compressed_data_bitshuffle_lz4" {
   DATATYPE  H5T_STD_U8LE
   DATASPACE  SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 5, 10 );
      BLOCK ( 1, 1 );
      DATA {
      (0,0): 53, 52, 53, 54, 54, 55, 55, 56, 56, 57,
      (1,0): 49, 50, 54, 55, 53, 54, 55, 56, 56, 58,
      (2,0): 50, 51, 54, 54, 53, 55, 56, 57, 58, 57,
      (3,0): 51, 54, 55, 54, 54, 55, 56, 57, 58, 59,
      (4,0): 53, 55, 54, 54, 56, 56, 58, 57, 57, 58
      }
   }
}
}

Note: Only works for reading compressed datasets, not for writing!

Insights

The Problem

For reading compressed datasets, compression filters do NOT need information from libHDF5. They work with the compressed stream.

For writing compressed datasets, some information about the dataset (e.g., data type size) can be needed by the filter (e.g., to shuffle the data). This information is retrieve through libHDF5 C-API (e.g., H5Tget_size).

Access to libHDF5 C-API is needed, but linking compression filters with libHDF5 is cumbersome in a dynamic environment like Python.

On Windows

Symbols from dynamically loaded Python modules and libraries are accessible to others.

Register compression filter at C-level with H5Zregister (see src/register_win32.c)

On Linux, macos

In Python, symbols from dynamically loaded modules and libraries are NOT visible to others.

  • Do not link filters with libHDF5.

  • Instead, provide some function wrappers to replace libHDF5 C-API and link the compression filter with those.

    • Those functions call libHDF5 corresponding functions that are dynamically loaded at runtime.

  • At runtime, we need to initialize the compression filter to load symbols dynamically from libHDF5 used by h5py and use them from the function wrappers.

src/hdf5_dl.c:

typedef size_t (* DL_func_H5Tget_size)(hid_t type_id);
static struct { /* Structure storing HDF5 function pointers */
    DL_func_H5Tget_size H5Tget_size;
} DL_H5Functions = {NULL};

/* Init wrapper by loading symbols from `libHDF5` */
int init_filter(const char* libname) {
    void * handle = dlopen(libname, RTLD_LAZY | RTLD_LOCAL);  /*Load libHDF5*/
    DL_H5Functions.H5Tget_size = (DL_func_H5Tget_size)dlsym(handle, "H5Tget_size");
}
/* H5Tget_size libHDF5 C-API wrapper*/
size_t H5Tget_size(hid_t type_id) {
    if(DL_H5Functions.H5Tget_size != NULL) {
        return DL_H5Functions.H5Tget_size(type_id);
    } else {
        return 0;
    }
}

Concluding remark

In the event the HDF5 compression filter API evolves, it would be great to take this into account to ease distribution of compression filters.

A word about hdf5plugin license

The source code of hdf5plugin itself is licensed under the MIT license…

It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib…) and copyrights.

Conlusion

hdf5plugin provides additional HDF5 compression filters (namely: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard) mainly for use with h5py but not only.

Credits to the contributors: Thomas Vincent, Armando Sole, @Florian-toll, @fpwg, Jerome Kieffer, @Anthchirp, @mobiusklein, @junyuewang

Partially funded by the PaNOSC EU-project.

bb2d197e23764b739d5c95ef0246e54f This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.