mz5: Space- and time-efficient storage of mass spectrometry data sets

Across a host of mass spectrometry (MS)-driven -omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS datasets and the need for their compact yet efficiently accessible storage has become clear.

The HUPO proteomics standard initiative (PSI) has defined an ontology and associated controlled vocabulary that specifies the contents of MS data files in terms of an open data format. Current implementations are the mzXML and mzML formats (mzML specification), both of which are based on an XML representation of the data. As a consequence, these formats are not particular efficient with respect to their storage space requirements or I/O performance.

This contribution introduces mz5, an implementation of the PSI mzML ontology that is based on HDF5, an efficient, industrial strength storage backend.

Compared to the current mzXML and mzML standards, this strategy yields an average file size reduction of a factor of ~2 and increases I/O performace ~3-4 fold.

The format is implemented as part of the ProteoWizard project.

Additional information about mz5 structure: Code example:
  • mz5_example: converts any format that is supported by proteoWizard to mz5.
mz5 Download and Development

The mz5 code is considered stable; it is currently maintained as part of the proteoWizard main development line (trunk). Being part of the main proteoWizard release, mz5 support is available in the standard proteoWizard package that can be downloaded from the proteoWizard download site.

mz5 development has been carried out in the Steen & Steen lab. If you happen to run into trouble, please do not hesitate to contact Marc or Mathias.

The mz5 license: Apache License, V2.0

mz5 uses the same license as the proteoWizard framework; for the sake of completeness, the full text of the Apache license used for mz5 is available here.

Please consider contributing your improvements and potential bugfixes.