MPI-MP Chlorobox

GBSON

A new annotation file format based on JSON, containing all information stored in the GenBank format but with advantageous parsing and information structure properties.

About GenBank

The GenBank Flat File Format (.gb or .gbk) is a widely used file format that allows storage of nucleic acid or protein sequences together with their annotation. It shall not be confused with the NIH genetic sequence database called GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Many applications such as GeSeq are able to read and write GenBank files. A detailed format description is aviable: http://www.insdc.org/documents/feature_table.html.

Motivation for GBSON

The GenBank file format is in principal a human-readable text file, not based on meta-formats like XML or CSV. Consequently, no standards to parse GenBank files exist. GenBank-specific use of tabs, spaces, and slashes is error-prone, and most text editors do not support syntax highlighting/checking of GenBank files. Developers often have to write their own GenBank export functions or even parsers. In addition, in custom GenBank files use of unsupported identifiers or mixing of upper and lower case expressions is not prevented, which can cause compatibility issues.

The JSON format

JavaScript Object Notation (JSON) is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format, with a diverse range of applications, such as serving as replacement for XML in AJAX systems. JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. The official Internet media type for JSON is application/json. JSON filenames use the extension .json. (from Wikipedia)

The new GBSON format

We chose to use JSON as a meta format because it is a modern, standardized and very powerful way to store structured data in a machine- and human-readable form. For every programming language there exist JSON parsers, often already built in (such as JavaScript/TypeScript) – even browsers can parse JSON and present it as an interactive folded data structure.

Existing approaches

There exist already several JSON-based annotation formats/converters such as:

However, our approach provides a strict type definition (see below) which allows validation, IDE support and avoids disambiguations.

Type definition

Using TypeScript type definitions, we can completely define the format in a readable way. This definition can be used to validate if a given JSON file is also a valid GBSON file. The current type definition can be found here: https://github.com/lehwark/GBSON/blob/master/GBSON.d.ts.

Examples

GeSeq already creates GBSON output alongside the GenBank output – an example output can be found here: https://chlorobox.mpimp-golm.mpg.de/GBSON-Example.json.