A new annotation file format based on JSON, containing all information stored in the GenBank format but with advantageous parsing and information structure properties.
The GenBank Flat File Format (.gb or .gbk) is a widely used file format that allows storage of nucleic acid or protein sequences together with their annotation. It shall not be confused with the NIH genetic sequence database called GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Many applications such as GeSeq are able to read and write GenBank files. A detailed format description is aviable: http://www.insdc.org/documents/feature_table.html.
Motivation for GBSON
The GenBank file format is in principal a human-readable text file, not based on meta-formats like XML or CSV. Consequently, no standards to parse GenBank files exist. GenBank-specific use of tabs, spaces, and slashes is error-prone, and most text editors do not support syntax highlighting/checking of GenBank files. Developers often have to write their own GenBank export functions or even parsers. In addition, in custom GenBank files use of unsupported identifiers or mixing of upper and lower case expressions is not prevented, which can cause compatibility issues.
The JSON format
The new GBSON format
There exist already several JSON-based annotation formats/converters such as:
However, our approach provides a strict type definition (see below) which allows validation, IDE support and avoids disambiguations.
Using TypeScript type definitions, we can completely define the format in a readable way. This definition can be used to validate if a given JSON file is also a valid GBSON file. The current type definition can be found here: https://github.com/lehwark/GBSON/blob/master/GBSON.d.ts.
GeSeq already creates GBSON output alongside the GenBank output – an example output can be found here: https://chlorobox.mpimp-golm.mpg.de/GBSON-Example.json.