A new annotation file format based on JSON, containing all information stored in the GenBank format but with advantageous parsing and information structure properties.
The GenBank Flat File Format (.gb or .gbk) is a widely used file format that allows storage of nucleic acid or protein sequences together with their annotation. It shall not be confused with the NIH genetic sequence database called GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Many applications such as GeSeq are able to read and write GenBank files. A detailed format description is aviable: http://www.insdc.org/documents/feature_table.html.
Motivation for GBSON
The GenBank file format is in principal a human-readable text file, not based on meta-formats like XML or CSV. Consequently, no standards to parse GenBank files exist. GenBank-specific use of tabs, spaces, and slashes is error-prone, and most text editors do not support syntax highlighting/checking of GenBank files. Developers often have to write their own GenBank export functions or even parsers. In addition, in custom GenBank files use of unsupported identifiers or mixing of upper and lower case expressions is not prevented, which can cause compatibility issues.
Another disadvantage of the GenBank format is the lack of ability to express hierarchal or nested structures. If an annotation in GenBank consists of a gene, a CDS, two exons, and an intron, there is no standardized method to express the tree-structured relationship of these elements (e.g., that the exons and the intron are part of the CDS, which is itself part of the gene). With another common file format (GFF3) it is possible to express these relationships using the parent identifier. The resulting file, however, does not reflect the nested elements (like for example the more verbose XML format). This, in turn, makes it difficult for humans to recognize the data structure. In addition, GFF3 suffers from the same problems of uncontrolled vocabulary as GenBank mentioned above.
The JSON format
The new GBSON format
There exist already several JSON-based annotation formats/converters such as:
However, our approach provides a strict type definition (see below) which allows validation, IDE support and avoids disambiguations.
Using TypeScript type definitions, we can completely define the format in a readable way. This definition can be used to validate if a given JSON file is also a valid GBSON file. The current type definition can be found here: https://github.com/lehwark/GBSON/blob/master/GBSON.d.ts.
GeSeq already creates GBSON output alongside the GenBank output – an example output can be found here: https://chlorobox.mpimp-golm.mpg.de/GBSON-Example.json.
Expressing nested features
GBSON allows us to solve the aforementioned problems and express nested structures directly, for example: