Login

GBSON


A new annotation file format based on JSON, containing all information stored in the GenBank format but with advantageous parsing and information structure properties.


About GenBank

 

The GenBank Flat File Format (.gb or .gbk) is a widely used file format that allows storage of nucleic acid or protein sequences together with their annotation. It shall not be confused with the NIH genetic sequence database called GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Many applications such as GeSeq are able to read and write GenBank files. A detailed format description is aviable: http://www.insdc.org/documents/feature_table.html.

 

Motivation for GBSON

 

The GenBank file format is in principal a human-readable text file, not based on meta-formats like XML or CSV. Consequently, no standards to parse GenBank files exist. GenBank-specific use of tabs, spaces, and slashes is error-prone, and most text editors do not support syntax highlighting/checking of GenBank files. Developers often have to write their own GenBank export functions or even parsers. In addition, in custom GenBank files use of unsupported identifiers or mixing of upper and lower case expressions is not prevented, which can cause compatibility issues.

 

Another disadvantage of the GenBank format is the lack of ability to express hierarchal or nested structures. If an annotation in GenBank consists of a gene, a CDS, two exons, and an intron, there is no standardized method to express the tree-structured relationship of these elements (e.g., that the exons and the intron are part of the CDS, which is itself part of the gene). With another common file format (GFF3) it is possible to express these relationships using the parent identifier. The resulting file, however, does not reflect the nested elements (like for example the more verbose XML format). This, in turn, makes it difficult for humans to recognize the data structure. In addition, GFF3 suffers from the same problems of uncontrolled vocabulary as GenBank mentioned above.

 

The JSON format

 

JavaScript Object Notation (JSON) is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format, with a diverse range of applications, such as serving as replacement for XML in AJAX systems. JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. The official Internet media type for JSON is application/json. JSON filenames use the extension .json. (from Wikipedia)

 

The new GBSON format

 

We chose to use JSON as a meta format because it is a modern, standardized and very powerful way to store structured data in a machine- and human-readable form. For every programming language there exist JSON parsers, often already built in (such as JavaScript/TypeScript) – even browsers can parse JSON and present it as an interactive folded data structure.

 

Existing approaches

There exist already several JSON-based annotation formats/converters such as:

However, our approach provides a strict type definition (see below) which allows validation, IDE support and avoids disambiguations.

 

Type definition

Using TypeScript type definitions, we can completely define the format in a readable way. This definition can be used to validate if a given JSON file is also a valid GBSON file. The current type definition can be found here: https://github.com/lehwark/GBSON/blob/master/GBSON.d.ts.

 

Examples

GeSeq already creates GBSON output alongside the GenBank output – an example output can be found here: https://chlorobox.mpimp-golm.mpg.de/GBSON-Example.json.

 

Expressing nested features

GBSON allows us to solve the aforementioned problems and express nested structures directly, for example: