GB2sequin - Documentation
The GB2sequin program aims to provide the regular wet lab user with an easy-to-use web tool to covert custom annotations in the GenBank or ENA format into the NCBI submission format Sequin. Additionally, it provides "five column feature tables" and FASTA files for BankIt or the update of existing GenBank entries.
GB2sequin parses the GenBank/ENA file and converts the annotation into a tab delimited annotation table ("five-column, tab-delimited feature table"). It further extracts the nucleic acid sequence information from the GenBank/ENA file and writes it, together with the mandatory source and sequence information of an NCBI record (see below), into a FASTA file. These two files can already be used for submission through BankIt or to update an existing GenBank record. To create Sequin files for direct submission, GB2sequin invokes tbl2asn. For this, it combines annotation table, FASTA file, and additional files that contain sequence source or author submission information (see below). As an optional feature, GB2sequin can edit or add gene product names of coding sequences (CDS), tRNAs, and/or rRNAs in or to the annotation. This might be helpful for the revision of lager genomes. Last, GB2sequin produces several output files for quality control (see below).
Several file types can be uploaded to GB2sequin:
GenBank/ENA file: This file is mandatory and must contain the LOCUS information (either an accession number or a user defined identifier), the sequence FEATURES according to the standards of the International Nucleotide Sequence Database Collaboration (INSDC), and the ORIGIN, i.e. the nucleic acid sequence in the GenBank format. Currently, GB2sequin does not accept multi-GenBank or ENA files. Please note that FEATURES, or included qualifiers therein, which are not concise with the INSDC syntax might be modified or removed by tbl2asn (also see below). All other entries in the GenBank file, such as submitter’s information, literature references, definition line, or source information are ignored and must be provided separately by the following files or input options:
Author Submission Template: This file contains submitter’s information and literature references, which will be later displayed in the final database entry. The template can be created at NCBI. If no Author Submission Template is provided, GB2sequin will use "Unknown Author" as default. Submitter’s information and literature references are also modifiable later in the Sequin file.
Source Modifier Table: This optional upload can contain non-mandatory information for the sequence source description, such as collection site of an organism, voucher information, or a note. Please note, that there is controlled vocabulary. The data can be either provided in the source table format *.src, or as a two-column, tab delimited text. Again, sequence source modifiers can also be added manually to the Sequin file prior to submission.
Gene Product Specification Table: This table is optional as well and might be useful to revise and update lager genomes: With the help of a two column, tab-delimited text file, GB2sequin will either add or change gene product names of the annotation features CDS, tRNA or rRNA. For instance, if the Gene Product Specification Table contains the line "psaA [tab character] photosystem I P700 apoprotein A1", GB2sequin will search for "psaA" in the annotation. If no gene product name for "psaA" was provided in the original GenBank file (which, e.g., is the case for GeSeq output), GB2sequin will add "photosystem I P700 apoprotein A1" as gene product name. If in the original GenBank file the gene product of "psaA" was differently specified, for example as "PSI-A core protein of photosystem I", this description will be replaced by "photosystem I P700 apoprotein A1". If "psaA" is not present in the annotation or in the Gene Product Specification Table, no action will be taken.
In this window, the user can add/select mandatory source and sequence information, such as source organism, molecule type, location, genetic code and indicate if the sequence is complete and/or circular. In addition, the user can specify the definition line of the GenBank/ENA record.
GB2sequin provides an “Error and Validation Summary” which list syntax errors in the original annotation as identified by tbl2asn. Those can include unknown qualifiers or feature names not allowed by NCBI. Consequently, the tbl2asn program corrects them by removing unknown qualifiers and/or changing any unknown features into misc_features. In addition, validation of the annotation by tbl2asn is provided. Downloadable output files of GB2sequin are: (i) the nucleic acid sequence in FASTA format with mandatory sequence information in the FASTA header, (ii) the annotation table, and (iii) the final Sequin file for direct submission. The first two files can be used for submission through the BankIt web interface or for the update of an existing GenBank entry. The remaining files are again for quality control: (iv) the tbl2asn log file reports any syntax errors in the original annotation (see above). (v) The annotation, as it will be displayed later in NCBI, is recorded in the GenBank output. Changes in the annotation due to potential conversion errors and/or modifications can be easily identified by comparing the user’s original GenBank file with GB2sequin’s GenBank output using the comparison function found in many common text editors such as Microsoft Word. (vi) Last, validation of the annotation in a downloadable format is provided in the files “Validation” and “Validation Summary”. Prior to submission identified annotation errors listed therein should be corrected and warnings checked. The most suitable program to correct these errors is Sequin, which also allows revalidation. Corrected files can be directly submitted.