GeSeq - Documentation

Quickstart for the annotation of chloroplast sequences

Recommended steps are in bold for more details please see below.

1) "FASTA file(s) to annotate"

Upload your nucleic acid FASTA sequence(s)
Select "circular sequence(s)" if applicable

2) Select "Options"

"Generate multi-FASTAs" will generate multi-FASTA files for the gene classes "CDS" (protein-coding regions), "rRNA" and "tRNA"
"Generate codon-based alignments" will generate codon-based alignments for the annotated CDSs. If desired, you can "Include references" in the alignments

3) Select "Annotation"

If desired select "Annotate plastid IR"
Select a third-party software for de novo tRNA annotation. For chloroplast genomes we recommend ARAGORN.
If desired select "Additional HMMER profile search": "Chloroplast (CDS & rRNA)"

4) Select "BLAT Reference Sequences"

Please note that the annotation of the three different gene classes "CDS" (protein coding regions), "tRNA" and "rRNA" depends on different references:

Reference	CDS	tRNA	rRNA
MPI-MP chloroplast reference set	yes	-	yes
Server References (NCBI RefSeq)	yes	yes	yes
Custom GenBank/ENA	yes	yes	yes
Custom FASTA Nucleotide (CDS)	yes	-	-
Custom FASTA Nucleotide (tRNA, rRNA, primer, other DNA or RNA)	-	yes	yes

Thus, for a typical GeSeq job, you would:
- Activate "MPI-MP chloroplast reference set" and
- Select at least one tRNA annotation software
- Optional: Select one or more GenBank files from the "Server Reference" menu
  - Press the "Select References" button; the selection pop-up opens
  - You can use the "Search" box to search for species or taxons; usually you'd select species closely related to your query
  - Chloroplast genomes are labelled in green, mitochondrial genomes in red
  - To select a genome or taxon, press the little "ADD" button
  - When finished, press the 'Ok' button at the bottom of the pop-up

5) Press the "Submit" button

6) Retrieve results

You can download all results as zip file by pressing on the little floppy disk symbol

Overview

GeSeq was primarily designed for the rapid and accurate annotation of plant organelle genome sequences, plastid genomes in particular. However, GeSeq is highly customizable and also capable of annotating any other sequence. For instance, you could use GeSeq to annotate your plasmid collection with your own reference sequence database.

GeSeq analyses input sequence(s) by comparing it against a fully customizable reference databases using BLAT. We chose BLAT, because it deals very well with exon intron boundaries. Protein coding genes are annotated by translated BLAT, RNA coding genes and DNA features by standard BLAT searches. In order to provide a high quality annotation of chloroplast protein coding genes, we additionally provide a manually curated reference database that covers a wide taxonomical range. The output of GeSeq is a standard GenBank or GFF3 file. Additionally, GeSeq creates a JSON output containing the same information stored in the GenBank file. For example see here.

Please take the time to manually confirm and, if necessary, correct GeSeq's output. Chloroplast genes that typically are difficult to annotate highly divergent ones (like ycf1 or ycf2), possess tiny exons (like petB or petD) or are trans-spliced (like rps12).

In addition to the above described basic annotation function, several additional options can be invoked:

The prediction of tRNAs by the third-party predictors tRNAscan-SE, ARAGORN and ARWEN
The generation of multi-FASTA files for the GenBank classes gene, CDS, rRNA and tRNA
The generation of codon-based multiple alignments of genbank class CDS by the third-party tools TranslatorX & MUSCLE
The control of CDS and rRNA annotations by an HMMER search using profiles built from manually curated MSAs (currently only for chloroplasts)

Sequence submission

You can upload nucleic acid FASTA files to GeSeq by pressing the "Add Files" button. You can submit single and multi-sequence FASTA files. Please note that GeSeq will process each FASTA sequence, no matter if provided by several single FASTA files or a single multi-FASTA file, sequence as an independent job.

Note: All sequence letters in your nucleic adid FASTA file not complying with the IUPAC code, including gaps ("-"), will be removed. An exception is "U" in RNA sequences, which is converted to "T".

Note: If you want to annotate multiple contigs that represent a single chloroplast genome or a organellar transcriptome, e.g. derived from DNA- or RNAseq data, we recommend to submit the contigs from one sequencing project as a single job and to activate "generate multi-fasta files". In this way, the job results will be all annotated contigs but also global multi-FASTA files for genes, CDS, tRNAs and rRNAs. The whole job can be downloaded as single ZIP-file.

Annotation

GeSeq uses different annotation pipelines for different gene classes.

Protein coding genes

For protein coding genes, GeSeq uses CDS squences that it extracts from your selected reference GenBank files or that are provided by FASTA files. Both the query and the reference sequences are translated in all six frames and compared to each other by BLAT. You can select the similarity cut-off for translated BLAT searches, the default value is 0.25. This value is likely too low when you have reference sequences very similar to your FASTA sequence submitted for annotation and this can result in the annotation of gene fragments ("GENE-fragment") from distant species. In case you observe this effect, simply raise the cut-off for translated BLAT searches until the spurious hits disappear.

tRNA and rRNA genes

For tRNA and rRNA genes, GeSeq used the corresponding nucleotide sequences provided by GenBank or FASTA files and compares them to the query by BLAT. You can select the similarity cut-off for BLAT searches, the default value is 0.85.

In addition, tRNA genes can be predicted by ARAGORN, ARWEN and tRNAscan-SE (see below).

You can also use custom FASTA files for annotation of other sequence elements like promotors, for instance (please see options).

Additional HMMER profile search

In addition to GeSeq's standard BLAT-based best-match approach, you can run an "Additional HMMER profile search" by selecting "Chloroplast (CDS + rRNA)" in the "Annotation" field. If selected, GeSeq will run an nhmmer using HMMER profiles for each CDS and rRNA of GeSeq's reference set and shows the hits (profile envelope coordinates) in the result GenBank file as "misc_features". This is intended as control for the BLAT-based best-match annotation and "misc_features" can comfortably shown or hidden in most GenBank viewers. However, if your run HMMER search alone (without selecting any reference in the "BLAT Reference sequences" field), than HMMER hits will be written as genes, i.e. annoated as CDS and rRNA. Please note that introns are currently not written as a result of an HMMER seach.

tRNA annotation by tRNAscan, ARAGORN or ARWEN

When activated, GeSeq will additionally call tRNAscan-SE, ARAGORN and/or ARWEN for the annotation of tRNA genes. The default parameters of tRNAscan-SE and ARAGORN are set for plant organelles. ARWEN, in turn, should be used for tRNA prediction in metazoan mitochondrial genomes. tRNA genes annotated by tRNAscan-SE, ARAGORN and/or ARWEN will appear as additional entries in the final GenBank of GFF3 file.

Circular sequence(s)

When activated, GeSeq will simulate a circular sequence by appending a copy of the first 10,000 bp to the end of the submitted sequence. This enables GeSeq to annotate genes that span the ends of the submitted linear sequence. Since tRNAscan-SE lacks a genuine sequence circularization feature, the simulated circularization might lead to tRNA predictions outside the range of a submitted sequence. These hits are displayed in the unfiltered tRNAscan-SE output table but are removed by GeSeq in the following processing steps.

Annotate plastid IR

When activated, GeSeq will annotate the largest identical inverted repeat pair found provided that they are longer than 200 bp.

Reference sequences

In general, GeSeq uses four types of references: Manually curated references sets, GenBank files, FASTA files with protein-coding nucleotide sequences (CDS) and FASTA files containing non-protein-coding nucleotide sequences, such as tRNAs, rRNAs, or primer binding sites. GeSeq does not use protein sequences. Never upload them.

From all references types, protein-coding and non-protein-coding sequences will be collected and assembled into seperate databases for translated and standard BLAT seaches, respectively.

MPI chloroplast reference set

GeSeq is equipped with a manually curated reference set for chloroplat genomes that spans a wide taxonomical range. The set includes all chloroplast protein coding sequences (CDS) and rRNAs.

Please note that the set currently does not inlcude tRNAs. Thus, if you want to do a de novo annotation of your chloroplast sequences, you need to additionally select a 3rd party tRNA annotation tool (ARAGORN, ARWEN or tRNAscan SE), a NCBI reference that includes tRNAs (see below), or upload an appropriate reference file (GenBank or non-protein coding FASTA).

NCBI references

GeSeq allows you to select GenBank files from the NCBI Organelle Genome Database (RefSeq). Our phylogenetic tree is seachable by free text. Please note that several database entries do not follow an up-to-date nomenclature for chloroplast genes. If one of those is used, multiple annotation of the same gene by different references will occur.

Custom GenBank/ENA references

Custom GenBank/ENA files, or NCBI or ENA entries in the GenBank or ENA format that are not present in the RefSeq database, can be uploaded as references by the user. GeSeq accepts multi-GenBank and -ENA files.

Custom FASTA references

In addition you may use nucleotide (multi-)FASTA files. Please note that non-protein-coding and protein-coding (CDS) FASTA files are handled separately by GeSeq since they are subject to BLATN or BLATX, respectively. Hence, they must be uploaded in seperate files.

If you have multiple references for the same gene or feature in the FASTA format, i.e. the same gene but from different organisms, you should use the following syntax for the header:

>gene/feature_source

"gene/feature" is the name of the gene or feature that will be displayed in the final annotation. "source" can be an accession number, a species or a voucher information that describes the origin of the reference in the GenBank output. So, typcial headers look like:

>psaA_AJ271079 or >psaA_Oenothera

The use of this synthax is necessary in order to prevent multiple annotation of the same gene/feature in your input sequence by multiple references.

Please note that NCBI FASTA Nucleotide headers of CDS sequences will be automatically converted into the above described format by GeSeq.

An expetion from the header format exitis for primer binding sites. Here the FASTA header should be simply the name of the primer, for example

>primer1

and the name of the FASTA file should start with "primer_", e.g. "primer_file.fas". Then GeSeq annotates hits as primer binding sites.

Moreover, if your FASTA file name starts with "tRNA_" or "rRNA_" (e.g. tRNA_file.fas or rRNA_file.fas), hits from those files will be annotated as tRNAs or rRNA. If no prefix is provided (e.g. file.fas), hits will be annotated as "misc_feature".