Prokaryotic Annotation Pipeline
Overview
Since 1996 when the former TIGR published the genome of Haemophilus influenzae, more than 400 prokaryotic genomes have been completely sequenced resulting in a huge flood of genomic data. A vital component to the data produced by any genome sequencing project is the annotation attached to the features of the genome. In general, annotation includes the assignment of attributes such as functional name, gene symbol, and biological role category to features like proteins and RNAs. The task of assigning putative functions to new genomic proteins is complex and requires the integration of diverse sources of information. JCVI has developed highly refined systems for the annotation and analysis of genomic sequence data and has made more than 70 complete prokaryotic genomes publicly available.
The automated Prokaryotic Annotation Pipeline was developed to generate on demand ORF prediction and functional annotation for prokaryotic genomes. Utilizing existing technologies at the JCVI, the pipeline is run using a modular web interface (known as Ergatis) and an XML based workflow system for automatic, high throughput, parallel computes. A round of non-coding RNA prediction is done using various tools such as tRNA-scanSE and BLAST. Gene finding is done using a self-training, iterated glimmer3 analysis. The predicted genes are then analyzed for overlaps, and homology based evidence is gathered using a system of hidden markov model search and BLAST. Roles and gene symbols are assigned to the predictions based on the above analyses, common names, GO terms, and EC numbers. The genes are also translated and run through transcript level computes, including a COG analysis, motif finding, and peptide signal identification. The pipeline produces a functionally annotated genome, including RNAs and various genome characteristics, creating a stepping stone for further analysis.