Variant Classifier
Generating the Annotation File (the .coding_info file)
The purpose of the annotation file (which has the file extension, ".coding_info"), is to provide the Variant Classifier with a set of features for the query variants to collocate with. For more information, please consult the full manual in Manuals/The_CodingInfo_FileFormat.doc. This text contains a full description of the annotation file format. If you are working with an organism with annotation currently in Ensembl, then you can ignore the Coding Info file format and focus on using the script Extract_Coding_Info.pl, which uses the Ensembl API to download the necessary information for you automatically. If you are working with an organism without Ensembl support, you will need to understand the file format by reading the documentation and making use of the examples provided for the corona virus or generated with the Extract_Coding_Info.pl script. If you need support on generating this file by hand, please contact kli at jcvi.org. The remaining section will describe the two tools that you can use to extract annotation from Ensembl.
Extracting Annotation from Ensembl
If you plan on using Ensembl to extract annotation, you must install the Ensembl API's, see Downloading and Installing Variant Classifier. Make sure you not only download the APIs, but also install them and configure your environmental variables. See our Frequently Asked Questions for installation problems.
Annotation at Ensembl is well versioned. Please see the Ensembl Release Cycle web page for more information. How this affects annotation extraction is that you need to be able to identify the gene "Build" and genome "Assembly" version. Unfortunately, the Perl API's must be constantly updated with the Build/Assembly and new API's are not backward compatible. Furthermore, there is a limit to how many previous Build/Assemblies are kept on line.
Latest Databases
To display the current databases that are still kept online, you can use the script:
Show_Latest_Databases.pl
This script will log on to the Ensembl's database and request a list of all current databases. Using the —o option will preformat the output to be in the form to look like the option string for running the Extract_Coding_Info.pl script, which you will need to do next.
Example output:
"./Show_Latest_Databases.pl"
Copyright (c) 2005 J.
Craig Venter Institute.
All rights reserved.
3358
You can use the -o option to print out a preformatted option string.
aedes aegypti: 48 1b
aedes aegypti: 49 1b
aedes aegypti: 50 1c
aedes aegypti: 51 1c
aedes aegypti: 52 1d
aedes aegypti: 53 1d
aedes aegypti: 54 1d
aedes aegypti: 55 1d
anolis carolinensis: 53 1
anolis carolinensis: 54 1
...
Output with the —o option:
"./Show_Latest_Databases.pl"
Copyright (c) 2005 J.
Craig Venter Institute.
All rights reserved.
3358
-O "aedes aegypti" -B 48 -A 1b
-O "aedes aegypti" -B 49 -A 1b
-O "aedes aegypti" -B 50 -A 1c
-O "aedes aegypti" -B 51 -A 1c
-O "aedes aegypti" -B 52 -A 1d
-O "aedes aegypti" -B 53 -A 1d
-O "aedes aegypti" -B 54 -A 1d
-O "aedes aegypti" -B 55 -A 1d
-O "anolis carolinensis" -B 53 -A 1
-O "anolis carolinensis" -B 54 -A 1
...
Annotation Extraction
An example of how to extract annotation is available in Examples/human/annotation_extraction/example.csh.
Use the script:
Extract_Coding_Info.pl
This script will extract both the reference FASTA file and the Annotation in the correct format. To run, first cd into the Examples/human/annotation_extraction directory.
../../../Extract_Coding_Info.pl \
-c 11 -b 64030000 -e 64050000 \
-O "homo sapiens" -B 55 -A 37 \
-f bcl2 \
-x
As you can see, the line which consists of three parameters specify the chromosome, and begin/end coordinate from where to extract. The begin/end coordinates are in Ensembl's 1-residue based coordinates. In other words, the first base in the chromosome has the offset of 1. The second line describes the organism, build and assembly. You would use the information from Show_Latest_Databases.pl to determine the build/assembly that you would like to use. The —f option is the output file name. The —x option is important to specify if there are genes that are only partially covered by the begin/end coordinate you have specified. What the —x option does is automatically expand the begin/end extraction, so that any genes overlapping your original begin/end range are fully extracted. For example, if you ask for an extraction between 100-200, and a gene ranges from 50-150, then the —x option will automatically extract a range from 50-200. This will ensure that any gene that overlaps your region of interest is fully extracted.
The result from the extraction will be the annotation (.coding_info file) and sequence (.fasta file). At this point you are ready to run the VariantClassifier.
Creating your own Annotation file
Creating your own annotation file can be trivial for small genomes, such as viruses, but significantly more difficult for larger genomes. If you did have to create annotation for an organism manually and you would like to share it with others, please send me an e-mail, and we will credit you and include it in SourceForge.
Full documentation on creating your own annotation can be found in the Manuals/The_CodingInfo_FileFormat.doc file. The following are selected parts of the extracted annotation from Examples/human/annotation_extraction/bcl2.coding_info.
EXTRACTION 11 64019041 64052176
HUGO_ID ENSG00000149782 PLCB3
PROT_DESC ENSG00000149782 1-phosphatidylinositol-4,5-bisphosphate phosphod ...
EXON ENSG00000149782 ENST00000325234 ENSE00001228515 0 179 1
EXON ENSG00000149782 ENST00000325234 ENSE00001228317 3382 3469 1
EXON ENSG00000149782 ENST00000325234 ENSE00001195828 3664 3744 1
EXON ENSG00000149782 ENST00000325234 ENSE00001195819 3826 3880 1
...
CDS ENSG00000149782 ENST00000325234 11 15987
...
SNP rs2855398 2434 2435 1 G/C
SNP rs2532590 2437 2438 1 G/C
SNP rs2515727 2490 2491 1 G/C
SNP rs12421615 2563 2564 1 G/A
SNP rs2510068 2577 2578 1 G/C
...
PROTEIN ENST00000313074 ENSP00000321698 Low_complexity 354 378
PROTEIN ENST00000313074 ENSP00000321698 ?:PRO_rich 356 405
REPEAT dust 1237 1255
REPEAT dust 4043 4091
REPEAT dust 5428 5452
REPEAT dust 6050 6080
Generally, the first column is the keyword identifier. This tells us what kind of feature is going to be described on the rest of the line. The second line is the feature key. It is a unique identifier for that annotation, whether it is an exon, CDS, SNP, protein, or repeat. For the case of the EXON identifier, the next columns describe the transcript/exon structure. The VariantClassifier uses the exons/transcripts information to determine upstream/downstream gene regions, as well as intronic regions. If your gene does not have exons, then you can fill the exon ID column with the transcript ID. If your gene does not have multiple transcripts, then you can fill your transcript ID column with the gene identifier. See the corona virus example, Examples/corona_virus/SARS-WT-annotated.coding_info.
Following the feature key, structural information are the begin, end, and orientation of the feature in local coordinates. For example if you extract from a chromosome from base 100 to base 200, then the local coordinates of the first base of your extraction will be 0. Remember, all local coordinates used in the VariantClassifier are in 0-space-based coordinates. See Coordinate Systems. Refer to the manual for additional details.