T4P
SIFT Output
SIFT Predictions for Substitutions
Output |
Description |
SIFT Score |
Ranges from 0 to 1. The amino acid substitution is predicted damaging is the score is <= 0.05, and tolerated if the score is > 0.05. |
Median Info |
Ranges from 0 to 4.32, ideally the number would be between 2.75 and 3.5. This is used to measure the diversity of the sequences used for prediction.
A warning will occur if this is greater than 3.25 because this indicates that the prediction was based on closely related sequences.
|
Seqs at Position |
This is the number of sequences that have an amino acid at the position of prediction. SIFT automatically chooses the sequence for you, but if the substitution is located at the beginning or end of the protein, there may be only a few sequences represented at that position, and this column indicates this. |
Below is example output returned when genomic variants are submitted.
Coordinates | Codons | Transcript ID | Protein ID | Substitution | Region | dbSNP ID | SNP Type | Prediction | Score | Median Info | # Seqs at position | User Comment |
1,100624830,1,T/A | ATA-tTA | ENST00000342895 | ENSP00000344470 | I121L | EXON CDS | rs34920283:A | Nonsynonymous | TOLERATED | 0.59 | 3.06 | 28 |
22,30163533,1,A/C | GAG-GcG | ENST00000330029 | ENSP00000332887 | E49A | EXON CDS | rs11554363:C | Nonsynonymous | DAMAGING | 0.03 | 3.04 | 50 |
X,10085674,1,T/C | GAT-GAc | ENST00000380861 | ENSP00000370242 | D525D | EXON CDS | rs6530368:C | Synonymous | N/A | N/A | N/A | N/A |
21,19638426,1,T/G | TTG-gTG | ENST00000338326 | ENSP00000339975 | L223V | EXON CDS | novel | Nonsynonymous | DAMAGING *Warning! Low confidence. | 0 | 4.32 | 2 |
2,230633386,1,G/A | CAG-tAG | ENST00000283943 | ENSP00000283943 | Q1910* | EXON CDS | rs1803846:A | Nonsynonymous | N/A | N/A | N/A | N/A |
2,230312220,1,G/A | CCC-CtC | ENST00000341772 | ENSP00000345229 | P433L | EXON CDS | rs17853365:A | Nonsynonymous | TOLERATED | 0.11 | 3.02 | 160 |
- The first column indicates the variant submitted. If alleles are submitted with respect to the - strand, they will be automatically converted to + strand. Please not that if you do not submit the variant correctly, it will default to a synonymous change. One way to check is if the reference and non-reference alleles in the coordinates column now match, this indicates that you most likely did not submit your variant correctly.
- The second column denotes the codon that has been changed, the bases are with respect to + mRNA orientation.
- If dbSNP has a variant overlapping at the same position, the rs ID is displayed. However, the alleles may not be the same.
- SIFT predictions are as described above.
Gene Annotation Error : We check that the Ensembl gene annotation codes for the expected Ensembl protein. For example, if an Ensembl gene is from positions 3-56 in the NCBI reference genome, we extract those DNA bases from the NCBI reference genome, translate it, and then check that it matches the corresponding Ensembl protein sequence.
If the Ensemble gene annotation does not match the expected protein sequence, we do not annotate the coding variant. 16% of the proteins from NCBI36 had this error, and 7% from NCBI37 have this error. Therefore, if you receive this error, we recommend you resubmit with NCBI37 coordinates (after converting NCBI36 to NCBI37), and if you still receive this error, we recommend that you annotate by hand.
Single Protein Output
For single protein submissions, the following output is also returned:
- A table of probabilities [Procedure]
Here is an example of one of the rows in the table.
pos |
A | C | D | E | F | G | H | I |
K | L | M | N | P | Q | R | S |
T | V | W | Y |
|
9I 0.75 | 0.71 | 0.12 | 0.39 | 0.68 |
0.35 | 0.36 | 0.30 | 0.81 | 1.00 | 0.87 |
0.24 | 0.42 | 0.28 | 0.54 | 0.76 | 0.58 |
0.58 | 0.94 | 0.02 | 0.39 |
This lists normalized probabilities for position 9I of the query sequence.
Underneath 9I is the fraction of sequences that are represented at this
position. In this case, 75% of the sequences had a basic amino acid appearing
in the sequence; 25% had either gaps or Xes.
The normalized probability for an I->W substitution is < 0.05
so it is predicted deleterious and highlighted in red.
- Predictions for each position
Here is an example of the output.
Predict Not Tolerated |
Position |
Seq Rep |
Predict Tolerated |
| |
| c |
w | d |
f | m |
i | y |
v | g | p | s | h | n |
a | l |
t | e | 7Q | 0.95 |
K | Q |
R | | | | | | | |
| |
| |
| | |
| |
|
At position 7Q in the query sequence, 95% of the sequences have an amino acid
appearing at this position. K, Q, R are predicted as tolerated and are observed in
the alignment (capitalized). C, W, D, F, M, I, Y, V, G, P, S, H, N, A, L, T, E are
predicted to be deleterious because they have normalized probabilities < 0.05 and
none of these appear in the alignment (small letters). Amino acids are color coded:
nonpolar,
uncharged polar, basic,
acidic.
- If you submit substitutions, in addition to returning the prediction, SIFT will return:
(1) The number of sequences used at the position of substitution for prediction (not counting sequences with gaps at that position)
(2) The median sequence information used to measure the diversity of the sequences used for prediction.
The median sequence information is calculated by first calculating the
information
at each position
and then obtaining the median over all
positions.
We recommend that
a substitution predicted to be deleterious with median sequence information
greater than 3.25 and especially those greater than 3.5 be
taken with caution
because this prediction was based on closely related sequences. The
substitution may be at a position that has not had time to evolve and is
conserved by
chance (and hence predicted to be deleterious) when
the position can actually tolerate substitutions and given more evolutionary
time, will eventually mutate to different amino acids.
Exception: If your protein belongs to a protein family that is especially
conserved (such as histones) then using a higher median sequence information
may be fine.
Sample output:
Substitution at pos 2 from S to F is predicted to be DELETERIOUS with a score of 0.01.
Median sequence information: 3.44
Sequences represented at this position:3
WARNING!! This substitution may have been predicted as deleterious just because
the prediction was based on sequences too closely related. We recommend a median
sequence information <= 3.25 for reasonable accuracy and for which sequence
diversity is adequate.
Substitution at pos 60 from E to L is predicted to be DELETERIOUS with a score of 0.00.
Median sequence information: 2.72
Sequences represented at this position:116
Explanation:
The substitution S2F was predicted to be deleterious. However, a warning was given because the sequences used for prediction were not very diverse from each other.
E60L was predicted to be deleterious and had a low median sequence information, indicating that the sequences used for prediction were diverse enough .
Changing Parameters
If SIFT is timing out:
- Check the length of your protein.
The duration SIFT runs is proportional to the length of the protein.
If your protein is very long (more than 500 amino acids), truncate your protein
to the region you're intersted in (a particular domain or region carrying the substitution) and rerun SIFT. The optimal length is 300-500 amino acids. Don't forget to shift the location of your substitution.
- The server is busy. In this case, change the default parameters so that your
query takes less time to run and will be completed before the time limit.
You can:
- change median conservation of sequences to 3.00 (from default 2.75).
-OR-
- change the database to SWISS-PROT.
If changing these parameters will affect the quality of prediction, a
warning will be given with the predictions of the substitutions.
If you are getting the error "not enough sequence information for prediction" (i.e. SIFT returns error or median sequence information > 3.25) :
- Increase the database size.
Database options
SWISS-PROT
= small but high quality.
Less likely to time out but may not have enough sequences for prediction.
SWISS-PROT/TrEMBL
= larger than SWISS-PROT, good quality <- what I like to use
nr (nonredundant database)
= largest database, but not that good quality. Increases possibility of timing out but can also give more sequence information.
If there is not enough sequences to make your prediction (getting a warning that the median sequence information > 3.25 for your substitution), increase the size of the database.
Change the database option from SWISS-PROT/TrEMBL (default)
to the nonredundant database.
Examples
LacI
Submission
Paste in Sequence
>gi|2506562|sp|P03023|LACI_ECOLI LACTOSE OPERON REPRESSOR
MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPNRVAQQLAGKQSLLIGVATSS
LALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAAVHNLLAQRVSGLIINYPLDDQDAIAVEAAC
TNVPALFLDVSDQTPINSIIFSHEDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQI
QPIAEREGDWSAMSGFQQTMQMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSS
CYIPPLTTIKQDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA
RQVSRLESGQ
With Parameters:
- Database: SWISS-PROT/TREMBL
- Account for gaps?: YES
SIFT Results
A PSI-BLAST alignment with 27 sequences
were returned along
with
normalized probabilities.
When results from tblastn on the microbial genomes were added, 54 sequences were chosen. This alignment gave similar predictions and probabilities. These probabilities were used for to test SIFT performance.
HIV-1 Protease
Submission
Paste in Sequence
>HIV-1 protease
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
With Parameters:
- Database: SWISS-PROT
- SWISS-PROT was chosen rather than SWISS-PROT/TREMBL or the nonredundant database because there may be nonfunctional proteins in the latter databases that would corrupt our predictions.
- Account for gaps?: YES
SIFT Results
A PSI-BLAST alignment with 39 sequences
was returned along
with predictions
and probabilities.
Bacteriophage T4 lysozyme
Submission
Paste in Sequence
>LYCV_BPT4|P00720
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGI
LRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTW
DAYKNL
With Parameters:
- Database: SWISS-PROT/TR
- Account for gaps?: YES
SIFT Results
An ERROR was returned with the following message:
The query sequence did not share any conserved regions with the sequences
found in PSI-BLAST. This may be due to divergence of PSI-BLAST results. We
suggest you run a PSI-BLAST search on your own and hand-pick sequences that
may be related to your sequence.
From the PSI-BLAST search
, the alignment that results is severely gapped.
Only one other sequence, VG05_BPT4, aligns well with the query sequence.
Submitting the pairwise alignment to SIFT, predictions
and normalized probabilities are returned.
Remarkably, with only two sequences, SIFT performs well in predicting deleterious substitutions.
Performance
To test SIFT against experimental data, we chose unbiased data sets in which mutagenesis was performed throughout the entire protein and both positive and
negative phenotypes were assayed.
Phenotypes with weakened activity were grouped with loss of function
phenotype and SIFT was tested for its ability to predict these substitutions as deleterious.
As a control, the BLOSUM62 substitution scoring matrix is used where nonnegative scores are predicted as tolerated; negative scores as deleterious.
Measures we used:
- Tolerant prediction accuracy
- (# of tolerant substitutions correctly predicted) / tolerated
- Deleterious prediction accuracy
- (# of deleterious substitution correctly predicted) / deleterious
- Total Prediction Accuracy/b>
- (total # of correctly predicted substitutions) / (total # of substitutions)
- Experimental Prediction Accuracy
- (redicted deleterious and phenotypically deleterious) / (# predicted deleterious)
SIFT performance compared to BLOSUM62
Test Case | Method | Tolerant Prediction Accuracy |
Deleterious Prediction Accuracy |
Total Prediction Accuracy |
Experimental Prediction Accuracy |
LacI |
SIFT | 78% (1747/2254) |
57% (989/1750)
| 68% (2736/4004) |
66% (989/1496) |
BLOSUM62 | 31% (696/2254) | 84% (1475/1750)
| 54% (2171/4004) | 49% (1475/3033) |
HIV-1 protease |
SIFT | 70% (78/111) | 82% (184/225)
|
78% (262/336) | 85% (184/215) |
BLOSUM62 | 63% (70/111) | 73% (165/225) |
70% (235/336) | 80% (165/206) |
Bacteriophage T4 lysozyme |
SIFT | 59% (817/1377) | 72% (460/638)
|
63% (1277/2015)
| 45% (460/1020) |
BLOSUM62 | 30% (406/1377) | 85% (542/638) |
47% (948/2015) | 36% (542/1513)
|
In all test cases, SIFT predicts better than BLOSUM62. SIFT always has a higher
number of correctly predicted substitutions than BLOSUM62 (positive gain). Also,
for all test sets, the number of substitutions predicted by SIFT to alter
protein function is reduced. For a biologist investigating substitutions
predicted to have a deleterious effect, fewer functional assays need to be
performed and a higher proportion of the assays will yield affected phenotypes.
Performances of other matrices.
Results were similar to BLOSUM62.
Test Case | Method | Tolerant Prediction Accuracy |
Deleterious Prediction Accuracy |
Total Prediction Accuracy |
Experimental Prediction Accuracy |
LacI |
BLOSUM45 | 32% (729/2254) | 81% (1421/1750) |
54% (2150/4004) | 48% (1421/2946) |
BLOSUM80 | 23% (526/2254) | 88% (1547/1750) |
52% (2073/4004) | 47% (1547/3275) |
HIV-1 protease |
BLOSUM45 | 64% (71/111) | 72% (161/225) |
69% (232/336) | 80% (161/201) |
BLOSUM80 | 59% (65/111) | 76% (170/225) |
70% (235/336) | 78% (170/216) |
Bacteriophage T4 lysozyme |
BLOSUM45 | 31% (430/1377) | 82% (522/638) |
47% (952/2015) | 36% (542/1469) |
BLOSUM80 | 22% (301/1377) | 91% (577/638) |
44% (878/2015) | 35% (577/1653) |
Page last modified August 2001
Questions or comments?
Contact us