SIFT Help

What does SIFT do?
What can SIFT do for me?
How does SIFT work?
Input for SIFT
Changing SIFT parameters
SIFT Output
Examples
Performance
- Compared to BLOSUM62
- Compared to other BLOSUM matrices

What does SIFT do?

SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. SIFT is based on the premise that protein evolution is correlated with protein function. Positions important for function should be conserved in an alignment of the protein family, whereas unimportant positions should appear diverse in an alignment.

What can SIFT do for me?

If you have a protein of interest that you would like to mutations in, submit your sequence to SIFT. Regions that do not tolerate many substitutions will be highlighted in red in the Score output file, and you can target these regions for mutation.

If you have mutant proteins with single amino acid substitutions, SIFT will predict which mutants may have a phenotypic effect before you carry out your functional assays.

How does SIFT work?

Brief Summary

SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function to the query sequence , (3) obtains the alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.

Procedure (the details):

Get related sequences. A PSI-BLAST search against a database is executed on the query sequence.

Parameters: 4 iterations, expectation value .0001, e-value threshold for inclusion in multipass model 0.002
Update 05/15/01 Number of PSI-BLAST iterations reduced to 2 to save time and prevent the search from diverging.
Choose closely related sequences.

As described in Genome Research 11:963-87: We desire to have sequences that are similar in function as well as structure to the query sequence. To do so, we select only a subset of sequences from the PSI-BLAST results.
1. Group sequences found from the PSI-BLAST search that are more than 90% identical together and make a consensus sequence for each group by choosing the amino acid that occurs most frequently at each position.
2. MOTIF finds conserved regions among the query sequence and the consensus sequences from (a) that were derived from at least two sequences.
3. After the conserved regions in the query sequence have been identified by MOTIF, these regions are extracted from the sequences aligned by PSI-BLAST.
4. The conserved regions of the query sequence and those consensus sequences more than 90% identical are converted to a PSI-BLAST checkpoint file.
5. The checkpoint file is given to PSI-BLAST to search among the remaining conserved regions of the consensus sequences not included in the seed checkpoint file. The top hit is added to the alignment corresponding tothe seed checkpoint file and the conservation over the entire alignment of conserved regions is calculated. If conservation does not decrease, the consensus sequence is added to the alignment and the checkpoint file rebuilt. (e) iterates until conservation decreases.
OR
SIFT by conservation: In the original version of SIFT, an arbitrary number of sequences is added. In this version, sequences are continually added until they reach a sequence conservation cutoff, set by the user.
If the sequences for which prediction is based on are very diverse (low conservation cutoff), only substitutions at the strongly conserved positions will be predicted as deleterious. If the sequences chosen for prediction are very similar to each other (high conservation cutoff), then most substitutions will be predicted as deleterious.
Users can choose the degree of sequence conservation: they can opt for detecting most of the deleterious substitutions (use a high sequence conservation) , or predict fewer deleterious substitutions but with a high level of certainty (use a low sequence conservation).
1. Group sequences found from the PSI-BLAST search (step 1) that are more than 90% identical together and make a consensus sequence for each group by choosing the amino acid that occurs most frequently at each position.
2. The query sequence and its checkpoint file is given to PSI-BLAST to search among the consensus sequences. The top hit is added and aligned to the query sequence. Information is calculated for each position in the alignment, and the median of these values is obtained. If the median conservation over all positions does not fall below a given cuttoff, the hit is retained in the alignment and the checkpoint file rebuilt. The process repeats until the median conservations as long as the median information does not fall below the cutoff.
The sequences picked from this iterative procedure are chosen as closely related sequences. You can also submit your own sequences.
Obtain alignment. Since PSI-BLAST alignments are fairly accurate and long (Sauder & Dunbrack, 2000), we obtain the alignment of the sequences chosen in (2) from the initial PSI-BLAST search results (1).
You can also submit your own alignment of your query sequence with other sequences.
Calculate probabilities. At each position of the alignment, each amino acid i appears at a frequency n_i. Using the n_i's, the probabilities of amino acids are estimated according to Dirichlet mixtures (d_i's. The final probability of an amino acid appearing at a position, p_i, is a weighted average of the observed frequencies and the Dirichlet estimation. The weight of the observed frequencies is the number of sequences used to construct the alignment. The weight of the Dirichlet estimated probabilities is an exponential function of a diversity measure (Div) calculated by

Div = SUM ( rank_i * n_i)
where rank_i is the rank amino acid i has in reference to the original amino acid when BLOSUM62 substitution scores for the original amino acid are ranked from highest to lowest.
Probabilities are normalized by dividing by max{Pr(amino acid)}.

Update: 08/08/01: Prior to calculating the probabilities, sequences > 90% identical to the query sequence are removed. This eliminates the possibility that the sequence containing your substitution of interest is already represented in the database therefore and will trivially be predicted as tolerated.

We have found by comparison to experimental data that substitutions with less than 0.05 are deleterious. We use this as a cutoff for prediction. We strongly suggest users examine the normalized probabilities manually. If your substitutions are slightly above the 0.05 cutoff, you might want to consider this as a deleterious substitution.

Input for SIFT

You can submit a protein sequence (slow), or your query sequence along with related sequences (fast) or your query sequence aligned with related sequences (even faster).

Submitting a NCBI GI #

You can submit a NCBI GI #id to obtain SIFT predictions. Predictions are based on pre-computed BLAST searches and are returned within a minute. This is the preferred method of submission.

To find a NCBI GI # for a particular protein sequence, go to the NCBI protein database and type in the gene name. If you get back too many results you can narrow it down by specifying the organism. For example, if looking for the human MLH1 gene, type "MLH1"[GENE] AND "homo sapiens"[ORGANISM] into the NCBI text box and a shorter list of genes restricted to human will be returned.

Submitting a sequence

You can submit a protein sequence in FASTA format. The entire SIFT procedure will be executed and results will be returned to you. This procedure is slow; if you have additional information about the protein, you can get your results much faster.

Submitting a group of related sequences

If you know of proteins related to your query protein, you can get results much faster by submitting your sequence and related sequences. Steps (1) & (2) of the SIFT procedure are skipped. Submit in FASTA format with your protein of interest as the first sequence in the file.

Submitting a multiple alignment

If you have a multiple alignment containing your protein of interest, you can submit the alignment in CLUSTAL, MSF, or FASTA format. Your protein should be first in the alignment. The length of the alignment should correspond to the query protein and there should be no gaps in the query protein sequence. Since steps (1) through (3) are skipped in the SIFT procedure, you will get your results SUPER-DUPER FAST and we encourage you to use this submission form instead of the others.

Submitting Substitutions

SIFT will return predictions on whether your substitutions are tolerant or intolerant based on the scores.

The format for a substitution is to have X#Y where X is the original amino acid, # is the position of the substitution and Y is the new amino acid. One substitution per line is allowed.

Example:
M1Y
K3S
T4P

SIFT Output

SIFT Predictions for Substitutions

Output	Description
SIFT Score	Ranges from 0 to 1. The amino acid substitution is predicted damaging is the score is <= 0.05, and tolerated if the score is > 0.05.
Median Info	Ranges from 0 to 4.32, ideally the number would be between 2.75 and 3.5. This is used to measure the diversity of the sequences used for prediction. A warning will occur if this is greater than 3.25 because this indicates that the prediction was based on closely related sequences.
Seqs at Position	This is the number of sequences that have an amino acid at the position of prediction. SIFT automatically chooses the sequence for you, but if the substitution is located at the beginning or end of the protein, there may be only a few sequences represented at that position, and this column indicates this.

Genome Tool Output

Below is example output returned when genomic variants are submitted.

Coordinates	Codons	Transcript ID	Protein ID	Substitution	Region	dbSNP ID	SNP Type	Prediction	Score	Median Info	# Seqs at position
1,100624830,1,T/A	ATA-tTA	ENST00000342895	ENSP00000344470	I121L	EXON CDS	rs34920283:A	Nonsynonymous	TOLERATED	0.59	3.06	28
22,30163533,1,A/C	GAG-GcG	ENST00000330029	ENSP00000332887	E49A	EXON CDS	rs11554363:C	Nonsynonymous	DAMAGING	0.03	3.04	50
X,10085674,1,T/C	GAT-GAc	ENST00000380861	ENSP00000370242	D525D	EXON CDS	rs6530368:C	Synonymous	N/A	N/A	N/A	N/A
21,19638426,1,T/G	TTG-gTG	ENST00000338326	ENSP00000339975	L223V	EXON CDS	novel	Nonsynonymous	DAMAGING *Warning! Low confidence.	0	4.32	2
2,230633386,1,G/A	CAG-tAG	ENST00000283943	ENSP00000283943	Q1910*	EXON CDS	rs1803846:A	Nonsynonymous	N/A	N/A	N/A	N/A
2,230312220,1,G/A	CCC-CtC	ENST00000341772	ENSP00000345229	P433L	EXON CDS	rs17853365:A	Nonsynonymous	TOLERATED	0.11	3.02	160

The first column indicates the variant submitted. If alleles are submitted with respect to the - strand, they will be automatically converted to + strand. Please not that if you do not submit the variant correctly, it will default to a synonymous change. One way to check is if the reference and non-reference alleles in the coordinates column now match, this indicates that you most likely did not submit your variant correctly.
The second column denotes the codon that has been changed, the bases are with respect to + mRNA orientation.
If dbSNP has a variant overlapping at the same position, the rs ID is displayed. However, the alleles may not be the same.
SIFT predictions are as described above.

Gene Annotation Error : We check that the Ensembl gene annotation codes for the expected Ensembl protein. For example, if an Ensembl gene is from positions 3-56 in the NCBI reference genome, we extract those DNA bases from the NCBI reference genome, translate it, and then check that it matches the corresponding Ensembl protein sequence.
If the Ensemble gene annotation does not match the expected protein sequence, we do not annotate the coding variant. 16% of the proteins from NCBI36 had this error, and 7% from NCBI37 have this error. Therefore, if you receive this error, we recommend you resubmit with NCBI37 coordinates (after converting NCBI36 to NCBI37), and if you still receive this error, we recommend that you annotate by hand.

Single Protein Output

For single protein submissions, the following output is also returned:

A table of probabilities [Procedure]
Here is an example of one of the rows in the table.

pos A C D E F G H I K L M N P Q R S T V W Y

9I 0.75 0.71 0.12 0.39 0.68 0.35 0.36 0.30 0.81 1.00 0.87 0.24 0.42 0.28 0.54 0.76 0.58 0.58 0.94 0.02 0.39

This lists normalized probabilities for position 9I of the query sequence. Underneath 9I is the fraction of sequences that are represented at this position. In this case, 75% of the sequences had a basic amino acid appearing in the sequence; 25% had either gaps or Xes. The normalized probability for an I->W substitution is < 0.05 so it is predicted deleterious and highlighted in red.
Predictions for each position
Here is an example of the output.

Predict Not Tolerated Position Seq Rep Predict Tolerated
c w d f m i y v g p s h n a l t e 7Q 0.95 K Q R

At position 7Q in the query sequence, 95% of the sequences have an amino acid appearing at this position. K, Q, R are predicted as tolerated and are observed in the alignment (capitalized). C, W, D, F, M, I, Y, V, G, P, S, H, N, A, L, T, E are predicted to be deleterious because they have normalized probabilities < 0.05 and none of these appear in the alignment (small letters). Amino acids are color coded: nonpolar, uncharged polar, basic, acidic.
If you submit substitutions, in addition to returning the prediction, SIFT will return:
(1) The number of sequences used at the position of substitution for prediction (not counting sequences with gaps at that position)
(2) The median sequence information used to measure the diversity of the sequences used for prediction.
The median sequence information is calculated by first calculating the information
at each position and then obtaining the median over all positions.

We recommend that a substitution predicted to be deleterious with median sequence information greater than 3.25 and especially those greater than 3.5 be taken with caution because this prediction was based on closely related sequences. The substitution may be at a position that has not had time to evolve and is conserved by chance (and hence predicted to be deleterious) when the position can actually tolerate substitutions and given more evolutionary time, will eventually mutate to different amino acids.
Exception: If your protein belongs to a protein family that is especially conserved (such as histones) then using a higher median sequence information may be fine.

Sample output:

Substitution at pos 2 from S to F is predicted to be DELETERIOUS with a score of 0.01.
Median sequence information: 3.44
Sequences represented at this position:3
WARNING!! This substitution may have been predicted as deleterious just because
the prediction was based on sequences too closely related. We recommend a median
sequence information <= 3.25 for reasonable accuracy and for which sequence diversity is adequate.

Substitution at pos 60 from E to L is predicted to be DELETERIOUS with a score of 0.00.
Median sequence information: 2.72
Sequences represented at this position:116

Explanation:
The substitution S2F was predicted to be deleterious. However, a warning was given because the sequences used for prediction were not very diverse from each other.

E60L was predicted to be deleterious and had a low median sequence information, indicating that the sequences used for prediction were diverse enough .

Predict Not Tolerated	Position	Seq Rep	Predict Tolerated
			c	w	d	f	m	i	y	v	g	p	s	h	n	a	l	t	e	7Q	0.95	K	Q	R

Changing Parameters

If SIFT is timing out:

Check the length of your protein.
The duration SIFT runs is proportional to the length of the protein. If your protein is very long (more than 500 amino acids), truncate your protein to the region you're intersted in (a particular domain or region carrying the substitution) and rerun SIFT. The optimal length is 300-500 amino acids. Don't forget to shift the location of your substitution.

The server is busy. In this case, change the default parameters so that your query takes less time to run and will be completed before the time limit. You can:
1. change median conservation of sequences to 3.00 (from default 2.75).
  -OR-
2. change the database to SWISS-PROT.
If changing these parameters will affect the quality of prediction, a warning will be given with the predictions of the substitutions.

If you are getting the error "not enough sequence information for prediction" (i.e. SIFT returns error or median sequence information > 3.25) :

Increase the database size.

Database options

SWISS-PROT = small but high quality. Less likely to time out but may not have enough sequences for prediction.
SWISS-PROT/TrEMBL = larger than SWISS-PROT, good quality <- what I like to use
nr (nonredundant database) = largest database, but not that good quality. Increases possibility of timing out but can also give more sequence information.

If there is not enough sequences to make your prediction (getting a warning that the median sequence information > 3.25 for your substitution), increase the size of the database. Change the database option from SWISS-PROT/TrEMBL (default) to the nonredundant database.

Examples

LacI

Submission

Paste in Sequence

>gi|2506562|sp|P03023|LACI_ECOLI LACTOSE OPERON REPRESSOR
MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPNRVAQQLAGKQSLLIGVATSS
LALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAAVHNLLAQRVSGLIINYPLDDQDAIAVEAAC
TNVPALFLDVSDQTPINSIIFSHEDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQI
QPIAEREGDWSAMSGFQQTMQMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSS
CYIPPLTTIKQDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA
RQVSRLESGQ

With Parameters:

Database: SWISS-PROT/TREMBL
Account for gaps?: YES

SIFT Results

A PSI-BLAST alignment with 27 sequences were returned along with normalized probabilities.
When results from tblastn on the microbial genomes were added, 54 sequences were chosen. This alignment gave similar predictions and probabilities. These probabilities were used for to test SIFT performance.

HIV-1 Protease

Submission

Paste in Sequence

>HIV-1 protease
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF

With Parameters:

Database: SWISS-PROT: SWISS-PROT was chosen rather than SWISS-PROT/TREMBL or the nonredundant database because there may be nonfunctional proteins in the latter databases that would corrupt our predictions.
Account for gaps?: YES

SIFT Results

A PSI-BLAST alignment with 39 sequences was returned along with predictions and probabilities.

Bacteriophage T4 lysozyme

Submission

Paste in Sequence

>LYCV_BPT4|P00720
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGI
LRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTW
DAYKNL

With Parameters:

Database: SWISS-PROT/TR
Account for gaps?: YES

SIFT Results

An ERROR was returned with the following message:

The query sequence did not share any conserved regions with the sequences found in PSI-BLAST. This may be due to divergence of PSI-BLAST results. We suggest you run a PSI-BLAST search on your own and hand-pick sequences that may be related to your sequence.

From the PSI-BLAST search , the alignment that results is severely gapped. Only one other sequence, VG05_BPT4, aligns well with the query sequence. Submitting the pairwise alignment to SIFT, predictions and normalized probabilities are returned.
Remarkably, with only two sequences, SIFT performs well in predicting deleterious substitutions.

Performance

To test SIFT against experimental data, we chose unbiased data sets in which mutagenesis was performed throughout the entire protein and both positive and negative phenotypes were assayed. Phenotypes with weakened activity were grouped with loss of function phenotype and SIFT was tested for its ability to predict these substitutions as deleterious. As a control, the BLOSUM62 substitution scoring matrix is used where nonnegative scores are predicted as tolerated; negative scores as deleterious.

Measures we used:

Tolerant prediction accuracy: (# of tolerant substitutions correctly predicted) / tolerated
Deleterious prediction accuracy: (# of deleterious substitution correctly predicted) / deleterious
Total Prediction Accuracy/b>: (total # of correctly predicted substitutions) / (total # of substitutions)
Experimental Prediction Accuracy: (redicted deleterious and phenotypically deleterious) / (# predicted deleterious)

SIFT performance compared to BLOSUM62

Test Case Method Tolerant
Prediction
Accuracy Deleterious
Prediction
Accuracy Total
Prediction
Accuracy Experimental
Prediction
Accuracy

LacI SIFT 78%
(1747/2254) 57%
(989/1750)
68%
(2736/4004) 66%
(989/1496)

BLOSUM62 31%
(696/2254) 84%
(1475/1750) 54%
(2171/4004) 49%
(1475/3033)

HIV-1 protease SIFT 70%
(78/111) 82%
(184/225)
78%
(262/336) 85%
(184/215)

BLOSUM62 63%
(70/111) 73%
(165/225) 70%
(235/336) 80%
(165/206)

Bacteriophage T4 lysozyme SIFT 59%
(817/1377) 72%
(460/638)
63%
(1277/2015) 45%
(460/1020)

BLOSUM62 30%
(406/1377) 85%
(542/638) 47%
(948/2015) 36%
(542/1513)

In all test cases, SIFT predicts better than BLOSUM62. SIFT always has a higher number of correctly predicted substitutions than BLOSUM62 (positive gain). Also, for all test sets, the number of substitutions predicted by SIFT to alter protein function is reduced. For a biologist investigating substitutions predicted to have a deleterious effect, fewer functional assays need to be performed and a higher proportion of the assays will yield affected phenotypes.

Performances of other matrices. Results were similar to BLOSUM62.

Test Case Method Tolerant
Prediction
Accuracy Deleterious
Prediction
Accuracy Total
Prediction
Accuracy Experimental
Prediction
Accuracy

LacI BLOSUM45 32%
(729/2254) 81%
(1421/1750) 54%
(2150/4004) 48%
(1421/2946)

BLOSUM80 23%
(526/2254) 88%
(1547/1750) 52%
(2073/4004) 47%
(1547/3275)

HIV-1 protease BLOSUM45 64%
(71/111) 72%
(161/225) 69%
(232/336) 80%
(161/201)

BLOSUM80 59%
(65/111) 76%
(170/225) 70%
(235/336) 78%
(170/216)

Bacteriophage T4 lysozyme BLOSUM45 31%
(430/1377) 82%
(522/638) 47%
(952/2015) 36%
(542/1469)

BLOSUM80 22%
(301/1377) 91%
(577/638) 44%
(878/2015) 35%
(577/1653)

Page last modified August 2001
Questions or comments?
Contact us

Test Case	Method	Tolerant Prediction Accuracy	Deleterious Prediction Accuracy	Total Prediction Accuracy	Experimental Prediction Accuracy
LacI	SIFT	78% (1747/2254)	57% (989/1750)	68% (2736/4004)	66% (989/1496)
LacI	BLOSUM62	31% (696/2254)	84% (1475/1750)	54% (2171/4004)	49% (1475/3033)
HIV-1 protease	SIFT	70% (78/111)	82% (184/225)	78% (262/336)	85% (184/215)
HIV-1 protease	BLOSUM62	63% (70/111)	73% (165/225)	70% (235/336)	80% (165/206)
Bacteriophage T4 lysozyme	SIFT	59% (817/1377)	72% (460/638)	63% (1277/2015)	45% (460/1020)
Bacteriophage T4 lysozyme	BLOSUM62	30% (406/1377)	85% (542/638)	47% (948/2015)	36% (542/1513)

Test Case	Method	Tolerant Prediction Accuracy	Deleterious Prediction Accuracy	Total Prediction Accuracy	Experimental Prediction Accuracy
LacI	BLOSUM45	32% (729/2254)	81% (1421/1750)	54% (2150/4004)	48% (1421/2946)
LacI	BLOSUM80	23% (526/2254)	88% (1547/1750)	52% (2073/4004)	47% (1547/3275)
HIV-1 protease	BLOSUM45	64% (71/111)	72% (161/225)	69% (232/336)	80% (161/201)
HIV-1 protease	BLOSUM80	59% (65/111)	76% (170/225)	70% (235/336)	78% (170/216)
Bacteriophage T4 lysozyme	BLOSUM45	31% (430/1377)	82% (522/638)	47% (952/2015)	36% (542/1469)
Bacteriophage T4 lysozyme	BLOSUM80	22% (301/1377)	91% (577/638)	44% (878/2015)	35% (577/1653)