Genome-wide prediction and analysis of coding variants
Overview
The identification of genetic variants implicated in disease is an important step in linking sequence data with new approaches to improve human health. Among the sequence variants currently known to be directly linked with human Mendelian disease, 57% are due to nonsynonymous mutations that encode a single amino acid substitution in the corresponding protein. An additional 23% of disease variants are due to small insertions and deletions (indels) in genes. An important problem in human health is the identification of coding variants which affect protein function and might be involved with disease. The SIFT algorithm was developed to predict if a single nucleotide variation (SNV) leading to amino acid substitution affects protein function.
Recent efforts in personalized genomics to sequence individual's genomes have generated large numbers of genome-wide variants including not only SNVs but also indels which require analysis and prioritization. To this end, we have developed a novel prediction algorithm with expanded functions, PROVEAN, which supports functional predictions for SNVs as well as insertions, deletions, and replacements of amino acids at the protein level. We have showed that the performance of PROVEAN is highly comparable to other popular tools including SIFT.
A PROVEAN web server is currently available at JCVI to support large scale genome-wide functional analysis of coding variants including both SNVs and indels. Executables and source code are freely available to the research community.
Funding
This work is funded by National Institutes of Health (NIH) grant number 5R01HG004701-03.