Algorithmically-Tuned Protein Families, Rule Base and Characterized Proteins
Overview
The massive numbers of microbial species living in and on the human organism vary greatly from person to person and transform our metabolism enough to impact our health. We are using an informatics-based approach to reads patterns of DNA differences from microbe to microbe as a means to figure out which species do what inside and on the human body. This work is part of the NIH Human Microbiome Project (HMP) as there is a great need to improve existing tools and develop computational methods to address the complexity of metagenomic data generated by human microbiome projects. This project takes a three-pronged approach to dramatically improve methods for extracting meaning from HMP sequence data.
The first approach is to develop algorithms that build protein families, each family just inclusive enough that checking a genome for some cohort of families tells whether or not a pathway is present. These algorithms resemble Phylogenetic Profiling, a data mining technique, but go through optimization steps that guide the building of each family. Pre-built families are not required. The result is new descriptive power that can discover and describe new systems and pathways. Thousands of new families will be created.
The second is a new way to apply annotation rules. Large numbers of rules created automatically, each of which works on fairly small numbers of proteins, can apply very exacting tests to determine whether one protein should be expected to have the same function as another that is already characterized. By deriving support from comparing gene regions or metabolic backgrounds in ways made possible only by having large numbers of complete genomes, these rules can achieve much greater confidence than more simplistic annotation techniques.
The third is a systematic compilation of the right starting points for annotation. Annotation methods today are built to achieve maximum leverage from those few proteins whose functions are known for sure, but searching for those good anchors is surprisingly difficult, and searching repeatedly wasteful. The CHAR database will collect experimentally characterized proteins and make them "rule-ready" and universally available. All of the resources developed through this project will be made publicly available. These approaches combine to let us read metabolic properties from microbial genome sequences more accurately, and figure out better ways to fight disease.
Funding
NIH / National Human Genome Research Institute