Bioinformatics Software for Improved Protein/Peptide Analysis

Bioinformatics and structure-aided drug design are really part of the same continuum. Bioinformatics offers a means to get to a structure through sequence; while structure- aided drug design offers a means to get to a drug through structure. We plan to combine innovative computational techniques with biochemical and structural expertise to bring bioinformatics and structure-aided drug design even closer together. In particular, we intend to blend computational chemistry with computational biology to create software that will aid protein chemists in understanding, evaluating and predicting the structure, function and activity of medically and industrially important proteins. My laboratory is currently involved in three "bioinformatics" projects. These include: (1) the development of novel methods to identify remote sequence/structure relationships; (2) the creation of a compact, relational database with advanced bioinformatics functionality; and (3) the development of novel methods for predicting and evaluating protein secondary and tertiary structure. Specific details for each of the three projects are provided below.

Development of new methods for identifying remote sequence and structure relationships. Among many protein chemists there is considerable interest in being able to identify sequence and structural relationships between remotely related proteins. Most "fast" sequence comparison methods used today do not allow these remote relationships to be detected. In many cases, the results one gets are often strongly dependent on the type of algorithm chosen, the scoring matrices used, the numerical cut-offs selected and the quality of the data in the database (esp. the dbEST). Even the use of very slow but thorough dynamic programming methods can sometimes fail to identify remote sequence relationships. Furthermore, standard sequence comparisons are only capable of identifying evolutionary "divergent" relationships -- not "convergent" relationships. Currently, the only way to identify convergent relationships is to incorporate information on the 3-D structure of the proteins of interest. To address these issues we are currently integrating, reorganizing and reducing the redundancy of several key protein sequence/structure databases (PIR, SWISS-PROT, PDB, dbEST). In particular we have succeeded in classifying the 200,000 sequences in the OWL protein database into 20,000 representative famillies. This classification will accelerate search speeds, reduce errors arising from database redundancy and permit both convergent and divergent relationships to be extracted. Working from these "cleansed" databases, we intend to develop at least two different algorithms to facilitate remote sequence/structure identification. One approach will involve the use of multiple passes of a FASTALIGN algorithm where low- scoring homologues identified in the first pass will serve as sequence queries for a second pass and any new homologues identified in this second pass will serve as queries for a third pass, an so on. We call this approach alignment "leap-frogging" or phylogenetic extension. This kind of multiple pass comparison can be done 10 to 100 times faster than standard Needleman-Wunsch algorithms. The second approach will involve using threading to perform sequence/structure alignments. The utility of threading is well established for identifying remote (divergent and convergent) relationships, however it is limited by the quality of the PDB database and the quality of the threading potential -- both of which we intend to enhance.

Creation of a compact relational database with advanced bioinformatics functionality (Biopedia). There are now more than 2000 electronic molecular/structural biology databases located around the world. The total volume of this information now exceeds 10 gigabytes and is growing rapidly. Most of these databases are stand-alone text-only repositories containing highly specialized medical, mutational, sequence or coordinate data. Unfortunately, the number, size, redundancy and limited query capabilities of molecular/structural biology databases is preventing many researchers from making full use of the information contained within them. The limitations of present-day bioinformatics databases could largely be overcome if many of them could be combined, reorganized and integrated. In this project we propose to create a compact (<600 Mbytes), fully integrated, visually oriented, bioinformatics database which would combine sequence, structure, locational, functional and physiological data into a single entity. In creating this resource, we will develop software tools to mine selected databases (MedLine, OMIM, PDB, dbEST, PIR) and extract appropriate sequence, structure, locational, functional and physiological data. From this mined data we will create a relational/object oriented database using keywords and deduced rules extracted from this data mining process. This will be further integrated with standard bioinformatics capabilities (alignment, pattern searching, structural superposition, graphing, structure prediction, etc.) to yield a database that can support visual queries, SQL queries and standard bioinformatics queries. It is expected that this kind of database could be a very useful tool for extracting new information and gaining new insights into molecular and structural biology.

Development of improved methods for protein structure prediction and evaluation. Structure prediction and structure evaluation are key to many areas of protein engineering and structural biology. Both areas are teaching us much about the forces that stabilize proteins and protein-ligand interactions. In this project we will be working in three main subprojects: (1) structure evaluation; (2) secondary structure prediction; and (3) tertiary structure prediction/generation. In the area of structure evaluation we will combine several methods (coordinate analysis [VADAR], contact potential analysis [MC-SYM], artificial intelligence, statistical analysis and neural networks) to create a program suite that will "expertly" assess the quality and correctness of a protein fold, either generated by experimental data or by computational folding protocols. In the area of secondary structure prediction we working on a predictive method called ESP (Evolutionary-based Structure Prediction) to improve the quality of secondary predictions of proteins. Secondary structure will be predicted by performing full-scale database searches to identify homologues to the query protein. These homologues will then be multiply aligned and the secondary structure for each sequence will be predicted using conventional statistical methods. These predictions will then be weighted according to the level of sequence similarity to the query sequence and the predictions combined to produce a consensus secondary structure. These consensus predictions will then be subject to further filtering through adaptive learning networks to reduce logically inconsistent assignments. It is expected that this kind of algorithm should attain an accuracy of >75%. In the area of tertiary structure prediction and computer protein folding, improved contact potentials (particularly aimed at describing paired beta-strands), enhanced loop modeling, and implementation of previously identified sub-structure biases (obtained through data mining) will be used to improve the quality of the structures generated through Genetic Algorithm/Monte Carlo sampling procedures. These algorithmic improvements will also be applied to the generation of three-dimensional protein structures using minimal experimental information (NMR chemical shifts, predicted beta-strand pairs and sparse NOE data).

BACK to the main home page