Statistical Analysis of Biological Interactions of Homologous Proteins

Stay connected



Share on facebook
Share on twitter
Share on linkedin

Dissertation Defense, Nov 17, 2008, 10:00AM – 11:30AM, Wachman 447

Statistical Analysis of Biological Interactions of Homologous Proteins

Qifang Xu


Dr. Zoran Obradovic, Advisor (Computer & Information Sciences department)
Dr. Roland Dunbrack (Fox Chase Cancer Center)
Dr. Richard Coico (Microbiology and Immunology department)
Dr. Slobodan Vucetic (Computer & Information Sciences department)
Dr. Longin Jan Latecki (Computer & Information Sciences department)
Information fusion aims to develop intelligent approaches of integrating information from complementary sources, such that a more comprehensive basis is obtained for data analysis and knowledge discovery. Our Protein Biological Unit (ProtBuD) database is the first database that integrated the biological unit information from the Protein Data Bank (PDB), Protein Quaternary Server (PQS) and Protein Interfaces, Surfaces and Assemblies (PISA) server, and compared the three biological units side-by-side. The database is fast and designed to be modular so that it can be updated easily. The statistical analyses show that the inconsistency within these databases and between them is significant. PDB and PQS disagree on 21% and PISA is different from PDB and PQS on 18%. In order to improve the inconsistency, we studied interfaces across different PDB entries in a protein family using an assumption that interfaces shared by different crystal forms are likely to be biologically relevant. A novel computational method is proposed to achieve this goal. First, redundant data were removed by clustering similar crystal structures, and a representative entry was used for each cluster. Then a modified k-d tree algorithm was applied to facilitate the computation of identifying interfaces from crystals. The interface similarity functions were derived from Gaussian distributions fit to the data. Hierarchical clustering was used to cluster interfaces to define the likely biological interfaces by the number of crystal forms in a cluster. Benchmark data sets were used to determine whether the existence or lack of existence of interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. The probability that a common interface is biological when two or more crystal forms are available is given. The evolution information was used in evaluating interfaces in more than one crystal form. An interface shared in two different crystal forms by divergent proteins is very likely to be biologically important. Finally, the interface data not only provide new interaction templates for computational modeling, but also provide more accurate data for training sets and testing sets in data-mining research to predict protein-protein interactions. In summary, this dissertation shows how to effectively apply computational methods in solving biological problems. Specifically, we describe a framework which is based on databases where different biological unit information is integrated and new interface data are stored. In order for users from the biology community to use all the data, a stand-alone software program, a web site with a user-friendly graphical interface, and a web service are provided.