Dartmouth
Computer Science Department
- Dartmouth Chemistry Department
- Dartmouth Biological Sciences
Department
- Dartmouth
Center for Structural Biology and Computational Chemistry
Algorithmic Challenges in Structural Molecular Biology and
Proteomics
This paper reviews our research in computational biology and chemistry. Some of the most challenging and influential opportunities for Physical Geometric Algorithms (PGA) arise in developing and applying information technology to understand the molecular machinery of the cell. Our recent work (e.g., [1-20]) shows that many PGA techniques may be fruitfully applied to the challenges of computational molecular biology. PGA research may lead to computer systems and algorithms that are useful in structural molecular biology, proteomics, and rational drug design.
Concomitantly, a wealth of interesting computational problems arise in proposed methods for discovering new pharmaceuticals. I'll briefly discuss some recent results from my lab, including new algorithms for interpreting X-ray crystallography [14,17,16] and NMR (nuclear magnetic resonance) data [3,9,6,19,10,5,7,18,4], disease classification using mass spectrometry of human serum [12], and protein redesign [13]. Our algorithms have recently been used, respectively, to reveal the enzymatic architecture of organisms high on the CDC bioterrorism watch-list [17,16], for probabilistic cancer classification from human peripheral blood [12], and to redesign an antibiotic-producing enzyme to bind a novel substrate [13]. I'll overview these projects, and highlight some of the algorithmic and computational challenges.
In the post-genomic era, key problems in molecular biology center on the determination and exploitation of three-dimensional protein structure and function. For example, modern drug design techniques use protein structure to understand how a drug can bind to an enzyme and inhibit its function. Structural proteomics will require high-throughput experimental techniques, coupled with sophisticated computer algorithms for data analysis and experiment planning.
My laboratory develops novel computational methods to enable high-throughput structural and functional studies of proteins. A key focus is structural genomics, whose goal is (in the broadest terms) to determine the three-dimensional structures of all proteins in nature, through a combination of direct experiments and theoretical analysis. Proteins are the worker molecules in every living thing. By determining the structures of proteins, we are better able to understand how each protein functions normally and how faulty protein structures can cause disease. Scientists can use the structures of disease-related proteins to help develop new medicines and diagnostic techniques.
PGA research in computational structural biology and proteomics can assist in our long-range goal of understanding biopolymer interactions in systems of significant biochemical as well as pharmacological interest. At the molecular level, many genes provide the blueprint for proteins, and it remains very expensive and time-consuming to determine what these proteins do, and how they do it. This paper reviews some novel algorithms to build three-dimensional models of proteins to better understand protein mechanism and function.
Our work spans a number of projects in computational structural biology, including algorithms for automated assignment and structure determination in NMR structural biology [3,9,6,19,10,5,7,18,4], protein redesign [13], computer-aided drug design [1,15,13], computational molecular replacement in X-ray crystallography [14,17,16], and other related projects [2,11,8]. To pursue research in the field of structural genomics is to study the geometric structures of proteins. Structural genomics is a field born of the the marriage between computer science and biology. The goal is to develop new technology - specifically, computer algorithms - to enable the determination of 10,000 new protein structures in 10 years. This would have an enormous impact on our understanding of disease mechanisms, and our ability to target drugs to specific protein targets. If successful, the impact would be comparable to that of the human genome project.
Modern automated techniques are revolutionizing many aspects of biology, for example, supporting extremely fast gene sequencing and massively-parallel gene expression testing. Protein structure determination, however, remains a long, hard, and expensive task. High-throughput, automated, algorithmic methods are required in order to apply modern techniques such as computer-aided drug design on a much larger scale. For example, to analyze non-crystallographic symmetry in X-ray diffraction data of biopolymers, one must ``recognize" a finite subgroup of (the Lie group of 3D rotations) out of a large set of molecular orientations. The problem may be reduced to clustering in modulo a finite group, and solved efficiently by ``factoring" into a clustering on the unit circle followed by clustering on the 2-sphere , plus some group-theoretic calculations [14]. This yields a polynomial-time algorithm that is efficient in practice, and which enabled us to collaborate with biological crystallographers to reveal the architecture of a parasite's enzyme, which could help researchers reduce the threat of certain diseases among those with weak immune systems [17,16]. As another example, we recently employed geometric techniques from statistical estimation and machine learning to develop an algorithm for cancer proteomics, in which we use data from a mass spectrometer to distinguish between healthy and diseased blood in humans [12]. Geometry pervades our work: [12] can be viewed as an investigation of the geometry of the oncoproteome (the space of cancer proteins) as projected onto the mass-to-charge ratios of their proteolytic digest. Nuclear vector replacement for automated NMR resonance assignments [6,10] is essentially a matching problem on a quotient space of orientations, induced by a quadratic form on . is parameterized by , and 3D structural homology detection from unassigned NMR data (enabling rapid fold determination) can be performed by combinatorial optimization, searching over to minimize a functional that compares distributions generated by 's image of the bond vectors from putative database protein models [9,10,5,7].
In each of our research projects, computational techniques are central, and the applications present intriguing problems to computer scientists who design algorithms and implement systems. For example, the techniques we introduced for automated NMR resonance assignment [6,10,3,9,7] and protein structure determination [19,18,5,7] are an instance a general approach to combinatorial problem solving in which constraints on the solution are enforced in an order determined by the strength of the evidence for them. This approach, which has analogies to the Celera whole-genome shotgun sequencing algorithm, also presents a flock of fascinating questions from the point of view of theoretical computer science (cf. Richard Karp's Keynote address, Computational Systems Bioinformatics Conference, 2003).
In this section (2), I attempt to illustrate the general themes introduced above, by way of specific examples and results. The NMR biophysics requires some technical description; some readers may prefer to skip to the higher-level discussion on Protein Fold Determination (p. ) or even to Section 3 (p. ).
In contrast to (e.g.) simulated annealing approaches, our algorithm is combinatorially-precise [18], and is built upon the exact solutions for computing backbone angles from RDC data and systematic search. [19] is the first NMR structure determination algorithm that simultaneously uses exact solutions, systematic search and only 2 RDCs per residue. (A systematic search is a search over all possible conformations (solutions) that employs a provable pruning strategy that guarantees pruned conformations need not be considered further). Our first contribution is the derivation of low-degree polynomial equations for computing, exactly and in constant time, dihedral angles from RDCs measured on a single internuclear vector in two different aligning media. The easily computable exact solutions eliminate the need for one-dimensional grid-search previously employed to compute the directions of or two-dimensional grid-search to compute angles. Furthermore, these equations are very general and can easily be extended to compute both the backbone and sidechain dihedral angles from RDC data measured on any single vector in two aligning media. And, our method can be applied mutatis mutandis to derive similar equations for computing dihedral angles from RDCs in nucleic acids. Compared with other algorithms for computing backbone structures using RDCs, our algorithm achieves similar accuracies but requires less data, relies less on statistics from the PDB and does not depend on molecular dynamics. Since RDCs can be acquired and assigned much more quickly than NOEs in general, our results show it is possible to compute structures very rapidly and inexpensively using mainly RDC restraints.
Most of the work described above concentrates on algorithms and system for analyzing biological data and biological problems. However, the techniques we develop can also be applied to synthetic problems such as a protein engineering. For example, in collaboration with Prof. Amy Anderson, we recently developed a novel ensemble-based scoring and search algorithm for protein redesign, and applied it to modify the substrate specificity of an antibiotic-producing enzyme in the non-ribosomal peptide synthetase (NRPS) pathway [13]. Realization of novel molecular function requires the ability to alter molecular complex formation. Enzymatic function can be altered by changing enzyme-substrate interactions via modification of an enzyme's active site. A redesigned enzyme may either perform a novel reaction on its native substrates or its native reaction on novel substrates. A number of computational approaches have been developed to address the combinatorial nature of the protein redesign problem. These approaches typically search for the global minimum energy conformation among an exponential number of protein conformations. We developed a novel algorithm for protein redesign, which combines a statistical mechanics-derived ensemble-based approach to computing the binding constant with the speed and completeness of a branch-and-bound pruning algorithm. In addition, we developed an efficient deterministic approximation algorithm, capable of approximating our scoring function to arbitrary precision. Our algorithm is the first provable -approximation algorithm for estimating partition functions for protein flexibility.
In practice, our approximation algorithm decreases the execution time of the mutation search by a factor of ten. To test our method, we examined the Phe-specific adenylation domain of the NRPS gramicidin synthetase A (GrsA-PheA). We used ensemble scoring, via a rotameric approximation to the partition functions of the bound and unbound states for GrsA-PheA, to predict binding of the wildtype protein and a previously described mutant (selective for leucine), and second, to switch the enzyme specificity toward leucine, using two novel active site sequences computationally predicted by searching through the space of possible active site mutations. The top scoring in silico mutants were created in the wetlab and dissociation/binding constants were determined by fluorescence quenching. These tested mutations exhibit the desired change in specificity from Phe to Leu. Our ensemble-based algorithm which flexibly models both protein and ligand using rotamer-based partition functions, has application in enzyme redesign, the prediction of protein-ligand binding, and computer-aided drug design.
This result represents a computational approach to reprogramming enzyme specificity, with the ultimate goal of combinatorial biosynthesis for small-molecule diversity. We are studying a family of enzymes responsible for the biosynthesis of hundreds of pharmaceutically-active peptide-like products. Understanding these enzyme functions will elucidate how natural biological products (e.g., antibiotics) are synthesized in vivo. To modify enzyme function, we are developing computational techniques to plan structurally-based site-directed mutations. By reëngineering the active site(s) to operate on different substrates, we hope to modify these different enzymatic steps with an eye to potential reprogramming of those steps for combinatorial biosynthesis. This opens the door to the possibility of using our redesigned enzymes for in vivo combinatorial chemistry, to create candidate drug leads for new antibiotics and other drugs. We are applying our algorithms to NRPS modules, whose products include natural antibiotics, antifungals, antivirals, immunosuppressants, and siderophores. NRPS have multiple domains with individual functions acting in an assembly-line fashion. We are modifying the active sites to switch the specificity of the amino acid-accepting domains from their natural substrates, to different amino acids. The modifications are planned and analyzed in silico, by developing new algorithms based on techniques from geometric algorithms, robotics, machine vision, and scientific computation. Our ``enzyme reprogramming" could allow the modified NRPS to synthesize different modified peptides. Exploration of the combinatorial space of new NRPS ``programs" will generate a large number of new compounds, which could then be screened for pharmaceutical activity.
There is much to be done. A primary focus will be to explore novel computational methods in structural biology, specifically, new algorithms for NMR resonance assignments and protein structure determination, with applications to structural genomics. I am particularly interested in new algorithms for structural biology using only a minimal number of inexpensive, fast experiments. I'm also interested in collaborating with structural biologists to develop novel algorithms and apply them to biological systems of significant biochemical and pharmacological importance. A model for this kind of work is our collaboration on the structure of dihydrofolate reductase-thymidylate synthase (DHFR-TS) from Cryptosporidium hominis [14,17,16]. Cryptosporidium is an organism high on the bioterrorism list of the Center for Disease Control (CDC), a Category B bioterrorist threat. Agents/diseases that fall under this category are given the second-highest priority, because they are moderately easy to disseminate; result in moderate morbidity rates and low mortality rates; and require specific enhancements of CDC's diagnostic and treatment capacities. There is currently no drug therapy for cryptosporidosis. The enzyme DHFR-TS is in the sole de novo biosynthetic pathway for the pyrimidine deoxyribonucleotide dTMP, and therefore an attractive drug target. Solving the structure of DHFR-TS from C. hominis opens the door to species-specific drug design, exploiting both structural and biophysical differences between the human enzyme and Cryptosporidium DHFR-TS.
We collaborated with Dr. Amy Anderson's lab to determine the structure of DHFR-TS from C. hominis, revealing a unique linker domain containing an 11-residue alpha helix that has extensive interactions with the opposite DHFR-TS monomer of the homodimeric enzyme [17,16]. Analysis of the structure of DHFR-TS from C. hominis and of previously solved structures of DHFR-TS from Plasmodium falciparum (a.k.a. malaria) and Leishmania major reveals that the linker domain primarily controls the relative orientation of the DHFR and TS domains. Using the tertiary structure of the linker domains, we have been able to place a number of protozoa in two distinct and dissimilar structural families corresponding to two evolutionary families and provide the first structural evidence validating the use of DHFR-TS as a tool of phylogenetic classification.
I am also interested more broadly in proteomics and functional genomics, both in developing novel algorithms for proteomic problems and bringing to bear structural, geometric, and biophysical insights (and algorithms) for proteomics. For example, in collaboration with the Norris-Cotton Cancer Center at Dartmouth, we are exploring oncoproteomic target selection using mass spectrometry [12]. We developed an algorithm called Q5 for probabilistic classification of healthy vs. disease whole serum samples using mass spectrometry. Q5 is the first closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. It employs a discriminant back-projection algorithm to compute clues as to the molecular identities of differentially-expressed proteins and peptides. Q5 analyzes whole spectrum Surface-Enhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) Mass Spectrometry (MS) data, and was demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. We achieved sensitivity, specificity, and positive predictive values above 97% on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques, and represents the first attempt to compute the molecular identities of the differentially-expressed proteins in two important MS data sets for ovarian and prostate cancer. Further investigation of our lead proteins and peptide fragments may enhance our understanding of the molecular basis of oncogenisis and could potentially lead to new therapeutic targets.
As discussed above (Sec. 3), we are designing and implementing planning algorithms to propose site-directed mutations for protein redesign. We are developing a general planner that can reprogram the specificity of many NRPS domains, from many biological systems. The results of our algorithms are being compared to known crystal structures, to biochemical activity assays, and to crystal structures of the modified domains bound to the proposed substrates. We are developing a predictive model for when and how well our planner will work, by characterizing the complexity, correctness, and completeness of our algorithms. We believe these algorithms may be generally useful to the structural biology community, for studies of protein-ligand binding and protein redesign. Our provable approximation algorithm represents a new technique for computer-assisted drug design, and a novel approach for docking flexible ligands to flexible active sites [13]. In the future, we will extend and apply our algorithms to the redesign of other enzymes, including polyketide synthase (PKS) systems, which synthesize polyketide products such as erythromycin, rapamycin and tetracycline. I believe there are broad potential applications of our techniques for modeling protein flexibility, redesigning enzymes, and evaluating the biophysical processes of binding and catalysis in protein biochemistry, and that these goals can (hopefully) be realized after a lot more hard work in developing geometric algorithms, provably-good approximation algorithms, statistical methods, and an array of algorithmic techniques for handling noise and uncertainty in combinatorial geometry and computational biophysics.
This work is supported by grants from the National Institutes of Health (R01 GM-65982 and R01 GM-67542), and the National Science Foundation (IIS-9906790, EIA-0102710, EIA-0102712, EIA-9818299, EIA-9802068, and EIA-0305444).
Bruce Randall Donald 2004-06-22