[Go
to SCIENCE Online with fewer graphics]
|
|
PubMed Citation || Related Articles in PubMed || Download to Citation Manager |
The comparison of the three-dimensional shapes of protein molecules poses a complex algorithmic problem. Its solution provides biologists with computational tools to organize the rapidly growing set of thousands of known protein shapes, to identify new types of protein architecture, and to discover unexpected evolutionary relations, reaching back billions of years, between protein molecules. Protein shape comparison also improves tools for identifying gene functions in genome databases by defining the essential sequence-structure features of a protein family. Finally, an exhaustive all-on-all shape comparison provides a map of physical attractor regions in the abstract shape space of proteins, with implications for the processes of protein folding and evolution.
If a living cell is viewed as a biochemical factory,
then its main workers are protein molecules, acting as catalysts, transporters,
and messengers, among other roles. A human genome encodes about 100,000 of
these biological macromolecules. Their functional diversity is made possible
by the diversity of three-dimensional (3D) protein shapes, also called
"structures," which are capable of highly specific molecular
recognition. Understanding or simulating the molecular processes involved
in the formation of protein structures and in their biological function
is a major challenge of molecular biology. However, in spite of many years
of focused research, we still lack a comprehensive and accurate theory
of protein structure based on physical and chemical principles. Fortunately,
and perhaps unexpectedly, a practical solution to the problem of predicting
protein shape and function from amino acid sequence (and thus, ultimately,
from nucleotide sequence) is provided by nature itself. Molecular evolution
has resulted in a dense network of kinship relations between proteins.
By inferring characteristics of one protein based on the function or structure
of its relatives, biologists exploit these evolutionary relations to predict
protein shape or functional properties.
This exploitation of evolutionary connectivity has become possible because of a wealth of molecular data about proteins from many different species. To date, biologists have read the complete nucleotide (and thus amino acid) sequences of well over 100,000 protein genes (1), and x-ray crystallographers and nuclear magnetic resonance spectroscopists have determined the 3D shapes of several thousand protein molecules (2). The molecular paleontology based on these data reveals a remarkable continuity of molecular evolution. The biochemical function of many proteins has persisted over large evolutionary time scales (even when cellular context has changed, such as in the transition from cells without a nucleus to those with a nucleus). As a protein molecule with an essential functional role evolves in the context of a living cell, a small number of amino acid residues crucial for its function tend to be strongly conserved (for example, residues that form a catalytically active site), while the rest of the protein sequence eventually undergoes considerable changes. The overall 3D structure also tends to remain essentially unaltered, even when all sequence memory appears to have been lost. This evolutionary resilience of protein 3D structure is the fundamental reason for the importance of protein shape comparison as a computational method in molecular biology.
Recent advances in molecular and structural biology have led to the determination of many 3D protein structures. This article reviews how solving the geometrical shape comparison problem leads to interesting evolutionary observations, to the prediction of function and structure in particular cases, and, on the basis of an all-on-all comparison, to an understanding of the distribution of known structures in shape space.
Exploiting the observation of evolutionary connections between proteins in order to predict some aspects of structure or function is simple in principle. If a protein is found to be evolutionarily related to another, then information about the function (or shape or enzymatic mechanism, among other attributes) of the one protein can be inferred from that of the other, with varying degrees of accuracy, depending on the evolutionary distance between them. The question then arises as to how evolutionary connections are best detected: by amino acid sequence comparison in 1D or by shape comparison in 3D?
The answer depends on the time interval that has elapsed since the presence of a common ancestor presumed to be similar in function and structure to the two extant descendants. At close evolutionary distances, string comparison between two protein sequences often suffices to establish evolutionary kinship. At larger evolutionary distances, more sophisticated methods must be used to identify subtle similarities in sequence patterns. For example, a method that uses sequence profiles compares probability values for each of 20 amino acids at matching positions in the two proteins under comparison. The most distant relations, however, are no longer detectable by current sequence analysis methods, however sophisticated, and require comparison of the 3D shapes of proteins.
Technically, protein sequence comparison is simpler than shape comparison and is routinely used in studies of protein evolution. Shape comparison can be used only if 3D structures are available (currently in a few percent of all cases), but it is more sophisticated and more powerful than sequence comparison, because similarity of shape remains detectable even though the sequence may have changed beyond recognition in the course of evolution. Comparing protein shapes rather than protein sequences is like using a bigger telescope that looks farther into the universe, and thus farther back in time, opening the door to detecting the most remote and most fascinating evolutionary relations.
An example of what can be done with protein shape comparison is the discovery of a common structural core (a common set of structural elements similarly arranged in space) in two apparently unrelated enzymes from different species with apparently different amino acid sequences. One is mammalian glycogen phosphorylase, a central control point in energy metabolism; the other is a DNA glucosyltransferase that protects the DNA of phage T4 against its own nucleases as it degrades the host's genome (3). Their shape similarity reflects a common chemical mechanism of diphosphate- and sugar-based chemistry, but their substrate specificities and cellular functions, and even their sizes, are very different. Methods for 3D shape comparison were instrumental in this discovery.
In geometrical and algorithmic terms, what is involved in shape comparison of two proteins? First of all, a typical 300-residue protein has about 3000 atoms distributed in space according to the convoluted ("folded") trajectory of the polymer chain. Recognizing common substructures between two such structures is in general a very complex combinatorial problem (which points in A are equivalent to which points in B?). The human visual system is very good at recognizing shapes; indeed, classical abstractions of protein architecture (Fig. 1) were established by structural biologists using visual inspection of structures (4, 5). However, as the number of known structures rapidly increased, visual inspection as a general method became inadequate because a human brain cannot easily store the shapes of thousands of complicated macromolecules and cannot easily process the large set of possible substructures. Computers have the advantage of tremendous storage capacity and processing speed but need adequate software. Software development for shape comparison requires (i) a suitable representation of the objects of study, (ii) an objective function to be optimized by (iii) a comparison algorithm, and (iv) appropriate decision rules concerning the significance of the result. Let us look at one way of approaching protein shape comparison.
A suitable representation. To make the problem computationally
tractable, one first simplifies the representation of protein shapes, keeping
essential features. For example, the complicated atomic structure can be
represented as a chain trace, that is, the ordered succession of residue
centers (C
atoms) described by their x, y, z coordinates (which accounts for
about 1 atom out of every 10 in the protein). In these terms,
the objective of a comparison of two protein shapes is an assignment of
one-to-one equivalence between the C
atoms, where nonmatching residue centers can be skipped in either chain.
In most applications, one also requires that the linear order of equivalent
pairs along the sequence is maintained, that is, that the continuity of
the polymer chain is considered a key aspect of shape.
An objective function to be optimized. Geometrical objective
functions can be formulated in terms of inter- or intramolecular distances
yielding, respectively, 3D and 2D comparison problems (Fig. 2).
In 3D comparison, one explicitly rotates and translates one molecule relative
to the other and measures intermolecular distances between equivalent points
in the two chains (Fig. 2A). The objective is to accommodate
the largest possible number of equivalent points within small deviations
in position, typically less than 2 to 3 Å. In 2D comparison,
3D shape is described with a matrix of all intramolecular distances between
the C
atoms.
Such a distance matrix is independent of coordinate frame but contains
more than enough information to reconstruct the 3D coordinates, except
for overall chirality, by distance geometry methods.
A comparison algorithm. How can the 2D matrix comparison be performed? Imagine sliding a (transparent) distance matrix on top of another one. Depending on the register of the two matrices, similar substructures will stand out as submatrices with similar patterns. This view leads to a combinatorial optimization problem of merging matching submatrices to larger consistent blocks of agreement by the removal of intervening rows and columns (Fig. 2B). Algorithmically, this can be treated by a trial-and-error (Monte Carlo) method. In the process of optimization, structurally equivalent regions can be filtered out with a fixed cutoff on acceptable differences of intramolecular distances or, as we prefer, with a continuous function defined in terms of relative distance deviations (6).
Appropriate decision rules. At the end of the optimization process, statistical significance of the comparison score for two proteins can be assessed with empirical criteria (calibrated on a large number of known examples). The results of the shape comparison of two proteins are typically reported in the form of equivalent sets of residues (alignments) (Fig. 2B) or as a 3D view of the matched parts of the two proteins (superimpositions) (Fig. 2A).
Many algorithms have been adapted to the problem of geometrical shape comparison of proteins, including branch-and-bound algorithms, brute force systematic searches, subgraph isomorphism algorithms, stochastic optimization by Monte Carlo or simulated annealing protocols, genetic algorithms, look-up or hashing methods, dynamic programming, and clustering (7). For most practical purposes, the algorithmic problem of 3D shape comparison of proteins (excepting the problem of comparing protein surface properties independent of the polymer trace) can be considered solved.
Beyond comparing two proteins, researchers also want to place new protein structures relative to the universe of all protein shapes, or at least relative to all known protein structures. This task is similar to that of finding a match to a fingerprint in a database, but more complicated in that similarities, and not just identities, are of interest. In particular, for a protein structure used as a query, researchers want to see all matches that score above some similarity threshold (for example, such as a threshold defined in terms of statistical significance). Our strategy for efficient searches in the database of 3D structures (2) is to first scan for obvious similarities using fast (but, in general, less accurate) procedures and then to rescan for more subtle similarities using more sophisticated (but slower) algorithms. We turn below to a brief description of these algorithms.
The fast search algorithm achieves simplification and speedup by representing
certain repetitive substructures (secondary structure elements, such as
helices
and
strands)
that consist of perhaps 50 to 150 atoms as 3D vectors anchored
at well-defined spatial positions (Fig. 3A). However,
this simplification can be misleading when subtle irregularities in the
coordinates lead to spurious differences in these vectors for proteins
that are actually similar in shape. The algorithm works by storing, in
a way convenient for geometrical lookup, a list of spatial relations between
such vectors taken from database proteins (8). Here,
lookup (or "hashing") is conceptually similar to looking up names
in a telephone book. The lookup procedure matches the vector relations
taken from the query protein with those in the stored list and proceeds
to sample a limited set of spatial superimpositions whenever enough matches
are found between the query protein and a database protein. Finally, a
dynamic programming step refines these superimpositions and generates detailed
residue-level alignments. The search of one structure against the structure
database of several thousand structures typically takes only about 5 min
on a computer workstation. Other simplified methods achieve similar speed
(7). In this way, a large portion (about 90%) of all
significant protein-protein shape similarities can be found (Fig. 3A).
The slower, more sophisticated algorithm is designed to deal with the
full combinatorial complexity of comparing two shapes in terms of the spatial
trace of the location of residue centers (C
atoms). As the general problem of finding the global best alignment of
two protein traces has the complexity of an NP-hard problem (9),
algorithmic solutions must either settle for an approximate solution or
risk sifting through an exponentially large search space (approximately
NM possible alignments of a sequence of N residues
onto a structure consisting of M segments of protein trace). To
solve this problem to a reasonable approximation, we have adapted the elegant
branch-and-bound algorithm by Lathrop and Smith (10)
that was originally developed for sequence-structure alignment (to optimally
fit the sequence of protein A into the structure of protein B), a problem
algorithmically similar to that of distance matrix comparison. The algorithm
iteratively splits the search space of many sequence-segment pairings into
subsets, calculates an upper bound of the objective function for each subset,
and focuses on further processing (splitting) the subset with the largest
upper bound. The chosen series of subsets eventually leads to a subset
that contains only a single alignment of protein A with protein B, which
corresponds to the exact global optimum of the objective function (Fig.
3B). Continuing the procedure past the global optimum
yields suboptimal solutions in monotonically decreasing order. Our adaptation
of this branch-and-bound procedure replaces the sequence of protein A by
the trace of residue centers of protein A and thus tests all residue-segment
pairings
that
is, all ways of placing residue centers of protein A at strategically chosen
positions in the structure of protein B (atthe beginning of all secondary
structure segments, for example).
For reasons of efficiency, we couple this branch-and-bound algorithm
to the hierarchical decomposition of a full structure into smaller compact
units [similar to "folding unit" decomposition or "domain"
decomposition (11)]; that is, we perform the comparison
in terms of well-defined substructures. Substructure decomposition is a
useful trick (heuristic) because a significant match between two proteins
is very likely to contain significant matches between well-chosen substructures.
As a result, most placements of residues in protein A onto segments in
protein B are pruned before they are examined explicitly. For example,
comparing the structures of transducin-
[Protein Data Bank code 1tag, 16 segments (12)]
with that of Ras p21 [5p21, 166 residues (13)]
leads to a nominal search of 1035 spatial arrangements, although
the best solution is found after only
11
s on a fast computer workstation.
The database search methodology containing these two algorithms, plus other tools, is made available over the Internet to users with a coordinate data set describing a 3D protein structure in hand (14). The searches aim to address questions such as which known proteins are related to the query protein in evolution, which parts of a query structure are most conserved, which pairs of proteins have similar internal architecture, and does the query protein represent a new shape (or new fold) not observed to date.
Protein scientists are interested not only in the evolutionary place of particular proteins, but also in a grand view of the architecture of all proteins. Although 10 years ago hand-assembled detailed catalogs of all protein structures would have fit into a review paper or book (5, 15), efficient algorithms of shape comparison and their implementation in computer programs are crucial for coping with the currently more than 4000 structures in the Protein Data Bank (2). Currently, Internet servers rather than printed publications are the preferred medium of dissemination (16). We have recently used shape comparison algorithms to perform an exhaustive all-on-all comparison in order to obtain a quantitative and objective overview of the currently known parts of the protein universe and, if possible, to arrive at a classification of architectural types. In processing the current database, two problems arise, one technical and the other conceptual in nature.
The technical problem is one of redundancy
that
is, unequal representation of protein families. For example, there are
230 crystal structures of engineered mutants of phage T4 lysozyme.
We can remove the family redundancy by equalizing all proteins with mutual
sequence identity greater than 25% (over most of their length, after optimal
sequence alignment) because these have essentially complete structural
overlap and in most cases similar function (17). Removing
such sequence redundancy from the April 1996 release of the Protein
Data Bank leaves a set of 740 representative proteins of known structure.
Many pairs in this set are still structurally similar to each other, in
spite of strong dissimilarity at the sequence level.
Next, in attempting to group structurally similar proteins within the set of 740 representative proteins, there is a conceptual problem, known as the problem of domains. Structural similarities within the set of proteins with unique sequences are typically restricted to only parts of the protein structure. Similar substructures, with relatively sharp boundaries, may recur between several proteins, and conversely, many proteins can be economically described as combinations of recurrent substructures (domains). The notion of such economical description is related to that of minimal encoding in information theory and, in this context, refers to the intuitive goal of defining a small set of large substructures in terms of which most protein structures can be described. In one attempt to achieve this goal, we have combined the notions of compactness and recurrence of domains. A compact domain has minimal surface and maximal interior residue-residue contacts. A recurrent domain is one that appears several times as a recognizably similar substructure in different proteins. This leads to an operational definition of substructures that makes use (i) of the property that normalized distance matrix similarity scores are strongest for complete overlap of large units and (ii) of a physical decomposition of protein structure into a tree of putative folding units at all size levels (18). Given a database of protein shapes, pairwise structural similarities, and alternative decompositions into substructures, the notion of maximal recurrence is implemented by selection of a set of substructures for which the sum of similarities is maximized across the database. As a result, the 740 proteins with unique sequences are split into 1048 domains.
Given this set of domains, one can now group structurally similar domains in a way that was not possible for the set of entire protein structures. There are several options for clustering domains into equivalence groups, none of them, in our opinion, ideal. We chose to group domains similar in shape into "domain fold" classes or simply "fold" classes by a process of average linkage clustering (19). Disregarding small, irregular domains and terminating clustering at an empirically chosen cutoff in similarity, the result is a set of fold classes whose members generally match over all secondary structure elements. This reduces the 1048 domains into 287 structurally unique folds that describe reasonably well the structures of the 740 sequence-unique proteins out of the approximately 4000 known protein structures. The list of currently known fold classes is a good starting point for attempts to better understand the genesis and diversity of protein shapes.
As more and more protein structures are determined experimentally (Fig. 4), automation of the comparison and classification process becomes indispensable. We now continuously monitor the rise in structural knowledge in terms of the appearance of new entries, new protein families, and new fold classes in the Protein Data Bank (2). Simple extrapolation leads us to expect 10,000 database entries, 1600 sequence-unique representative structures (sequence families), and 400 fold classes by the end of 1997. If current trends continue exponentially and without saturation, the 3D coordinates of at least 1 representative of up to 5000 protein sequence families will be known by the year 2000.
Conceptually, each protein structure may be imagined as a point in an abstract, high-dimensional fold space. At close range in this fold space, clusters represent protein families related through strong functional constraints (for example, hemoglobin and myoglobin). At intermediate range, clusters are related by shape similarity that does not necessarily reflect similarity of biological function [for example, globins and colicin A (20)]. At long range, the overall distribution of folds is dominated by five densely populated regions, which we call attractors (Fig. 5). Although the current distribution of folds is the result of several effects, including database bias, we put forward the hypothesis that these attractors represent both dominant folding pathways and evolutionary sinks that are the result of physical constraints.
Which basis set represents fold space? We have adopted a multivariate scaling method that discerns the presence or absence of similar features and mathematically amounts to solving an eigenvalue problem (21). The method is related to (but different in detail from) principal component analysis and has been used, for example, in archaeology, to arrange sites ("individuals") in a time series based on trends in the composition of excavated artifacts ("attributes") and, in molecular biology, to analyze the correlation between codon usage and level of gene expression (22). In our application here, the features are the similarity of an "individual" structure to each other "attribute" structure. We plotted the points in fold space in the 2D plane of the two dominant eigenvectors (Fig. 5A).
What is the evidence for the attractor hypothesis? The five dominant peaks in the distribution of domains in the 2D projection of shape space (Fig. 5A) contain domains with similar secondary structure composition and characteristic topological motifs (secondary structure elements plus loop connections). In the folded structures, the shared motifs are not exposed to solvent, so they are likely to form early on in the folding process and may represent nucleation sites. If this is true, the diversity of different folds that contain these core motifs would represent alternative evolutionary extension paths to add on more elements (23). Selective pressure in evolution from random or partially random sequences would be more likely to result in specifically folded stable structures in one of these regions. Economy of description, which underlies our quantitative derivation, may reflect economy of construction.
It would be tempting to speculate that five attractors exhaust the particularly
simple pathways of collapsing
helix
and
strand
elements into globular proteins. However, other solutions to folding up
proteins do exist and recur between unrelated families. One such example
is the so-called
trefoil
fold, which has internal threefold symmetry; it is described as a cone-shaped
barrel covered by three
hairpins
(24) and is not in any of the attractor regions. In
addition, about 10% of the known fold classes map to small clusters that
lack similarity to others. How many more attractor regions are there? Extrapolating
from folds that are known to exist, to folds that can exist, is a challenging
problem (25). We do anticipate the emergence of some
new basic folding patterns from membrane proteins, few of which are known
in structural detail. We would be surprised, however, if the number of
attractors more than doubled in the next 5 years.
As more protein structures are determined, the placement of each new
protein in shape space makes a contribution to the completion of the map
and can, in special cases, lead to a considerable gain in biological knowledge.
As an example, let us examine the steps used in unraveling the evolutionary
origins of DNA polymerase
,
a DNA repair enzyme. When the structure of DNA polymerase
was
solved (26), it turned out to be a structural outlier
compared to three other DNA and RNA polymerases of known structure. This
outlier role seemed to match its peculiarities in function, being smaller
and simpler (its polymerase action is stepwise rather than continuous)
than the other polymerases. Both features were put in context by the discovery
(27) of a close structural resemblance to kanamycin
nucleotidyltransferase, an enzyme conferring antibiotic resistance to bacteria.
The catalytic domains of DNA polymerase
and
kanamycin nucleotidyltransferase share not only a common substructure,
but also a sequence signature pattern that maps to the nucleoside triphosphate
binding site in the conserved domains (Fig. 5). Pattern
searches in sequence databases led to the identification of five additional
families of nucleotidyltransferases that are predicted to contain the same
substructure responsible for the nucleotide transfer reaction, which in
turn led to the definition of an extended enzyme family. In spite of their
relation in structure and basic biochemical function, the biological functions
of member enzymes are diverse, ranging from nucleic acid synthesis to the
regulation of biosynthetic pathways by nucleotidylation (Fig. 6).
Most evolutionary links are identified on the basis of sequence similarity,
but the most interesting new discoveries are the result of explorations
in the "twilight zone" of sequence similarity. Shape comparison
contributes, as it did for DNA polymerase
,
by helping to identify subtle but characteristic sequence patterns. The
procedure has these steps: structural alignment in 3D of two or more known
structures, definition of the pattern of conserved residues in 3D, sequence
database searches using that pattern to identify additional candidates,
multiple sequence alignment in each candidate family to check consistency
of conservation of the search pattern, building explicit 3D models by homology,
and verification that the models are physically plausible in terms of sequence-structure
fitness (determining how well can the amino acid sequence be accommodated
in the 3D structure). This process has already led to the unification of
several large sets of functionally related protein families into extended
families, with further simplifications expected.
The growth of sequence and function data from genome projects and 3D structures from experimental structural biology should yield a complete catalog of all proteins soon. Orphan sequences, with no known relatives detectable by sequence alignment, are already diminishing in number (28), and observations of the recurrence of similar substructures in remotely related proteins are more frequent. As more experimentally determined proteins structures become available and computational tools improve, model building by homology will yield a rapidly increasing fraction of all possible 3D models of natural proteins. As a result, the protein folding problem that has traditionally been the focus of computational molecular biology will fade in importance and be replaced by other challenges. In time, computational biologists will move beyond the mere description of evolutionary relations to a quantitative and predictive model of the evolution of proteins. The increasingly complete knowledge of protein structure will be used as a basis for detailed modeling of protein function, protein-protein interactions, and metabolic or signaling pathways. Mapping the protein universe by surveying and classifying protein shapes (29) is a key contribution to these endeavors.
|
where the summation is over all residues i,j of the common core,
dij* denotes the arithmetic mean of the C
-C
distances dijA and dijB
in proteins A and B, a relative deviation of 0.2 (20%) is the threshold
of similarity, and the exponential factor downweights contributions from
pairs at longer distances. The score is elastic, which means that close
contacts (for example, adjacent strands in a sheet) may vary 5 ± 1 Å but
helix-helix contacts may shift 10 ± 2 Å. The
optimal structural alignment is that set of equivalences (iA,
iB) that maximizes S.
PubMed Citation || Related Articles in PubMed || Download to Citation Manager |
Copyright © 1998 by the American Association for the Advancement of Science.