当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第22期 > 正文
编号:11370183
Looking into DNA recognition: zinc finger binding specificity
http://www.100md.com 《核酸研究医学期刊》
     Laboratoire de Biochimie Théorique, CNRS UPR 9080, Institut de Biologie Physico-Chimique, 13 rue Pierre et Marie Curie, Paris 75005, France

    * To whom correspondence should be addressed. Tel: +33 1 5841 5016; Fax: +33 1 5841 5026; Email: rlavery@ibpc.fr

    ABSTRACT

    We present a quantitative, theoretical analysis of the recognition mechanisms used by two zinc finger proteins: Zif268, which selectively binds to GC-rich sequences, and a Zif268 mutant, which binds to a TATA box site. This analysis is based on a recently developed method (ADAPT), which allows binding specificity to be analyzed via the calculation of complexation energies for all possible DNA target sequences. The results obtained with the zinc finger proteins show that, although both mainly select their targets using direct, pairwise protein–DNA interactions, they also use sequence-dependent DNA deformation to enhance their selectivity. A new extension of our methodology enables us to determine the quantitative contribution of these two components and also to measure the contributions of individual residues to overall specificity. The results show that indirect recognition is particularly important in the case of the TATA box binding mutant, accounting for 30% of the total selectivity. The residue-by-residue analysis of the protein–DNA interaction energy indicates that the existence of amino acid–base contacts does not necessarily imply sequence selectivity, and that side chains without contacts can nevertheless contribute to defining the protein's target sequence.

    INTRODUCTION

    Some DNA-binding proteins can target DNA sequences with remarkable specificity. In the case of human genome, this implies locating one or a handful of sites among billions of base pairs. Understanding the mechanism of such recognition is important not only for identifying binding sites within genomic sequences, but also for understanding how point mutations within a protein, whether they occur naturally or are voluntarily induced, will influence binding specificity. Such knowledge will be the key to future protein design efforts.

    In practice, despite early optimism, protein–DNA recognition has turned out to be much more complicated than expected. The first models of recognition assumed that a specific DNA target was selected as the result of a finite number of hydrogen bonds or steric interactions between amino acid side chains and bases (1). Although the analysis (2,3) and also the modeling (4–7) of complexes in these terms has been undertaken, it has been recognized that, at least in a number of cases, there are simply not enough direct interactions to explain specificity. This led to the idea that the sequence dependence of DNA deformation can also play a role in recognition. This so-called indirect component of recognition is naturally more difficult to quantify since its importance cannot be judged by simply looking at the conformation of a complex. Although indirect effects have been confirmed experimentally in cases of extreme deformation, such as that induced by the TATA box binding protein (8), molecular modeling has been the principal source of data on sequence effects dependent DNA mechanics (9–11) and dynamics (12). Modeling studies are however generally too expensive to be used for a comprehensive study of selectivity, which, for typical protein targets, implies comparing millions of potential binding sequences.

    To overcome this problem, we have recently developed an approach termed ADAPT. This method is based on an all-atom representation of a protein–DNA complex, derived from crystallographic data, coupled with the calculation of the principal sequence-dependent terms of the complexation energy, using a molecular mechanics force field (13–15). By using a multi-copy approach for the DNA base pairs, we are able to study the complexation energy for all possible DNA sequences of a given complex (4N for a target sequence of N base pairs). ADAPT has already been used to successfully reproduce experimental consensus sequences for a variety of DNA-binding proteins, as well as the ordering of free energy changes for a number of distinct binding sequences (13). Since ADAPT calculates both protein–DNA interaction energy (Eint) and DNA deformation energy with respect to an unbound reference DNA (Edef), it is also possible to use the results to analyze recognition in terms of its ‘direct’ component (involving hydrogen bonding and steric compatibility at the protein–DNA interface, contained within Eint) and its ‘indirect’ component (sequence-dependent DNA deformation, contained within Edef).

    In the present study, we apply ADAPT to the analysis of the binding specificity of zinc finger proteins. This important family of proteins (the largest family within eukaryotic organisms) is generally taken to be a textbook example of the classical type of recognition, based on an additive set of protein–DNA interactions. This idea is supported by the behavior of hybrid proteins constructed by the so-called finger swapping experiments (16,17). Here, we hope to delve deeper into the binding mechanisms by breaking recognition down into its components, looking at how individual amino acids contribute to specificity, whether indirect effects contribute and, if so, what is the relative importance of different structural components of DNA deformation. For this analysis, we have also developed indices based on information theory which enable us to make a fully quantitative analysis of terms that are generally only discussed qualitatively.

    The zinc finger proteins that we have chosen, belong to the Cys2-His2 family. Zif268, a well-studied member of this family, containing three zinc ‘fingers’, binds to a GC-rich target sequence. Each finger is constituted by a 24 residue ?? motif, which chelates a zinc ion. Recognition appears to involve four amino acids within the -helix of this motif (see Figure 1A). The apparent modularity of recognition within a single finger led to the hope that mutations might again be used in a simple manner to modify specificity. Such studies have actually revealed a more complex picture of Zif268 recognition (18). As with other proteins, amino acid–base interactions do not turn out to be additive, and a single amino acid may be involved in the recognition of several DNA bases. Moreover, mutating the key residues within a finger may also lead to repositioning of the finger with respect to DNA and thus, to a completely different recognition pattern. This effect is particularly clear in the work carried out by the group of Pabo (19) who designed mutants of Zif268, which showed selectivity for a TATA box sequence, GCTATAAAA, very different from their wild-type, GC-rich consensus sequence, GCGTGGGCG (see Figure 1B).

    Figure 1. Sequence logo and protein–DNA interaction pattern for Zif268 and TATA–Zif DNA complex. (A) Key amino acid–base contacts for the wild-type Zif268 complex. Solid arrows indicate hydrogen bonds and dashed arrows indicate van der Waals contacts (28). (B) Key amino acid–base contacts for the TATA–Zif complex, using the same representation. (C) Comparison of the computed and experimental (29) consensus sequence logos for the Zif268 complex. (D) Computed consensus sequence logo for the TATA–Zif complex.

    Here, we use our methodological tools to analyze the recognition of both Zif268 and the TATA binding mutant (hereafter termed TATA–Zif), with the specific aim of testing the two assumptions generally thought to apply to zinc finger binding, namely: (i) specificity relies entirely on amino acid–base interactions (with no so-called indirect contributions from sequence-dependent DNA deformation) and (ii) individual contributions of amino acid–base interactions can be identified visually from the structures of the corresponding protein–DNA complexes.

    The results suggest that although zinc finger recognition is indeed dominated by direct interactions, DNA deformation can play a significant role. It is also found that structural data alone need not be a reliable guide to the role of individual amino acids in determining sequence specificity. Although these conclusions do not help to simplify an already complex story, they illustrate a new way to analyze recognition mechanisms, which can also be applied to understanding the impact of point mutations within chosen families of protein–DNA complexes.

    MATERIALS AND METHODS

    Before discussing the information theory indices used in this article to make a quantitative analysis of protein–DNA recognition, we present an overview of the ADAPT methodology. For complete details, the reader is referred to our earlier publication (13).

    Energy calculations

    All energy calculations are performed with a modified version of the JUMNA molecular mechanics program. JUMNA uses an all-atom representation of DNA combined with a mixture of helicoidal and internal variables for positioning individual nucleotides, and for modifying their conformations in terms of torsion angles and certain valence angles (20). Energy calculations were carried out with the AMBER PARM98 force field (21). Solvent and counter ion electrostatic damping effects, respectively, are included by using a sigmoidal distance-dependent dielectric function (with a slope of 0.356 and a plateau value of 80) and reduced net charges on the DNA phosphate groups (–0.5e) (20). For the present calculations, we have used a soft-core version of the standard Lennard–Jones potential for all protein–DNA interactions (13). This modification introduces a cubic potential at short range to smoothly damp repulsion and to limit its maximum value (chosen here to be 10 kcal mol–1). The energy minimizations discussed below are carried out in the JUMNA coordinate system (mixing helicoidal and internal variables), using a quasi-Newton algorithm with analytically calculated energy derivatives and a convergence criteria of 10–4 on the predicted energy change at the last cycle of minimization.

    Multi-copy bases

    In order to be able to study all possible DNA sequences bound to a given protein, we have introduced the concept of nucleotides carrying multi-copy bases, termed ‘lexides’. Such nucleotides contain all four standard bases, adenine (A), cytosine (C), guanine (G) and thymine (T), superposed in space, and linked to the same sugar C1' atom. All bases thus share a common C1'-(N1/N9) glycosidic torsion angle. In terms of energy, the individual bases within a given lexide do not interact with one another. The contributions of each of these bases to the total energy of the system are controlled by variable coefficients (normalized to 1.0 for each lexide). By setting all weights to 0.25, it is possible to carry out calculations on a DNA with an ‘average’ sequence. This option has been used for creating unbound reference DNA structures, and for removing unnecessary sequence-dependent deformations, from the crystallographic complex conformations (see below). Energy calculations involving lexides can be stored in a matrix, which groups together terms involving the interactions of each possible base at each nucleotide position with the DNA backbones and an eventual bound protein (along the diagonal), and also the interactions between such bases (off-diagonal entries). This matrix can then be used to rapidly calculate the energy for any chosen sequence by simply summing the matrix elements corresponding to the appropriate bases. Note that since we are only interested in conventional Watson–Crick base pairs in the present study, base-paired lexides can be grouped together into multi-copy lexide pairs, thus reducing the size of the energy matrix and the number of additions to be carried out.

    Constructing a protein–DNA complex and a free DNA oligomer

    The starting point for our calculations is the crystallographic structure of an appropriate protein–DNA complex taken from the Protein Data Bank (PDB) (22). Before binding energy calculations are carried out, these data are processed in several ways. First, any unpaired, terminal DNA bases are removed. Next, the DNA fragment is rebuilt using lexides from the JUMNA library. Since crystallographic DNA fragments are often quite short, six flanking nucleotide pairs (in a canonical B-DNA conformation) are added to each end of the fragment to avoid possible end-effects. We also add missing hydrogen atoms to the protein residues.

    We next carry out energy minimization under the following restraints: (i) the conformation of the protein backbone is fixed, (ii) an ‘average’ sequence is imposed along the whole DNA fragment (all lexide coefficients set to 0.25) and (iii) the geometry of the protein–DNA interface is maintained by requiring that the relative position of all DNA atoms belonging to successive nucleotide pairs contacted by the protein, remain in the same relative positions with respect to one another. In this context, ‘contacts’ refer to all protein–DNA atom pairs distant by <4 ?. Relative atom positions are maintained using flat-well quadratic constraints allowing a 0.1 ? freedom of movement and neither hydrogen atoms nor phosphate anionic oxygens are constrained. This procedure allows us to generate a protein–DNA conformation, which respects the specific interactions characterizing the complex, but removes any fine sequence-dependent structural details related to the specific DNA sequence used for crystallization.

    Having obtained a protein–DNA complex conformation as described above, we also generate a conformation of the corresponding free DNA oligomer. This oligomer has the same length as that in the complex, but is again constructed using an ‘average’ base sequence, where each base pair is present with a coefficient of 0.25. This oligomer is energy minimized without any structural constraints and is taken to represent an optimal ‘sequence-averaged’ B-DNA conformation.

    Complexation energy and its decomposition

    Once conformations have been obtained for the protein–DNA complex and the free DNA oligomer, we can obtain the complexation energy for a given base sequence. This requires inserting the appropriate sequence in both conformations and calculating the interaction energy between the protein and the DNA (Eint) and the deformation energy necessary to pass from the free DNA oligomer to its complexed conformation (Edef).

    As explained above, using multi-copy lexide pairs enables us to make these calculations very fast by using a pre-constructed energy matrix. This matrix groups together all components of the energy, which depend on the given lexide coefficients. Thus, the protein–DNA interaction energy, when it concerns DNA bases, will be grouped base by base along the diagonal of the matrix. Similarly, the internal energy of the complexed DNA (minus the energy of the free oligomer), when it concerns the base interactions, will be placed in off-diagonal matrix elements, while base–backbone interactions will again lie along the diagonal. Obtaining the complexation energy of a given base sequence then only requires adding up the elements of the energy matrix, which correspond to the presence of the desired base pair at each position along the DNA oligomer. Note that the internal energy of the protein is not required since its conformation and thus, its energy, is taken to be unchanged by modifications to the DNA sequence. Note also that the protein–DNA interaction energy can, if desired, be broken down into contributions from individual amino acids. This option is used in the present study to investigate how given residues contribute to sequence specificity.

    Sequence selectivity in terms of information content

    Once we have calculated the energies for all possible sequences, we consider that binding sites are those which fall below an energy cutoff with respect to the complexation energy of the optimal sequence. On the basis of our earlier studies (13), this cutoff is taken to be 5 kcal mol–1. The group of M sequences selected in this way (where M = 677 for Zif268 and M = 776 for the mutant protein) can be described by a weight matrix, where each element fij is the frequency with which a base j (A, C, G or T) occurs at position i. In terms of information theory, this group of sequences exhibits an entropy Ri at each position i:

    Since early work in information theory dealt with binary data (23), entropy is typically calculated in terms of bits (a choice between two possible states). This implies using log2 in the equation above. Although this choice has generally been conserved (24,25) when dealing with biological sequence information, each position in a binding site requires a choice between four states (that is, the four possible base pairs). This implies that Ri will range between 0 (when 1 bp appears with a frequency f = 1) and 2 (when all 4 bp appear with equal frequencies of f = 0.25). In fact, choosing log4 is more natural in this case, since the resulting entropy unit then corresponds to a choice between four possible states and the range of Ri becomes 0 1. This choice is used here; however, to return to traditional units of bits it is enough to multiply the results by two.

    Note that entropy is the complement of information content, since Ri = 1 means that protein binding provides no sequence information at position i, while Ri = 0 means complete sequence information has been determined. The change of information content at position i in passing between two states with entropies and is consequently:

    If the initial state is the unbound DNA, the entropy is maximal, since each of the 4N possible sequences occurs with equal probability and the information content gained in selecting a set of M binding sequences becomes:

    Note that the information content per site (like Ri) can range from 0 (no sequence preference) to 1 (an absolute requirement for a single base pair). Ii therefore, characterizes the exact mixture of base pairs selected at a position i within the set of acceptable binding sites, and can be considered as being in units of base pairs.

    To characterize the detailed base preferences at any position along the binding site, we can construct a weight matrix or represent the same information graphically with sequence logos (26), where the height of each base is determined by its frequency of occurrence in the set of M binding sequences and the sum of the heights of the letters corresponds to the total information content.

    Note that the local measure of information content can be generalized to the full binding site by a simple summation, if we assume, as we will do here, that there is no correlation between neighboring positions along the site:

    I now refers to the full site and, since it is measured in units of base pairs, it can also be thought of as a binding site length. Successful comparisons of these lengths to experimental data for a variety of proteins provided an a posteriori justification of the energy cutoff used here (13). In the case of Zif268, the 5 kcal mol–1 cutoff leads to a binding site length of 7.8 bp. The sensitivity of this value with respect to the energy cutoff can be judged by noting that, if we had used a more tolerant value of 7 kcal mol–1, the calculated binding site length would have been reduced by 0.5 bp, whereas a more stringent value of 3 kcal mol–1 would have increased the length by 0.7 bp. Note that the length sensitivity may vary from complex to complex as a function of the energy distribution in sequence space.

    RESULTS AND DISCUSSION

    Validating ADAPT for Zif268

    The wild-type Zif268 complex was built using the high-resolution crystallographic structure 1AAY (27). These coordinates were treated as described in Materials and Methods. Six nucleotide pair fragments were added to either end of the 9 bp DNA fragment within the complex, leading to a total of 21 bp with Zif268 bound at positions 7–16. Using the possibilities provided by the multi-copy base pairs in ADAPT, the 4 bp at the 5'- and 3'-termini were fixed at an average base composition (25% contributions from each standard base pair: AT, TA, CG and GC). The protein binding energies were then calculated for the full combinatorial set of sequences corresponding to the remaining 13 bp (which consequently covered a total of 413, i.e. 6.7 x 107, sequences). The bound proteins were found to induce selectivity at only 10 of these base pairs (positions 7–17), and further analysis was limited to these positions. In the following discussion, we will adopt a local numbering for the Zif268 DNA target site, where positions 1–9 constitute the canonically recognized bases and position 10 is the supplementary 3'-base pair discussed above (see Figure 1A). It is remarked that the interaction pattern of Zif268 and its mutant with DNA displayed in Figure 1A and B were determined using HBplus (28). (Note that, for consistency, five amino acid residues are indicated for each finger in Figure 1A and B, although not all of these residues necessarily play a role in selective binding). The geometrical criteria used by this program result in some minor differences with other descriptions of these complexes (19), for example, the absence of a hydrogen bond involving Glu 3 of finger 3 of the wild-type Zif268, but these differences have no impact on the conclusions drawn in the present study.

    An experimental consensus weight matrix for wild-type Zif268 has been determined using a PCR-mediated random site selection protocol (29) and is shown in the lower part of Figure 1C. This consensus confirms that Zif268 selectivity indeed extends over 10 bp and not 9 bp as was initially reported (30). The strongest selectivity is for guanine at positions 1, 3, 6, 7 and 9, and for cytosine at position 2. In terms of information content (for definitions see Materials and Methods), the overall weight matrix represents a selectivity for 7.9 bp.

    We can compare these experimental results with the Zif268 binding preference obtained with ADAPT. Using a 5 kcal mol–1 cutoff with respect to the complexation energy of the optimal sequence, we obtain a group of 677 strongly binding sequences. The highest energy in this group is termed as . The number of sequences in the group corresponds to an information content of 7.8 bp, which is very close to the experimental result. The corresponding theoretical sequence logo, shown in the upper part of Figure 1C, confirms that this good overall agreement also applies to each position within the binding site.

    Since the consensus view of binding selectivity does not take in to account the relative energy of binding sequences, we have also checked the ordering of calculated complexation energies for sequences where experimental results were available (31). The results, which concern nine variants of a GCGxxxGCGT target sequence, are shown in Figure 2. Despite the simplifications involved in the ADAPT approach, the two sets of data show a good correlation (with a correlation coefficient of 0.84). However, it should be noted that, assuming a linear correlation between these two sets of values is linear, the theoretical consensus in Figure 1C shows that T is less weakly selected in the first position of the second finger than if it is experimentally (implying that the sequences of the type Txx should be shifted to the left in Figure 2). It is also remarked that we are using a single conformation of the complex for all these comparisons, which is unlikely to be optimal for each sequence of the set. We have tested this problem by energy minimizing the complex and free DNA oligomer conformations for the sequences given in Figure 2, and recalculating the complexation energies. This leads to a slightly better alignment, notably for the sequences Txx, and to an overall correlation coefficient of 0.86.

    Figure 2. Experimental binding free energies for wild-type Zif268 compared with theoretical energies calculated with ADAPT. Results are shown for a variety of sequences involving the central triplet of the Zif268 binding site GCGxxxGCGT (31). All values are in kcal mol–1.

    Analyzing recognition mechanisms

    As a first step to understanding how wild-type Zif268 recognizes its target sequence, we look at the contributions coming from the direct (Eint) and indirect (Edef) terms of the calculated complexation energies. This is done by finding the maximum value of Eint for the set of binding sequences selected on the basis of . This value is termed . We then look at the full set of sequences, ordered in terms of Eint, and count how many fall in the energy interval up to .

    If Zif268 recognition relied entirely on direct (protein–DNA) interactions, one would expect the ordering of sequences in terms of Eint or Etot to be the same, since Edef would not show any selectivity. Therefore, we would expect to count the same number of sequences up to the respective energy cutoffs. In fact, there are more than four times as many sequences in the Eint list, 3201 compared with 677 selected on the basis of Etot. This implies that Eint has an information content of 6.9 bp, which is 89% of the total selectivity (7.8 bp) found for Zif268. Although direct interactions therefore dominate Zif268 recognition, a non-negligible part must be attributed to indirect contributions.

    If we now build a weight matrix on the basis of the sequences selected using Eint (Figure 3A), we can see that the only significant loss of selectivity involves positions 2 and 8. This change can be traced to the fact that steric repulsions contained within Eint effectively eliminate thymine at either of these positions, but Eint alone cannot select between the remaining bases (A, C and G). This selection requires a contribution from local DNA deformation. In passing, it is remarked that the steric repulsions at positions 2 and 8 involve Glu 3 of fingers 1 and 3, and it is known that mutating these residues to Ala indeed decreases the binding selectivity of Zif268 (18).

    Figure 3. Analysis of recognition within the wild-type Zif268 complex, including the contributions of individual amino acids. (A) Direct (protein–DNA interaction) contributions to recognition. Sequences with are selected from the full set of 413 sequences to define the direct recognition logo. Choosing the subset of these sequences for which defines the overall consensus logo, the corresponding, additional recognition logo being shown on the far left of the figure. (B) Consensus sequence logos corresponding to the total binding energy, the protein–DNA interaction energy and the contributions from key amino acid residues. Gray shading indicates amino acid–base contacts (see Figure 1A).

    Although only a relatively small amount of selectivity remains to be accounted for, this must be due to indirect effects involving induced DNA deformation. This can be understood by noting that the addition of Edef to Eint for the 3201 sequences selected above and then reordering the energies will effectively bring us back to the set of 677 sequences <5 kcal mol–1 cutoff on Etot. This can be quantified by calculating the increase in information content in passing between these two situations. As expected, the overall result is equivalent to 0.9 bp, that is to say 11% of the total selectivity. Once again this selectivity can be represented as a sequence logo (Figure 3A), which confirms that it is indeed DNA deformation that is responsible for the selection of cytosine at positions 2 and 8. We can therefore conclude that, although the DNA deformation induced by binding wild-type Zif268 is small, it still plays a significant role in selecting two positions within the target sequence of this protein.

    Decomposing recognition by residues

    In the same way that the complexation energy can be split into interaction and deformation components, we can also subdivide the interaction energy into individual residue contributions. Eint is simply the sum of the interaction energies, Eint(i), between DNA and each amino acid i composing the protein. Therefore, we can again calculate an information content associated with a chosen residue i by finding the number of sequences falling below (i), where this is the highest residue interaction energy for the set of sequences initially selected from Etot.

    Only the amino acids at positions –1, 2, 3 and 6 of the recognition helix of each finger of Zif268 potentially contact DNA (see Figure 1A), so we will limit our detailed analysis to these residues, the remaining amino acids being considered as a single group. (It was confirmed that no single amino acid within this group makes a significant contribution to selectivity.)

    For each of the four key amino acids, in each of the three fingers, the sequences selected on the basis of (i) give access to both the total information content for the binding site and to the information position by position along this site. The results are presented in Figure 3B. In the majority of cases, these results verify that individual amino acids lead to base selectivity at the positions they directly contact. Note that the residues Asp 2 of each finger show significant selectivity for the base pair adjacent to the 3' end of the base triplet nominally recognized by each finger (0.3, 1.0 and 0.6 bp for fingers 1–3, compared with 0.4, 1.0 and 0.5 bp for the entire protein). This confirms that each finger actually influences selectivity for a total of 4 bp (32).

    The results for Arg 6 of finger 1 are particularly interesting since, as shown clearly in Figure 3B, this residue is unique in contributing significantly to the selectivity of two successive base pairs (in positions 6 and 7). This result confirms that even the direct component of recognition cannot be decomposed into binary amino acid–base contributions. The complex nature of these interactions can also be seen by noting that the sum of the information content related to the key amino acids at a given position within the target site is often significantly lower than the total information coming from the full protein–DNA interaction term. For example, at position 10, only Asp 2 of finger 1 shows a significant selectivity (equivalent to 0.23 bp), whereas total protein interaction has an information content equivalent to 0.44 bp, the remaining selectivity being the cumulative result of small contributions from many amino acids.

    It is finally remarked that residue Glu 3 of fingers 1 and 3 has been described as being involved in selective contacts with DNA (33). However, as noted in the previous section, our analysis implies that it actually influences selectivity mainly by sterically hindering the presence of thymine in positions 2 and 8. Finally, selecting cytosine at these positions requires contributions from DNA deformation. Hence, some residue contacts lead to only partial selectivity, which needs to be complimented by other effects.

    TATA–Zif recognition mechanism

    We now turn to the mutant of Zif268 created by the group of Pabo to recognize a TATA box binding sequence, GCTATAAAA (19). We term this mutant TATA–Zif. We have constructed the corresponding complex using the PDB structure 1G2D . (We also carried out calculations for a related structure, 1G2F, but the results were very similar and will not be discussed here.) Overall, the TATA–Zif complex is very similar to that of Zif268: the C RMSD difference for the two proteins is only 1 ?. The bound DNA remains close to a canonical B-DNA conformation, with the exception of an enlarged major groove at the positions where the zinc fingers are bound. The DNA fragments corresponding to the canonical binding site (positions 1–9, reconstructed with an identical sequence, GCGTGGGCG) differ by an RMSD of only 1.8 ?.

    Binding specificity for the TATA box was obtained by mutating Zif268 at positions –1, 1, 2, 3, 4, 5 and 6 of each finger (19). This leads to a protein–DNA interaction pattern that is very different to that of wild-type Zif268 (see Figure 1A and B). Residues that did not contact DNA in the case of the wild-type protein, now enter into play and several residues contact more than a single DNA base, leading to significant overlap between the binding sites of successive fingers.

    No experimental consensus is available for TATA–Zif, but the consensus calculated using ADAPT (Figure 1D) confirms a strong dominance of thymine and adenine, in contrast to the GC-rich selectivity of wild-type Zif268. Again, using a 5 kcal mol–1 cutoff on Etot, we obtain a total information content of 7.9 bp for TATA–Zif binding. We can analyze the origins of this recognition in the same way as for the wild-type protein, starting with the calculation of the information contained in the direct interaction term Eint. The results in Figure 4A show that TATA–Zif has a much weaker consensus based on direct interactions, with an information content of only 5.5 bp, corresponding to 70% of total information. Positions 2, 3, 5, 6 and 7 are all less well-defined than in the full consensus. This is particularly striking for positions 3, 5 and 6 (T/C, T/C and G/A, respectively), where the information content per base pair falls from 1 bp in the full consensus, to 0.48, 0.37 and 0.52 bp, respectively, on the basis of the protein–DNA interaction energy.

    Figure 4. Analysis of recognition within the TATA–Zif complex including individual amino acid contributions. (A) Direct (protein–DNA interaction) contributions to recognition. Sequences with are selected from the full set of 413 sequences to define the direct recognition logo. Choosing the subset of these sequences for which defines the overall consensus logo, the corresponding, additional recognition logo being shown on the far left of the figure. (B) Consensus sequence logos corresponding to the total binding energy, the protein–DNA interaction energy and the contributions from key amino acid residues. Gray shading indicates amino acid–base contacts (see Figure 1B).

    We can again decompose the direct protein–DNA interactions into contributions from individual residues. Five amino acids at positions –1, 1, 2, 3 and 6 have been analyzed in detail and it was again checked to confirm that the remaining residues made no significant contribution to selectivity. The results are shown in Figure 4B. Note that, in contrast to wild-type Zif268, residue 1 now plays a role in recognition for fingers 2 and 3, in line with the analysis of binding made on the basis of the crystallographic structure (19). In contrast, other nominally important residues show little selectivity, although they do make contacts with the DNA bases (and may be important for the stability of the complex). This is notably the case for Thr 2 in finger 2 and Thr 3 in finger 3. It is equally found that some residues, which do not make contacts, can nevertheless influence binding selectivity, as in the case of Thr –1 in finger 3. Finally, Figure 4B shows that many residues in TATA–Zif influence the selectivity of 2 or even 3 bp. This is strikingly different from the pattern of selectivity seen for the wild-type protein in Figure 3B.

    We now turn to indirect contributions to recognition in the case of TATA–Zif. These effects amount to an information content of 2.4 bp, that is, 30% of the total selectivity (7.9 bp). This is surprising, given that zinc finger protein only induces very small deformations upon binding. Figure 5A shows that deformation mainly contributes to selectivity at positions 3, 4 and 8, with smaller contributions at positions 1, 6 and 9. In the same way that we have analyzed the selectivity due to direct protein–DNA interactions in terms of individual residue contributions, we can also analyze the indirect DNA deformation contribution in terms of the various structural changes composing the overall deformation. Here, we distinguish base–base and base–backbone deformations (note that since we use a fixed average DNA conformation for all calculations, there are no intra-backbone terms to contribute to selectivity). Base–base deformations are further divided in pairing and stacking deformations. Therefore, we can again calculate an information content associated with a given type of deformation by ordering all sequences in terms of the associated deformation energy, determining the highest energy within the set of sequences initially selected from Etot using the 5 kcal mol–1 cutoff, and counting the number of sequences now lying below this energy. The results are shown in Figure 5B. The results indicate that base–base interactions are more important than base–backbone interactions and represent 57% of the total indirect selectivity. Within the base–base term, it turns out that base pairing deformations play a significantly larger role in recognition than base stacking. Although it is naturally more difficult to trace the structural origins of indirect selectivity, it can be noted that DNA bound to TATA–Zif shows significant base pair buckling (5°–12°) at positions 3–5 and significant base opening (7°–11°) at positions 4–8. The base pairs in the fragment of DNA bound to wild-type Zif268 are less distorted, with buckling between 1° and 4° and opening between –3° and 4° for the same groups of positions.

    Figure 5. Analysis of recognition within the TATA–Zif complex including contributions from sequence-dependent DNA deformation. (A) Indirect (DNA deformation) contributions to recognition. Sequences with are selected from the full set of 413 sequences to define the indirect recognition logo. Choosing the subset of these sequences for which defines the overall consensus logo, the corresponding, additional recognition logo being shown on the far left of the figure. (B) Consensus sequence logos corresponding to the total binding energy, the DNA deformation energy and its components.

    CONCLUSIONS

    Zinc finger proteins are of major interest as one of the largest families of eukaryote DNA-binding proteins. This family of proteins is assumed to represent a canonical example of DNA recognition, where selectivity is achieved using a set of direct interactions between amino acid side chains and DNA bases. It is also generally assumed that these interactions can be identified visually from an experimental structure of the protein–DNA complex.

    We have investigated the validity of these assumptions by carrying out a theoretical analysis of recognition in the case of the three finger wild-type Zif268 protein and of a multiple mutant generated to recognize a TATA box binding site. The results, which are based on calculating protein–DNA interaction energies and DNA deformation energies for a full combinatorial set of sequences, enable both direct and indirect contributions to be quantified in terms of their information content.

    Our findings agree with deductions made from mutation studies and from an analysis of the crystallographic data. Notably, recognition is dominated by protein–DNA contacts involving a limited number of key amino acids, although certain residues can influence selectivity at more than one site in the target DNA sequence, and the selectivity related to each finger actually overlaps, involving 4, rather than the canonical 3, base pairs. However, two results do not support the assumptions cited above. First, direct interactions alone cannot account for the observed binding specificity. Although zinc finger proteins do not cause major DNA deformation upon binding, these deformations still accounts for >10% of recognition in the case of wild-type Zif268 and for 30% of recognition in the case of the mutant protein. Second, a residue-by-residue analysis shows that the presence of direct amino acid–base contacts does not necessarily imply significant contributions to selectivity. Indeed, our study brings to light both examples wherein contacts exist without an impact on selectivity and where selectivity is found without direct molecular contacts.

    These results are not very encouraging from the point of view of protein engineering. Although zinc finger proteins have a modular design, which allows individual fingers to be swapped between proteins, the complexity of the recognition process carried out by individual fingers makes structure-based re-engineering a daunting task. Part of this complexity was already visible from the modified pattern of contacts seen in the crystallographic complex of the mutated Zif268 protein (19). Our analysis suggests that this complexity is still greater.

    From a methodological point of view, indices based on information theory have been developed and used in the present study to extend our ADAPT approach in order to describe overall and residue-by-residue contributions to protein–DNA selectivity in a quantitative manner and this will hopefully be useful in studying, and in attempting to modify, other protein–DNA complexes.

    ACKNOWLEDGEMENTS

    The authors wish to thank the CNRS and the inter-organism Bioinformatics Program for funding this research.

    REFERENCES

    Seeman,N.C., Rosenberg,J.M. and Rich,A. ( (1976) ) Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl Acad. Sci. USA, , 73, , 804–808.

    Suzuki,M., Brenner,S.E., Gerstein,M. and Yagi,N. ( (1995) ) DNA recognition code of transcription factors. Protein Eng., , 8, , 319–328.

    Mandel-Gutfreund,Y., Schueler,O. and Margalit,H. ( (1995) ) Comprehensive analysis of hydrogen bonds in regulatory protein–DNA complexes: in search of common principles. J. Mol. Biol., , 253, , 370–382.

    Kono,H. and Sarai,A. ( (1999) ) Structure-based prediction of DNA target sites by regulatory proteins. Proteins, , 35, , 114–131.

    Selvaraj,S., Kono,H. and Sarai,A. ( (2002) ) Specificity of protein–DNA recognition revealed by structure-based potentials: symmetric/asymmetric and cognate/non-cognate binding. J. Mol. Biol., , 322, , 907–915.

    Yoshida,T., Nishimura,T., Aida,M., Pichierri,F., Gromiha,M.M. and Sarai,A. ( (2001) ) Evaluation of free energy landscape for base-amino acid interactions using ab initio force field and extensive sampling. Biopolymers, , 61, , 84–95.

    Mandel-Gutfreund,Y. and Margalit,H. ( (1998) ) Quantitative parameters for amino acid–base interaction: implications for prediction of protein–DNA binding sites. Nucleic Acids Res., , 26, , 2306–2312.

    Parvin,J.D., McCormick,R.J., Sharp,P.A. and Fisher,D.E. ( (1995) ) Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature, , 373, , 724–727.

    Sarai,A., Mazur,J., Nussinov,R. and Jernigan,R.L. ( (1989) ) Sequence dependence of DNA conformational flexibility. Biochemistry, , 28, , 7842–7849.

    Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. ( (1998) ) DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl Acad. Sci. USA, , 95, , 11163–11168.

    Steffen,N.R., Murphy,S.D., Tolleri,L., Hatfield,G.W. and Lathrop,R.H. ( (2002) ) DNA sequence and structure: direct and indirect recognition in protein–DNA binding. Bioinformatics, , 18, (Suppl. 1), S22–S30.

    Thayer,K.M. and Beveridge,D.L. ( (2002) ) Hidden Markov models from molecular dynamics simulations on DNA. Proc. Natl Acad. Sci. USA, , 99, , 8642–8647.

    Paillard,G. and Lavery,R. ( (2004) ) Analyzing protein–DNA recognition mechanisms. Structure (Cambridge), , 12, , 113–122.

    Lafontaine,I. and Lavery,R. ( (2000) ) ADAPT: a molecular mechanics approach for studying the structural properties of long DNA sequences. Biopolymers, , 56, , 292–310.

    Lafontaine,I. and Lavery,R. ( (2000) ) Optimization of nucleic acid sequences. Biophys. J., , 79, , 680–685.

    Beerli,R.R. and Barbas,C.F.,III. ( (2002) ) Engineering polydactyl zinc-finger transcription factors. Nat. Biotechnol., , 20, , 135–141.

    Beerli,R.R., Segal,D.J., Dreier,B. and Barbas,C.F.,III ( (1998) ) Toward controlling gene expression at will: specific regulation of the erbB-2/HER-2 promoter by using polydactyl zinc finger proteins constructed from modular building blocks. Proc. Natl Acad. Sci. USA, , 95, , 14628–14633.

    Miller,J.C. and Pabo,C.O. ( (2001) ) Rearrangement of side-chains in a Zif268 mutant highlights the complexities of zinc finger-DNA recognition. J. Mol. Biol., , 313, , 309–315.

    Wolfe,S.A., Grant,R.A., Elrod-Erickson,M. and Pabo,C.O. ( (2001) ) Beyond the ‘recognition code’: structures of two Cys2His2 zinc finger/TATA box complexes. Structure (Cambridge), , 9, , 717–723.

    Lavery,R., Zakrzewska,K. and Sklenar,H. ( (1995) ) JUMNA (junction minimization of nucleic acids). Comput. Phys. Commun., , 91, , 135–158.

    Cheatham,T.E.,III, Cieplak,P. and Kollman,P.A. ( (1999) ) A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn., , 16, , 845–862.

    Berman,H.M., Battistuz,T., Bhat,T.N., Bluhm,W.F., Bourne,P.E., Burkhardt,K., Feng,Z., Gilliland,G.L., Iype,L., Jain,S. et al. ( (2002) ) The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr., , 58, , 899–907.

    Shannon,C.E. ( (1948) ) A mathematical theory of communication. Bell Syst. Tech. J., , 27, , 379–423 623–656.

    Schneider,T.D., Stormo,G.D., Gold,L. and Ehrenfeucht,A. ( (1986) ) Information content of binding sites on nucleotide sequences. J. Mol. Biol., , 188, , 415–431.

    Stephens,R.M. and Schneider,T.D. ( (1992) ) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol., , 228, , 1124–1136.

    Schneider,T.D. and Stephens,R.M. ( (1990) ) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., , 18, , 6097–6100.

    Elrod-Erickson,M., Rould,M.A., Nekludova,L. and Pabo,C.O. ( (1996) ) Zif268 protein–DNA complex refined at 1.6 ?: a model system for understanding zinc finger–DNA interactions. Structure, , 4, , 1171–1180.

    McDonald,I.K. and Thornton,J.M. ( (1994) ) Satisfying hydrogen bonding potential in proteins. J. Mol. Biol., , 238, , 777–793.

    Swirnoff,A.H. and Milbrandt,J. ( (1995) ) DNA-binding specificity of NGFI-A and related zinc finger transcription factors. Mol. Cell. Biol., , 15, , 2275–2287.

    Christy,B. and Nathans,D. ( (1989) ) DNA binding site of the growth factor-inducible protein Zif268. Proc. Natl Acad. Sci. USA, , 86, , 8737–8741.

    Bulyk,M.L., Huang,X., Choo,Y. and Church,G.M. ( (2001) ) Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl Acad. Sci. USA, , 98, , 7158–7163.

    Isalan,M., Choo,Y. and Klug,A. ( (1997) ) Synergy between adjacent zinc fingers in sequence-specific DNA recognition. Proc. Natl Acad. Sci. USA, , 94, , 5617–5621.

    Elrod-Erickson,M. and Pabo,C.O. ( (1999) ) Binding studies with mutants of Zif268. Contribution of individual side chains to binding affinity and specificity in the Zif268 zinc finger–DNA complex. J. Biol. Chem., , 274, , 19281–19285.(Guillaume Paillard, Cyril Deremble and R)