当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 免疫学杂志 > 2005年 > 第11期 > 正文
编号:11256749
In Silico Identification of Supertypes for Class II MHCs1
     Abstract

    The development of epitope-based vaccines, which have wide population coverage, is greatly complicated by MHC polymorphism. The grouping of alleles into supertypes, on the basis of common structural and functional features, addresses this problem directly. In the present study we applied a combined bioinformatics approach, based on analysis of both protein sequence and structure, to identify similarities in the peptide binding sites of 2225 human class II MHC molecules, and thus define supertypes and supertype fingerprints. Two chemometric techniques were used: hierarchical clustering using three-dimensional Comparative Similarity Indices Analysis fields and nonhierarchical k-means clustering using sequence-based z-descriptors. An average consensus of 84% was achieved, i.e., 1872 of 2225 class II molecules were classified in the same supertype by both techniques. Twelve class II supertypes were defined: five DRs, three DQs, and four DPs. The HLA class II supertypes and their fingerprints given in parenthesis are DR1 (Trp9), DR3 (Glu9, Gln70, and Gln/Arg74), DR4 (Glu9, Gln/Arg70, and Glu/Ala74), DR5 (Glu9, Asp70), and DR9 (Lys/Gln9); DQ1 (Ala/Gly86), DQ2 (Glu86, Lys71), and DQ3 (Glu86, Thr/Asp71); DPw1 (Asp84 and Lys69), DPw2 (Gly/Val84 and Glu69), DPw4 (Gly/Val84 and Lys69), and DPw6 (Asp84 and Glu69). Apart from the good agreement between known binding motifs and our classification, several new supertypes, and corresponding thematic binding motifs, were also defined.

    Introduction

    Major histocompatibility complex proteins are glycoproteins that bind, within the cell, small peptide fragments, or epitopes, derived, through proteolysis, from both host and pathogen proteins, and present them at the cell surface for interaction by T cells. Recognition by the immune system of peptide-bound MHCs is fundamental to the mechanism by which the host identifies and responds to foreign Ags. MHC class I molecules, available on most cell types, present peptides from protein synthesized within the cell (endogenous processing pathway), and only a subset of macrophages are able to present peptides derived from phagocytosed material via class I molecules (1). MHC class II molecules, expressed on a restricted number of cell types, such as B cells and macrophages, can present peptides derived from endocytosed extracellular proteins (exogenous processing pathway) (2).

    A principal feature of MHC molecules is their allelic polymorphism: the July 2004 ImMunoGeneTics/HLA database release lists 1114 class I and 707 class II molecules (3). Such polymorphism presumably enhances the probability of mounting an immune response by at least a subset of individuals within a population, ultimately increasing the chance of group survival against infection (4). Unlike many proteins, MHC alleles have arisen under a specific and discernible evolutionary pressure, adapting to a fitness landscape mediated by geographically constrained infectious disease. Moreover, any poly-epitope vaccine targeting the whole population would, on the same basis, need to bind a range of HLA molecules. Gulucota and DeLisi (5) found that three to six class I HLA alleles, depending on the ethnic group, would cover 90% of the population. Indeed, because of linkage disequilibrium (the joint probability of a given allelic pair is not generally equal to the product of their individual probabilities, Pij PiPj), it is not necessarily optimal to choose the alleles with the highest individual frequencies.

    The peptide binding site of MHC molecules is composed of a single protein chain for class I and two separate chains in class II. X-ray data reveal that the walls of the cleft are formed by two antiparallel helices and the floor is formed by an eight-stranded -sheet (6, 7, 8, 9, 10). In MHC class I molecules, the ends of the cleft are closed off, generally allowing only short peptides of 8–11 aa to bind. In contrast, the cleft in class II is open-ended, allowing much longer peptides to bind, even though only 9 aa occupy the site itself. Both clefts have binding pockets, corresponding to primary and secondary anchor positions on the binding peptide. The combination of two or more anchors is called a motif. It has been found that certain class I alleles can recognize similar motifs (11, 12, 13, 14) and thus be grouped into HLA "supertypes", binding common "supermotifs". The classification of MHC molecules into supertypes, based on structural features and/or peptide specificity, is of prime importance in the development of epitope-based vaccines (15, 16). The experimental determination of motifs for every allele is prohibitively expensive in terms of labor, time, and resources. The only comprehensive, yet practical, alternative is a bioinformatic approach.

    Chemometric methods are widely used and extensively validated in computational chemistry for structural classifications (17, 18). Recently, we proposed a "three-dimensional (3D)3 supertype fingerprint" approach which classifies alleles on the basis of information from the structure of the binding sites using two chemometric techniques: principal component analysis and hierarchical clustering (18). We applied this approach to class I MHC molecules belonging to the HLA-A, HLA-B, and HLA-C loci and showed that only 1–3 aa are sufficient for an allele to be classified within a particular supertype.

    In the present study, a combined two-dimensional-3D approach was applied to class II HLA molecules belonging to the DR, DQ, and DP loci, identifying a consensus supertype classification. In contrast to class I, supertypes and supermotifs for class II MHC molecules have not been widely studied. There are only a few classifications for class II molecules: three for HLA-DR molecules (3, 19, 20) and one each for HLA-DQ (21) and HLA-DP (22). Here, we use two clustering techniques—hierarchical and nonhierarchical—applied to both the sequences of HLA class II proteins and their 3D structures. Clustering is a data analysis technique that, when applied to a set of heterogeneous items, identifies homogeneous subgroups as defined by a given model or measure of similarity. For a detailed review see Ref.23 .

    In hierarchical clustering, the data set is analyzed iteratively: at each step either a pair of clusters is merged (agglomerative) or a single cluster is divided (divisive). Determining the number of "natural" clusters is among the most difficult problems in clustering and, to date, no general solution has been found. Agglomerative hierarchical clustering was applied to HLA class II molecules using similarity fields generated by Comparative Similarity Indices Analysis (CoMSIA) (24, 25). CoMSIA is widely used in 3D molecular design to model the interactions between small molecules and proteins (26, 27, 28, 29, 30, 31, 32).

    Nonhierarchical methods generate a specific number of disjoint, flat, unconnected clusters. K-means clustering is a nonhierarchical method in which the dataset is partitioned into k clusters by choosing an initial set of k seed compounds to act as initial cluster centers. Each compound is assigned to its nearest cluster and cluster membership is iteratively refined by shifting compounds between clusters until stability is achieved, i.e., no compounds are moved from one cluster to another. The k-means method was applied to a set of z-scales, as defined by Hellberg et al. (33) and extended by Sandberg et al. (34), which describe the most important properties of each amino acid within the HLA class II binding site. In the field of quantitative structure–activity relationships, z-descriptors are used to model the interactions between peptides and proteins (35, 36, 37, 38, 39).

    Based on the consensus of the classifications, made by the hierarchical and nonhierarchical methods, twelve class II supertypes were defined: five DR, three DQ, and four DP. Fingerprints for each supertype were also identified. In a similar way to our existing analysis of class I supertypes (40), it was found that 1–3 aa are sufficient to distinguish between class II supertypes.

    Materials and Methods

    Hierarchical clustering on CoMSIA fields

    CoMSIA was used as implemented in Sybyl version 6.9 (Tripos, 2004). The 3D structures of proteins belonging to the same locus were aligned: the x-ray structure 1PYW was used as a template for DR (42), the x-ray structure 1JK8 (43) was used for DQ, and the modeled DP structure was used for DP molecules. The amino acids outside the binding site were excluded. The grid had a resolution of 2.0 ? and extended beyond the molecular dimensions by 4.0 ? in all directions. At each grid point, a similarity index between the probe and the target molecule is calculated using a Gausian-type distance-dependent function. Similarity indices fields were generated with an attenuation factor = 0.3. The attenuation factor shows the steepness of the Gaussian-type function. The probe used had a 1 ? radius, charge + 1, hydrophobicity + 1, hydrogen-bond donor + 1, and acceptor properties + 1. The agglomerative hierarchical clustering (23) option of Sybyl version 6.9 was applied to CoMSIA fields. According to this technique the clusters are built from the bottom up, first by merging individual items into clusters, and then by merging clusters into superclusters, until the final merge brings all items into a single cluster. The distance between the clusters was calculated using the complete-linkage method, i.e., using the distance between the most distant pair of data points in both clusters. The last four levels of the hierarchy were considered for supertype definition.

    Nonhierarchical clustering on z-scores

    The protein sequences for each class II locus were aligned. As in the CoMSIA study, amino acids outside the binding site were excluded. Each amino acid was described by five z-descriptors: z1 (hydrophobicity), z2 (steric bulk), z3 (polarity), z4, and z5 (electronic effects) (34). An X-matrix was formed for each locus. Rows corresponded to the number of proteins and columns equaled five times the number of polymorphic amino acids in the binding site. The X-matrices were imported into MDL QSAR version 2.2. K-means clustering was applied, with the initial set of k seeds equal to the number of clusters generated by the hierarchical clustering. The members of the clusters generated by the hierarchical and nonhierarchical clustering were compared, and the commonly clustered members were calculated as a percentage of all alleles for every locus.

    Results

    Systematic models of combinatorially generated class II human MHC dimers were analyzed using a 3D technique, which applies hierarchical clustering to CoMSIA fields, and a two-dimensional technique, which uses z-descriptors and k-means clustering. Together, these clustering methods generated a robust grouping of MHCs based on the similarities of their binding sites. Structural models were built using three templates: two taken from x-ray data (42, 43) and one generated using homology modeling (44, 45). The DP-modeled structure had a root mean squared deviation from the DRB*0101 structure (1PYW) of 0.28 ? for all invariant atoms. The three templates were then used to generate, in a combinatorial fashion, all modeled dimers through side chain placement using rotamers (41). Invariant side chains were held fixed. As a result, the models created are best viewed as a conservative estimate: an attempt to reduce error by overlaying as much of the generated models as possible. This approach is well suited to the statistical nature of the method, which looks for overall correlations in CoMSIA fields, and is tolerant of small misplacements of individual side chains (17, 18). This is confirmed by the high agreement seen with the k-means clustering, which makes no use of structural information and is based on sequence information alone.

    Making use of similarity fields generated by CoMSIA, agglomerative hierarchical clustering was applied to amino acids forming the binding sites on HLA class II molecules (Table I). CoMSIA is a 3D grid method, in which a probe is placed at all lattice points in a regular 3D grid in and around the target molecule (24, 25). At each point, a similarity index between probe and target is calculated using a Gaussian-type distance dependent function. Five similarity fields were calculated: steric bulk, electrostatic potential, local hydrophobicity, and hydrogen-bond donor and acceptor abilities. In hierarchical clustering, each level defines a partition of the data set into clusters. However, in general it is not clear which level is best in terms of splitting the data set into a "natural" number of clusters, so that each cluster contains the most appropriate compounds (23). In the present study, the number of clusters was selected to be in a good agreement with previous classifications and known binding motifs, where these are available. Usually, the last four levels were considered for the supertype definition. The number of clusters defined in the hierarchical clustering was used as the input k cluster number in the nonhierarchical k-means clustering.

    Nonhierarchical k-means clustering was applied to a set of z-properties describing each amino acid of the HLA class II binding site. These z-scales, as defined by Hellberg et al. (33), reflect the most important properties of amino acids and are referred to as "principal properties". These scales were derived by principal component analysis from a data matrix consisting of a large number of physicochemical variables, such as m.w., pKa’s, 13C NMR-shifts, etc. The first principal component (PC) reflects amino acid hydrophobicity, the second PC reflects their size, and the third, their polarity. The three PCs are labeled: z1-, z2- and z3-scales, respectively. More recently, Sandberg et al. (34) extended the three z-scales to five, adding z4 and z5, which account for electronic effects of the amino acids. With these z-scales, it is possible to quantify numerically the structural variations within a series of related peptides, by arranging the z-scales according to the amino acid sequence. In the present study the five z-scales were used to describe the polymorphic amino acid sequences of the binding site of HLA class II molecules (Table I).

    The common members of the clusters derived by both methods were expressed as a percentage of all alleles for every locus. Supertype fingerprints were defined on the basis of common amino acids found within the multiple sequence alignment of alleles belonging to one supertype.

    HLA-DR supertypes

    Hierarchical clustering. The hierarchical clustering, using CoMSIA fields, of HLA-DRB is plotted in Fig. 1, and the detailed content of each cluster is shown in Table II. At the second level of the hierarchy there are three clusters: two small clusters flanking one huge central cluster. At this level, the clustering is associated with polymorphism around pocket 9. The leftmost cluster is composed of structures with Try9, the middle cluster has Glu9, and the rightmost has Lys/Gln9. As the structures in the middle cluster were still quite diverse, mainly around binding pocket 4, (70, 71, and 74), this group was further subdivided. At the fourth level there are five roughly equally sized clusters, which corresponds well with known DR binding motifs. The clusters were defined as supertypes and named after the lowest serotype included.

    The first cluster, which we called the DR1 supertype, includes DR1 (DRB1*0101–11), DR2 (DRB1*1501–11, DRB1*1601–08), and DR7 (DRB1*0701–07). The main structural feature for this supertype is the presence of Trp9. Residues 9, 30, 38, 57, and 61 are part of pocket 9, which accommodates peptide residue 9. Trp9 makes the pocket shallow and allows the binding of small nonpolar residues: Leu, Ala, Gly, and Pro. The binding motif for HLA-DRB1*0101 favors Leu or Ala at position 9 (Table II) and that for HLA-DRB1*1501, favors Gly, Ser, Pro, or Thr (Table II). Trp9 is the fingerprint residue for the DR1 supertype.

    The second cluster, called DR3 supertype, comprises DR3 (DRB1*0301–25), DR52 (DRB3*0101–10, DRB3*0201–18, DRB3*0301–03), DRB1*0422, and DRB1*1107. The common features for this cluster are Gln70, Lys71, and Gln/Arg74. Residues 70, 71, and 74 are the polymorphic residues forming pocket 4. This pocket binds one of the main anchor residues for MHC class II molecules (9, 10, 19). Amino acid side chain charges are important for interaction between TCRs and peptide-DR complexes (48). Due to Lys71 the total charge in pocket 4, for this supertype, is positive, which corresponds well to the preference for negatively charged Asp and Glu at position 4 in the DRB1*0301 binding motif (55, 56, 57). DR3 supertype fingerprint residues are Gln70, Lys71, and Gln/Arg74.

    The third cluster consists of DR4 (DRB1*0402, 12, 15, 25, 36, 37, 47), DR5 (DRB1*1101–47, DRB1*1201–09), DR6 (DRB1*1301–62, DRB1*1403, 16, 22, 25, 27, 40), and DR8 (DRB1*0801–25). As most DR4 alleles go into another cluster, this supertype was named after the next numerically lowest serotype DR5. The common feature for all MHC molecules belonging to this supertype is Asp70. To distinguish certain DR51 alleles with Asp70 which belong to another supertype, Glu9 is added to the DR5 supertype fingerprint. It is probable that Asp70 is negatively charged in the pocket. When Glu71 is next to Asp70, as in DRB1*0402, negatively charged residues—Asp and Glu—at peptide position 4 are detrimental to binding (61). If Arg or Lys are available at position 71, the pocket becomes both positively and negatively charged and can accommodate neutral amino acids like Leu, Val, and Ile, as in the DRB1*1101 binding motif (59, 64).

    The fourth cluster includes DR4 (DRB1*0401, 03–48 without alleles from the DR5 supertype), DR5 (DRB1*1113, 17, 26, 34, 42), DR6 (DRB1*1309, DRB1*1401–48), DR10 (DRB1*1001), and DR53 (DRB4*0101–06). This cluster was called the DR4 supertype. The DR4 supertype fingerprint is Gln/Arg70, Arg/Lys71, and Glu/Ala74. Arg or Lys at 71 position makes pocket 4 positively charged and residues like Arg and Lys at peptide position 4 become detrimental for MHC binding, as is evident from binding motifs for DRB1*0401 and DRB1*0404 (58, 59, 60, 61).

    The last cluster, which we call the DR9 supertype, is composed of DR9 (DRB1*0901 and DRB1*0902) and DR51 (DRB5*0101–12, DRB5*0202–05), and has the fingerprint Lys/Gln9. Lys/Gln9 coexists with Asp11. Both residues take part in the formation of binding pocket 9. Asp11 makes the pocket negatively charged and peptides with Arg and Lys at position 9 are preferred as is evident from the DRB5*0101 binding motif (53, 54).

    Nonhierarchical clustering. The contents of clusters derived by nonhierarchical clustering are given in Table II. The major discrepancies concern alleles DR3 (DRB1*0301–25), DR7 (DRB1*0701–07), and DR51 (DRB5*0101–12, DRB5*0202–05). The first was classified as DR3 by hierarchical clustering and as DR4 by nonhierarchical. The second was considered part of the DR1 supertype by hierarchical clustering and as part of DR9 by nonhierarchical. The third belongs to the DR9 supertype according to hierarchical clustering and to DR5 according to nonhierarchical. Despite these minor differences, 82% (285 of 347) of the DR alleles were classified in the same supertype by both clustering methods.

    HLA-DQ supertypes

    Hierarchical clustering. The hierarchical clustering of HLA-DQ molecules is shown in Fig. 2 and the cluster contents are listed in Table III. Two clusters exist at the first level. The structural differences here concern the polymorphic region 84–87 of the -chain. Alleles from the left cluster contain Gln84, Leu85, Glu86, and Leu87 (DQB1*02, 03, and 04), whereas the right cluster members have Glu84, Val85, Ala86, or Gly86, and Tyr87 or Phe87 (DQB1*05 and 06), respectively. Position 86 is part of pocket 1, together with the amino acids at positions 24, 31, and 52 from the -chain. X-ray data for DQ8 (DQA1*0301/DQB1*0302) indicates that pocket 1 is lined by two positively charged side chains—His24 and Arg52—inside the entrance and two negatively charged residues—Glu31 and Glu86—deeper in the pocket (43). Together, they form a hydrogen-bonding network. The replacement of Glu86 with Ala or Gly destroys this network and the side chain of Arg52 might reorient closer to the pocket entrance. This might explain the pronounced intolerance of positively charged amino acids at position 1 according to the DQA1*0102/DQB1*0602 binding motif (65). The cluster with a fingerprint Ala/Gly86 was called the DQ1 supertype. It includes the serotypes: DQ1 (DQB1*0611, 12), DQ5 (DQB1*0501, 02, 03), and DQ6 (DQB1*0601–05, 09) and the rest of molecules containing DQB1*05 or 06.

    At the third level of the DQ dendrogram (Fig. 2), the left cluster is divided into three smaller subclusters. The first cluster contains alleles beginning DQB1*02, the second contains alleles starting DQB1*03 and 04, and the third comprises alleles commencing DQA1*03 (Table III). DQA1*03 chains contain an additional Arg after Arg52, which is the main difference compared with other DQ -chains. As these alleles were not classified as a separate cluster by the nonhierarchical clustering, we decided to define them as outliers rather than as a separate supertype. The alleles from the other two clusters differ in several positions at the binding site and correspond well to known motifs.

    The cluster containing the serotype DQ2 (DQB1*0201, 02, and 03) was labeled the DQ2 supertype. Its fingerprint is Glu86 and Lys71. The amino acid at position 71 takes part in the formation of pockets 4 and 7. DQ2 is the only serotype with Lys at position 71. This gave the pockets a strongly basic character, a factor accounting for the almost absolute requirement for acidic residues, principally at peptide position 7, but also at position 4 (67). A computer model of DQ2 with HLA class I 46–60 peptide in the binding cleft indicates that Lys71 makes a direct salt-bridged hydrogen bond to the glutamic acid or aspartic acid at position 7 of the peptide (67).

    The cluster, named DQ3, includes several serotypes: DQ3 (DQB1*0306), DQ4 (DQB1*0401, 02), DQ7 (DQB1*0301, 04), DQ8 (DQB1*0302, 05), DQ9 (DQB1*0303), and the rest of alleles carrying DQB1*03 or 04. Its fingerprint is Glu86 and Thr71 (DQB1*03) or Asp71 (DQB1*04). Peptide positions 4 and 7 must be aliphatic amino acids (Table III), although Godkin et al. (72) found that Arg is also tolerated.

    Nonhierarchical clustering. The nonhierarchical clustering did not classify DQA1*03 alleles in a separate cluster. For the rest of the DQ molecules, we found an 83% (615 of 738) agreement with the hierarchical classification (Table III).

    HLA-DP supertypes

    Hierarchical clustering. The DP dendrogram is plotted in Fig. 3 and the cluster members are listed in Table IV. Like the DR locus, DP clustering depends on the polymorphism of the -chain only. Two large clusters are apparent at the first level of the hierarchy. The left cluster comprises alleles with Asp84, Glu85, Ala86, and Val87. The right one includes alleles with Gly84 or Val84, Gly85, Pro86, and Met87. All four residues play a major role in forming the contact area between - and -chains and only position 84 and partly 85 are involved in forming the surface of pocket 1 (46). Further division appears at the second level of the hierarchy, and it is connected with position 69, which is involved in forming pockets 4 and 6 (47). Clustering at the third level is contingent upon positions 55 and 56. Position 55 is part of pocket 9. At this level of the hierarchy, DPB1*0401 and 0402 split into separate clusters. As recent experimental data indicate that they probably belong to the same supertype (22), the second level of the dendrogram was chosen for definition of the DP supertypes.

    The first cluster includes DPw1 (DPB1*0101), DPw3 (DPB1*0301), DPw5 (DPB1*0501), DPB1*14, 20, 25, 26, 27, 31, 35, 36, 38, 45, 50, 52, 56, 57, 63, 65, 67, 68, 70, 76, 78, 79, 84, 85, 87, 89, 90, 91, 92, 97, and 98. It was called DPw1 and its fingerprint is Asp84 and Lys69. The MHC molecules forming this supertype have a negatively charged pocket 1 and a positively charged pocket 4. To the best of our knowledge, no binding motif is currently available for any of the alleles in this supertype. One may imagine that peptides with complementary charges—positively charged amino acids at position 1 and negatively charged residues at position 4—should, in general, bind well to members of this supertype.

    The second cluster involves DPw6 (DPB1*0601), DPB1*08, 09, 10, 11, 13, 16, 17, 19, 21, 22, 29, 30, 37, 44, 54, 55, 58, 69, 88, and 93. This supertype was called DPw6. All alleles have Asp84 and Glu69, except DPB1*1101 and DPB1*6901 which have Arg69. Unfortunately, no binding motif was available for any alleles from this supertype. Again, peptides with complementary charges—positively charged amino acids at positions 1 and 4—may be supposed to bind well to this supertype.

    The third cluster, called the DPw2 supertype, consisted of DPw2 (DPB1*0201 and 0202), DBP1*32, 33, 41, 46, 47, 48, 71, 81, 86, and 95. Its fingerprint is Gly84 or Val84 and Glu69. Alleles of the supertype have a deep, nonpolar pocket 1 capable of accepting bulky amino acids, as is evident from the available binding motif (Table IV) (74). Due to a negatively charged Glu69, pocket 4 of HLA-DP2 showed high affinity for peptides with positively charged residues at this position (47).

    DPw4 (DPB1*0401 and 0402), DPB1*15, 18, 23, 24, 28, 34, 39, 40, 49, 51, 53, 59, 60, 62, 66, 72, 73, 74, 75, 77, 80, 83, 94, 96, and 99 form the last cluster, which we call the DPw4 supertype. Its fingerprint is Gly84 or Val84 and Lys69; only DRB1*1501 and 74 have Arg69. Again, consistent with our results, known motifs for DPB1*0401 and 0402 indicate preferences for bulky aromatic amino acids at positions 1 and 4 (Table IV) (22).

    Nonhierarchical clustering. The contents of clusters derived by nonhierarchical clustering are listed in Table IV. Among several minor differences, DPB1*11 and 69 (Arg69) are clustered into the DPw1 supertype (Lys69), as opposed to DPw6 (Glu69), which was identified by hierarchical clustering. 85% (972 of 1140) of the DP molecules were clustered into the same supertype by both methods.

    Discussion

    The extreme polymorphism, apparent within higher vertebrates, confounds the study of epitope binding by MHCs, particularly from an experimental perspective: no existing technique is fast or reliable enough to determine peptide specificities on an appropriate scale. MHC polymorphism greatly complicates epitope-based vaccine development, particularly in regard to population coverage. One initial approach to the problem has been to characterize the binding specificity of five to nine of the most common HLA alleles and to develop a mixture of several epitopes to cover the general population (75). Latter, the logical framework of this approach has been inverted. Instead of developing a single epitope for each of the common HLA alleles, attempts were made to identify epitopes capable of binding multiple HLA types (12, 13, 14, 16). The grouping of alleles into supertypes, based on common structural and functional features, is useful in addressing such attempts. Sette et al. (16) found that by focusing only on the HLA class I A1, A2, A3, A24, and B7 supertypes, 100% population coverage is achieved (76). The strategy of epitope selection based on HLA supertypes has been validated in different disease settings worldwide (77, 78, 79, 80, 81, 82). However, while HLA class I supertypes have been widely explored, class II supertypes are still in the relatively early stages of investigation.

    In this study, we have applied a combined bioinformatics approach, using both protein sequence and structural data, to 2225 HLA class II molecules, to detect similarities in their peptide binding sites and to define supertype fingerprints. Two chemometric techniques were used: hierarchical clustering on 3D CoMSIA fields and nonhierarchical k-means clustering on sequence-based z-descriptors. The former method classifies the molecules on the basis of binding site similarities, in terms of steric bulk, electrostatic potential, local hydrophobicity, and hydrogen-bond-donor and acceptor abilities. The latter method uses five principal properties (z-scales) of the amino acids and classifies the proteins according to their sequence-based binding site similarities.

    An average consensus of 84% was achieved, i.e., 1872 of 2225 class II molecules were classified in the same supertype by both techniques. Twelve class II supertypes were defined: five DRs, three DQs, and four DPs. The DR supertypes are DR1 (fingerprint Trp9), DR3 (Glu9, Gln70, and Gln/Arg74), DR4 (Glu9, Gln/Arg70, and Glu/Ala74), DR5 (Glu9, Asp70), and DR9 (Lys/Gln9). The DQ supertypes are DQ1 (Ala/Gly86), DQ2 (Glu86, Lys71), and DQ3 (Glu86, Thr/Asp71) and the DP supertypes are DPw1 (Asp84 and Lys69), DPw2 (Gly/Val84 and Glu69), DPw4 (Gly/Val84 and Lys69), and DPw6 (Asp84 and Glu69). Apart from the good agreement between known binding motifs and our classification, several new supertypes have been defined and thematic binding motifs have been outlined for them. In the following, we discuss the congruence of our systematic structural analysis of binding with extant data on the biology of class II human MHCs, rather than making unsupported speculations.

    HLA-DR molecules account for >90% of the HLA class II isotypes expressed on APCs (83). Although the HLA-DRA locus is monomorphic, >300 alleles have been described for the HLA-DRB1 locus (3). X-ray data indicate that 12 hydrogen bonds exist between conserved DR atoms and main-chain atoms of the bound peptide (9). As they do not involve the side chains of the peptide, these hydrogen bonds are likely to play a common role in peptide binding to HLA-DR.

    Five binding pockets, pockets 1, 4, 6, 7, and 9 (named after the corresponding positions on the binding peptide), were found to be common for most DR proteins (9, 10). Specificity of pocket 1 is modulated by a Gly/Val86 dimorphism. DR proteins with Gly86 show strong preferences for large hydrophobic side chains (Trp, Tyr, Phe) at peptide position 1, whereas Val86 restricts the pocket size and alters the preferences to small hydrophobic side chains (Val and Ala) at this position. The main difference in the preferences concern bulky aromatic residues—Trp, Tyr, and Phe—which are not accepted at pocket 1 when it contains Val86. However, the medium sized hydrophobic amino acids Leu and Ile are well accepted in all DR molecules and peptide position 1 could not be considered as an anchor able to distinguish between different DR alleles.

    Pocket 4 is formed by polymorphic amino acids at positions 13, 26, 28, 70, 71, 74, and 78. Residues at positions 70, 71, and 74 play a significant role both in protein binding and T cell recognition (Refs.4 , 9 , 10 , 19). Residues 71 and 74 also take part in the formation of pockets 6 and 7 (9, 84). Ou et al. (19) made a functional categorization of DR alleles on the basis of pocket 4 polymorphism, associating each group with certain autoimmune diseases. Good agreement was found between this categorization and our classification. The DR3 supertype corresponds to the functional DR restrictive supertype pattern (RSP) "R". It contains the pattern Gln70, Lys71, and Arg/Gln74 and the overall charge within pocket 4 is positive, which requires negatively charged amino acids Asp and Glu at position 4 of the binding peptide (Table II, motif DRB1*0301). This supertype is associated with two autoimmune diseases: systematic lupus erythematosus and Hashimoto’s thyroiditis (19, 83). The DR4 supertype corresponds to DR RSP "A" (19). Its pattern, Gln/Arg70, Arg/Lys71 and Glu/Ala74, is close to that of DR RSP "R", differing only in position 74. When Ala appears at 74, pocket 4 increases in size and can accommodate larger amino acids such as Phe, Trp, and Ile (Table II, motifs DRB1*0401, 04, 05). Unfortunately, no binding motif is available for any allele bearing Glu74, but one could suppose that small polar residues, like Ser and Thr, will be accepted. This supertype is associated with a susceptibility to rheumatoid arthritis (19, 83). The DR5 supertype corresponds to DR RSP "D" with pattern Asp70, Glu/Arg71, and Leu/Ala74 (19). The main feature here is the negatively charged Asp at position 70, which restricts the accommodation of negatively charged amino acids at peptide position 4 (Table II, motif DRB1*0402). Juvenile rheumatoid arthritis (JRA), pemphigus vulgaris, and allergic bronchopulmonary aspergillosis are autoimmune diseases associated with this supertype (19).

    Residues 9, 30, 37, 38, 57, and 61 are involved in the formation of pocket 9 (9, 84). The polymorphism at 9 determines the pocket size and hence binding motif preferences at this position. The clustering at the first and second level of the DR dendrogram (Fig. 1) is associated with the 9 polymorphism. Trp9 is the fingerprint for the DR1 supertype, Lys/Gln9 for DR9, and Glu9 for DR3, DR4, and DR5. Small amino acids (Ala, Val, Gly, Ser, Thr, Pro) are accepted in pocket 9 of the DR1 supertype (Table II, motifs DRB1*0101, 1501). Glu9, in combination with Asp57, makes this pocket negatively charged, facilitating the accommodation of positively charged amino acids, such as Lys (motifs DRB1*0401, 0404) and His (motif DRB1*0402). In most MHC class II alleles, Asp57 makes a salt-bridged hydrogen bond with Arg76, allowing the pocket to also accommodate aliphatic and polar amino acids (43). In cases where Asp57 is replaced by Ser (DRB1*0405) or Ala (DQ8), the hydrogen bonding network is destroyed and Arg76 can strongly attract negatively charged amino acids (Asp, Glu) available at position 9 of the binding peptide (motif DRB1*0405). Lys/Gln9 always coexists with Asp11 and Asp/Gly30. Vogt et al. (53) suggested that the positively charged anchor residue R and K (motif DRB5*0101) may form a salt bridge with Asp at position 11 and/or position 30 of the DRB5*0101 molecule.

    During the last 10 years, interest in HLA-DQ proteins has increased because certain DQ alleles are associated with susceptibility to type 1 diabetes and celiac disease (85, 86). The x-ray structure of DQ8 (DQA1*0301/DQB1*0302) complexed with an immunodominant peptide from insulin was solved (43). Several DQ binding motifs have been defined (65, 66, 67, 68, 69, 70, 71, 72, 73). The initial hypothesis was that class II molecules with non-Asp57 (i.e., DQ2, DQ8, I-Ag7) preferentially bind peptides with negatively charged anchor residue at peptide position 9, such as peptides from insulin -chain, gliadin, glutenin, and present them to islet-infiltrating T cells or mucosal T cells (87, 88, 89, 90). As was discussed above, the molecular explanation for this phenomenon is that Asp57 forms a salt bridge with Arg76, whereas in non-Asp57 molecules Arg76 is free to interact with the negatively charged peptide anchor at position 9 (43). However, recent data does not support this hypothesis: not all non-Asp57 class II molecules have a preference for negatively charged anchor residues at peptide position 9 and should thus be associated with susceptibility to type 1 diabetes and celiac disease (69). For example, in the Japanese population the class II molecule DQA1*0301/DQB1*0401, which has the same -chain as DQ8, but has a -chain containing an Asp57, is associated with increased susceptibility to type 1 diabetes (43). Other exceptions include molecules DQA1*0201/DQB1*0201 and DQ9 (DQA1*0301/DQB1*0303). The former does not contain Asp57 but is neutral-protective to type 1 diabetes (43), while the latter does contain Asp57 yet is associated with susceptibility to celiac disease (73).

    The DQ classification defined in the present study is based on two important amino acids from the -chain: positions 71 and 86. Residue 71 participates in the formation of pockets 4 and 7, while residue 86 is part of pocket 1. Pocket 1 is a deep, very polar pocket in HLA-DQ molecules, formed by two positively and two negatively charged amino acids, which form a hydrogen bonding network. Replacement of Glu86 with Ala or Gly will destroy this network and leave Arg52 free to contact the side chain of peptide position 1 (43). This is consistent with strong intolerance for positively charged amino acids at position 1 for the DQ1 supertype (Table III, motif for DQA1*0102/DQB1*0602). Ala/Gly86 coexists with Phe/Tyr87. The last residue is also part of pocket 1 and the Phe/TyrLeu replacement increases the pocket size. Large hydrophobic amino acids (Trp, Tyr, Phe) at position 1 are well accepted by alleles bearing Glu86/Leu87 and belong to supertypes DQ2 and DQ3 (Table III, motifs DQA1*0501/DQB1*0201, DQA1*0301/DQB1*0301), whereas alleles with Ala/Gly86 and Phe/Tyr87 (supertype DQ1) prefer medium sized hydrophobic or polar amino acids (Leu, Ile, Thr, Ser) (Table III, motif DQA1*0102/DQB1*0602).

    DQ pocket 4 is significantly deeper than the corresponding pocket 4 in DR molecules (43). Lys71 accounts for the strong basic character of this pocket in DQ2 supertype molecules. Lys71 makes a salt bridge with acidic residues at position 7 of the binding peptide (67). Asp and Glu are preferred amino acids at positions 4 and 7 of the DQ2 binding motif (Table III). In the DQ3 supertype, Lys71 is replaced by Thr71, which coexists with Glu74. The last amino acid makes the pocket negatively charged and acidic residues (Asp and Glu) are not observed at this peptide position (motif DQA1*0301/DQB1*0301).

    DQ alleles beginning DQA1*03 differ from other DQ alleles in having an additional Arg residue after Arg52. This affects the architecture of pocket 1 (21) and determines a preference for small to medium sized amino acids at peptide position 1, including aliphatic or negatively charged side chains (Table III, motif DQA1*0301/DQB1*0301). DQA1*03 alleles were classified as outliers and not as a separate supertype.

    Apart from type 1 diabetes and celiac disease, HLA-DQ alleles are strongly associated with either protection or susceptibility to other autoimmune diseases. Susceptibility to multiple sclerosis has been suggested for individuals with DQA1*0102/DQB1*0602 (91, 92); pemphigus vulgaris is associated with DQB1*0503 (93); rheumatoid arthritis with DQ3 (DQA1*03/DQB1*03 and DQA1*03/DQB1*04) and DQ5 (DQA1*0101/DQB1*0501) (94); systemic sclerosis with DQA1*0501 (95); and protection against type 1 diabetes with DQA1*0102/DQB1*0602 (96). Although these associations concern single HLA-DQ alleles, one could draw a more general conclusion, connecting susceptibility to multiple sclerosis, pemphigus vulgaris, or rheumatoid arthritis as well as protection against type 1 diabetes with alleles from DQ1 supertype.

    In contrast to HLA-DR and DQ, HLA-DP molecules have not been studied extensively, as they have been viewed as less important in immune responses than DRs and DQs. Moreover, currently, no x-ray data exist for any peptide/HLA-DP complex. However, it is now known that HLA-DP proteins contribute to the risk of graft-vs-host disease (97, 98), and that some DP alleles are associated with chronic beryllium disease (99), sarcoidosis (100), and JRA (101). Both the - and -chains of HLA-DP are polymorphic, allowing multiple combinations, but only a few DP molecules are abundant globally. For example, DPA1*0103/DPB1*0401 and 0402 are overrepresented, carried by 76% of individuals in the Caucasian population (22).

    The HLA-DP classification, made in this study, is based on two key amino acids of the DP -chain: positions 69 and 84. These positions correspond to DR/DQ 71 and 86. Both are important for DQ classification, while only 71 takes part in the DR classification. Positions 84 and, to a lesser extent, 85 take part in the formation of pocket 1. Almost half (40 of 95) of the -chains have Gly/Val84 and Gly85, the other half (55 of 95) have Asp84 and Glu85. The chemical nature of the two pairs is very different and this determines the strong differences in the pockets formed by them. Pocket 1 with Gly/Val84 is deep and nonpolar and could accept large hydrophobic amino acids like Phe, Tyr, and Leu (Table IV). Pocket 1 with Asp84 is more shallow and negatively charged. Because no binding motif is available for alleles with Asp84, one might suppose positively charged amino acids, such as Arg and Lys, may be favored here. Position 84 was found to be a key amino acid in Castelli’s HLA-DP classification (22). They defined three supertypes, based on positions 11 and 84, in contrast to the four identified by our analysis.

    Glu/Lys dimorphism exists at position DP 69. Additionally, there are four alleles (DPB1*11, 15, 69, and 74) with Arg69. Because Lys and Arg are similar, these alleles were grouped into Lys69 clusters. Position 69 affects the shape and charge distribution of pockets 4 and 6 (47). Pockets 4 and 6 with Glu69 show high affinity for positive polar residues like Arg, Lys, Gln, and Asn or nonpolar aromatic residues (Phe, Trp, Tyr, and His), but reduced affinity for large nonpolar aliphatic residues (Table IV, motif DPA1*0103/DPB1*0201). Because Glu69 is associated with sarcoidosis, one could suppose a connection between susceptibility to this disease and alleles from DPw2 and DPw6 supertypes (100). The susceptibility of JRA is strongly associated with DPB1*0201 allele (101). By analogy, a relation between JRA and the DPw2 supertype could be supposed. Pockets 4 and 6 with Lys/Arg69 have reduced amino acid selectivity, with aromatic residues most preferred (motifs DPA1*0103/DPB1*0401 and 0402). Additionally, Lys69 favors the binding of large residues endowed with the capacity to form hydrogen bonds (such as Arg) with residue Gln60 (47).

    Analysis of our classification of HLA class II proteins into supertypes reveals several general trends. First, -chain polymorphism within the peptide binding site plays the leading role in the overall polymorphism of human MHC. The key polymorphic positions revealed to be important for our and other supertype definitions (22) all belong to -chains. Second, despite the extraordinary diversity of HLA proteins, common structural features and similarities could be detected and used as fingerprints for their identification and classification into supertypes. The number of amino acids involved in the supertype fingerprints is strikingly small, i.e., one to three. Finally, the classifications proposed here are based on key amino acids with very different, even opposite, properties. For example, position Glu/Lys69 for HLA-DP alleles could be considered as a key position, because of the opposite properties of Glu and Lys. However, position Gly/Val86 could not be a key position for DR classification, because of the similar properties of Gly and Val.

    The MHC is among the most polymorphic of human proteins, and this has greatly complicated the discovery of epitope vaccines. Supertype analysis is one approach taken to address this confounding problem. We have previously identified class I supertypes using computational methods (40), which we now complement with our present analysis of human class II supertypes. The veracity of this analysis is confirmed, as far as possible, by reference to known peptide binding motifs. Although such motifs are an imperfect, or at least incomplete, representation of binding (102, 103), they have clear utility as an approximation to peptide specificity. All supertypes are theoretically derived. Supertypes, based on "binding motifs", may possess a certain verisimilitude, but are, at best, only a partial definition of supertypic membership, limited by the lack of available data for most MHC molecules. Indeed, all work based on the analysis of experimental work, including our own (104, 105, 106, 107), is necessarily limited by the paucity and haphazard nature of extant experimental binding studies. The approach presented here is complementary to such analysis and to existing supertype analyses (3, 19, 20, 21, 22). However, our approach is fundamentally different, at a conceptual and technical level, from other, earlier attempts to cluster alleles into supertypes using structural approach.

    We have discussed such data as exists which supports and verifies our analysis, rather than speculating in a specious and uncorroborated manner. In the context of human class II MHC, this data is, unfortunately, only partial. Further demonstration of the accuracy of our classification will come in either of two ways: through the accumulation of further motifs in the literature or by the exploration of the peptide specificity repertoire of MHC molecules through systematic study. The utility of the method, though obvious to us, will again require independent, external validation for a sufficiently large number of peptides and alleles that its accuracy can be shown to work to statistical significance. We see supertype definition as a grand challenge with significant scientific and utilitarian merit: it is difficult, and thus exciting, and is also truly valuable, as a pivotal tool in the drive to develop new and better vaccines.

    Footnotes

    The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

    1 This work was supported by GlaxoSmithKline, Medical Research Council, Biotechnology and Biological Sciences Research Council, and United Kingdom Department of Health.

    2 Address correspondence and reprint requests to Dr. Darren R. Flower, Edward Jenner Institute for Vaccine Research, Compton, Berkshire, U.K., RG20 7NN. E-mail address: darren.flower{at}jenner.ac.uk

    3 Abbreviations used in this paper: 3D, three dimensional; CoMSIA, Comparative Similarity Indices Analysis; RSP, restrictive supertype pattern; JRA, juvenile rheumatoid arthritis.

    Received for publication September 8, 2004. Accepted for publication February 28, 2005.

    References

    Pfeifer, J. D., M. J. Wick, R. L. Roberts, K. Findlay, S. J. Normark, C. V. Harding. 1993. Phagocytic processing of bacterial antigens for class I MHC presentation to T cells. Nature 361: 359-362.

    R?tzschke, O., K. Falk. 1994. Origin, structure and motifs of naturally processed MHC class II ligands. Curr. Opin. Immunol. 6: 45-51.

    Robinson, J., M. J. Waller, P. Parham, N. de Groot, R. Bontrop, L. J. Kennedy, P. Stoehr, S. G. E. Marsh. 2003. IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res. 31: 311-314.

    Chelvanayagam, G.. 1997. A roadmap for HLA-DR peptide binding specificities. Hum. Immunol. 58: 61-69.

    Gulucota, K., C. DeLisi. 1996. HLA allele selection for designing peptide vaccines. Genet. Anal. Biomol. Eng. 13: 81-86.

    Bjorkman, P. J., M. A. Saper, B. Samraouri, W. S. Bennett, J. L. Strominger, D. C. Wiley. 1987. Structure of the human class I histocompatibility antigen, HLA-A2. Nature 329: 506-512.

    Bjorkman, P. J., M. A. Saper, B. Samraouri, W. S. Bennett, J. L. Strominger, D. C. Wiley. 1987. The foreign antigen binding site and T cell recognition regions of class I histocompatibility antigens. Nature 329: 512-518.

    Saper, M. A., P. J. Bjorkman, D. C. Wiley. 1991. Refined structure of the human histocompatibility antigen HLA-A2 at 2.6 E resolution. J. Mol. Biol. 219: 277-319.(Irini A. Doytchinova and )