Kahraman Thornton 2007

download Kahraman Thornton 2007

of 19

Transcript of Kahraman Thornton 2007

  • 8/14/2019 Kahraman Thornton 2007

    1/19

    Shape Variation in Protein BindingPockets and their Ligands

    Abdullah Kahraman1, Richard J. Morris2, Roman A. Laskowski1

    and Janet M. Thornton1

    1European BioinformaticsInstitute, Wellcome TrustGenome Campus, Hinxton,

    Cambridgeshire, CB10 1SD,UK2 John Innes Centre, NorwichResearch Park, Colney Lane,Norwich, NR7 7UH, UK

    A common assumption about the shape of protein binding pockets is thatthey are related to the shape of the small ligand molecules that can bindthere. But to what extent is that assumption true? Here we use a recently

    developed shape matching method to compare the shapes of proteinbinding pockets to the shapes of their ligands. We find that pockets bindingthe same ligand show greater variation in their shapes than can beaccounted for by the conformational variability of the ligand. This suggeststhat geometrical complementarity in general is not sufficient to drivemolecular recognition. Nevertheless, we show when considering only shapeand size that a significant proportion of the recognition power of a bindingpocket for its ligand resides in its shape. Additionally, we observe a bufferzone or a region of free space between the ligand and protein, whichresults in binding pockets being on average three times larger than theligand that they bind.

    2007 Elsevier Ltd. All rights reserved.

    *Corresponding author

    Keywords: shape; ligand; binding pocket; molecular recognition; conforma-

    tional diversity

    Introduction

    Molecular recognition is a central theme inmolecular biology and arguably the primary drivingforce behind most processes in and between cells.The recognition procedure is based mainly ongeometric and electrostatic complementarity.Enzymes are thought to have optimised theirastonishing catalytic power and specificity byevolving their surfaces to complement substrate

    transition states. One would expect the co-evolutionof substrate and enzyme to result in a fairlyexclusive partnership that must somehow bereflected in both the ligand and the binding site.

    Therefore it is reasonable to assume that proteinsbinding similar ligands have binding sites of similarphysical or biochemical properties. However, we arenot aware of any comprehensive analyses consider-ing multiple ligands that adequately investigate thisassumption. Here we have used a global shapedescriptor from Morris et al.1 to test this assumptionfrom a purely shape perspective. We address thefollowing questions: (1) to what extent are bindingpockets from non-homologous protein domains that

    bind the same ligand similar in shape? (2) To whatextent are binding pockets similar in shape to theligands they bind? (3) Is shape or size moreimportant when comparing binding pockets withligands? (4) How useful is a global shape descriptorfor binding sites in molecular recognition analysisand especially as a ligand predictor?

    Previous studies

    The importance of binding sites in proteins wasrecognised early in structural biology and led tomany studies to identify and compare binding sites.As a detailed comparison of all these techniques and

    closely related methods such as docking and QSARis beyond the scope of this article, only a fewmethods conceptually similar to ours will bedescribed here. For reviews relating to binding site

    Abbreviations used: RMSD, root-mean-squared

    deviation; ROC, receiver operating characteristics; AUC,area under the curve; PQS, protein quaternary structure.E-mail address of the corresponding author:

    [email protected]

    In this manuscript a binding site is defined as thecluster of protein atoms on the protein surface, whichinteract with the binding partner via hydrogen and othernon-covalent bonds. In contrast a binding pocket is thenegative picture of the binding site, i.e. the voluminousimprint of the binding site in space, which bears theligand.

    doi:10.1016/j.jmb.2007.01.086 J. Mol. Biol. (2007) 368, 283301

    0022-2836/$ - see front matter 2007 Elsevier Ltd. All rights reserved.

    mailto:[email protected]://-/?-http://-/?-mailto:[email protected]
  • 8/14/2019 Kahraman Thornton 2007

    2/19

    determination and comparison, see the literature.28

    Current approaches analysing binding sites can beroughly divided into three classes. Firstly, methodsthat detect cavities and geometrically match them toeach other; secondly, methods that identify andcompare specific geometrical patterns of aminoacids in binding sites; and thirdly methods thatuse evolutionary information to predict the locationof binding sites.

    Among the methods in the first category isCavbase,7 which uses pseudospheres to representthe locations and physicochemical properties of theatoms involved in molecular recognition. The spatialdistribution of the pseudospheres is represented bya graph, and a clique detection algorithm is used toidentify similar binding sites in other proteinstructures. The eF-site (electrostatic surface of Func-tional-site) database9 also uses clique detection.

    Here the graphs represent hydrophobicity, electro-static potential and curvature of surface patches.Methods from the second category are based on

    the fact that functionally important residues tend tomaintain the same relative spatial disposition evenin distantly related proteins. This is particularly truefor the catalytic residues in enzymes. The best-known example is the Ser-His-Asp catalytic triad ofserine proteases. In this specific case the relative

    positioning of these three residues is stronglyconserved even in totally different structuralfolds.10 The CSA (Catalytic Site Atlas) database11

    contains a catalogue of structural templates of twoto six residues each derived from the catalytic resi-dues ofenzymes. 3D search programs like SPASM,RIGOR12 and Jess13 or algorithms proposed by Besland McKay14 and Nussinov and Wolfson15 allowone to scan such templates against any query pro-tein structure.

    Finally, an example of a method from the thirdclass, which uses evolutionary information to pre-dict the location of a protein's binding site, ispvSOAR16 (pocket and void surfaces of amino acidresidues). pvSOAR uses the CASTp database17 ofprotein clefts and voids and searches for similarsequence and spatial arrangement of the cavityresidues for a query structure.

    All the methods above involve comparison ofatomic coordinates in one form or another. In thepresent work a shape comparison technique is used.This avoids the problems of superposing bindingsites, particularly where they are composed ofdifferent numbers of atoms and atom types. Further-more, the consideration of shape alone allows adirect comparison of the degree of complementary

    between binding pockets and ligands that they bind.

    Figure 1. The cleft volume reduction procedure illustrated on the example of flavo-hemo-protein 1cqx. SURFNETclefts are reduced using one of three procedures (atoms used for reduction are in red colour, reduced clefts are in greencolour): (a) Conserved Cleft Model: keep SURFNET spheres next to conserved cleft region (top inset: conservationmapped on protein structure). (b) Interact Cleft Model: keep SURFNET spheres next to protein atoms that interact withligand (top inset: LigPlot/HBPLUS diagram). (c) Ligand Cleft Model: keep SURFNET spheres that are in contact withligand molecule (top inset: FAD molecule).

    284 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    3/19

    Many sophisticated shape description and match-ing methods exist, all with their strengths andweaknesses. As we focus on 3D shape in thisanalysis, we employ the method originally pio-neered by Ritchie & Kemp in a series of articles 1822

    in which the idea of using spherical harmonics forproteinprotein interactions and docking was devel-oped. The idea was taken further by Cai andcolleagues and applied directly to binding pockets23

    and was further improved by Morris et al.1

    Three cleft approximations were implemented inour algorithm (Figure 1) in order to investigate theshape descriptor as a molecular recognition andfunction prediction tool. Each model was obtained

    by extracting a different type of information fromthe protein (see section Methods, Cleft Reduction).These models were: the Conserved Cleft Model,Interact Cleft Model and the Ligand Cleft Model.

    Figure 2 provides some examples to the models.

    Data set

    The analysis requires multiple examples of bind-ing sites and ligands that are found in unrelatedproteins for which structural data are available. Infact rather few ligands of this type are available.Applying the criteria from Methods, The data set,100protein binding sites were obtained that bind oneof nine ligand types (Table 1). The ligands were all ofdifferent size and flexibility, including phosphate asthe smallest and most rigid molecule to ATP as

    flexible and middle-sized molecule up to FAD as thebiggest and most flexible molecule in the data set.

    Results and Discussion

    Before applying the spherical harmonic functionsfor the approximation of molecular shapes wecompare the reconstructed shapes to the moleculesthat were used to define it. Thus we first present anumber of quality checks of the shape descriptionand comparison method in this study. We thendiscuss the biological implications of our results inmore detail.

    Shape reproduction quality and comparisonmetric

    Reconstruction error

    Any function on the unit sphere can be recon-structed to any arbitrary error threshold by a linearcombination of spherical harmonic functions to dif-ferent orders. In Figure 3 such reconstructions areshown for two ligand molecules. Depending on theapplication,thespherical harmonics expansion can beterminated at an appropriate order, e.g. to roughlycapture the overall shape of a small molecule an

    expansion up to lmax=6 is usually sufficient; forhighly non-central distributions an expansion orderof several hundred may be necessary. The effects ofseries termination on the error of the binding pocket

    shape reconstruction are visualised in the two ex-amples of Figure 3. Additionally, reconstructionerrors are provided in the Figure reflecting the root-mean-square-deviation (RMSD) between 240 samplepoints and their reconstructed values. The 240 pointswere spherical design points that are a special set ofpoints uniformly spread over a unit sphere and usedin our application as the integration layout for thespherical harmonicfunctions(see Methods). In Figure4, the reconstruction error is shown as a function ofthe expansion order. Mathematically one wouldexpect the error to decrease smoothly with increasingexpansion order, which is indeed the case forintegration methods that are accurate into higherorders. The spherical design method proposed byMorris,24 however, has a limited region of applic-ability for integration in expansion space. Thelimitation leads to increasing numerical errors for

    spherical harmonicorders higher than lmax=14fortheemployed spherical-21 design. As we wanted toobtain the most accurate reconstruction of the shapes,whilst keeping a fast integration, all the cleft modelsand ligand shapes in the data set were expanded toorder 14, which according to the plot has an averageerror of 0.188 .

    Comparision to surface RMSD

    The difference between two shapes is calculatedusing the standard Euclidean metric in coefficientspace (see equation (2)). To assess whether the result-ing coefficient distances indicate similarity or dis-similarity, they were plotted against surface RMSD(Figure 5). The surface RMSD follows the commonstandard RMSD calculation in structural biology butinstead of using atomic coordinates the 240 spherical21-design sample points were used. The plot inFigure 5 shows a high correlation of R2=0.99 be-tween the coefficient distance and the surface RMSDand thus allows the translation of any requiredRMSD into a coefficient distance with the ratio of1:3.5. Experience shows that a coefficient distance ofunder 3 gives visually almost identical shapes and adistance below 5 corresponds to similar shapes.

    Furthermore the strong correlation between the

    coefficient distances and the surface RMSD valuesconfirms that a weighting of the expansion coeffi-cients is not required. The standard coefficients arealready able to sufficiently distinguish dissimilarfrom similar shapes.

    Shape variation in the data set

    In order to investigate the questions from theIntroduction, shape coefficients to the order lmax=14were calculated for all 100 ligands and cleft modelsfrom the data set and compared to each other usingthe standard Euclidean metric.

    Ligand conformations

    It should be kept in mind that any recognitionprocess of binding pockets or ligand shapes has to

    285Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    4/19

    Figure 2. Reconstructed shape of the order lmax=14 for cleft models from ATP, NAD, heme and FAD. Associatedligands are shown as well and PQS-Ids are provided in parentheses. The reconstructed shapes are visualised as a meshand coloured according to the cleft models: CM, Conserved cleft region Model; IM, proteinligand Interacting region cleftModel; LM, Ligand region cleft Model.

    286 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    5/19

    deal with conformational variance of both theprotein (and therefore also the binding pocket) andthe ligand. In a non-homologous data set containingunrelated proteins that may have evolved differentstrategies for binding the same ligand, one canexpect different conformations and therefore differ-ent shapes for every flexible ligand. A robust shape-

    based classification of such a data set is thereforelikely to be difficult. However a working shapedescriptor should be able to pick up conformationalsimilarities for the same ligands and differences

    between different ligands.The average shape similarity of all identical li-

    gands in our data set is 3.6 coefficient distance,which corresponds to a surface RMSD 1 , seeTable 2A. As such the shape variation for individualligands is low but mainly related to the flexibility ofthe ligand molecules. Four of the nine ligand sets

    (glucose, phosphate, steroid, AMP) can be consid-ered as rigid molecules with an average distance ofless then 3 (surface RMSD

  • 8/14/2019 Kahraman Thornton 2007

    6/19

    Table 1. Data set of 100 binding pockets spread over nine ligand sets

    No Ligand set PQS IdChain

    Id Protein EC code CATH code LigandLigand

    chain Id

    Ligandresiduenumber

    Ligandaltern loc

    1 AMP 12as A Asparagine synthetase 6.3.1.1 3.30.930.10 AMP X 2 2 1amu_1 A Gramicidin synthetase 5.1.1.11 2.30.38.103.40.50.980

    AMP A 551

    3 1c0a A Aspartyl t-RNA synthetase 6.1.1.12 3.30.1360.30 AMP E 800 4 1ct9_1 A Asparagine synthetase 6.3.5.4 3.40.50.620 AMP A 1100 5 1jp4 A Bisphosphate nucleotidase 3.1.3.7 3.40.190.80 AMP B 601 6 1kht B Adenylate kinase 2.7.4.3 3.40.50.300 AMP D 2193 7 1qb8 A Adenine

    phosphoribosyltransferase2.4.2.7 3.40.50.2020 AMP C 300

    8 1tb7 B cAMP-specific-cyclicphosphodiesterase

    3.1.4.17 1.10.1300.10 AMP C 401

    9 8gpb A Glycogen phosphorylase 2.4.1.1 3.40.50.2000 AMP B 930 10 ATP 1a0i ATP-dependent DNA ligase 6.5.1.1 3.30.470.30

    3.30.1490.70ATP 1

    11 1a49_1 A Pyruvate kinase 2.7.1.40 3.20.20.60 ATP A 535 12 1ayl A Phosphoenolpyruvate

    carboxykinase4.1.1.49 2.170.8.10

    3.90.228.20ATP A 541

    13 1b8a A Aspartyl-tRNA synthetase 6.1.1.12 3.30.930.10 ATP C 500 14 1dv2 A Biotin carboxylase 6.3.4.14 3.30.470.20

    3.30.1490.20ATP C 1000

    15 1dy3 A Pyrophosphokinase 2.7.6.3 3.30.70.560 ATP A 200 16 1e2q A Thymidylate kinase 2.7.4.9 3.40.50.300 ATP A 302 17 1e8x A Phosphatidylinositol kinase 2.7.1.153 1.10.1070.11

    3.30.1010.10ATP A 2000

    18 1esq A Hydroxyethylthiazole kinase 2.7.1.50 3.40.1190.20 ATP D 300 19 1gn8 B Phosphopantetheine

    adenylyltransferase2.7.7.3 3.40.50.620 ATP B 600

    20 1kvk A Mevalonate kinase 2.7.1.36 3.30.230.10 ATP C 535 21 1o9t A Adenosylmethionine synthetase 2.5.1.6 3.30.300.10 ATP B 1397 22 1rdq E cAMP-dependent protein kinase 2.7.1.37 1.10.510.10

    3.30.200.20ATP A 600 B

    23 1tid A Anti-sigma F factor 2.7.1.37 3.30.565.10 ATP E 200 24 FAD 1cqx A Flavohemoprotein 1.14.12.17 2.40.30.10

    3.40.50.80

    FAD A 405

    25 1e8g B Vanillyl-alcohol oxidase 1.1.3.38 3.30.43.103.30.465.20

    FAD B 600

    26 1evi B D-amino acid oxidase 1.4.3.3 3.30.9.103.40.50.720

    FAD C 353

    27 1h69_1 A NAD(P)H dehydrogenase 1.6.99.2 3.40.50.360 FAD A 1274 28 1hsk A Acetylenolpyruvoylglucosamine

    reductase1.1.1.158 3.30.43.10

    3.30.465.10FAD D 401

    29 1jqi A Short chain acyl-CoAdehydrogenase

    1.3.99.2 1.20.140.102.40.110.10

    FAD E 399

    30 1jr8 B Oxidreductase 1.8.3.? 1.20.120.310 FAD C 334 31 1k87 A Proline dehydrogenase 1.5.99.8 3.20.20.220 FAD C 2001 32 1pox A Pyruvate oxidase mutant 1.2.3.3 3.40.50.1220

    3.40.50.970FAD A 612

    33 3grs A Glutathione reductase 1.8.1.7 3.50.50.60 FAD A 479 34 FMN 1dnl A Pyridoxine-phosphate oxidase 1.4.3.5 2.30.110.10 FMN C 250 35 1f5v A Oxidoreductase 1.?.?.? 3.40.109.10 FMN C 360

    36 1ja1_1 A NADPH-cytochrome reductase 1.6.2.4 3.40.50.360 FMN A 1751 37 1mvl A Lyase 4.1.1.36 3.40.50.1950 FMN D 1001 38 1p4c A Mandelate dehydrogenase 1.1.3.15 3.20.20.70 FMN E 490 39 1p4m A Transferase 2.7.1.26 2.40.30.30 FMN B 401 40 Glucose 1bdg A Hexokinase 2.7.1.1 3.30.420.40

    3.40.367.20GLC A 501

    41 1cq1 A Quinoprotein glucosedehydrogenase

    1.1.5.2 2.120.10.30 GLC C 3

    42 1k1w A Transferase 2.4.1.25 3.20.20.?1.20.?.?

    2.70.98.?

    GLC C 653

    43 1nf5_2 C Transferase ?.?.?.? 1.10.530.103.90.550.10

    GLC D 527

    44 2gbp Periplasmic binding protein ?.?.?.? 3.40.50.2300 GLC 310 45 Heme 1d0c A Endothelial nitric oxide synthase 1.14.13.39 3.90.340.10 HEM A 500 46 1d7c A Cellobiose dehydrogenase 1.1.99.18 2.60.40.1210 HEM A 401 47 1dk0 A Heme-binding protein ?.?.?.? 3.30.1500.10 HEM A 200 48 1eqg A Prostaglandin synthase 1.14.99.1 1.10.640.10 HEM A 601 49 1ew0 A Transferase 2.7.3.? 3.30.450.20 HEM A 501 50 1gwe A Catalase 1.11.1.6 2.40.180.10 HEM A 504

    288 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    7/19

    8, apart from the flexible FAD, NAD and partiallyheme and ATP ligands where the distributions

    overlap in Figure 8.Additionally, for every ligand in the data set, onecan order all the other ligands by their coefficient dis-tance from it. The rank of the first ligand, belonging to

    the same ligand set in each case, is plotted in the his-togramin Figure9.Thegreenbarsshowthatin89%of

    the cases the closest ligand belongs to the same ligandset. Repeating the same procedure for the InteractCleft Model (red bars) reveals that the percentage atthe first rank drops to 44%. When each Interact Cleft

    Table 1 (continued)

    No Ligand set PQS IdChain

    Id Protein EC code CATH code LigandLigand

    chain Id

    Ligandresiduenumber

    Ligandaltern loc

    51 1iqc_1 A Heme peroxidase 1.11.1.5 1.10.760.10 HEM A 401

    52 1naz E Oxygen transport ?.?.?.? 1.10.490.10 HEM E 200 53 1np4 B Nitrophorin ?.?.?.? 2.40.128.20 HEM B 185 54 1po5 A Cytochrome 1.14.14.1 1.10.630.10 HEM A 500 55 1pp9 C Oxidoreductase ?.?.?.? 1.20.810.10 HEM C 501 56 1qhu A Binding protein hemopexin ?.?.?.? 2.110.10.10 HEM A 500 57 1qla C Oxidoreductase ?.?.?.? 1.20.950.10 HEM G 1 58 1qpa B Lignin peroxidase 1.11.1.14 1.10.420.10

    1.10.520.10HEM B 350

    59 1sox A Sulfite oxidase 1.8.3.1 3.10.120.10 HEM A 502 60 2cpo Oxidoreductase 1.11.1.10 1.10.489.10 HEM 396 61 NAD 1ej2 B Nicotinamide

    adenylyltransferase2.7.7.1 3.40.50.620 NAD H 1339

    62 1hex A Isopropylmalate dehydrogenase 1.1.1.85 3.40.718.10 NAD A 400 A63 1ib0 A NADH-cytochrome reductase 1.6.2.2 3.40.50.80 NAD B 1994 64 1jq5 A Glycerol dehydrogenase 1.1.1.6 1.20.1090.10

    3.40.50.1970NAD I 401

    65 1mew A Monophosphate dehydrogenase 1.1.1.205 3.20.20.70 NAD E 987

    66 1mi3_1 A Oxidoreductase 1.1.1.21 3.20.20.100 NAD A 1350 67 1o04_1 A Aldehyde dehydrogenase 1.2.1.3 3.40.309.10

    3.40.605.10NAD A 6501

    68 1og3 A T-cell ADP-ribosyltransferase 2.4.2.31 2.30.100.10 NAD A 1227 69 1qax A Methylglutaryl-coenzyme

    reductase1.1.1.88 3.30.70.420

    3.90.770.10NAD G 1001

    70 1rlz A Deoxyhypusine synthase 2.5.1.46 3.40.910.10 NAD H 700 71 1s7g B NAD-dependent deacetylase 3.5.1.? 3.40.50.1220 NAD F 701 72 1t2d A Lactate dehydrogenase 1.1.1.27 3.40.50.720

    3.90.110.10NAD E 316

    73 1tox_1 A Diphtheria toxin 2.4.2.36 3.90.175.10 NAD A 536 74 2a5f B Protein transport 2.4.2.36 3.90.210.10 NAD C 1536 75 2npx A NADH peroxidase 1.11.1.1 3.50.50.60 NAD A 818 76 Phosphate 1a6q Phosphatase 3.1.3.16 3.60.40.10 PO4 701 77 1b8o C Purine nucleoside

    phosphorylase2.4.2.1 3.40.50.1580 PO4 F 599

    78 1brw A Pyrimidine nucleosidephosphorylase 2.4.2.2 3.40.1030.10 PO4 C 2001

    79 1cqj_1 B Succinyl-CoA synthetase 6.2.1.5 3.30.1490.20 PO4 B 904 80 1d1q B Tyrosine phosphatase 3.1.3.48 3.40.50.270 PO4 C 402 81 1dak A Dethiobiotin synthetase 6.3.3.3 3.40.50.300 PO4 C 803 82 1e9g A Inorganic pyrophosphatase 3.6.1.1 3.90.80.10 PO4 A 3001 A83 1ejd C Enolpyruvyltransferase 2.5.1.7 3.65.10.10 PO4 F 2431 84 1euc A Succinyl-CoA synthetase 6.2.1.4 3.40.50.261 PO4 C 224 85 1ew2 A Phosphatase 3.1.3.1 3.40.720.10 PO4 C 1005 86 1fbt B Bisphosphatase 3.1.3.46 3.40.50.1240 PO4 C 100 87 1gyp A Glyceraldehyde-phosphate

    dehydrogenase1.2.1.12 3.30.360.10 PO4 A 359

    88 1h6l A Phytase 3.1.3.8 2.120.10.20 PO4 A 501 89 1ho5_1 B Nucleotidase 3.1.3.5 3.60.21.20 PO4 B 2603 90 1l5w A Maltodextrin phosphorylase 2.4.1.1 3.40.50.2000 PO4 D 998 91 1l7m_1 A Phosphoserine phosphatase 3.1.3.3 3.40.50.1000 PO4 A 720 92 1lby A Bisphosphatase 3.1.3.25 3.30.540.10

    3.40.190.80PO4 C 293

    93 1lyv A Protein-tyrosine phosphatase 3.1.3.48 3.90.190.10 PO4 B 1000 94 1qf5 A Adenylosuccinate synthetase 6.3.4.4 3.40.440.10 PO4 C 2 95 1tco A Serine-threonine phosphatase 3.1.3.16 3.60.21.10 PO4 D 507 96 Steroid 1e3r B Isomerase 5.3.3.1 3.10.450.50 AND B 801 97 1fds A Hydroxysteroid-dehydrogenase 1.1.1.62 3.40.50.720 EST A 350 98 1j99 A Alcohol sulfotransferase 2.8.2.2 3.40.50.300 AND B 401 A99 1lhu A Sex hormone-binding globulin ?.?.?.? 2.60.120.200 EST G 301 100 1qkt A Estradiol receptor ?.?.?.? 1.10.565.10 EST C 600

    _ is a placeholder for unlabelled chains and alternative locations. ? is a placeholder for not available information.

    289Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    8/19

    Model is compared to all ligand molecules (orangebars) the percentage at the first rank drops to 27%,with more than half of the first true hits being beyondthe rank order of 10.

    Detailed examination of the crystallographic struc-tures of the proteins shows that a perfect fit of theligand into its binding site is never achieved. Notevery ligand atom makes contact with the protein.

    Figure 3. Various approximations of the shapes (black coloured mesh) for NAD and FMN with different degrees oftermination in the spherical harmonics series expansion. Reconstruction errors are provided corresponding to RMSDvalues between the ligand shape and the reconstructed shape. (NAD was extracted from PQS structure 1t2d; FMN wasextracted from PQS structure 1f5v).

    290 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    9/19

  • 8/14/2019 Kahraman Thornton 2007

    10/19

    shape incorporated) and zeroth order coefficients(only size incorporated), respectively. From theAUC values it can be observed that shape playsthe main role in the cleft versus ligand comparison

    and vice versa. For the clefts versus clefts and ligandsversus ligands comparison size seems to outweighthe performance of shape alone. This is not remark-able, as the ligands in the data set are almost alldistinguishable by size. However, except for the cleftversus cleft comparison of the Interact Cleft Models,it is remarkable how little the performance differswhen using only shape for the classification.

    In fact, thesize difference between binding pocketsand ligands accounts for the failure of the shapecomparison method to match binding pockets totheir ligands as described in the previous subsection.With the normalisation, the size is excluded and asuccessful matching solely on shape becomes possi-

    ble. As a result, the AUC value for the ligand versuscleft comparison rises to a maximum of 0.83 (Table3B). Interestingly, the cleft versus ligand comparisonstill gives relatively low AUC values, which iscaused mainly by the FAD and NAD ligand sets.The average coefficient distances using normalisedcoefficients for FAD and NAD binding pocketsare smaller than for their ligands, due to imperfectcomplementarity.

    Performance of cleft models

    The poor performance of the Conserved Cleft

    Model is mainly caused by enzymes in our data setthat have at least two binding pockets next to eachother (one for the cofactor and one for the substrate).As both binding pockets are important for the

    function both will be highly conserved. Thus, redu-cing the SURFNET27 spheres via conservation, stillresultsinalargermergedcleftmodel,consistingofthecofactor and substrate binding pocket. This is acommon problem and at least 27 ligands in our dataset are known to be cofactors for which a combined

    binding pocket was obtained. Another issue is thedivergenceofsubstratesinsomelargeproteinfamiliesliketheSDRproteinfamily,28 wherethebindingsiteisnot more conserved than the rest of the protein. Inthese and similar cases the Conserved Cleft Modelcontains only a portion of the binding pocket (seeConserved Cleft Model of NAD binding pocket inFigure 2). It is also important to note the number ofprotein homologues used to calculate the sequenceconservation and their sequence similarity. Fewsequence homologs will result in an unreliableconservation score and therefore in an unreliable

    binding pocket prediction.

    Limitations and problems

    Binding pocket prediction

    The main obstacle for our approach is the bindingpocket prediction step and the related accuracy of thecleft model. A number of approaches exist (seeIntroduction, Previous studies), but generally anaccurate solution remains unavailable.

    Other problems involve some general characteris-tics of protein structures. For example loop regionsare often missing in crystallographicstructures due to

    their flexibility, making it difficult to predict thebinding pocket for those built up partially by loops.Nine protein structures in ourdata set featuremissingloops close to binding sites. Furthermore, manyprotein structures are solved as part of functionalassessment experiments, where functionally relevantamino acids are mutated to study their effects on theprotein structure and function. Mutations are oftenperformed on ligand interacting residues, resulting ina slightly different binding pocket shape. Suchmutations are found in 26 protein structures in ourdata set. Other more technical problems involve theaccuracy of X-ray structure coordinates. The medianvalue of theestimatedstandard deviation for atoms incrystallographic structures is about 0.28 .29 NeitherSURFNET nor HBPLUS account for this uncertaintyin their algorithms, which leads to missing SURFNETspheres in some of the cleft models.

    Partially bound ligands

    Some ligands are bound only partially inside abinding pocket with their other end protruding intothe solvent, such as the NAD and the heme group inFigure 11. As the spherical harmonic functions workglobally on the whole shape they are not well suitedfor local shape matching. Finding the correct ligand

    in such cases will not succeed. However, if thepartial bound state is a common picture for theentire protein family, a cleft versus cleft comparisoncould help to find a homologous family member.

    Table 2. Statistics on the coefficient distances ordered bytheir average

    Ligand set(set size)

    Avg.coeff. dist.

    Std dev.coeff. dist.

    Min.coeff. dist.

    Max.coeff. dist.

    A. Statistics for ligand moleculesGLC (5) 1.2 0.2 0.6 1.5PO4 (20) 1.2 0.2 0.4 1.9Steroids (5) 1.5 1.0 0.2 2.4AMP (9) 2.4 0.5 1.1 3.9Heme (16) 3.3 0.6 1.6 5.6FMN (6) 3.8 0.6 2.3 4.6ATP (14) 4.3 0.7 1.4 6.2NAD (15) 6.8 0.9 4.5 9.8FAD (10) 7.1 1.0 3.8 9.4Total (100) 3.6 1.9 0.2 9.8

    B. Statistics for proteinligand interacting reduced cleft modelsPO4 (20) 4.6 0.9 2.2 8.2Steroid (5) 5.4 0.8 4.1 6.3GLC (5) 5.6 1.4 3.6 7.6AMP (9) 6.1 0.8 4.5 8.3Heme (16) 6.5 1.0 3.8 8.9FMN (6) 7.1 0.6 5.5 8.2ATP (14) 7.4 0.7 5.2 10.1FAD (10) 8.8 0.3 6.6 11.9NAD (15) 9.0 1.8 6.2 13.7Total (100) 6.6 1.7 2.2 13.7

    Surface RMSD values can be obtained by dividing the coefficientdistances by 3.5 (see correlation in Figure 5).

    292 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    11/19

    Star-like shapes and rotational variance

    There are some minor problems related to proper-ties of the spherical harmonic. These functions aresuitable for describing the global surface of star-likeshapes. But binding pockets and ligands are notalways star-like in shape. In cases where the rayfrom the centre of gravity to the surface penetratesthe surface more than once, the outermost surfacepoint was used to approximate the global shape.This can bring some loss of shape information butshould not change the matching results significantly.

    Furthermore, the coefficient vectors of ourapproach are not rotationally invariant. Althoughobtaining coefficient vectors for all four axis-flip-

    combinations solved the flipping-problem, it is stillpossible that a rotationally invariant shape descrip-tor might improve the results.

    Single property descriptor

    The molecular recognition of a ligand is induced by physicochemical properties in addition to theshape, such as electrostatic potential and hydropho-

    bicity. Includingsuch features in the cleft models andthe ligands might improve our results. As theelectrostatic potentials are a solution to Laplace'sequation in the absence of electric charges, theimplementation of the electrostatic potential algo-rithm in our method should be straightforward

    Table 3. Average area under receiver operator curves for different comparisons with different cleft models

    Cleft model Cleft versus cleft Cleft versus ligand mol. Ligand mol. versus cleftLigand mol. versus

    ligand mol.

    A. Comparison with standard shape coefficients incorporating size and shapeConserved 0.53 0.54 0.52 0.92Interact 0.77 0.63 0.56Ligand 0.85 0.69 0.59

    B. Comparison with normalised shape coefficients corresponding to shape onlyConserved 0.52 0.52 0.55 0.87Interact 0.64 0.64 0.73Ligand 0.74 0.68 0.83

    C. Comparison with the size of the shapes, which corresponds to the zeroth order in the spherical harmonics expansionConserved 0.53 0.51 0.51 0.94Interact 0.73 0.51 0.51Ligand 0.76 0.52 0.51

    Different cleft models in the rows are related to comparison combinations between cleft model and ligand molecules in the columns.

    Figure 6. Matrices of all-against-all coefficient distances visualising the shape (dis)similarity between (a) ligandmolecules and (b) proteinligand interacting region cleft model shapes. The coefficient distances are coloured from green toorange to yellow reflecting low, intermediate and high coefficient distances. Coefficient distances higher than 10 are left out(white). The ligand sets are separated by a grid and labelled on the left and bottom of each matrix.

    293Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    12/19

    Figure 7. Diversity of binding pocket shapes shown for five examples of AMP, ATP, and NAD. The binding pockets at

    the top are manually chosen. The other binding pockets are the most different ones to the manually chosen top bindingpockets. Binding pockets correspond to proteinligand interacting region cleft model and are oriented according to theadenine ring of their bound ligand and represented by a spherical harmonic reconstruction of the order lmax=14. PQS-Idsof associated protein structures are provided below each binding pocket.

    294 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    13/19

    (spherical harmonic functions are a solution to the

    angular part of Laplace's equation).

    Conclusions

    Here, a fast and efficient spherical harmonics shapedescriptor was employed to compare binding pocketand ligand shapes. It was shown that the shape des-criptor is able to reflect the conformational state of theligands allowing correct classification of rigid ligands,

    but poor classification of highly flexible ligands.In addition it was shown that the assumption

    about proteins binding similar ligands havingsimilar geometrical properties is only partially

    true. As expected the similarity is closely related tothe flexibility of the ligand molecules. The bindingpockets are observed to be more variable in theirshapes than their bound ligand molecules with adifference in their average coefficient distances of3.0, which corresponds to 0.9 surface RMSD. Thisdifference in shape variation between the cleftmodels and ligand molecules shows that shapecomplementarity in general is not sufficient to drivemolecular recognition alone and requires additionalphysicochemical properties.

    Furthermore we observed a buffer zone betweenligand and ligand interacting protein atoms, whichis partially occupied by water molecules so that on

    average binding pockets tend to be three timeslarger than their bound ligand molecule.The normalisation procedure of the standard

    spherical harmonic coefficients enabled the investi-

    gation of the contribution of shape and size to the

    classification performance. Shape alone outperformsthe contribution of size alone in the classification, but size does surprisingly well when comparingclefts to clefts and ligands to ligands. However themolecular sizes of the ligand sets in this study werealmost all distinguishable, which would not be thecase if all metabolites were considered.

    The relationship between classification perfor-mance and accuracy of the cleft models pointstowards the need for a good binding pocket model.The random classification of the conserved cleftregions proved that residue conservation does notprovide sufficiently accurate binding pocket modelsandcannot be used forfunction prediction. However

    the global shape descriptor combined with theInteract Cleft Model is an elegant descriptive methodfor comparing binding pocket shapes in proteinfamilies.

    In this context a detailed analysis of the changes of binding pocket shapes in protein families duringevolution is in progress. Additionally improvementsto the method will be implemented that will allowthe analysis of binding pockets not just on shapealone but also including electrostatic potential andhydrophobicity.

    Methods

    Our approach and implementation of the sphericalharmonics expansion offers a number of advantages overexisting methods. A detailed description of these methods

    Figure 8. Distribution of the coefficient distances for each ligand set. Green and red bars show the relative occurrenceof the coefficient distances for ligand molecules and proteinligand interacting region cleft models, respectively.

    295Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    14/19

    can be found in Morris et al.1 and Morris24 and will not berepeated here. Instead we explain the various approaches

    for the determination of binding pockets and emphasizetheir relevance.

    The procedure for identifying, describing and compar-ing binding pocket shapes can be divided into five generalparts. The main steps of the algorithm are: (1) identifica-tion of a binding site cleft; (2) reduction of cleft volume towhere binding occurs; (3) transformation to a standardframe of reference; (4) spherical harmonic expansion ofshape; (5) coefficient comparison between two shapes toquantify similarity.

    Ligand shapes are modelled using only steps (3)(5)above. The clefts in a protein's surface are computedusing SURFNET,27 which detects protein cavities byinserting spheres of a certain range of sizes betweenprotein atoms. The clefts are identified as distinct clustersof overlapping spheres and reduced in size (see CleftReduction, below). For comparison of cleft and ligandshapes, it is necessary for the modelled shapes to be inthe same orientation and coordinate frame of reference.Previous approaches used the rotational properties of thespherical harmonic functions to rotate the shapes in allorientations until the optimal superimposition wasfound.21,22 The rotation is achieved by using a Wignerrotation matrix on the coefficients and calculating thesmallest distance between the respective coefficientvectors, e.g. using a genetic algorithm.23 However thisis computer-intensive and unsuitable for database scan-ning. To speed up the scan we implemented a pre-orientation with three transformation operations on thecleft model as described by Morris et al.1 The firsttranslates the cleft model so that its centre of gravity isplaced at the origin of the coordinate system. The nextstep involves a rotation of the cleft in terms of itsmoments of inertia as a gross shape characteristic.

    Therefore the cleft model is rotated so that its moment ofinertia tensor becomes diagonal with maximal values inx, followed by y followed by z. However the symmetry ofthe tensor cannot distinguish between objects at 0 and180 rotation on the x-,y-,z-axes. To tackle this axis-flip-problem, shape coefficients were calculated for fournon-redundant combinations of flips, resulting in fourcoefficient vectors for each cleft model.

    The spherical harmonics expansion approach was thenapplied to describe the shape of the transformed cleft.Therefore the surface of the cleft, which was built up by theouter SURFNETspheres, was considered as a single valued(star-like) surface. In cases of a non-star-like shape only theoutermost surface points were taken into account. Further-more a sphere of radius 1.6 was rolled over the surfaceclosing up any gaps between molecular atoms. Theresulting star-like shape was considered as a sphericalfunction on a unit sphere, with angle pairs (,) reflectingthe domain values of the function extracted from sphericalt-designs. Spherical t-designs are sample points uniformlyspread over a unit sphere, which provide an optimalintegration layout for spherical harmonic functions up to acertain order.24 Using the sample points of the spherical 21-design, the surface function was approximated by anexpansion with real spherical harmonic functions:

    fu;fcXlmaxl0

    Xlml

    clmReYlmu;f 1

    where f(,) is the surface function, lmax is 14, Re[Ylm (,)]are the real parts of the spherical harmonic functions ofindices l and m, and clm are the associated coefficients.

    The coefficients are computed from the functional scalarproduct between the function and the spherical harmonicsfor each combination of l and m. See Morris et al.1 andMorris24 for further details.

    Figure 9. Histogram of the rela-tiveoccurrencesofthepositionsthathold the most similar ligand setmember. The positions are deter-mined by ordering each coefficientdistance list and recording the posi-tionofthefirsthitthatbelongstothesame ligand set when the list iswalked down from best to worst.Green coloured bars illustrate thehistogram for ligand molecules; redcoloured bars show the histogramfor proteinligand interacting re-gion cleft models and orange col-oured bars display the histogramfor the Interact Cleft Model versusligand molecule comparison.

    296 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    15/19

    The orthonormal property of the spherical harmonicpolynomials guarantees a unique breakdown of thesurface function into spherical harmonic functions in theexpansion process and provides unique coefficients forany shape. The uniqueness enables the usage of thecoefficients directly for comparison against other bindingpocket or ligand coefficients. The standard Euclideandistance metric was used for the comparison:

    da;

    b

    ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXni0

    bi ai2

    s2

    with a and b as two coefficient vectors having n =(lmax+1)2 coefficients. The rapid similarity calculation is the mainstrength of our method and enables fast retrieval of related

    binding pockets or ligands from a coefficients database.

    Cleft reduction

    The following sub-sections are devoted to the three cleftmodels we employ in our analysis. Cleft models fromSURFNET are often large and reach out beyond the regionof the ligand location. Such clefts are neither convenient for

    binding pocket comparison nor for ligand docking. There-fore the SURFNET clefts need to be reduced in size. Weemployed three procedures to reduce these initial clefts tomore accuratelyapproximatethe actualshape of the ligand.

    All three cleft-reduction procedures provide a validseries of approximations to the real binding pocketdepending on the available information. See Glaser etal.30 for a recent discussion on this topic of binding pocketlocalisation methods using SURFNET. An overview of thereconstructed cleft shapes of all three cleft models togetherwith their ligands is given for four binding pockets in

    Figure 10. Not every ligand atom contacts a protein atom and thus leaves space between parts of the ligand and theprotein. The space is partially occupied by crystallographic observable water molecules. An example is shown on theAMP binding pocket of PQS entry 1qb8, with the reconstructed pocket shape shown as a black coloured mesh, the ligandshown in varicolour and the oxygen atoms of the water molecules shown as green coloured spheres.

    297Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    16/19

    Figure 2. Note the decreasing accuracy of the cleft modelscompared to their ligand shapes when walked down fromtop to bottom.

    All distances in the reduction steps are calculated between the surfaces of van der Waals atoms andSURFNET spheres.

    Conserved cleft model

    This approximation can be applied without any priorinformation about the proteinligand interactions. It usesthe approach described in Glaser et al.30 to map phyloge-netic residue conservation scores from the ConSurf-HSSPdatabase31 onto the protein structure (Figure 1(a) top). Thereduction of the cleft is performed by picking out only

    SURFNET spheres within 0.3 of a highly conservedresidue atom (ConSurf scoreshigher than eight)(Figure 1(a)

    bottom). Evolutionarily conserved residues are oftenfunctionally important and highlight potential ligand-

    binding residues when they are found within clefts.32 Thisapproach is mostsuitablefor structures solved by structuralgenomics groups, where the function of the protein isunknown and no biologically relevant ligand is bound tothe protein in the solved structure.

    Interact cleft model

    Another approximation of the binding pocket isobtained by keeping all SURFNET spheres within 0.3 of protein atoms interacting with the bound ligand (Figure1(b)). The residues were identified using HBPLUS.33HBPLUS calculates hydrogen bonds between a proteinand a ligand by looking at the distances and angles

    between potential hydrogen bond donors and acceptors. Italso lists pairs of atoms that are in non-bonded contact.The Interact Cleft Model is of practical importance, sincemethods already exist for predicting ligand-interactingresidues3436 and pharmaceutical companies as well asacademia usually have high quality binding site informa-tion. Thus this approach can be used when there is noligand bound in the available structure but the user hasinformation about the ligand-binding protein residues.

    Ligand cleft model

    This somewhat artificial case represents the scenario ofwell-characterised binding pockets. Only SURFNETspheres that make contact to any ligand atom are retained(Figure 1(c)). This results in a very accurate, although not

    perfect, approximation of the ligand shape and produces abinding pocket that is obviously well suited for matchingto its bound ligand. Any predictive approach will performworse than this in getting theright shape, so this procedurecorresponds to the best case scenario and provides an

    estimate of the upper bounds on what performance can beexpected for binding pockets with the current method.

    The data set

    The following criteria were applied to derive the dataset for this manuscript: (1) structural domains should betaken only from X-ray protein quaternary structures(PQS37) that are thought to represent protein structuresin their true biological unit. (2) The binding sites in aligand set should not be evolutionarily related butdescend from different CATH H-levels (homologoussuperfamily). In cases of homology only the bindingpocket with the highest X-ray resolution should be re-tained. (3) Partial, modified or incorrectly labelled ligands

    should be discarded, by comparing each ligand againstthe reference compound for that ligand's three-letterresidue identifier in MSDchem38 (MSD-ligand-chemistrydatabase). (4) Binding sites of only cognate ligands should

    be considered. For enzymes a biologically relevant ligandwas defined as one involved in the protein's enzymaticreaction as given by the protein's EC number.39 For non-enzymes the protein's Uniprot entry40 was checked forany information about its cognate ligand(s). (5) Each li-gand set should have at least five members. (The number 5was chosen arbitrarily but was deemed sufficient forassessing the success rate of assigning binding pockets totheir ligand sets).

    The intersection between two data sets in the literature,first Stockwell and Thornton25 and second Nobeli et al.41

    assisted the derivation of our final data set. The first dataset ensured the achievement of the first three rules,whereas the second data set verified the fourth rule.Additional to both data sets manual searches were carriedout to overcome two deficiencies of both data sets; namelythat the first data set was missing all binding sites havingno CATH domain assignments, while the second data setwas missing all non-enzyme structures.

    Binding sites without a CATH assignment were tackled by querying the Cathedral server42 with the proteinstructure holding the binding site of interest and assigningto it the CATH code of the closest fold. The seconddeficiency was approached by scanning the appropriatethree-letter residue identifier (e.g. FMN) and the ligandname (e.g. flavin) in the protein's Uniprot entries. All hitswere manually checked to avoid false positives.

    The final data set comprises 100 binding pocketsdistributed over nine ligand sets (Table 1).

    Classification and data analysis

    The following approaches were used to visualise andanalyse our results.

    Distance matrices

    These matrices contain all-against-all pairwise coefficientdistances and give a good visual overview of the achievedclassification power. A perfect classification in these plots isindicated by green squares for each ligand set in thediagonal from bottom left to top right. In the remainingrows and columns the coefficient distances should rangefrom low to high as indicated by orange to yellow to white

    Table 4. Statistics on the volume of proteinligandinteracting region cleft models for all ligand sets ordered

    by their volumes

    Ligand set(set size)

    Avg. lig.vol.

    Avg.vol.

    Std dev.vol.

    Min.vol.

    Max.vol.

    PO4 (20) 73 445 118 168 797GLC (5) 156 590 203 416 912Steroids (5) 280 903 171 607 1144AMP (9) 290 1097 156 774 1579ATP (14) 400 1416 186 822 1723FMN (6) 402 1443 265 1196 1879Heme (16) 610 1507 209 1031 2030NAD (15) 562 1809 305 486 2340FAD (10) 688 2099 224 1580 2507Total (100) 395 1279 515 168 2507

    Second column provides the average volume of the respectiveligand. Volumes are given in 3.

    298 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    17/19

    Figure 11. Two examples for a partially bound ligand to its protein. The protein is represented as a transparent surface

    coloured in grey, the reconstructed binding pocket shape (proteinligand interacting region cleft model) is shown as a redcoloured mesh and the ligands are varicoloured. The top example shows an NAD (PQS-Id: 1hex) from which only thefront part is surrounded by amino acids. The bottom example displays a heme group (PQS-Id: 1sox), which protrudes tothe solvent with its two carboxyl-groups.

    299Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    18/19

    colouring, depending onthesimilaritylevel to theligand setof interest. By rule of thumb coefficient distances smallerthan 3 are considered as identical shapes and coloured indark green. Coefficient distances between 3 and 5 aretreated as similar,distances between5 and 8 are regarded as

    dissimilar anddistancesbetween8 and10 areconsidered ashighly dissimilar shapes. Coefficient distances above 10 arenot coloured at all and left in white. A grid on the matricesseparates different ligand sets.

    Area under receiver operating characteristics curves

    ROC curves and especially the AUC are well suited forthe numerical comparison of classification approaches.ROC curves are used to measure the ranking quality ofclassifiers, by plotting the fraction of recovered true hitsagainst the fraction of false hits when the ordered list ofclassifications (in this work coefficient distances) is walkeddown from best to worst. A diagonal ROC curve leadingfrom the bottom left to the top right indicates a random

    classification where for each true hit a false hit is recovered(i.e. equal to flipping a coin). Such a curvecorresponds to anAUC of 0.5. Conversely, the best case is a horizontal line atthe top of the plot, where all true hits are recovered before afalse hit is obtained. Such a curve corresponds to an AUC of1.0. Hence, AUC values closer to 1.0 indicate classifiers thatare more able to distinguish true from false positives.

    Acknowledgements

    This work was supported by the BioSapiensNetwork of Excellence, through the EuropeanCommission within its FP6 Programme, under thethematic area 'Life Sciences, Genomics and Biotech-nology for Health,' contract number LHSG-CT-2003-503265. All figures containing molecules wererendered using PyMOL (W.L. DeLano, http://pymol.sourceforge.net/).

    References

    1. Morris, R. J., Najmanovich, R. J., Kahraman, A. &Thornton, J. M. (2005). Real spherical harmonicexpansion coefficients as 3D shape descriptors forprotein binding pocket and ligand comparisons.

    Bioinformatics, 21, 23472355.2. Laskowski, R. A., Luscombe, N. M., Swindells, M. B. &Thornton, J. M. (1996). Protein clefts in molecularrecognition and function. Protein Sci. 5, 24382452.

    3. Bergner, A. & Gnther, J. (2004). Structural aspects ofbinding site similarity: a 3D upgrade for chemoge-nomics. Chemogenomics Drug Discov. 22, 97135.

    4. Campbell, S. J., Gold, N. D., Jackson, R. M. &Westhead, D. R. (2003). Ligand binding: functionalsite location, similarity and docking. Curr. Opin.Struct. Biol. 13, 389395.

    5. Gold, N. D. & Jackson, R. M. (2006). Fold independentstructural comparisons of proteinligand binding sitesfor exploring functional relationships. J. Mol. Biol. 355,11121124.

    6. Rosen, M., Lin, S. L., Wolfson, H. & Nussinov, R.(1998). Molecular shape comparisons in searches foractive sites and functional similarity. Protein Eng. Des.Select. 11, 263277.

    7. Schmitt, S., Kuhn, D. & Klebe, G. (2002). A newmethod to detect related function among proteinsindependent of sequence and fold homology. J. Mol.Biol. 323, 387406.

    8. Whisstock, J. C. & Lesk, A. M. (2003). Prediction of

    protein function from protein sequence and structure.Quart. Rev. Biophys. 36, 307340.9. Kinoshita, K. & Nakamura, H. (2003). Identification of

    protein biochemical functions by similarity searchusing the molecular surface database eF-site. ProteinSci. 12, 15891595.

    10. Wallace, A. C., Laskowski, R. A. & Thornton, J. M.(1996). Derivation of 3D coordinate templates forsearching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases andlipases. Protein Sci. 5, 10011013.

    11. Porter, C. T., Bartlett, G. J. & Thornton, J. M. (2004).The Catalytic Site Atlas: a resource of catalytic sitesand residues identified in enzymes using structuraldata. Nucl. Acids Res. 32, 129.

    12. Kleywegt, G. J. (1999). Recognition of spatial motifs inprotein structures. J. Mol. Biol. 285, 18871897.13. Barker, J. A. & Thornton, J. M. (2003). An algorithm for

    constraint-based structural template matching: appli-cation to 3D templates with statistical analysis.Bioinformatics, 19, 16441649.

    14. Besl, P. J. & McKay, N. D. (1992). A method for regis-tration of 3-D shapes. IEEE Trans. PAMI, 14, 239256.

    15. Nussinov, R. & Wolfson, H. J. (1991). Efficientdetection of three - dimensional motifs in biologicalmacromolecules by computer vision techniques. Proc.Natl Acad. Sci. USA, 88, 1049510499.

    16. Binkowski, T. A., Adamian, L. & Liang, J. (2003).Inferring functional relationships of proteins fromlocal sequence and spatial surface patterns.J. Mol. Biol.

    332, 505526.17. Binkowski, T. A., Naghibzadeg, S. & Liang, J. (2003).CASTp: computed atlas of surface topography ofproteins. Nucl. Acid Res. 31, 33523355.

    18. Ritchie, D. W. (1998). Parametric protein shaperecognition. PhD thesis, University of Aberdeen, UK.

    19. Ritchie, D. W. (2005). High order analytic transla-tion matrix elements for real space six-dimensionalpolar Fourier correlations. J. Appl. Crystallog. 38,808818.

    20. Ritchie, D. W. (2003). Evaluation of protein dockingpredictions using Hex 3.1 in CAPRI rounds 1 and 2.Proteins: Struct. Funct. Genet. 52, 98106.

    21. Ritchie, D. W. & Kemp, G. J. L. (2000). Protein dockingusing spherical polar Fourier correlations. Proteins:

    Struct. Funct. Genet. 39, 178194.22. Ritchie, D. W. & Kemp, G. J. L. (1999). Fast computa-tion, rotation, and comparison of low resolutionspherical harmonic molecular surfaces. J. Comp.Chem. 20, 383395.

    23. Cai, W., Shao, X. & Maigret, B. (2002). Protein-ligandrecognition using spherical harmonic molecular sur-faces: towards a fast efficient filter for large virtualthroughput screening. J. Mol. Graph. Model. 20,313328.

    24. Morris, R. J. (2006). An evaluation of spherical designsfor molecular-like surfaces. J. Mol. Graph. Model. 24,356361.

    25. Stockwell, G. R. & Thornton, J. M. (2006). Conforma-tional diversity of ligands bound to proteins. J. Mol

    Biol. 356, 928944.26. Kraut, D. A., Sigala, P. A., Pybus, B., Liu, C. W., Ringe,D., Petsko, G. A. & Herschlag, D. (2006). Testingelectrostatic complementarity in enzyme catalysis:

    300 Shape Variation in Binding Pockets and Ligands

  • 8/14/2019 Kahraman Thornton 2007

    19/19

    hydrogen bonding in the ketosteroid isomeraseoxyanion hole. PLoS Biol. 4, 501519.

    27. Laskowski, R. A. (1995). SURFNET: a program forvisualizing molecular surfaces, cavities and intermo-lecular interactions. J. Mol. Graph. 13, 323330.

    28. Oppermann, U., Filling, C., Hult, M., Shafqat, N., Wu,X., Lindh, M. et al. (2003). Short-chain dehydro-genases/reductases (SDR): the 2002 update. Chemico-Biol. Interact. 143, 247253.

    29. Laskowski, R. A. (2003). Structural quality assurance.Methods Biochem. Anal. 44, 273303.

    30. Glaser, F., Morris, R. J., Najmanovich, R. J., Laskowski,R. A. & Thornton, J. M. (2006). A method for localizingligand binding pockets in protein structures. Proteins:Struct. Funct. Genet. 62, 479488.

    31. Glaser, F., Pupko, T., Paz, I., Bell, R. E., Bechor-Shental,D., Martz, E. & Ben-Tal, N. (2003). ConSurf: identifica-tion of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics,19, 163164.

    32. Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). Anevolutionary trace method defines binding surfacescommon to protein families. J. Mol. Biol. 257, 342358.

    33. McDonald, I. K. & Thornton, J. M. (1994). Satisfyinghydrogenbonding potential in proteins.J.Mol. Biol. 238,777793.

    34. Bate, P. & Warwicker, J. (2004). Enzyme/non-enzymediscrimination and prediction of enzyme active site

    location using charge-based methods. J. Mol. Biol. 340,263276.

    35. Laurie, A. T. R. & Jackson, R. M. (2005). Q-SiteFinder:an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics, 21, 19081916.

    36. Ondrechen, M. J., Clifton, J. G. & Ringe, D. (2001).THEMATICS: a simple computational predictor ofenzyme function from structure. Proc. Natl Acad. Sci.USA, 98, 1247312478.

    37. Henrick, K. & Thornton, J. M. (1998). PQS: a proteinquaternary structure file server. Trends Biochem. Sci.23, 358361.

    38. Golovin, A., Oldfield, T. J., Tate, J. G., Velankar, S.,Barton, G. J., Boutselakis, H. et al. (2004). E-MSD: anintegrated data resource for bioinformatics. Nucl.

    Acids Res. 32(Database issue), 211216.39. Bairoch, A. (2000). The ENZYME database in 2000.

    Nucl. Acids Res. 28, 304305.40. Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C.,

    Boeckmann, B., Ferro, S. et al. (2004). UniProt: the Uni-

    versal Protein Knowledgebase. Nucl. Acids Res. 32, 115.41. Nobeli, I., Ponstingl, H., Krissinel, E. B. & Thornton,J. M. (2003). A structure-based anatomy of the E.colimetabolome. J. Mol. Biol. 334, 697719.

    42. Pearl, F. M. G., Bennett, C. F., Bray, J. E., Harrison, A. P.,Martin,N.,Shepherd, A. et al.(2003).The CATHdatabase:an extended protein family resource for structural andfunctional genomics. Nucl. Acids Res. 31, 452455.

    Edited by F. E. Cohen

    (Received 21 November 2006; received in revised form 15 January 2007; accepted 31 January 2007)Available online 7 February 2007

    301Shape Variation in Binding Pockets and Ligands