Tutorial

Using the search bar to retrieve binding sites

Basic usage

Any text can be entered in the search bar, which results in a free text search. A query term kinase cancer, for example, is translated to the search for the binding site entries that contain the (sub-)terms kinase and cancer anywhere in their descriptions. The white space character between the terms is interpreted as the logical AND operator.

Advanced usage

Alternatively, a query can be made specific by providing the column name in which to perform the search, and the value for this column. For example, pdb_id=1got retrieves exactly one protein from the database.

Valid column names are:

  • alphafold_id - AlphaFold DB identifier, e.g., alphafold_id=AF-O75616-F1-model_v2
  • min_beta_factor - AlphaFold's model confidence score from 0 to 100 (highest confidence) , e.g., min_beta_factor>90 (to get very high confidence binding sites whose surface is lined with atoms having at least this confidence value)
  • compound_score - Druggability score (Sbsite score) - applies only to substrate- and cofactor-competitive binding sites (other binding sites have this value set to -1), e.g., to filter binding sites with at least a certain Sbsite threshold, use compound_score> 90 (to fetch only highly druggable) OR compound_score>50 (druggable) OR compound_score>30 (likely druggable) OR compound_score>15 (possibly druggable) OR compound_score>0 (unlikely druggabable)</td>
  • uniprot_id - UniProt identifier, e.g., O75616
  • pdb_id - Four letter PDB identifier, e.g., 1all, null (to get models with no equivalent structure in PDB)
  • chain_id - Protein chain identifier, e.g., A, B, ...
  • bsite_rest - Binding site type [small.substrate.competitive or small.cofactor.competitive]
  • bs_id - Binding site number according to our druggability ranking
  • ec - Enzyme Commision number e.g., 3.4.25.1
  • molecule - Name of the protein e.g., PROTEASOME SUBUNIT BETA TYPE-3
  • protein_class - Protein class [kinase]
  • disease - Disease relatedness [cancer]
  • cons - Conservation [0, 1] defined as: num. of similar binding sites with this ligand / num. of all sim. binding sites found, e.g., cons>0.9 gets only highly conserved ligands
  • num_occ - Number of occurrences of a ligand in similar binding sites detected, e.g., num_occ>10

The following examples show how to properly use the query expression language in the search bar to retrieve whatever binding sites you wish from the database:

  • All binding sites: leave empty
  • Primary binding sites (mostly orthosteric): bs_id=1
  • Secondary binding sites (allosteric/orthosteric/other): bs_id>1 (note: our method does not reliably determine if a site is allosteric)
  • Compound binding sites (for substrates, agonists,...): bsite_rest=small.substrate.competitive
  • Cofactor binding sites: bsite_rest=small.cofactor.competitive
  • Binding sites ranked No. 1 and 2: bs_id=[1,2]
  • Binding sites with very high AlphaFold model confidence (all residues of filtered binding sites are gurranteed to have confidence score more than 90): min_beta_factor>90
  • Highly druggable binding sites: compound_score>90
  • Druggable binding sites - those with Sbsite scores between 51 and 90: compound_score>50 compound_score<91 (note: space character between the terms is interpreted as logical AND)
  • Highly conserved water that was found to bind to a similar residue motif (to the one in query protein) in at least 10 other binding sites in PDB: bsite_rest=water cons>0.9 num_occ>10 (note: these waters can be important for protein function or structural stability)

Predicted binding sites

Binding sites were predicted using the ProBiS-Fold method for the whole AlphaFold DB. Here, we showcase a few examples of potential usage, advantages and properties of the predicted binding sites.

Binding site centroids

Binding site grid is then generated for each ligand cluster by sampling hexagonal close-packed points spaced 0.2 Å apart that fall within the radius of any of the ligands’ atoms but do not overlap with any of the protein atoms. A binding site grid thus follows the contours of the molecular surface of the biological assembly and the space occupied by the predicted ligands up to 8 Å from the protein surface. Binding site centroids are then calculated as grid points sampled at approximately 3 Å intervals. Each centroid is assigned a radius of about 4 Å, and each binding site is represented by a set of overlapping centroids with radiuses that closely follow its contours.

Primary and secondary binding sites

Primary binding sites are those with rank equal to 1 and typically correspond to a main binding site in a protein. Secondary binding sites are those with rank>1 according to our binding sites prioritization score. Ligand binding to an secondary site that is also an allosteric site can lead to a conformational change within the orthosteric binding site, thus modulating the protein’s activity.[5,6] As such, secondary sites are important in proteins as they often serve as natural control loops, such as feedback from downstream products of enzymes, while also being crucial in cell signaling. The secondary binding sites in our database can readily be used in the identification of previously unknown binding locations and subsequently the design of drugs acting on previously un-targeted binding sites, potentially resulting in drugs exhibiting novel and unique scaffolds, while still acting on the same target protein as the existing drugs that target orthosteric sites.

Predicted ligands

The criterion for assuming that a ligand of a protein can be transposed into a binding site on a query protein is the similarity between the binding site of the originating protein and the binding site of the query protein. Ligands are transposed from similar proteins if they have binding sites that are sufficiently similar to the binding site(s) on the query protein. Sufficiency for transposition is determined separately for each ligand type using a Z-score metric.

Cofactor ligands

Cofactor ligands and cofactor binding sites are identified based on the list of known cofactors extended with a few more. The coordinate file for each cofactor are obtained from the PubChem database, and, basically, all PDB ligands that are very similar to the cofactors in this list, are considered cofactors themselves. The cofactors that we consider are the following:

Glycan ligands

This is the list of known monosaccharides that are used to determine if a PDB ligand is a part of a glycan.

Metal ions and conserved waters

Biologically relevant metal ions and conserved water molecules are determined by counting the number of times an ion is found in similar binding sites at the same or a similar position according to the methodology described in the ProBiS H2O approach. Biologically relevant ions are identified based on the candidate ions (see Figure 3 in the accompanying paper) and an additional filter which is used to determine that they belong to clusters of at least 10 members. Those ions that do not meet both criteria are considered artifacts and classified as buffer. Similarly, water is labeled as conserved water if it belongs to clusters with >10 members, otherwise it is considered to bind nonspecifically.

Prediction of ligands by transposition from similar binding sites

Z-scores are assigned by the ProBiS-ligands approach to each pairwise protein superimposition and measure the local structural similarity of the superimposed protein patches, where higher Z-scores indicate higher structural similarity of the compared binding sites.[4] For compounds, cofactors, glycans, and water molecules the superimpositions with Z-score ≥ 2.5 are used, while for metal ions this threshold is set to ≥ 2.0. Further, three different cases are distinguished for transposition: if a ligand originates from a non-representative protein within the same sequence cluster (Step 1, see our paper) as the query protein chain, then the rotational-translational matrix obtained in Step 2 is applied to the ligand’s coordinates to transpose them into the coordinate frame of the query protein chain; if the ligand originates from a representative protein of another cluster, then the rotational-translational matrix obtained in Step 3 is used; finally, if the ligand is from a non-representative protein from another sequence cluster, then both the corresponding matrices from Step 2 and Step 3 are applied to the ligand’s coordinates to transpose the ligand into the binding site of the query protein.

Nonspecific Binders

This is an updated and extended list of non-specific binders given as PDB Chemical IDs based on the list of non-specific binders available here.

12P, 144, 15P, 16D, 16P, 1BO, 1PE, 1PG, 1PS, ACA, ACE, ACN, ACT, ACY, AE3, AE4, AGC, AZI, B3P, B7G, BCN, BE7, BEN, BEQ, BEZ, BGC, BMA, BNG, BOG, BTB, BTC, BU1, BU2, BU3, C10, C15, C8E, CAC, CBM, CBX, CCN, CE1, CIT, CM, CM5, CN, CPS, CRY, CXE, CYN, CYS, D10, DDQ, DHD, DIA, DIO, DMF, DMS, DMU, DMX, DOD, DOX, DPR, DR6, DTT, DXE, DXG, EDO, EEE, EGL, EOH, EPE, ETE, ETF, FCL, FCY, FMT, FRU, GBL, GCD, GLC, GLO, GLY, GOL, GPX, HEZ, HTG, HTO, ICI, ICT, IDT, IOH, IPA, IPH, JEF, LAK, LAT, LBT, LDA, LMT, M2M, MA4, MAN, ME2, MES, MG8, MHA, MLI, MOH, MPD, MPO, MRD, MRY, MTL, MXE, N8E, NDG, NH4, NHE, NO3, O4B, OTE, P15, P33, P3G, P4C, P4G, P6G, PDO, PE3, PE4, PE5, PE7, PE8, PEG, PEU, PG0, PG4, PG5, PG6, PGE, PGF, PGO, PGQ, PGR, PIG, PIN, PO4, POL, SAL, SBT, SCN, SDS, SO4, SOR, SPD, SPK, SPM, SUC, SUL, SYL, TAR, TAU, TBU, TEP, TLA, TMA, TOE, TRE, TRS, TRT, UMQ, UNK, URE, VO4, XPE, XYP, AL, CS, BR, CL, F, IOD, PB, LI, HG, K, RB, AG, NA, SR, YT3, Y1, XE

References

  1. Trott, O.; Olson, A. J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31 (2), 455–461.
  2. Feinstein, W. P.; Brylinski, M. Calculating an Optimal Box Size for Ligand Docking and Virtual Screening against Experimental and Predicted Binding Pockets. J. Cheminf. 2015, 7 (1), 18.
  3. Huang, Z.; Zhu, L.; Cao, Y.; Wu, G.; Liu, X.; Chen, Y.; Wang, Q.; Shi, T.; Zhao, Y.; Wang, Y.; Li, W.; Li, Y.; Chen, H.; Chen, G.; Zhang, J. ASD: A Comprehensive Database of Allosteric Proteins and Modulators. Nucleic Acids Res. 2011, 39 (suppl_1), D663–D669.
  4. Konc, J.; Česnik, T.; Konc, J. T.; Penca, M.; Janežič, D. ProBiS-Database: Precalculated Binding Site Similarities and Local Pairwise Alignments of PDB Structures. J. Chem. Inf. Model. 2012, 52 (2), 604–612.
  5. Bu, Z.; Callaway, D. J. E. Chapter 5 - Proteins MOVE! Protein Dynamics and Long-Range Allostery in Cell Signaling. In Advances in Protein Chemistry and Structural Biology; Donev, R., Ed.; Protein Structure and Diseases; Academic Press, 2011; Vol. 83, pp 163–221.
  6. Kern, D.; Zuiderweg, E. R. The Role of Dynamics in Allosteric Regulation. Curr. Opin. Struct. Biol. 2003, 13 (6), 748–757.10.008.