antibacTR

Alejandro Panjkovich, Isidre Gibert and Xavier Daura
Universitat Autònoma de Barcelona


antibacTR is a computational pipeline designed to aid researchers in the selection of potential drug targets, one of the initial steps in antibacterial-drug discovery. The method is based on sequence comparisons and queries to multiple databases (e.g. gene essentiality, virulence factors) to rank proteins according to their potential as antibacterial targets and covers a priority list of Gram-negative pathogens: Acinetobacter baumannii, Escherichia coli, Helicobacter pylori, Pseudomonas aeruginosa and Stenotrophomonas maltophilia. The dynamic ranking of potential drug targets can easily be executed, customized and accessed by the user through a web interface which also integrates computational analyses performed in-house and visualizable on-site. These include three-dimensional modeling of protein structures and prediction of active sites among other functionally relevant ligand-binding sites. Versatility and ease-of-use have been emphasized so that this tool may effectively assist microbiologists, medicinal-chemists and other researchers working in the field of antibacterial drug-discovery. The public web-interface for antibacTR is available at `http://bioinf.uab.cat/antibactr'.

Figure 1: antibacTR pipeline
Image flowchart


Contents

Computational target-ranking pipeline

Typically, one of the first steps in a target-discovery project is to readily select, among thousands of proteins composing the pathogens' proteomes, those with the highest chance of becoming useful therapeutic targets. Following the lines defined by previous studies [White and Kell, 2004], we developed an algorithm to score and rank potential drug targets in pathogenic organisms by evaluating a modular set of criteria that are commonplace in antimicrobial-development efforts [Payne et al., 2007]:

  1. the presence of the protein in different pathogens
  2. evolutionary conservation
  3. essentiality
  4. presence of isoforms and paralogs in the proteome
  5. similarity to human proteins.

We implemented a set of five weighted scores that cover these criteria and defined a scoring function combining them.

The first two concepts were incorporated as two independent scores, measuring the conservation of the protein among Gram-negative organisms and among different strains of the same species, respectively. Conservation among strains is a basic requirement for target consideration. Conservation among Gram-negative species is highly desirable as it enables the development of broad-spectrum solutions and increases economic viability. In addition, well conserved targets will presumably have low tolerance to mutations, decreasing the chance of resistance to emerge by this type of mechanism.

Essential proteins, which inhibition compromises bacterial viability, are potential antibacterial targets by definition. We implemented a binary score by marking genes known to be essential from previous experimental work [Zhang and Zhang, 2008].

The remaining two scores are given negative weights. If the protein under consideration has isoforms and/or paralogs the pathogen may readily develop resistance by functional substitution, and the effect of the antibacterial may be also reduced by competitive binding to non-essential forms. We considered similarity to human proteins negative as well, since close human homologs to the target may interact with the drug, giving rise to unwanted side-effects.

The scoring and ranking scheme, partially following the work of White and Kell [White and Kell, 2004] provides an advantage when compared to static selection or filtering approaches [Sakharkar et al., 2004,Chanumolu et al., 2012]. In our case, if further experimental analysis reveals that a given protein is not suitable as a drug target, work can continue with the next protein in the ranking. Moreover, it would be straightforward to incorporate new criteria into the ranking scheme if needed.

The pipeline to which each proteome of interest was subjected is illustrated in Figure 1, which summarizes the approach.

Sequence-based analysis

Currently, the database covers 74 Gram-negative pathogens, including 224 distinct strains. For pathogens distinguished by their prevalence in community and/or nosocomial infections and the incidence of drug-resistant isolates [Boucher et al., 2009,Bereket et al., 2012]: Acinetobacter baumannii, Escherichia coli, Helicobacter pylori, Pseudomonas aeruginosa and Stenotrophomonas maltophilia we included all fully sequenced available strains (82 strains, which conform a `priority set'). For the rest of the species, we included all strains that were marked as `human pathogens' in HAMAP (142 strains) [Pedruzzi et al., 2013]. This query data set was compared against the human proteome and a reference set of 770 Gram-negative proteomes (494 distinct species), by means of the BLAST program [Altschul et al., 1997] using default parameters. BLAST searches are very fast, however resulting E-values depend on the alignment itself and on other parameters such as the size of the database scanned. We needed unbiased similarity scores between proteins matched during the sequence-based search to keep results valid in case of further increasing the size of the data sets. To attain this objective, we further aligned BLAST matches (E-value <= 0.0001) using the Smith-Waterman algorithm and calculated `normalized sequence similarity scores' (NS). NS values were then used to pre-compute toxicity, presence of isoforms or paralogs and the two conservation scores for each protein in the query data set, as described in further detail in the Methods section.

Interface and access to results

Interactive access to results is available through the web-interface at `http://bioinf.uab.cat/antibactr'.

This interface allows the user to select the organisms and strain of interest, set custom weights to the different scores and then proceed to calculate the corresponding ranking. If the user wishes to ignore a specific ranking parameter, a weight of 0 (zero) can be applied. The system has been built in such a way that normalization of scores is performed only among the selected set of strains and parameters. To further facilitate the analysis of results, the user may also limit the amount of top-ranked entries that are displayed. Once the ranking procedure is finished (it takes a few seconds), the ranking is printed to the browser. An option is available for downloading the ranking to the local computer in tab-delimited text format, useful for researchers interested in further processing the data. Targets are displayed in ranked order and individual scores are shown for each protein after normalization but prior to weighting. A brief description of the biological function is displayed for each protein but, to facilitate immediate access to full annotation and other relevant data, a link to the related Uniprot entry is provided as well [Consortium, 2009]. In cases where the target shows sequence similarity to an already known drug target or virulence factor, the corresponding links are also provided. In addition, specific links with details on predicted active sites and homology models are given. If a homology model is supplied, the user may download model coordinates in PDB format and target-template alignments generated during the modeling process, along with sequence identity, DOPE score and other relevant modeling data [Eswar et al., 2008]. Furthermore, available protein structures can be visualized using Jmol (http://www.jmol.org) along with the results of the pocket analysis previously described [Panjkovich and Daura, 2012].

User query sequences

Besides the ranking of complete proteomes, researchers may want to look at the ranking of a few selected proteins of their particular interest. To achieve this functionality, we added the possibility to include the user's own query sequences in an optional field. These sequences are then compared by means of the BLAST program against our query data set (224 strains). Scoring and ranking proceed as normally, but results are then displayed only for significant hits within our data set. Details of this BLAST search are also available to the user.

Technical descriptions

Normalized sequence-similarity score (NS)

We used the BLAST program [Altschul et al., 1997] with default parameters to scan complete proteomes. Since BLAST E-values may vary depending on the size of the queried database, we aligned all matched pairs and calculated their Smith-Waterman similarity score [Smith and Waterman, 1981]. We ignored alignments with scores lower than 100, as previously described [Aoki and Kanehisa, 2005].

Given that the Smith-Waterman similarity score is related to the size of the alignment, we divided the score by the length of the alignment to obtain a normalized sequence-similarity score (NS). The Smith-Waterman algorithm computes an optimal local alignment, meaning that the NS measure of similarity between two proteins is equivalent to the similarity between their most closely related pair of domains or regions.

Essentiality

Experimental information regarding gene essentiality is available for a few organisms at the database of essential genes (DEG) [Zhang and Zhang, 2008]. If a particular strain was not available at DEG, we mapped query proteins to essential genes by using BLAST. For each annotated essential gene in a related strain, we scanned the proteome of interest and marked the best hit as an essential gene. Only E-values of 1e-10 or better were considered acceptable for this task. At the time of this writing, no large-scale essential gene information is available for Stenotrophomonas maltophilia (STRM5 & STRMK).

Toxicity

An antibacterial drug acting on protein targets which are similar to human proteins may also bind these causing adverse effects and/or toxicity. We estimated the potential toxicity of each putative target proportional to the largest NS value obtained after pairwise alignment against the whole human proteome.

Isoforms and paralogs

If a given drug target presents multiple isoforms or paralogs (`variants'), the pathogen may readily develop resistance by functional substitution mechanisms. It is also possible that the drug may bind both the target and its variants, thus decreasing the antibiotic effect. To assess this parameter for each potential drug target, we counted the amount of variants present in the same proteome. We considered as variants of a protein all similar proteins with a NS value equal or larger than 2.

Evolutionary conservation among Gram-negative organisms

We defined a score to estimate the evolutionary conservation of potential targets across Gram-negative (GN) organisms as shown in Equation 1.

$\displaystyle GNC_p = \sum_{i=1}^{i=n-1} \max(NS_p)$ (1)

Where $ GNC_p$ is the Gram-negative conservation score for protein p, computed by adding the highest NS value (max$ (NS_p)$ ) obtained against each of the different GN species (i) in the data set, with $ n$ being the total number of GN species.

Conservation among strains

We estimated the evolutionary conservation of proteins among different strains of the same species using the following score:

$\displaystyle SC_p = \frac{\sum_{j=1}^{j=m-1} \max(NS_p)}{m}$ (2)

where $ SC_p$ is the strain conservation score for protein p, computed by adding the highest NS value (max$ (NS_p)$ ) obtained against each other strain (j) of the selected species in the data set, with m the total number of distinct strains of the particular species.

Scoring function and ranking of potential drug targets

Each of the different scores is normalized by the largest value obtained across the selected organisms. Normalized values are then multiplied by 100 to obtain percentages, i.e. final scores range between 0 and 100.

Each independent score has an associated weight, which can be negative or positive. These weighting values can be set by the user. However, default values are provided as follows. A priori negative features of a putative target (i.e. Toxicity and Paralogs) are given a default weight of -1, while positive features (e.g. Evolutionary conservation, Essentiality) have a corresponding default weight of 1.

For each protein in the selected data set, normalized scores are multiplied by their respective weights. The final score for each protein is obtained by summing up all weighted scores. Finally, all proteins in the selected data set are ranked according to their final score in terms of drug-target potential.

Comparative-genomics reference data set

Sequence data on Gram-negative (GN) organisms was gathered for a total of 749 fully sequenced GN proteomes covering 472 distinct species. GN bacteria species were identified at `http://bacterialphylogeny.info/bacteria.html' and listed fully sequenced bacterial proteomes from `http://www.uniprot.org/taxonomy' using the query string: `bacteria AND complete:yes'. A total of 749 bacterial strains were common to both listings. We downloaded sequence data from `ftp://ftp.expasy.org/databases/complete_proteomes/fasta/bacteria/'. We also included the Human proteome, as downloaded from KEGG [Kanehisa et al., 2008].


Known drug targets and virulence factors

Each proteome of interest was compared by means of the BLAST program [Altschul et al., 1997], with default parameters, against known drug targets available at DrugBank [Wishart et al., 2008] and virulence factors available at VFDB [Yang et al., 2008]. Proteins showing a match with a BLAST E-value <= 1e-2 display a link to the related hits in the output table.


Three-dimensional homology modeling

Researchers evaluating prospective drug targets may benefit from the availability of protein structural data. For the organisms in the priority set, we performed a large-scale homology modeling of all protein sequences for which we found valid structural templates as explained in the Methods section. In total, we generated three-dimensional homology models for 136,141 proteins (covering 47% of the priority set). This number was obtained after discarding models presenting less than 30% sequence identity (target-template) or G-factors below -1.00 [Laskowski et al., 1996]. All models were generated by means of the MODELLER program [Eswar et al., 2008] using default parameters.

To save computational power, proteins belonging to other strains were not modeled automatically. However, if the user is interested in obtaining one of such homology models, we have implemented an option at the web-interface for automatic submission of the selected modeling task.

Active-site prediction

To further add relevant information on putative targets, we applied a sequence-based approach [Mistry et al., 2007] to predict the location of active-site residues. The method is based on comparing query sequences to homologs for which the position of the active site has been annotated. After analyzing the whole query set (777,585 proteins), this procedure predicted the location of active-site residues for 90,482 proteins (11.6%). Proteins with a predicted active site display a link to the details of the prediction in the web interface described below.

Pocket analysis

For proteins for which we could build a three-dimensional homology model, we predicted the location of ligand-binding sites on the structure by means of the LIGSITEcs program [Huang and Schroeder, 2006]. We further analyzed the ligand-binding sites using two previously developed methodologies which estimate the regulatory potential of particular ligand-binding pockets. When possible, the structural conservation of predicted pockets was measured considering the evolutionary record of the protein family, given that conserved pockets may have a relevant biological role [Panjkovich and Daura, 2010]. Furthermore, using Normal Mode Analysis we estimated the effect of ligand binding upon overall protein flexibility, a measure which has been used in combination with structural conservation to predict the location of allosteric sites [Panjkovich and Daura, 2012]. As described below, the user can visualize the protein structure and predictions online.

Bibliography

Altschul et al., 1997
Altschul,S.F., Madden,T.L., Schaeffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped blast and psi-blast: a new generation of protein database search programs.
Nucleic Acids Res, 25 (17), 3389-3402.

Aoki and Kanehisa, 2005
Aoki,K.F. and Kanehisa,M. (2005) Using the kegg database resource.
Curr Protoc Bioinformatics, Chapter 1, Unit 1.12.

Bereket et al., 2012
Bereket,W., Hemalatha,K., Getenet,B., Wondwossen,T., Solomon,A., Zeynudin,A. and Kannan,S. (2012) Update on bacterial nosocomial infections.
Eur Rev Med Pharmacol Sci, 16 (8), 1039-1044.

Boucher et al., 2009
Boucher,H.W., Talbot,G.H., Bradley,J.S., Edwards,J.E., Gilbert,D., Rice,L.B., Scheld,M., Spellberg,B. and Bartlett,J. (2009) Bad bugs, no drugs: no eskape! an update from the infectious diseases society of america.
Clin Infect Dis, 48 (1), 1-12.

Chanumolu et al., 2012
Chanumolu,S.K., Rout,C. and Chauhan,R.S. (2012) Unidrug-target: a computational tool to identify unique drug targets in pathogenic bacteria.
PLoS One, 7 (3), e32833.

Consortium, 2009
Consortium,U. (2009) The universal protein resource (uniprot) 2009.
Nucleic Acids Res, 37 (Database issue), D169-D174.

Eswar et al., 2008
Eswar,N., Eramian,D., Webb,B., Shen,M.Y. and Sali,A. (2008) Protein structure modeling with modeller.
Methods Mol Biol, 426, 145-159.

Huang and Schroeder, 2006
Huang,B. and Schroeder,M. (2006) Ligsitecsc: predicting ligand binding sites using the connolly surface and degree of conservation.
BMC Struct Biol, 6, 19.

Kanehisa et al., 2008
Kanehisa,M., Araki,M., Goto,S., Hattori,M., Hirakawa,M., Itoh,M., Katayama,T., Kawashima,S., Okuda,S., Tokimatsu,T. and Yamanishi,Y. (2008) Kegg for linking genomes to life and the environment.
Nucleic Acids Res, 36 (Database issue), D480-D484.

Laskowski et al., 1996
Laskowski,R.A., Rullmannn,J.A., MacArthur,M.W., Kaptein,R. and Thornton,J.M. (1996) Aqua and procheck-nmr: programs for checking the quality of protein structures solved by nmr.
J Biomol NMR, 8 (4), 477-486.

Mistry et al., 2007
Mistry,J., Bateman,A. and Finn,R.D. (2007) Predicting active site residue annotations in the pfam database.
BMC Bioinformatics, 8, 298.

Panjkovich and Daura, 2010
Panjkovich,A. and Daura,X. (2010) Assessing the structural conservation of protein pockets to study functional and allosteric sites: implications for drug discovery.
BMC Struct Biol, 10, 9.

Panjkovich and Daura, 2012
Panjkovich,A. and Daura,X. (2012) Exploiting protein flexibility to predict the location of allosteric sites.
BMC Bioinformatics, 13 (1), 273.

Payne et al., 2007
Payne,D.J., Gwynn,M.N., Holmes,D.J. and Pompliano,D.L. (2007) Drugs for bad bugs: confronting the challenges of antibacterial discovery.
Nat Rev Drug Discov, 6 (1), 29-40.

Pedruzzi et al., 2013
Pedruzzi,I., Rivoire,C., Auchincloss,A.H., Coudert,E., Keller,G., de Castro,E., Baratin,D., Cuche,B.A., Bougueleret,L., Poux,S., Redaschi,N., Xenarios,I., Bridge,A. and Consortium,U. (2013) Hamap in 2013, new developments in the protein family classification and annotation system.
Nucleic Acids Res, 41 (Database issue), D584-D589.

Sakharkar et al., 2004
Sakharkar,K.R., Sakharkar,M.K. and Chow,V.T.K. (2004) A novel genomics approach for the identification of drug targets in pathogens, with special reference to pseudomonas aeruginosa.
In Silico Biol, 4 (3), 355-360.

Smith and Waterman, 1981
Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences.
J Mol Biol, 147 (1), 195-197.

White and Kell, 2004
White,T.A. and Kell,D.B. (2004) Comparative genomic assessment of novel broad-spectrum targets for antibacterial drugs.
Comp Funct Genomics, 5 (4), 304-327.

Wishart et al., 2008
Wishart,D.S., Knox,C., Guo,A.C., Cheng,D., Shrivastava,S., Tzur,D., Gautam,B. and Hassanali,M. (2008) Drugbank: a knowledgebase for drugs, drug actions and drug targets.
Nucleic Acids Res, 36 (Database issue), D901-D906.

Yang et al., 2008
Yang,J., Chen,L., Sun,L., Yu,J. and Jin,Q. (2008) Vfdb 2008 release: an enhanced web-based resource for comparative pathogenomics.
Nucleic Acids Res, 36 (Database issue), D539-D542.

Zhang and Zhang, 2008
Zhang,C.T. and Zhang,R. (2008) Gene essentiality analysis based on deg, a database of essential genes.
Methods Mol Biol, 416, 391-400.



2013-10-08