QSAR-22-185-190

download QSAR-22-185-190

of 6

Transcript of QSAR-22-185-190

  • 8/3/2019 QSAR-22-185-190

    1/6

    Neural networks for effect prediction in environmental and healthissues using large datasets

    Klaus L. E. Kaiser*

    National Water Research Institute, 867 Lakeshore Road, Burlington, Ontario L7R 4A6, Canada

    Abstract

    Neural network methodologies allow the modeling of non-linear relationships. This makes them useful tools for theanalysis of larger data sets of non-congeneric compoundswith unknown or varying modes of action. This briefreview describes recent advances and their applications to

    sets of several hundred to over 1 000 compounds, modelingacute toxicity data for several aquatic species, includingfish, ciliate, bacteria, and non-acute toxicity data for amammalian species endpoint, i.e. estrogen receptor bind-ing assay data.

    1 Introduction

    The Domestic Substances List (DSL) tabulates approxi-mately 25000 substances which are in current use in Canada.An additional 50000 substances are on the Non-DomesticSubstances List, comprising those compounds which are inuse either below the threshold limit of 100 kg/year or onlyused elsewhere. Worldwide, the estimate of substances inuse is in excess of 100000. Based on the recognition ofbioaccumulation, toxicity, and persistence in the environ-ment as the three most important properties determiningthe pathways and effects of chemicals in the environment,countries belonging to the Organization for EconomicCooperation and Development (OECD) have begun toundertake research and regulatory initiatives to collateinformation on and to assess the potential harmful effects ofthese substances on the environment and human health.

    One critical issue in this endeavor is the lack of physico-chemical and toxicological data for many, in fact most ofthese substances. Measurements of these properties areboth time consuming and costly and avenues to reliably

    estimate such properties are required. Since the mid-20th

    century, the field of quantitative structure-activity relation-ships (QSARs) has developed into a highly useful science,particularly for the development of new and potentpharmaceuticals and the optimization of desirable effectsby QSAR-driven variations of the basic chemical structures.In contrast, in the environmental arena, the application ofQSAR was adopted much later in the 1980s, when RachelCarsons Silent Spring lead to a widespread recognition forthe need to research and regulate this field.

    Numerous mathematical models have been developed toaid in the analysis of effect data in relation to the structuresof the chemical agents involved and many models are in useto predict hundreds of different types of effects, from acutelethality to chronic and sub-lethal effects, to effect thresholdconcentrations. In addition, a variety of physico-chemicalproperties, such as hydrophilicity and hydrophobicityparameters, ionization constants, solubility, melting andboiling points, and other molecular properties are beingestimated by an increasing number of computer programsand with increasing accuracy. Until recently, successfulQSARs, using linear modeling techniques, were more or less

    predicated on small data sets of chemicals with an uniformmode of action, and congeneric chemical frameworks.The development of non-linear technologies, artificial

    intelligence-based algorithms opened the field to the con-current analysis of a wider variety of structures with(potentially) varying modes of action and non-congenericchemicals. This possibility is especially useful in the effectmodeling of the wide variety of substances entering theenvironment from human activity and enterprise. This briefreview will look at some of the issues, results and recentattempts in this field.

    Quant. Struct.-Act. Relat., 22 (2003) 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 0931-8771/03/010&-0001 $ 17.50+.50/0 1

    * To receive all correspondence. Dr. K. L. E. Kaiser, TerraBaseInc., 1063 King St. West, Suite 130, Hamilton, Ontario L8S 4S3,Canada, phone: 905-802-0154, fax: 905-527-0263, email: Klaus

    Key words: neural networks, toxicity, fish, Daphnia, Tetrahymena,Vibrio, steroid receptor binding

    Abbreviations: ANN, Artificial Neural Network; BNN, Back-propagation Neural Network; BRANN, Bayesian RegularizedArtificial Neural Network; PNN, Probabilistic Neural Network;DSL, Domestic Substances List; FHM, fathead minnow; RBA,relative binding affinity

    Neural networks for effect prediction in environmental and health issues using large datasets

  • 8/3/2019 QSAR-22-185-190

    2/6

    2 Acute toxicity of chemicals to aquatic organisms

    Overthe lasttwo decades, three substantial toxicity datasetswere developedin this field; they are(i) thefathead minnowlethal concentration (96-hr LC50) data for approximately900 substances, measured largely by the U.S. Environmental

    Protection Agencys Research Laboratory at Duluth, MN,and available through AQUIRE [1]; (ii) the 40 to 72-hrgrowth-inhibitory (ICG50) values for the ciliate Tetrahy-mena pyriformis, largely determined at the University ofTennessee, Knoxville, TN, published values in the literatureencompass approximately 1 200 substances, available fromT.W. Schultz; and (iii) the 5 to 30-min effective concen-tration (EC50) values for the luminescent marine bacteriumVibrio fischeri (formerly known as Photobacterium phos-

    phoreum), covering approximately 2200 substances, avail-able in substructure-searchable electronic format fromTerraBase Inc. [2]. In addition, smaller data sets, coveringfrom one hundred to several hundred substances are

    available for the acute toxicities of individual chemicals toseveral fish species, including the rainbow trout (Onco-rhynchus mykiss), thezebrafish (Brachydanio rerio),theredkillifish (Oryzias latipes),theguppy(Poecilia reticulata), thegoldorfe (Leuciscus idus melanotus), the goldfish (Carassiusauratus), the channel catfish (Ictalurus punctatus), and thebluegill (Lepomis macrochirus), as well as the to waterflea(Daphnia magna) and the algae Chlorella sp. These data arealso available from both commercial and non-commercialsources, e.g. [1, 2].

    2a QSARs of fathead minnow LC50 data

    Numerous reports can be found evaluating various subsetsof the fathead minnow (FHM) 96-hr LC50 data. In additionto the models proposed in the literature, the computerprogram ECOSAR[3], available from U.S. government websites and others, which derives 96-hr LC50 FHM estimatesfrom measured or calculated octanol/water (log Kow, orlogP) partition coefficients by applying these to over onehundred individual linear regressions. However, many ofthe underlying equations in ECOSAR lack any degree ofstatistical significance (as many are based on one or twosubstances only) and are further compromised by the needto know the use of the substance in question, as pointed out

    in [4]. Due to the variety of structures of the chemicals in theFHM data set, a number of different modes of action havebeen identified. Some modes of action, apparent from thebehavior of the fish upon exposure, are also known as fishacute toxicity syndromes and have been predicted on thebasis of the chemicals structures [5].

    With the increasing availability of neural network meth-ods, several studies have been published on the entire FHMdata set or large sub sets, using these methods. Withoutgoing into details of these reports, the following should bementioned. One of the first investigations on the use ofneural network methodology with aquatic toxicity data wasthe application of the feed forward back-propagation net-

    work (BPN) to approximately 400 FHMdata by Kaiser et al.[6]. Jurs and coworkers [7] studied a subset of that,comprising 375 substances with an artificial neural network(ANN) method. Kaiser and Niculescu [8] used a probabil-isticneural network (PNN) to analyze essentiallythe full setof 865 FHM values for organic substances. These models

    were sufficiently successful to be used for the quantitativeestimation of FHM values for a large number of DSLsubstances. Furthermore, a very detailed study employing avariety of statistical measures by Moore et al. [9], and usingan external test set of 130 substances derived from othersources, showed that the PNN results were superior inalmost all aspects when compared to the other modelspredictions, including those of ECOSAR [3], ASTER [10],and TOPKAT [11]. These results clearly demonstrate boththe general ability of neural network methods to be capableof modeling such diverse data sets and the specific ability ofthe PNN in doing so. Moreover, only the ECOSAR andPNN methods were able to provide estimates for all

    compounds in the testing data set.Quite recently, TerraBaseInc. [2] releaseda commercially

    available stand-alone fathead minnow 96-hr LC50 toxicityestimation program under the name TerraQSARTM-FHM,which is also based on the PNN methodology. It has not yetbeen part of any comparative study, such as the one byMoore et al. [9], however a graph showingthe estimated andmeasured values for the training data set is provided in themanual of that program and indicates a high degree ofcorrelation.

    2b QSARs of Daphnia magna LC50 data

    The waterflea, Daphnia magna, is used worldwide as anaquatic test organism. Tests are usually performed in staticsystems of several liters size and a recommended testprotocol has been developed (OECD Test Guideline 202,Part I, under revision) by the Organization for EconomicCooperation and Development (OECD). This test is also arequirement for the assessment of environmental fate andeffects of high-production-volume chemicals in OECDcountries. Kaiser and Niculescu [12] published the firstlarge-data-set analysis of the acute (48-hr LC50) toxicity ofover 700 compounds of Daphnia magna using the PNNapproach. Daphnia magna acute toxicity values are also

    being estimated by several computer programs, includingECOSAR, ASTER and TOPKAT.The PNN model [12] of the acute toxicity of

    700chemicals to Daphnia m. was fully cross-validatedusing a 20%-leave-out procedure, i.e. using fiverandomsub-sets of 80% each of the training set data. The resulting fivemodels showed very similar statistics, indicating the validityof the main model. This was further tested by applying themain model to an external test set of 76 compounds. Thevalues predicted for the external test set were in goodagreement with the experimental data. In comparison to theestimates produced by ECOSAR for the same external testdata, the PNN model estimates were found to be much

    2 Quant. Struct.-Act. Relat., 22 (2003)

    Klaus L. E. Kaiser

  • 8/3/2019 QSAR-22-185-190

    3/6

  • 8/3/2019 QSAR-22-185-190

    4/6

    compounds for their light-diminishing effect on the bacte-rial light emission (EC50). Since the tests appearance, wellover two thousand chemicals have so been tested and thedata are available in monograph [18], as well as electronic,chemical structure-fragment-searchable formats [2]. Vari-ous subsets of the available EC50 values for approximately

    2200 chemicals have been studied with a variety of QSARmethods, including neural network algorithms. Among thelarger sets studied are those dealing with 604 chemicals byDevillers and coworkers [19], and another set of 1 068compounds, including values for three different exposuretimes using the BNNmethodology [20]. Anotherstudy using1 200 compounds is using the PNN algorithm and structuralfragment descriptors as independent parameters [21].

    Without going into details of these studies, it can begeneralized that the neural network methodologies appliedto these large, non-congeneric data sets have provensuccessful in modeling the available data, typically spanningup to ten orders of magnitude in the compounds range of

    toxic effects observed. Devillers [22] reviewed the use ofneural networks for large heterogeneous sets of molecules,including the studies mentioned above, and concluded that... these models can often compete favorably with classicalregression models ....

    3 Sub-acute effects of chemicals to aquatic and non-aquatic organisms

    There is a rapidly rising volume of literature on theapplication of neural network methods to specific non-lethality endpoints for many types of effects, enzymes and

    organisms. Except for physico-chemical properties, such asthe octanol/water partition coefficient and aqueous solubil-ity and certain non-specific effects, such as carcinogenicityand mutagenicity, the data sets in these fields rarely surpass100 compounds at a time. Moreover, such data sets usuallycomprise compounds with limited structural variation of abase molecule. Many of these groups of data are generallywell suited for analysis by more traditional QSAR methods,including classical QSAR, principal component analysisandrelated methods. Therefore, neural network methods haveonly begun to be employed with such data when the othermethods failed to provide satisfactory results. The followingwill give just a few examples of such cases.

    3a Steroid receptor binding affinities

    There is a considerable interestin the assessmentof theDSLsubstances vis-a-vis their potential endocrine disrupting

    4 Quant. Struct.-Act. Relat., 22 (2003)

    Figure 2. Measured vs. predicted estrogen receptor binding affinity (RBA) data, values in log(RBA), where RBA is the binding affinityrelative to 17beta-estradiol (100%). Reproduced with permission from Water Quality Research J. Canada 36, 619 630 (2001), copyrightCAWQ, Burlington, Ontario, Canada.

    Klaus L. E. Kaiser

  • 8/3/2019 QSAR-22-185-190

    5/6

    effects on organisms. The binding affinity of compounds tothe hormone-receptor proteins is one possible endpointrelated to that. Consequently, efficient models of thereceptor binding of substances, and their agonistic andantagonistic effects are highly desirable. A comparativelysmall, but representative data set was developed and

    analyzed by Van Helden et al. [23] for binding to theprogesterone receptor. The same group [24] also used acombination of genetic algorithms and neural networks.Later, Niculescu and Kaiser [25] used the same data withdifferent neural network methodologies. Independent ofthe particular method applied, neural network models werefound to perform better than any of the traditional QSARmethods.

    Kaiser and Niculescu [26] also analyzed a data set of 1 000compounds known to bind to the estrogen receptor com-plex. Specifically, the estrogen receptor binding affinity(RBA) values of compounds relative to 17beta-estradiolwere used with a 20%-leave-out cross-validation and

    external test set of another 118 substances. Both setscomprise steroidal and non-steroidal structures with a largevariety of substitutents and molecular sizes and coverapproximately eight orders of magnitude in activity al-though, because of the underlying goal to create compoundswith high binding affinity, the data available are somewhatskewed towards the log(RBA) range of 0 to 3, withrelatively few values in the log(RBA) 5 to 0 range. Whilethe overall model clearly shows promise, it is not yet refinedenough to be of practical use. However, several sub-sets,defined by certainstructural conditions, such as the presenceof halogens, or carboxylic acid groups, or carboxylic acidester groups, gave good correlations of predicted versus

    measured results, indicating that such sub-models maybeuseful at this time in predicting estrogen receptor bindingaffinity for such chemicals. Figure 2 shows the model resultsfiltered for the carboxylic ester moiety-containing com-pounds in this data set; this set comprises 75 and 9compounds in the training and test sets, respectively, andthis group has standard deviation errors of 0.28 (training)and0.20 (test) andPearsons r2 of>0.94, respectively. Whilethere are several studiesin the literature on the prediction ofsteroid RBA values, they all use much smaller training setsand different assay selections than that used in [26].Therefore, any comparison of the results between these

    studies is not possible.

    3b Various effects and properties

    There are numerous studies on the application of artificialintelligence methods, including neural networks to themodeling of a host of biological effects and physico-chemical properties of compounds. They will not bereviewed here; suffice it to mention two important exam-ples. Mammalian carcinogenicity and mutagenicity datasetsof several hundred compounds and different routes ofexposure and types of effect, have been analyzed by Bristol

    and coworkers [27]. Devillers et al. [28] used a BNN toinvestigate the biodegradability of organic chemicals.

    4 Conclusions

    Most studies with neural networks find that the versatilityand modeling capabilities of neural networks lead to resultsthat are at least at par, but frequently superior to thoseobtained with traditional QSAR methods. Particularly fordata sets involving non-congeneric structures, such as manyof the substances tested with acute toxicity bioassays in thestudies mentioned here, non-linear neural networks usuallyperform much better by recognizing and adapting to non-linear relationships. Further advancements and, ultimately,reliable neural-network-based effect prediction programscan be expected to result from this research.

    5 References

    [1] U.S.Environmental Protection Agency, AQUIRE, http://www.epa.gov/ecotox, (2002).

    [2] TerraBase Inc., http://www.terrabase-inc.com, (2002).[3] U.S.Environmental Protection Agency, ECOSAR, http://

    www.epa.gov/oppt/newchems/21ecosar.htm, (2000).[4] Kaiser, K. L. E., Dearden, J. C., Klein, W., and Schultz, T. W.,

    A note of caution to users of ECOSAR, Wat. Qual. Res. J.Canada 34, 179 182 (1999).

    [5] Russom, C. L., Bradbury, S. P., Broderius, S. J., and Hammer-meister, D. E., Predicting modes of toxic action from chem-ical structure: acute toxicity in the fathead minnow (Pime-

    phales promelas), Environ. Toxicol. Chem. 16, 948967(1997).

    [6] Kaiser, K. L. E., Niculescu, S. P., and McKinnon, M. B., Onsimple linear regression, multiple linear regression, andelementary probabilistic neural network with Gaussian ker-nels performance in modeling toxicity values to fatheadminnow, in: Chen, F. and Schrmann, G. (Ed.), QSAR inEnvironmental Sciences -VII, SETAC Press, Pensacola, FL.1997, pp.285 297.

    [7] Eldred, D. V., Weikel, C. L., Jurs, P. C., and Kaiser, K. L. E.,Prediction of fathead minnow acute toxicity of organiccompounds from molecular structure, Chem. Res. Toxicol12, 670 678 (1999).

    [8] Kaiser, K. L. E., and Niculescu, S. P., Using probabilistic

    neural networks to model the toxicity of chemicals to thefathead minnow (Pimephales promelas): a study based on 865compounds, Chemosphere 38, 3237 3245 (1999).

    [9] Moore, D. R. J., Breton, R. L., and MacDonald, D. B., Acomparison of model performance for six QSAR packagesthat predict acute toxicity to fish, Environ. Toxicol. Chem., inpress.

    [10] U.S.Environmental Protection Agency, ASTER, AssessmentTools for the Evaluation of Risk, http://www.epa.gov/med/databases/aster.htm (2002).

    [11] Accelrys Inc., TOPKAT, http://www.accelrys.com (2002).[12] Kaiser, K. L. E., and Niculescu, S. P., Modeling acute toxicity

    of chemicals to Daphnia magna: a probabilistic neuralnetwork approach, Environ. Toxicol. Chem. 20, 420431(2001).

    Quant. Struct.-Act. Relat., 22 (2003) 5

    Neural networks for effect prediction in environmental and health issues using large datasets

  • 8/3/2019 QSAR-22-185-190

    6/6

    [13] Schultz, T. W., and Cronin, M. T. D., Response-surface anal-yses for toxicity to Tetrahymena pyriformis. Reactive carbon-yl-containing aliphatic chemicals, J. Chem. Inf. Comput. Sci.39, 304 309 (1999).

    [14] Serra, J. R., Jurs, P. C., and Kaiser, K. L. E., Linear regressionand computational neural network prediction ofTetrahymenaacute toxicity for aromatic compounds from molecular

    structure, Chem. Res. Toxicol. 14, 1535 1545 (2001).[15] Niculescu, S. P., Kaiser, K. L. E., and Schultz, T. W., Modelingthe toxicity of chemicals to Tetrahymena pyriformis usingmolecular fragment descriptors and probabilistic neural net-works, Arch. Environ. Contam. Toxicol. 39, 289 298 (2000).

    [16] Kaiser, K. L. E., Niculescu, S. P., and Schultz, T. W., Proba-bilistic neural network modeling of the toxicity of chemicalsto Tetrahymena pyriformis with molecular fragment descrip-tors, SAR QSAR Environ. Res. 13, 57 67 (2002).

    [17] Burden, F. R., Winkler, D. A., A QSAR model for the acutetoxicity of substituted benzenes to Tetrahymena pyriformisusing Bayesian-regularized neural networks, Chem. Res.Toxicol. 13, 436 440 (2000).

    [18] Kaiser, K. L. E., and Devillers, J., 1994. Ecotoxicity ofChemicals to Photobacterium phosphoreum. Gordon andBreach Science Publishers, Reading, MA.

    [19] Devillers, J., Bintein, S., Domine, D., and Karcher, W., Ageneral QSAR model for predicting the toxicity of organicchemicals to luminescent bacteria (MicrotoxR test), SARQSAR Environ. Res. 4, 29 38 (1995).

    [20] Devillers, J., and Domine, D., A noncongeneric model forpredicting toxicity of organic molecules to Vibrio fischeri,SAR QSAR Environ. Res. 10, 61 70 (1999).

    [21] Kaiser, K. L. E., and Niculescu, S. P., Neural network model-ing of Vibrio fischeri toxicity data with structural physico-chemical parameters and molecular indicator variables, in:

    Walker, J. D. (Ed.), Proc. 8th Intl. Workshop on QSAR inEnvironmental Sciences, SETAC, Pensacola, FL. 2002,pp.1620.

    [22] Devillers, J., QSAR modeling of large heterogenous sets ofmolecules, SAR QSAR Environ. Res. 12, 515 528 (2001).

    [23] Van Helden, S. P., Hamersma, H., and van Geerestein, V. J.,Prediction of the progesterone receptor binding of steroids

    using a combination of genetic algorithms and neural net-works, in: Devillers, J. (Ed.), Genetic Algorithms in MolecularModeling, Academic Press, London. 1996, pp.159 192.

    [24] So, S.-S., van Helden, S. P., van Geerestein, V. J., and Karplus,M., Quantitative structure-activity relationship studies ofprogesterone receptor binding steroids, J. Chem. Inf. Comput.Sci. 40, 762 772 (2000).

    [25] Niculescu, S. P., and Kaiser, K. L. E., Modeling the relativebinding affinity of steroids to the progesterone receptor withprobabilistic neural networks, Quant. Struct.-Act. Relat. 20,223 226 (2001).

    [26] Kaiser, K. L. E., and Niculescu, S. P., On the PNN modellingof estrogen receptor binding data for carboxylic acid estersand organochlorine compounds, Wat. Qual. Res. J. Canada 36,619 630 (2001).

    [27] Bahler, D., Stone, B., Wellington, C., and Bristol, D. W.,Symbolic, neural, and Bayesian machine learning models forpredicting carcinogenicity of chemical compounds, J. Chem.Inf. Comput. Sci. 40, 906 914 (2000).

    [28] Devillers, J., Domine, D., and Boethling, R. S., Use of abackpropagation neural network and autocorrelation descrip-tors for predicting the biodegradation of organic chemicals,in: Devillers, J. (Ed.), Neural Networks in QSAR and DrugDesign, Academic Press, London. 1996, pp. 65 82.

    Received on ; Accepted on November 7, 2002

    6 Quant. Struct.-Act. Relat., 22 (2003)

    Klaus L. E. Kaiser