OPSIN: Taming the jungle of IUPAC chemical nomenclature
description
Transcript of OPSIN: Taming the jungle of IUPAC chemical nomenclature
1
OPSIN Taming the jungle of IUPAC
chemical nomenclature
Daniel Lowe, Peter Murray-Rust, Robert C Glen 8th September 2013
Indianapolis, ACS
4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26-
methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O-
methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside
2
ol
What is chemical name to structure? (2S)- but 2- Amino 1- -
Stereochemistry locant substituent locant alk unsaturation suffix
an
NH2• 1
2
3
4
3
• Identify documents by their chemical structures
• Assist with structure viewing
• Identify incorrect chemical names
• Extract reagent structures hence allowing reactions to be reconstructed from text
Uses of chemical name to structure
4
5
Parsing
• Over 4000 discrete morphemes form the program’s vocabulary
(a morpheme is the smallest section of a word with meaning)
• These are grouped into 140 classes e.g.
• unsaturator (‘ene’)
• aminoAcidEndsInIne (‘tyros’)
• simpleSubstituent (‘amino’)
6
Word Rule Example acetal Propanal dimethyl acetal additionCompound Carbon tetrachloride acidHalideOrPseudoHalide Cyanic chloride amide Nitrous amide anhydride Acetic anhydride biochemicalEster Adenosine 5'-triphosphate carbonylDerivative Propanone oxime divalentFunctionalGroup Diethyl ether ester Ethyl ethanoate functionalClassEster Acetic acid ethyl ester functionGroupAsGroup Cyanide glycol Ethylene glycol glycolEther Ethylene glycol monomethyl ether hydrazide Phosphoric hydrazide monovalentFunctionalGroup Ethyl alcohol multiEster Ethyl propyl methylphosphonate oxide Thiophene 1,1-dioxide polymer Poly(ethylene) simple Ethylbenzene substituent Chloro
7
Supported chain nomenclature
Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides
dodectetractkiliane pentaphosphane disilazane
Trivial acids
butyric acid
8
Supported ring nomenclature Monocyclic spiro
dispiro[4.2.4.2]tetradecane
Hantzsch-Widman
1,3,5-triazine
furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl
Fused ring Ring assembly
Von Baeyer
tricyclo[2.2.1.12,5]octane
Polycyclic spiro
spiro[piperidine-4,9'-xanthene]
9
Structural assembly nomenclature Conjunctive nomenclature
benzeneethanol
Substitutive nomenclature
2,4,6-trinitrotoluene
Additive nomenclature
methylsulfonyl
Multiplicative nomenclature
4,4'-methylenedioxydibenzoic acid
Functional class nomenclature
ethyl alcohol
10
Structural modifications
Heteroatom replacement
1-thia-4-aza-2,6-disilacyclohexane
Unsaturation
hexa-1,3-dien-5-yne
Hydro, dehydro, indicated hydrogen and added hydrogen
2,7-dihydro-1H-azepine
Functional replacement Suffixes including
infixed suffixes
methanedithioic acid 1-chloro-2,4-
diimidotricarbonic acid
Lambda convention
2λ6-trisulfane
11
Bridges and stereochemistry Bridges
4a,8a-propanoquinoline
E/Z stereochemistry
(Z)-2-chloro-but-2-ene
Relative cis/trans stereochemistry
trans-2,6-dimethyl-2,6-dihydronaphthalene
R/S stereochemistry
(1R,3S)-3-amino-3-methylcyclohexanol
12
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately positioned structural features
Charge and oxidation numbers
methylmercury(1+) or methylmercury(II)
“per-nomenclature”
2-deoxy-ᴅ-ribose
Subtractive nomenclature
perhydroanthracene
perchlorobenzene
13
Polymer nomenclature
poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]
Structure-based polymer nomenclature
14
Domain specific nomenclature
Steroid nomenclature
17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
ʟ-leucinamide
Amino acid
cyclo(ᴅ-alanyl-ʟ-phenylalanyl) ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline
Oligopeptide Cyclic peptide
guanylyl(3'-5')uridine 3'-monophosphate
Nucleotide nomenclature
15
Carbohydrate nomenclature (acyclic)
ᴅ-gluco-hexose or ᴅ-glucose (preferred)
ʟ-ribo-ᴅ-manno-nonose
• Carbohydrates are defined using configurational prefixes that each specify the stereochemistry for between 1 and 4 stereocentres
16
Carbohydrate derivatives
• These carbohydrate chains can then be algorithmically modified by suffixes
ᴅ-glucose
ᴅ-glucitol
ᴅ-glucaric acid
ᴅ-gluconic acid
17
Carbohydrate nomenclature (cyclic)
α-ᴅ-glucopyranose
2,7-anhydro-D-glycero-β-D-galacto-oct-2- ulopyranosonic acid
ᴅ-glucose
18
Carbohydrate nomenclature (oligosaccharides)
β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-(1→3)-ᴅ-glucopyranose
19
Fused ring nomenclature
• All fused ring nomenclature is processed algorithmically e.g. even benzofuran is constructed from benzene and furan rather than being a trivial name
• For example:
benzo[b]cycloocta[jk]fluorene
8
6 6 6
5
20
Fused ring nomenclature (numbering)
• Transform to an idealised grid aligned along the longest row of rings
• Apply quadrant rules e.g. favour most rings in upper right quadrant
8 6
6 6 5 6 6 6 5 8
8 6 6 5 6 6 6 6 5 8
6 6 5 6 8 6 6 5 8 6
21
Fused ring nomenclature (numbering)
• Atoms numbered in ascending order from upper rightmost ring
6
6 6 5 8 Peripheral numbering rules used to
choose grid layout that gives the
best numbering
22
Beyond IUPAC: CAS index name un-inversion
CAS Index Name IUPAC name
benzene, ethyl- ethylbenzene
Disulfide, bis(2-chloroethyl) Bis(2-chloroethyl) disulfide
Benzoic acid, 4,4’-methylenebis[2-chloro- 4,4'-Methylenebis[2-chlorobenzoic acid]
Phosphoric acid, ethyl dimethyl ester ethyl dimethyl phosphate
23
Beyond IUPAC: Correcting missing spaces
tert-butylacetate tert-butyl acetate
tert-butyl-4-vinylperbenzoate
No locant and perbenzoate has more
than one non-degenerate hydrogen
diethylcarbonate
Has no substitutable hydrogen
Ethylacetate
non-ester would be butanoate or butyrate!
24
Performance on machine-generated names
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ACD/Name 12.02Names
ChemBioDraw13Names
Lexichem 2.1 Names Marvin 6.0.2 Names
No Result
Constitutional Discrepancy
Stereochemical Discrepancy
Correctly Interpreted
30,000 structures randomly selected from PubChem
used as input to machine-generate names
25
Performance on unique names from US patent headings
26
What’s not supported
• Parsing of generic chemical names
• E.g. 2- or 3- alkylsubstitutedbenzofurans
• Advanced inorganic nomenclature e.g. coordinate bonding
• Some natural product nomenclature
• Advanced stereochemistry e.g. pseudo asymmetric stereo centers, axial stereochemistry etc.
27
Usage Batch conversion on the
command line
RESTful web service
(opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName);
Java API
java -jar opsin-1.5.0-jar-with-dependencies.jar -osmi input.txt output.smi
28
Who is using OPSIN? Commercial software
Cinfony
(interface to
Python)
Many text mining efforts
Workflows Web services
29
Conclusions
• OPSIN combines high recall, precision and speed of execution
• Recent improvements have significantly improved coverage of biochemical nomenclature
Visit opsin.ch.cam.ac.uk to try it out and download!
30
OPSIN: Taming the jungle of IUPAC chemical nomenclature
For more information see:
Chemical Name to Structure: OPSIN, an Open Source Solution
J. Chem. Inf. Model., 2011, 51 (3), pp 739–753
Extraction of chemical structures and reactions from the literature (https://www.repository.cam.ac.uk/handle/1810/244727)
Acknowledgements
Albina Asadulina
Rich Apodaca
Peter Corbett
Roger Sayle
Funding