Noel M O'Boyle’s research while affiliated with University College Cork and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (45)


Practical Applications of Matched Series Analysis: Sar Transfer, Binding Mode Suggestion and Data Point Validation
  • Article

January 2017

·

44 Reads

·

5 Citations

·

·

Noel O'Boyle

·

Aim: The assumption in scaffold hopping is that changing the scaffold does not change the binding mode and the same structure-activity relationships (SARs) are seen for substituents decorating each scaffold. Results/methodology: We present the use of matched series analysis, an extension of matched molecular pair analysis, to automate the analysis of a project's data and detect the presence or absence of comparable SAR between chemical series. Conclusion: The presence of SAR transfer can confirm the perceived binding mode overlay of different chemotypes or suggest new arrangements between scaffolds that may have gone unnoticed. The absence of series correlation can highlight the presence of inconsistent data points where assay values should be reconfirmed, or provide challenge to any project dogma.


Figure 1: Graphical abstract An example series from one of the benchmark datasets. Each fingerprint is assessed on its ability to reproduce a specific series order.
Figure 2: Composition of a series in the multi-assay benchmark. The diagram shows a series consisting of five molecules M1, M3, M5, M7 and M9 (in that order) taken from four assays in four different papers, where each assay has a compound in common
Figure 3: Histogram showing the effect of successive filters on the pairwise similarity of structures in the same assay. Pairwise similarity was measured using the LECFP4 fingerprint for pairs of structures from each assay in the dataset and a histogram generated using a bin width of 0.05. The initial data (green) was for assays containing up to 25 structures. Successive filters were then applied to restrict the data to those assays of size 8 or greater, to remove promiscuous molecules, and to remove molecules found in Wikipedia or with INNs. For comparison, the pairwise similarity of randomly chosen molecules from the entire dataset is shown as the dashed line. Histograms were normalised to 100 % over all bins, except for the histogram for the random data which was scaled to 30 %
Figure 4: Examples of series from a the single-assay benchmark, b the multi-assay benchmark
Figure 5: Histogram showing the structural similarity of structures in the single-assay benchmark with respect to their corresponding reference molecules. Similarity was measured with the LECFP6 fingerprint and a histogram created using bins of width 0.05. Histograms were normalised to 100 % over all bins. The data used here is taken from all 1000 repetitions of the benchmark

+8

Comparing structural fingerprints using a literature-based similarity benchmark
  • Article
  • Full-text available

July 2016

·

473 Reads

·

213 Citations

Journal of Cheminformatics

Background The concept of molecular similarity is one of the central ideas in cheminformatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of molecular similarity in the context of drug discovery: molecules A and B are similar if a medicinal chemist would be likely to synthesise and test them around the same time as part of the same medicinal chemistry program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery. If we make the assumption that molecules in the same compound activity table in a medicinal chemistry paper were considered similar by the authors of the paper, we can create a dataset of similar molecules from the medicinal chemistry literature. Furthermore, molecules with decreasing levels of similarity to a reference can be found by either ordering molecules in an activity table by their activity, or by considering activity tables in different papers which have at least one molecule in common. Results Using this procedure with activity data from ChEMBL, we have created two benchmark datasets for structural similarity that can be used to guide the development of improved measures. Compared to similar results from a virtual screen, these benchmarks are an order of magnitude more sensitive to differences between fingerprints both because of their size and because they avoid loss of statistical power due to the use of mean scores or ranks. We measure the performance of 28 different fingerprints on the benchmark sets and compare the results to those from the Riniker and Landrum (J Cheminf 5:26, 2013. doi:10.1186/1758-2946-5-26) ligand-based virtual screening benchmark. Conclusions Extended-connectivity fingerprints of diameter 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topological torsion fingerprint. However, when ranking very close analogues, the atom pair fingerprint outperforms the others tested. When ranking diverse structures or carrying out a virtual screen, we find that the performance of the ECFP fingerprints significantly improves if the bit-vector length is increased from 1024 to 16,384.Graphical abstractAn example series from one of the benchmark datasets. Each fingerprint is assessed on its ability to reproduce a specific series order. Electronic supplementary material The online version of this article (doi:10.1186/s13321-016-0148-0) contains supplementary material, which is available to authorized users.

Download

Figure 1. Example of a Wikipedia disease page, demonstrating the term relationships that were extracted in bulk from a dump of Wikipedia.  
Figure 2. Workflow for chemical-disease relationship extraction. Dashed boxes are optional steps.  
Table 3 . Precision of patterns where the chemical term fol- lows the disease term.
Table 4 . Performance of the system on the test set for the DNER and CID tasks
Effect of the choice of lexicon on performance of the system on the development set.
Efficient chemical-disease identification and relationship extraction using Wikipedia to improve recall

April 2016

·

139 Reads

·

29 Citations

Database

Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based entity recognizer and was used to recognize and normalize both chemicals and diseases to Medical Subject Headings (MeSH) IDs. The disease lexicon was obtained from three sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon. Composite entities (e.g. heart and lung disease) were detected and mapped to their composite MeSH IDs. For CIDs, we developed a simple pattern-based system to find relationships within the same sentence. Our system was evaluated in the BioCreative V Chemical–Disease Relation task and achieved very good results for both disease concept ID recognition (F1-score: 86.12%) and CIDs (F1-score: 52.20%) on the test set. As our system was over an order of magnitude faster than other solutions evaluated on the task, we were able to apply the same system to the entirety of MEDLINE allowing us to extract a collection of over 250 000 distinct CIDs.




When two are not enough: Lead optimisation beyond matched pairs

January 2015

·

8 Reads

Drug Discovery World

Lead optimisation projects progress by making successive enhancements to one or more starting structures. This is a classic multi-objective optimisation procedure where the goal is not only to improve potency but also to improve physicochemical and absorption, distribution, metabolism and elimination (ADME) properties. For physicochemical and ADME properties, the popular matched molecular pair analysis method has been a successful strategy; however, it notably fails in the goal of improving potency. Here we discuss a lead optimisation approach involving matched series, the extension of matched pairs to more than two R-groups, which can successfully be used to guide molecular design towards improved potency. Furthermore, this approach retains the attractive features of matched pair analysis in that it is entirely driven by experimental data and is a natural fit to the medicinal chemistry approach of designing analogues by successive small changes to an existing molecule.


Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity

March 2014

·

145 Reads

·

56 Citations

Journal of Medicinal Chemistry

A Matched Molecular Series is the general form of a Matched Molecular Pair, and refers to a set of two or more molecules with the same scaffold but different R groups at the same position. We describe Matsy, a knowledge-based method that uses matched series to predict R groups likely to improve activity given an observed activity order for some R groups. We compare the Matsy predictions based on activity data from ChEMBLdb to the recommendations of the Topliss Tree and carry out a large scale retrospective test to measure performance. We show that the basis for predictive success is preferred orders in matched series and that this preference is stronger for longer series. The Matsy algorithm allows medicinal chemists to integrate activity trends from diverse medicinal chemistry programmes and apply them to problems of interest as a Topliss-like recommendation or as a hypothesis generator to aid compound design.


What compound should I make next? Using Matched Molecular Series for prospective medicinal chemistry

March 2014

·

823 Reads

·

1 Citation

Journal of Cheminformatics

A matched molecular series is the general form of a matched molecular pair and refers to a set of two or more molecules with the same scaffold but different R groups at the same position. We describe Matsy, a knowledge-based method that uses matched series to predict R groups likely to improve activity given an observed activity order for some R groups. We compare the Matsy predictions based on activity data from ChEMBLdb to the recommendations of the Topliss tree and carry out a large scale retrospective test to measure performance. We show that the basis for predictive success is preferred orders in matched series and that this preference is stronger for longer series. The Matsy algorithm allows medicinal chemists to integrate activity trends from diverse medicinal chemistry programs and apply them to problems of interest as a Topliss-like recommendation or as a hypothesis generator to aid compound design.


Cheminformatics

November 2012

·

2,047 Reads

·

29 Citations

Communications of the ACM

NOVEL TECHNOLOGIES IN the life sciences produce information at an accelerating rate, with public data stores (such as the one managed by the European Bioinformatics Institute http://www.ebi. ac.uk) containing on the order of 10PB of biological information. For nearly 40 years, the same was not so for chemical information, but in 2004 a large public small-molecule structure repository (PubChem http://pubchem. ncbi.nlm.nih.gov) was made freely available by the National Library of Medicine (part of the U.S. National Institutes of Health) and soon followed by other databases. Likewise, while many of the foundational algorithms of cheminformatics have been described since the 1950s, open-source software implementing many of them have become accessible only since the mid-1990s.



Citations (31)


... With the aid of structurald ata and modelss uch as those listed below,t he reasonf or discontinuities can be investigated and evaluated. Findingand explaining activity cliffswith the inclusion of structurald ata into quantitative structure-activity relationship (QSAR)m odels, [8,10,11,[18][19][20][21] SAR transfer between molecular series, [8,12,22] and visualization of activity patterns as well as high-impact sites [8,10] are some of the main applicationso f the 3D MMP concept. Depending on how 3D similarity is defined, scaffold hopping can be realized as well. ...

Reference:

Exploring Structure-Activity Relationships with Three-Dimensional Matched Molecular Pairs-A Review
Practical Applications of Matched Series Analysis: Sar Transfer, Binding Mode Suggestion and Data Point Validation
  • Citing Article
  • January 2017

... Tradi�onally, these descriptors were cra�ed manually, relying on expert knowledge to encode molecular proper�es into a computer-interpretable vector (6). The most employed vector of molecular representa�ons is Morgan fingerprints, also known as extended-connec�vity fingerprints (ECFPs), as these o�en outperform other types of fingerprints in molecular bioinforma�cs and virtual screening tasks (7,8). However, ECFPs are not only very high dimensional and sparse but may also suffer from bit collisions introduced by the hashing step (9) On the other hand, most of the ML models developed in the domain of cheminforma�cs and drug design commonly employ pre-extracted tradi�onal molecular descriptors as input which contrasts with the core principle of representa�on learning (10). ...

Comparing structural fingerprints using a literature-based similarity benchmark

Journal of Cheminformatics

... A variety of methods such as intrasentence, machine learning, pattern recognition, rule-and knowledgebased approaches have been employed for relation extraction over the years [6][7][8][9][10][11][12]. Protein-protein interaction [12,13], as well as some other related studies, have aimed at extracting relations or identifying relevant concepts [7,[13][14][15]. ...

Efficient chemical-disease identification and relationship extraction using Wikipedia to improve recall

Database

... This is not an utopia in our contemporary society and especially in the social system of chemical knowledge. The disciplinary advantages of openness and data sharing are well documented [405,406], and chemistry is actually moving in the direction of open data, open source, and open standards [407]. While freely available data are at hand, academic agreements can be signed between the industry and research institutions, where the commercial interests of one part are guaranteed while providing access for the purposes of this research programme. ...

Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on

Journal of Cheminformatics

... (9) Å], the S1-O1 bond length is 1.4890 (14) Å, indicating a significant degree of double-bond character, and the C5-S1-C8 bond angle is 97.10 (7)°. These data compare with the corresponding S atom displacement, S-O separation and C-S-C angle of 0.684 (2) Å, 1.489 (6) Å and 96.9 (3)°, respectively, in the centrosymmetric co-crystal of thianthrene 5-oxide with 1,4-di-iodotetrafluorobenzene [47]. The dihedral angle between the C1-C6 ring and the pendant C15-C20 ring in 9 is 85.96 (6)°, and the C2-N1-C13-C14 torsion angle is 154.86 (16)°. ...

Utilizing Sulfoxide···Iodine Halogen Bonding for Cocrystallization
  • Citing Article
  • May 2012

Crystal Growth & Design

... According to the literature, several compounds with 5-methoxyindole moieties possess anti-trypanosomal activity. [30] Examination of the established matched molecular series [31] led to the conclusion that the 5-bromo substituent is indispensable for activity as none of the four prepared derivatives (7 a-d) reduced parasite viability. One could speculate that the bromo substituent at this position forms a halogen bond to the as yet unidentified target or fills a suitable binding pocket. ...

Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity
  • Citing Article
  • March 2014

Journal of Medicinal Chemistry

... The amount of chemical information is massive and can be in various formats, such as text, diagrams, numbers, chemical symbols, line notations, molecule files, photographs, videos, 2D representations, and 3D models. It is often delivered in a multimodal format [54], which can be communicated at three different levels of chemical information (see Figure 4). The macro level communicates concepts that can be seen. ...

Cheminformatics

Communications of the ACM

... e-Drug3D (Douguet, 2018), and the BindingDB subset of FDA approved drugs (Liu et al., 2007). Beforehand, all drugs present in each database were prepared as follows: removal of duplicate/redundant molecules with the software OpenBabel (O'Boyle et al., 2011) according to their InChI (IUPAC InchI identifier); addition of hydrogens; assignment of bond order and partial charges; and saving final molecules in mol2 extension. Drugs were also processed with FILTER (OpenEye. ...

De novo design of molecular wires with optimal properties for solar energy conversion

Journal of Cheminformatics

... [8][9][10][11][12][13][14][15][16][17][18][19] Naturally, the accuracy of these calculations needs to be thoroughly evaluated by comparing the computed excitation energies against experimental data. [20][21][22][23] For example, Jacquemin and coworkers investigated the performance of various density functional theory methods in calculating the excitation energies of over 100 organic dyes from major chromophore classes. 24 With an integrated design process involving theoretical calculations, machine learning, and experimental validation, highly efficient organic lightemitting diode molecules can be discovered from 1.6 million molecules with external quantum efficiencies up to 22%. 2 In another recent example, a dataset of 48,182 organic semiconductors containing benchmarks of organic molecules was presented, providing relevant electronic properties and demonstrating its potential for repurposing known molecules. ...

Computational Design and Selection of Optimal Organic Photovoltaic Materials
  • Citing Article
  • July 2011

The Journal of Physical Chemistry C

... Data Cleaning and Preprocessing: For the collected carcinogenic molecules, we first removed duplicate molecules to ensure data uniqueness. We then used PubChemPy 33 to retrieve typical SMILES strings 34 and InChI strings 35 for the collected molecular names. Molecules that could not be converted to SMILES were discarded. ...

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI

Journal of Cheminformatics