Simon Dobson’s research while affiliated with University of St Andrews and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (236)


Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility
  • Article

July 2024

·

14 Reads

·

1 Citation

ACS Omega

·

·

Simon Dobson

The solubility of chemical substances in water is a critical parameter in pharmaceutical development, environmental chemistry, agrochemistry, and other fields; however, accurately predicting it remains a challenge. This study aims to evaluate and compare the effectiveness of some of the most popular machine learning modeling methods and molecular featurization techniques in predicting aqueous solubility. Although these methods were not implemented in a competitive environment, some of their performance surpassed previous benchmarks, offering gradual but significant improvements. Our results show that methods based on graph convolution and graph attention mechanisms demonstrated exceptional predictive abilities with high-quality data sets, albeit with a sensitivity to data noise and errors. In contrast, models leveraging molecular descriptors not only provided better interpretability but also showed more resilience when dealing with inherent noise and errors in data. Our analysis of over 4000 molecular descriptors used in various models identified that approximately 800 of these descriptors make a significant contribution to solubility prediction. These insights offer guidance and direction for future developments in solubility prediction.


Fig. 1 (a) Overview of aqueous solubility prediction. (b) log S distribution of training data prepared for regression problems described in Table 1. (c) Class distribution and (d) classification criteria of the imbalanced training data from the 1st EUOS/SLAS Joint Challenge: Compound Solubility.
Fig. 5 Scatter plots of measured versus predicted aqueous solubility for the GCN and GAT models.
Training data prepared for regression problems
Performance of ML algorithms for predicting aqueous solubility.
Machine Learning for Solubility Prediction
  • Preprint
  • File available

November 2023

·

330 Reads

·

1 Citation

The solubility of a chemical in water is a critical parameter in drug development and other fields such as environmental chemistry and agrochemistry, but its in silico prediction presents a formidable challenge. Here, we apply a suite of graph-based machine learning algorithms to the benchmark problems posed over several years in international ``solubility challenges'', and also to our own newly-compiled dataset of over 11,000 compounds. We find that graph convolutional networks (GCNs) and graph attention networks (GATs) both show excellent predictive power against these datasets. Although not executed under competition conditions, these approaches achieve better scores in several instances than the best models available at the time. They offer an incremental, but still significant, improvement when compared against a range of existing cheminformatics approaches.

Download


Belief propagation on networks with cliques and chordless cycles

May 2023

·

3 Reads

·

3 Citations

PHYSICAL REVIEW E

It is well known that tree-based theories can describe the properties of undirected clustered networks with extremely accurate results [S. Melnik et al., Phys. Rev. E 83, 036112 (2011)]. It is reasonable to suggest that a motif-based theory would be superior to a tree one, since additional neighbor correlations are encapsulated in the motif structure. In this paper, we examine bond percolation on random and real world networks using belief propagation in conjunction with edge-disjoint motif covers. We derive exact message passing expressions for cliques and chordless cycles of finite size. Our theoretical model gives good agreement with Monte Carlo simulation and offers a simple, yet substantial improvement on traditional message passing, showing that this approach is suitable to study the properties of random and empirical networks.


FIG. 1. (From top left) A primary disease has spread over a network of susceptible nodes (grey) to create a giant component of infected hosts (red) at its equilibrium. A seed node (dark green) of degree k = 4 has degree 2 in the RG (uninfected neighbours) and k − l = 2 strain 1 infected neighbours. In the general case, the second strain then spreads on both the RG with transmissibility T2 and over the GCC of strain 1 (light green) with modulated transmissibility T 2 . The light green nodes are coinfected with both strain 1 and strain 2, while the red (dark green) nodes only have strain 1 (2).
FIG. 3. The outbreak fractions for several (T2, T 2 ) combinations of the model all in the absence of clustering. From left to right: (A) complete cross-immunity (T2, T 2 ) = (1, 0) , (B) partial cross-immunity (T2, T 2 ) = (0.6, 0.39) , (C) partial coinfection (T2, T 2 ) = (0.4, 0.7) and (D) perfect coinfection (T2, T 2 ) = (0, 0.6) . Markers are the average of 50 repeats of bond percolation over CCM networks of size N = 65000, α = 2.0 and θ = 0; square markers are strain 1 whilst circles are strain 2. Solid lines are the theoretical results of Eq (21) for strain 2.
FIG. 4. The outbreak fractions for both strains for clustered (Eq 23) and unclustered networks as T1 is varied under two disease couplings: (A) partial cross-immunity (T2, T 2 ) = (0.6, 0.4) and (B) partial coinfection (T2, T 2 ) = (0.4, 0.7) . Simulations are the average of 50 repeats of bond percolation on networks with N = 35000 and θ = 0.0, α = 2.0 and 0.5 for the unclustered and clustered networks, respectively. Solid lines are the theoretical results of Eqs 11 and 21. In general, clustering reduces the extent of plural infections in the network; however, degree assortativity within the contact topology causes a reversal of this at high (low) values of T1 in A (B).
Symbiotic and antagonistic disease dynamics on networks using bond percolatio

January 2023

·

28 Reads

In this paper we introduce a novel description of the equilibrium state of a bond percolation process on random graphs using the exact method of generating functions. This allows us to find the expected size of the giant connected component (GCC) of two sequential bond percolation processes in which the bond occupancy probability of the second process is modulated (increased or decreased) by a node being inside or outside of the GCC created by the first process. In the context of epidemic spreading this amounts to both a antagonistic partial immunity or a synergistic partial coinfection interaction between the two sequential diseases. We examine configuration model networks with tunable clustering. We find that the emergent evolutionary behaviour of the second strain is highly dependent on the details of the coupling between the strains. Contact clustering generally reduces the outbreak size of the second strain relative to unclustered topologies; however, positive assortativity induced by clustered contacts inverts this conclusion for highly transmissible disease dynamics.


FIG. 1. Top: A network is covered with edge-disjoint chordless cycles and cliques. Bottom: the factor graph of the network. Figure inspired by Figure 1 of [18]. Such networks can have a high local density of loops that encapsulate the neighbour correlation, becoming increasingly sparse and treelike at long ranges.
FIG. 4. A bond occupancy configuration of a 6-clique in which two vertices do not belong to the giant component (unfilled) whilst the remaining 4 vertices do. Solid edges are occupied whilst dashed edges are unoccupied. The occupation state of the edge linking the two unfilled vertices is inconsequential to the percolation properties of the 4 vertices in the giant component. There are 0.5(3+1)3 = 6 occupied edges among the filled vertices, of which 3 can be set to unoccupied and connectivity retained. There are ω(2) = 8 edges that must be unoccupied if the two unfilled vertices are to remain outside of the giant component.
FIG. 5. A random GCM graph with 2-3-and 4-cliques. Solid lines are the results of the message passing calculations from Eq 11 with Eq 49 and Eq 13 with Eq 51. The scatter points are the average of Monte Carlo simulations of bond percolation on the same network. Circles are the average size of the largest cluster; whilst squares are the average finite component sizes.
FIG. 6. a) The neighbourhood of a focal vertex i covered with ordinary edges (green) and triangles (shaded). Only half of the connections between the neighbours are included in the neighbourhood of i due to the edge-disjoint property of the cover. b) When we calculate the cavity equations H i←τ (z) for i, the ordinary edges are included in the product over the motifs that vertex j ∈ τ i belongs to ∏ ν Hj←ν (z); introducing bias and therefore statistical error. c) The factor graph contains short range loops.
FIG. 7. The size of the largest cluster of: a) a coauthorship network of 13,861 scientists [42] and (b) a network of 10,680 users of the PGP encryption software [43]. The scatter points are the average of Monte Carlo simulation of bond percolation. The dashed (green) line is the result of the message passing equations with the 2-clique cover; whilst the solid (black) line is the result for the MPCC cover and Eq 49.
Belief propagation on networks with cliques and chordless cycles

January 2023

·

41 Reads

It is well known that tree-based theories can describe the properties of undirected clustered networks with extremely accurate results [S. Melnik, \textit{et al}. Phys. Rev. E 83, 036112 (2011)]. It is reasonable to suggest that a motif based theory would be superior to a tree one; since additional neighbour correlations are encapsulated in the motif structure. In this paper we examine bond percolation on random and real world networks using belief propagation in conjunction with edge-disjoint motif covers. We derive exact message passing expressions for cliques and chordless cycles of finite size. Our theoretical model gives good agreement with Monte Carlo simulation and offers a simple, yet substantial improvement on traditional message passing showing that this approach is suitable to study the properties of random and empirical networks.



FIG. 1. A snapshot of a 2-and 3-clique clustered graph with vertices labelled with their joint degree tuples (s, t).
FIG. 5. The expected size of the GCC for neutral and correlated random networks in k-regular graphs with 2-and 3-clique clustering among vertices with overall degree s + 2t = 7 and fixed clustering coefficients. Curves are the theoretical results of Eq 18.
Mixing patterns in graphs with higher-order structure

October 2022

·

52 Reads

In this paper we examine the percolation properties of higher-order networks that have non-trivial clustering and subgraph-based assortative mixing (the tendency of vertices to connect to other vertices based on subgraph joint degree). Our analytical method is based on generating functions. We also propose a Monte Carlo graph generation algorithm to draw random networks from the ensemble of graphs with fixed statistics. We use our model to understand the effect that network microstructure has, through the arrangement of clustering, on the global properties. Finally, we use an edge disjoint clique cover to represent empirical networks using our formulation, finding the resultant model offers a significant improvement over edge-based theory.


N -strain epidemic model using bond percolation

July 2022

·

6 Reads

PHYSICAL REVIEW E

In this paper we examine the emergent structures of random networks that have undergone bond percolation an arbitrary, but finite, number of times. We define two types of sequential branching processes: a competitive branching process, in which each iteration performs bond percolation on the residual graph (RG) resulting from previous generations, and a collaborative branching process, where percolation is performed on the giant connected component (GCC) instead. We investigate the behavior of these models, including the expected size of the GCC for a given generation, the critical percolation probability, and other topological properties of the resulting graph structures using the analytically exact method of generating functions. We explore this model for Erdős-Renyi and scale-free random graphs. This model can be interpreted as a seasonal N-strain model of disease spreading.


Degree correlations in graphs with clique clustering

April 2022

·

19 Reads

·

6 Citations

PHYSICAL REVIEW E

Correlations among the degrees of vertices in random graphs often occur when clustering is present. In this paper we define a joint-degree correlation function for vertices in the giant component of clustered configuration model networks which are composed of clique subgraphs. We use this model to investigate, in detail, the organization among nearest-neighbor subgraphs for random graphs as a function of subgraph topology as well as clustering. We find an expression for the average joint degree of a neighbor in the giant component at the critical point for these networks. Finally, we introduce a novel edge-disjoint clique decomposition algorithm and investigate the correlations between the subgraphs of empirical networks.


Citations (77)


... Subsequent systematic comparisons of modern ML techniques, including various GNN architectures, generally confirm these findings. For instance, the best RM SE achieved on the "tight" (easier) SC-2 test set was 0.7 (R 2 0.59), attained by a Graph Convolutional Neural Network (GCN) [25]. For the "loose" (harder) SC-2 test set the same model achieved an RM SE of 1.62 (R 2 0.35). ...

Reference:

Neural Mulliken Analysis: Molecular Graphs from Density Matrices for QSPR on Raw Quantum-Chemical Data
Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility
  • Citing Article
  • July 2024

ACS Omega

... It addresses task-free, data-incremental scenarios, mitigates catastrophic forgetting with experience replay, and enhances inter-class separation with contrastive loss, achieving strong results on five public datasets. Schiemer et al. introduced OCL-HAR, an online continual learning approach for HAR that addresses challenges such as new class discovery and distribution shifts in streaming sensor data [27]. Their method leverages semisupervised learning, isolation forests for outlier detection, and prototype-based memory replay, achieving up to 0.23 improvements in macro F1 scores compared to state-of-theart techniques on four public datasets. ...

Online continual learning for human activity recognition
  • Citing Article
  • June 2023

Pervasive and Mobile Computing

... This algorithm is an extension of belief propagation (BP), a heuristic approach in machine learning for estimating posterior probabilities in graphical models. BP has also been studied in statistical physics [111][112][113][114][115][116][117][118][119][120][121][122][123][124][125][126][127][128][129]. As stated in [130][131][132], BP calculates marginal probabilities for individual nodes within a factor graph. ...

Belief propagation on networks with cliques and chordless cycles
  • Citing Article
  • May 2023

PHYSICAL REVIEW E

... Bipartite networks consist of two disjoint sets of nodes and a set of edges between the nodes of different sets, which are translated into Fig. 1 A schematic view of a bipartite network consisting of seven individual nodes (solid spheres) and two group nodes (open squares) (a), and its projection (b). In the projection, two individual nodes sharing at least one adjacent group node in the bipartite networks are connected by an edge to each other generating function method for the generalized configuration model when examining the structure of GC in a random network with arbitrary clique clustering (Mann et al 2022). They observed that GC possesses negative degree correlations for a single-size clique network. ...

Degree correlations in graphs with clique clustering
  • Citing Article
  • April 2022

PHYSICAL REVIEW E

... We detail the synergy of these components via ablation studies. • We compare µDAR's macro-F1 performance against six SOTA UDA algorithms [10,20,5,13,3,15] on four public benchmark datasets [8,1,17,21] covering varying label complexities, dexterity levels (novice vs. expert). µDAR consistently outperforms baselines by ≈ 4-12% over five independent trials per dataset. ...

ContrasGAN: Unsupervised domain adaptation in Human Activity Recognition via adversarial and contrastive learning

Pervasive and Mobile Computing

... The simplest cover is to simply assume each edge is a 2-clique [12] -and this approximation certainly works well in some cases. However, given the large body of knowledge for random clustered networks [17,[24][25][26][27][28][29][30][31][32][33] it is tempting to apply covers with larger and more complicated motifs in the hope to obtain more accurate models. For sparse random graphs, this technique works well and the success lies in the locally treelike nature of the factor graph of the covered network. ...

Exact formula for bond percolation on cliques
  • Citing Article
  • August 2021

PHYSICAL REVIEW E

Peter Mann

·

·

·

[...]

·

Simon Dobson

... In this context, the occupancy of edges directly corresponds to the successful transmission of epidemic among individuals, while the emergence of the giant component signifies the percolation threshold, indicating a widespread outbreak. Additionally, percolation models can assess the efficacy of immunization, [17][18][19][20][21] providing a solid theoretical foundation for formulating scientific and reasonable prevention and control measures. However, classic percolation theory does not adequately capture the dynamic changes in the network over time. ...

Symbiotic and antagonistic disease dynamics on networks using bond percolation
  • Citing Article
  • August 2021

PHYSICAL REVIEW E

... Clustering coefficient is the ratio of the number of edges between adjacent nodes to the maximum number of edges that can exist between them, and the clustering coefficient of nodes is a number between 0 and 1. In a well-connected network, the clustering coefficient should be high and the number of connected components should be small [27] . In the co-occurrence network constructed in this study (Fig. 1), the average clustering coefficient is 0.666 and the number of connected components is 3, indicating that this network has a nodular structure and good connectivity. ...

Two-pathogen model with competition on clustered networks
  • Citing Article
  • June 2021

PHYSICAL REVIEW E