Project

Data analysis

Goal: To develop mathematical and statistical tools to extract knowledge from data.

Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
14
Reads
0 new
34

Project log

Guillermo Restrepo
added a research item
Chemical space entails substances endowed with a notion of nearness that comes in two flavours: similarity and synthetic reachability. What is the maximum size for the chemical space? Is there an upper bound for its classes of similar substances? How many substances and reactions can it house? Can we store these features of the chemical space? Here I address these questions and show that the physical universe does not suffice to store the chemical one embodied in the chemical space. By analysing the historical evolution of the space as recorded by chemists over the centuries, I show that it has been mainly expanded by synthesis of organic compounds and unfolds at an exponential rate doubling its substances each 16 years. At the turn of the 20th century it left behind an expansion period driven by reactions and entered the current era ruled by substance discovery, which often relies on some few starting materials and reaction classes. Extrapolating from these historical trends, synthesising a large set of affordable chemicals in the foreseeable future would require trebling the historical stable speed rate of discovery of new chemicals. Likewise, creating a database of failed reactions accounting for 25% of the known chemical space to assist the artificial intelligence expansion of the space could be afforded if the synthetic efforts of the coming five years are entirely dedicated to this task. Finally, I discuss hypergraph reaction models to estimate the future shape of the network underlying the chemical space.
Guillermo Restrepo
added 3 research items
The collection of every species reported up to date constitutes the so-called Chemi- cal Space (CS). This space currently comprises well over 30 million substances and is growing exponentially [2]. In order to characterize this ever-growing space, chemists seek for similarity of substances on the CS based on the way they combine [3]. Mendeleev’s work on chemical elements was based upon his knowledge of the CS by 1869 is per- haps the most famous example of how the CS determines similarity relations [4]. From a contemporary point of view, Network Theory serves as a natural framework to identify c these kind of relational patterns in the CS [5]. Nowadays, databases such as Reaxys 6 have grown to a point where they can be taken as proxies for the whole CS, opening the possibility to analyze it from a data driven perspective. In this work we propose to study the similarity of chemical elements according to the compounds they form. From each compound, we deleted each element to ob- tain a formula that is connected to the deleted element, v.g. S 1/2 O 4/2 , Na 2/1 O 4/1 and Na 2/4 S 1/4 are formulae coming from Na 2 SO 4 (Sodium sulfate) where Na, S and O, have been deleted respectively. This form a bipartite graph formed by elements and those formulae where they have been deleted, We build our network using 26,206,663 compounds recorded on Reaxys up to 2015. Similarity among chemical elements is constructed analogously to Social Network Analysis, where actors are declared similar whenever they are connected to the same set of other actors. The more formulae ele- ments share, the more similar they are. We introduce a new notion of in-betweenness of elements acting as mediators on similarity relations of others. We analyze the struc- tural features of this network and how they are affected by node removal. We show that the network is both highly dense and redundant. Even though it is heavily centralized, similarity relations are widely spread across a wide range of formulae, which grants the network extraordinary structure resiliency, even against directed attack. We discuss some implications of these results for chemistry.
Hypergraphs serve as models of complex networks that capture more general structures than binary relations. For graphs, a wide array of statistics has been devised to gauge different aspects of their structures. Hypergraphs lack behind in this respect. The Forman–Ricci curvature is a statistics for graphs based on Riemannian geometry, which stresses the relational character of vertices in a network by focusing on the edges rather than on the vertices. Despite many successful applications of this measure to graphs, Forman–Ricci curvature has not been introduced for hypergraphs. Here, we define the Forman–Ricci curvature for directed and undirected hypergraphs such that the curvature for graphs is recovered as a special case. It quantifies the trade-off between hyperedge (arc) size and the degree of participation of hyperedge (arc) vertices in other hyperedges (arcs). Here, we determine upper and lower bounds for Forman–Ricci curvature both for hypergraphs in general and for graphs in particular. The measure is then applied to two large networks: the Wikipedia vote network and the metabolic network of the bacterium Escherichia coli. In the first case, the curvature is governed by the size of the hyperedges, while in the second example, it is dominated by the hyperedge degree. We found that the number of users involved in Wikipedia elections goes hand-in-hand with the participation of experienced users. The curvature values of the metabolic network allowed detecting redundant and bottle neck reactions. It is found that ADP phosphorylation is the metabolic bottle neck reaction but that the reverse reaction is not similarly central for the metabolism. Furthermore, we show the utility of the Forman–Ricci curvature for quantification of assortativity in hypergraphs and illustrate the idea by investigating three metabolic networks.
The periodic system arose from knowledge about substances, which constitute the chemical space. Despite the importance of this interplay, little is known about how the expanding space affected the system. Here we show, by analysing the space between 1800 and 1869, how the periodic system evolved until its formulation. We found that after an unstable period culminating around 1826, the system began to converge to a backbone structure, unveiled in the 1860s, which was clearly evident in the 1840s. Hence, contrary to the belief that the ``ripe moment'' to formulate the system was in the 1860s, it was in the 1840s. The evolution of the system is marked by the rise of organic chemistry in the first quarter of the nineteenth-century, which prompted the recognition of relationships among main group elements and obscured some of transition metals, which explains why the formulators of the periodic system struggled accommodating them. We also introduced an algorithm to adjust the chemical space according to different sets of atomic weights, which allowed for estimating the resulting periodic systems of chemists using one or the other nineteenth-century atomic weights. These weights produce orderings of the elements very similar to that of 1869, while providing different similarity relationships among the elements, therefore producing different periodic systems. By analysing these systems, from Dalton up to Mendeleev, we found that Gmelin's atomic weights of 1843 produce systems remarkably similar to that of 1869, a similarity that was reinforced by the atomic weights on the years to come.
Wilmer Leal
added 2 research items
Networks encoding symmetric binary relations between pairs of elements are mathematically represented by (undirected) graphs. Graph theory is a well developed mathematical subject, but empirical networks are typically less regular and also often much larger than the graphs that are mathematically best understood. Several quantities have therefore been introduced to characterize the large scale behavior or to identify the most important vertices in empirical networks. As the crucial structure of a graph is, however, given by the set of its edges rather than by its vertices, we should systematically define and evaluate quantities assigned to the edges rather than to the vertices. Curvature is a notion originally introduced in the context of smooth Riemannian manifolds to measure local or global deviation of a manifold from being Euclidean. Ricci curvature specifically, as a local measure, provides relatively broad information about the structure of positively curved manifolds. Therefore, there have been several attempts to discretize curvature notions to other settings such as cell complexes, graphs and undirected hypergraphs for obtaining similar results. By this discretizations they have been able to transfer some of the analytical or topological properties of original smooth curvatures to these discrete spaces. For the directed hypergraph case, these curvatures were introduced recently and very little is known about their descriptive power. In this paper, we first present the results of our discretizations of Forman-Ricci and Ollivier-Ricci curvature notions, then, we show that they are powerful tools for exploring local properties of directed hypergraph motifs. To conclude, we carry out a curvature-based analysis of the metabolic network of E. coli.
The relations, rather than the elements, constitute the structure of networks. We therefore develop a systematic approach to the analysis of networks, modelled as graphs or hypergraphs, that is based on structural properties of (hyper)edges, instead of vertices. For that purpose, we utilize so-called network curvatures. These curvatures quantify the local structural properties of (hyper)edges, that is, how, and how well, they are connected to others. In the case of directed networks, they assess the input they receive and the output they produce, and relations between them. With those tools, we can investigate biological networks. As examples, we apply our methods here to protein-protein interaction, transcriptional regulatory and metabolic networks.
Wilmer Leal
added a research item
Relationships in real systems are often not binary, but of a higher order, and therefore cannot be faithfully modelled by graphs, but rather need hypergraphs. In this work, we systematically develop formal tools for analyzing the geometry and the dynamics of hypergraphs. In particular, we show that Ricci curvature concepts, inspired by the corresponding notions of Forman and Ollivier for graphs, are powerful tools for probing the local geometry of hypergraphs. In fact, these two curvature concepts complement each other in the identification of specific connectivity motifs. In order to have a baseline model with which we can compare empirical data, we introduce a random model to generate directed hypergraphs and study properties such as degree of nodes and edge curvature, using numerical simulations. We can then see how our notions of curvature can be used to identify connectivity patterns in the metabolic network of E. coli that clearly deviate from those of our random model. Specifically, by applying hypergraph shuffling to this metabolic network we show that the changes in the wiring of a hypergraph can be detected by Forman Ricci and Ollivier Ricci curvatures.
Wilmer Leal
added 3 research items
For more than 150 years the structure of the periodic system of the chemical elements has intensively motivated research in different areas of chemistry and physics. However, there is still no unified picture of what a periodic system is. Herein, based on the relations of order and similarity, we report a formal mathematical structure for the periodic system, which corresponds to an ordered hypergraph. It is shown that the current periodic system of chemical elements is an instance of the general structure. The definition is used to devise a tailored periodic system of polarizability of single covalent bonds, where order relationships are quantified within subsets of similar bonds and among these classes. The generalised periodic system allows envisioning periodic systems in other disciplines of science and humanities.
For more than 150 years, the structure of the periodic system of the chemical elements has intensively motivated research in different areas of chemistry and physics. However, there is still no unified picture of what a periodic system is. Herein, based on the relations of order and similarity, we report a formal mathematical structure for the periodic system, which corresponds to an ordered hypergraph. It is shown that the current periodic system of chemical elements is an instance of the general structure. The definition is used to devise a tailored periodic system of polarizability of single covalent bonds, where order relationships are quantified within subsets of similar bonds and among these classes. The generalized periodic system allows envisioning periodic systems in other disciplines of science and humanities.
Meyer and Mendeleev came across with their periodic systems by classifying and ordering the known elements by about 1869. Order and similarity were based on knowledge of chemical compounds, which gathered together constitute the chemical space by 1869. Despite its importance, very little is known about the size and diversity of this space and even less is known about its influence upon Meyer's and Mendeleev's periodic system. Here we show, by analysing 11,484 substances reported in the scientific literature up to 1869 and stored in Reaxys database, that 80% of the space was accounted by 12 elements, oxygen and hydrogen being those with most compounds. We found that the space included more than 2,000 combinations of elements, of which 5%, made of organogenic elements, gathered half of the substances of the space. By exploring the temporal report of compounds containing typical molecular fragments, we found that Meyer's and Mendeleev's available chemical space had a balance of organic, inorganic and organometallic compounds, which was, after 1830, drastically overpopulated by organic substances. The size and diversity of the space show that knowledge of organogenic elements sufficed to have a panoramic idea of the space. We determined similarities among the 60 elements known by 1869 taking into account the resemblance of their combinations and we found that Meyer's and Mendeleev's similarities for the chemical elements agree to a large extent with the similarities allowed by the chemical space.
Wilmer Leal
added a research item
Chemical research unveils the structure of chemical space, spanned by all chemical species, as documented in more than 200 y of scientific literature, now available in electronic databases. Very little is known, however, about the large-scale patterns of this exploration. Here we show, by analyzing millions of reac- tions stored in the Reaxys database, that chemists have reported new compounds in an exponential fashion from 1800 to 2015 with a stable 4.4% annual growth rate, in the long run nei- ther affected by World Wars nor affected by the introduction of new theories. Contrary to general belief, synthesis has been the means to provide new compounds since the early 19th cen- tury, well before Wöhler’s synthesis of urea. The exploration of chemical space has followed three statistically distinguishable regimes. The first one included uncertain year-to-year output of organic and inorganic compounds and ended about 1860, when structural theory gave way to a century of more regular and guided production, the organic regime. The current organometal- lic regime is the most regular one. Analyzing the details of the synthesis process, we found that chemists have had preferences in the selection of substrates and we identified the workings of such a selection. Regarding reaction products, the discovery of new compounds has been dominated by very few elemental com- positions. We anticipate that the present work serves as a starting point for more sophisticated and detailed studies of the history of chemistry.
Wilmer Leal
added a research item
In contrast to graph-based models for complex networks, hypergraphs are more general structures going beyond binary relations of graphs. For graphs, statistics gauging different aspects of their structures have been devised and there is undergoing research for devising them for hypergraphs. Forman-Ricci curvature is a statistics for graphs, which is based on Riemannian geometry, and that stresses the relational character of vertices in a network through the analysis of edges rather than vertices. In spite of the different applications of this curvature, it has not yet been formulated for hypergraphs. Here we devise the Forman-Ricci curvature for directed and undirected hypergraphs, where the curvature for graphs is a particular case. We report its upper and lower bounds and the respective bounds for the graph case. The curvature quantifies the trade-off between hyperedge(arc) size and the degree of participation of hyperedge(arc) vertices in other hyperedges(arcs). We calculated the curvature for two large networks: Wikipedia vote network and Escherichia coli metabolic network. In the first case the curvature is ruled by hyperedge size, while in the second by hyperedge degree. We found that the number of users involved in Wikipedia elections goes hand-in-hand with the participation of experienced users. The curvature values of the metabolic network allowed detecting redundant and bottle neck reactions. It is found that ADP phosphorilation is the metabolic bottle neck reaction but that the reverse reaction is not that central for the metabolism.
Guillermo Restrepo
added a research item
Background Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.
Guillermo Restrepo
added a project goal
To develop mathematical and statistical tools to extract knowledge from data.