Figure 1 - uploaded by Romain Azaïs
Content may be subject to copyright.
Construction of the Harris path (right) from 0 to 2n = 14 as the contour of an ordered tree (left) with n = 7 nodes.
Source publication
Tree-structured data naturally appear in various fields, particularly in biology where plants and blood vessels may be described by trees, but also in computer science because XML documents form a tree structure. This paper is devoted to the estimation of the relative scale of ordered trees that share the same layout. The theoretical study is achie...
Contexts in source publication
Context 1
... Galton-Watson tree is the genealogy tree of a population starting from one initial ancestor (the root) in which each individual gives birth to a random number of children according to the same probability distribution, independently on each other. Any ordered tree may be encoded by its Harris path which returns height of nodes in depth-first order (see Subsection 2.2, Algorithm 1 and Figure 1). Aldous [2, Theorem 23] stated the following asymptotic property of the Harris path H[τ n ] of a Galton-Watson tree τ n conditioned on having n nodes, ...
Context 2
... think that our method may be a complementary tool to this famous technique. Indeed the structure of HTML documents, such as Wikipedia articles, may be encoded by an ordered tree structure (see Figure 11). Furthermore, all the Wikipedia webpages share the same layout and thus can be differentiated by their relative scale. ...
Context 3
... Harris process is then defined as the linear interpolation of the Harris walk (see example in Figure 1). Note that, as displayed in Figure 2, the tree can be recovered from its Harris process such that the correspondence is one to one. ...
Context 4
... that, as displayed in Figure 2, the tree can be recovered from its Harris process such that the correspondence is one to one. Figure 2: The ordered tree of Figure 1 in its Harris path (left): each vertical axis represents a node of the original structure (right). A common picture helping to see how to recover the tree from the contour is to imagine putting glue under the contour and then crushing the contour horizontally such that the inner parts of the contour which face each others are glued. ...
Context 5
... appears to be consistent with the theoretical tolerence intervals given by the central limit theorem. Similar results have been obtained from the Wasserstein method (see Figure 10). ...
Context 6
... is the standard markup language for creating webpages. Documents encoded in a markup language naturally presents a tree structure: the area delimited by opening and closing tags represents a node of the tree; the children of this node are given by the tags directly found in this area in the order they appear (see Figure 11 for an example of HTML document and the corresponding ordered tree structure). It should be noted that the ordered tree representing an HTML document does not take into account the text between tags but only the hierchical structure. ...
Context 7
... no revision has been found during this period, λ ls is equal to the estimate of the previous month, and recursively. Figure 12 displays the evolution of λ ls over time. First, we remark two spikes (a), negative in May 2007, and (b), positive in May 2016. ...
Context 8
... these spikes correspond to massive vandalism of the article on May 7 2007 (addition of 720 pointless sections with random text) and May 23 2016 (complete deletion of the article) by malicious people. Indeed, if we do not consider these two vandalized webpages in our estimation, we obtain the graph of Figure 13 (left) that has no spikes. In addition, we observe in Figure 13 (left) that the time series of λ ls has roughly two regimes (c) and (d). ...
Context 9
... if we do not consider these two vandalized webpages in our estimation, we obtain the graph of Figure 13 (left) that has no spikes. In addition, we observe in Figure 13 (left) that the time series of λ ls has roughly two regimes (c) and (d). The first period (c) corresponds to the "running in" required to find the adequate structure of the article. ...
Context 10
... a good structure arises, the webpage is then slowly broadened during the second regime (d). It should be remarked in Figure 13 (right) that two important modifications occur in the period (d): (e) between July 2013 and April 2014 and (f ) in February 2016. The (e) period is related to major changes in the webpage (mainly addition of references and reorganization of some sections) especially following advances in this field. ...
Context 11
... perform the same methodology on the history of the Wikipedia article Chocolate 3 (see Figure 14). This article has been edited 6332 times by 3105 Wikipedians since its creation on November 13 2001 (information acquired on August 11 2016). ...
Context 12
... article has been edited 6332 times by 3105 Wikipedians since its creation on November 13 2001 (information acquired on August 11 2016). All the spikes observed on the graph of Figure 14 correspond to acts of vandalism (deletion of substantial content). For the sake of example, we highlight two major events (a) (in May 2008) and (b) (in June 2010) occuring during the "running in" period (c): (a) corresponds to important additions in the article (sections Etymology, Holydays and Manufacturers have been added), while (b) is related to the creation of the parallel article Health effects of chocolate leading to deletion of the corresponding sections in the main article. ...
Similar publications
In this paper, we describe the construction of TeKnowbase, a knowledge-base of technical concepts in computer science. Our main information sources are technical websites such as Webopedia and Techtarget as well as Wikipedia and online textbooks. We divide the knowledge-base construction problem into two parts -- the acquisition of entities and the...
Citations
... Interpreting 0 as +1 and 1 as −1, we can read this sequence as an excursion (i.e. a walk that comes back to the origin) in Z, starting at 0. This walk also draws the graph of a function, which is called the Harris path of the tree [50,51] [50]: Aldous (1993), 'The continuum random tree III' In the case of unordered trees, this tuple is not unique (except in pathological cases). In the example of Figure 2.3, if we swap the nodes of depth 1 to place the leaf between its two siblings (or after them), we obtain a different tuple for the tree. ...
Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.
... treex offers converters to the standard encoding of nested brackets (see for instance (Aho, Hopcroft, & Ullman, 1974)) and L-strings as manipulated by L-Py, a simulation framework for modeling plant architectures (Boudon, Pradal, Cokelaer, Prusinkiewicz, & Godin, 2012). Numerical experiments and/or figures of recent publications (Azais, Genadot, & Henry, 2019, Azais (2017, ) have been made using the current or previous versions of treex. Furthermore, ongoing academic projects on the development and implementation of supervised classification methods for tree data, the study of lineage trees, as well as investigations on plant modeling, make intensive use of structures and algorithms implemented in treex. ...
... Several approaches have been considered in the literature to deal with this kind of data: edit distances between unordered or ordered trees (see [6] and the references therein), coding processes for ordered trees [24], with a special focus on conditioned Galton-Watson trees [3,5]. One can also mention the approach developed in [29]. ...
... By virtue of the previous lemma, one can derive the following result on the quantity ∆ i x defined by (3). ...
Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficul per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through two real data classification problems the great efficiency of our approach, in particular with respect to the ones considered in the literature, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.