Leen Stougie’s research while affiliated with Vrije Universiteit Amsterdam and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (251)


GFSS(w,k,Sk)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {GFSS} (w,k,S_k)$$\end{document}
The complete de Bruijn graph Gk=(Vk,Ek)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_k=(V_k,E_k)$$\end{document} of order k=4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=4$$\end{document} over Σ={a,b}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma =\{\texttt {a},\texttt {b}\}$$\end{document}; forbidden edges are in red
The graph G(D, E) after we have computed the source node marked s and the non-sink nodes marked s, 1, 2. All other nodes, marked with double circle, are sink nodes. The shortest path from source node to any sink node (the node marked e) is then bb\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {bb}$$\end{document}, which gives the solution x=aab·bb·aba\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x=\texttt {aab}\cdot \texttt {bb} \cdot \texttt {aba}$$\end{document} ( ·\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdot$$\end{document} denotes concatenation)
Edit distance and Lk\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}_k$$\end{document} distance, with k∈[6,10]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k\in [6,10]$$\end{document}, for each pair of strings in the Influenza dataset. The gap between both distance measures at string pair ID 400 is due to the underlying nature of the Influenza dataset. This dataset is comprised of five virus subtypes (H1N1, H2N2, H7N3, H7N9, H5N1). Sequence pairs within the same subtype or between closely related sutypes are highly similar (e.g., the pair with ID 618 comprised of two sequences of H2N2, or the pair with ID 419 comprised of one sequence in H1N1 and another in H5N1), whereas sequence pairs spanning other subtypes are not that similar (e.g., the pair with ID 26 comprised of one sequence in H1N1 and another in H2N2). Indeed, this is captured by both the edit distance and the Lk\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}_k$$\end{document} distance, k∈[6,10]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k\in [6,10]$$\end{document}, which have similar trends
NMI and ARI for varying number |Sk|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|S_k|$$\end{document} of forbidden patterns, for: a, bNews and c, dWebKb. The NMI and ARI values, as well as |Sk|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|S_k|$$\end{document} and the number d of occurrences of forbidden patterns, are averages over 10 runs. On the top of each pair of bars, we plot the p-value of a t-test; p<0.05\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p< 0.05$$\end{document} implies that the difference between ETFS and SFSS is statistically significant

+14

Missing value replacement in strings and applications
  • Article
  • Full-text available

January 2025

·

26 Reads

Data Mining and Knowledge Discovery

·

Chang Liu

·

·

[...]

·

Michelle Sweering

Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.

Download


Heavy Nodes in a Small Neighborhood: Exact and Peeling Algorithms With Applications

December 2024

·

91 Reads

IEEE Transactions on Knowledge and Data Engineering

We introduce a weighted and unconstrained variant of the well-known minimum k union problem: Given a bipartite graph G(U,V,E)\mathcal {G}(U,V,E) with weights for all nodes in V , find a set SVS\subseteq V such that the ratio between the total weight of the nodes in S and the number of their distinct adjacent nodes in U is maximized. Our problem, which we term Heavy Nodes in a Small Neighborhood ( HNSN ), finds applications in marketing, team formation, and money laundering detection. For example, in the latter application, S represents bank account holders who obtain illicit money from some peers of a criminal and route it through their accounts to a target account belonging to the criminal. We prove that HNSN can be solved exactly in polynomial time via linear programming. We also develop several algorithms offering different effectiveness/efficiency trade-offs: an exact algorithm, based on node contraction, graph decomposition, and linear programming, as well as three peeling algorithms. The first peeling algorithm is a near-linear time approximation algorithm with a tight approximation ratio, the second is an iterative algorithm that converges to an optimal solution in a very small number of iterations in practice, and the third is a near-linear time greedy heuristic. In addition, we formalize a money laundering scenario involving multiple target accounts and show how our algorithms can be extended to deal with it. Our experiments on real and synthetic datasets show that our algorithms find (near-)optimal solutions, outperforming a natural baseline, and that they can detect money laundering more effectively and efficiently than two state-of-the-art methods.


Elastic-Degenerate String Matching with 1 Error or Mismatch

September 2024

·

21 Reads

·

2 Citations

Theory of Computing Systems

An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω-1)+O(N)O~(nmω1)+O(N)\mathcal {\tilde{O}}(nm^{\omega -1})+\mathcal {O}(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ωω\omega denotes the matrix multiplication exponent and the O~(·)O~()\mathcal {\tilde{O}}(\cdot ) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+kN)O(k2mG+kN)\mathcal {O}(k^2mG+kN) time, under edit distance, or O(kmG+kN)O(kmG+kN)\mathcal {O}(kmG+kN) time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k=1k=1, the existing algorithms run in Ω(mN)Ω(mN)\varOmega (mN) time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+N)logm)O((nm2+N)logm)\mathcal {O}((nm^2 + N)\log m) or O(nm3+N)O(nm3+N)\mathcal {O}(nm^3 + N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm+Nloglogm)O(nm2logm+Nloglogm)\mathcal {O}(nm^2\sqrt{\log m} + N\log \log m)-time algorithm. We also show that 1-EDSM can be solved in O(nm2+Nlogm)O(nm2+Nlogm)\mathcal {O}(nm^2 + N\log m) time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022.


Table 3 ).
Figure 6: Results (left) and running time in seconds (right) for synthetic instances with L = 100, R = 30, |T | ∈ {20, 50, 100}, Me = 0 and Ml ∈ {0, 0.2} for varying k ∈ {1, 2, 5, 10} (k is the number of leaves chosen in Step 1 of FHyNCH-MultiML: see Section 2.3.2).
Figure 7: Synthetic instance results for different values of Ml, Me, L, and R. The reference reticulation number value per instance is the network the trees were extracted from.
Running times for the large instances extracted from the Bacterial and Archaeal Genomes data set. For each instance group, we give the average running time in seconds. Dashes indicate empty instance groups.
Inferring Phylogenetic Networks from Multifurcating Trees via Cherry Picking and Machine Learning

July 2024

·

20 Reads

·

2 Citations

Molecular Phylogenetics and Evolution

The Hybridization problem asks to reconcile a set of conflicting phylogenetic trees into a single phylogenetic network with the smallest possible number of reticulation nodes. This problem is computationally hard and previous solutions are limited to small and/or severely restricted data sets, for example, a set of binary trees with the same taxon set or only two non-binary trees with non-equal taxon sets. Building on our previous work on binary trees, we present FHyNCH, the first algorithmic framework to heuristically solve the Hybridization problem for large sets of multifurcating trees whose sets of taxa may differ. Our heuristics combine the cherry-picking technique, recently proposed to solve the same problem for binary trees, with two carefully designed machine-learning models. We demonstrate that our methods are practical and produce qualitatively good solutions through experiments on both synthetic and real data sets.


Connecting de Bruijn Graphs

We study the problem of making a de Bruijn graph (dBG), constructed from a collection of strings, weakly connected while minimizing the total cost of edge additions. The input graph is a dBG that can be made weakly connected by adding edges (along with extra nodes if needed) from the underlying complete dBG. The problem arises from genome reconstruction, where the dBG is constructed from a set of sequences generated from a genome sample by a sequencing experiment. Due to sequencing errors, the dBG is never Eulerian in practice and is often not even weakly connected. We show the following results for a dBG G(V,E) of order k consisting of d weakly connected components: 1. Making G weakly connected by adding a set of edges of minimal total cost is NP-hard. 2. No PTAS exists for making G weakly connected by adding a set of edges of minimal total cost (unless the unique games conjecture fails). We complement this result by showing that there does exist a polynomial-time (2-2/d)-approximation algorithm for the problem. 3. We consider a restricted version of the above problem, where we are asked to make G weakly connected by only adding directed paths between pairs of components. We show that making G weakly connected by adding d-1 such paths of minimal total cost can be done in O(k|V|α(|V|)+|E|) time, where α(.) is the inverse Ackermann function. This improves on the O(k|V|log(|V|)+|E|)-time algorithm proposed by Bernardini et al. [CPM 2022] for the same restricted problem. 4. An ILP formulation of polynomial size for making G Eulerian with minimal total cost.


Total Completion Time Scheduling Under Scenarios

December 2023

·

26 Reads

Lecture Notes in Computer Science

Scheduling jobs with given processing times on identical parallel machines so as to minimize their total completion time is one of the most basic scheduling problems. We study interesting generalizations of this classical problem involving scenarios. In our model, a scenario is defined as a subset of a predefined and fully specified set of jobs. The aim is to find an assignment of the whole set of jobs to identical parallel machines such that the schedule, obtained for the given scenarios by simply skipping the jobs not in the scenario, optimizes a function of the total completion times over all scenarios. While the underlying scheduling problem without scenarios can be solved efficiently by a simple greedy procedure (SPT rule), scenarios, in general, make the problem NP-hard. We paint an almost complete picture of the evolving complexity landscape, drawing the line between easy and hard. One of our main algorithmic contributions relies on a deep structural result on the maximum imbalance of an optimal schedule, based on a subtle connection to Hilbert bases of a related convex cone.



Constructing phylogenetic networks via cherry picking and machine learning

September 2023

·

51 Reads

·

10 Citations

Algorithms for Molecular Biology

Background Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks. Results In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times. Conclusions Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.


Constructing Phylogenetic Networks via Cherry Picking and Machine Learning *

March 2023

·

56 Reads

Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times. Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.


Citations (65)


... CTCTG GTA · · CT · · CC · · TG A Unfortunately, the cardinality c of T in the above complexities is bounded only by N , so even for k = 1, the existing algorithms run in Ω(mN ) time in the worst case. In response, Bernardini et al. [4] showed many algorithms for approximate EDSM for k = 1 working inÕ(nm 2 + N ) time or in O(nm 3 + N ) time for both the Hamming and the edit distance metrics. Pissis et al. then improved these algorithms for k = 1 (for both metrics) to O(nm 2 + N ) time [22]. ...

Reference:

Faster ED-String Matching with $k$ Mismatches
Elastic-Degenerate String Matching with 1 Error or Mismatch

Theory of Computing Systems

... Since then, the results presented in [14] have been generalised to larger classes of rooted phylogenetic networks and to deciding if a given rooted phylogenetic network is embedded in another such network (e.g., [2,17,20]). Most recently, cherry-picking sequences have also been used in the context of computing distances between phylogenetic networks [19] and to develop practical algorithms that reconstruct phylogenetic networks from a collection of phylogenetic trees in a machine-learning framework [1]. ...

Inferring Phylogenetic Networks from Multifurcating Trees via Cherry Picking and Machine Learning

Molecular Phylogenetics and Evolution

... [16,24]). This theoretical work has, in turn, resulted in the development of practical algorithms to reconstruct rooted phylogenetic networks from a set of rooted phylogenetic trees [4,13]. In a related line of research, cherry picking operations have recently been used to define and compute distances between phylogenetic networks [22,23]. ...

Constructing phylogenetic networks via cherry picking and machine learning

Algorithms for Molecular Biology

... Pattern Matching For indeterminate strings, we point out a Boyer-Moore adaptation [60], a combination [91] of ShiftAnd and Boyer-Moore-Sunday, and an KMP-based approach [82]. For ED strings, there is a rather long line on improvements for exact pattern matching [13,34,25,66,88,14], or with one error [24]. There are indices for pattern matching on ED texts, based on the suffix tree [55], on the Burrows-Wheeler transform (BWT) [73] or on a graph for read mapping [28]. ...

Elastic-Degenerate String Matching with 1 Error
  • Citing Chapter
  • October 2022

Lecture Notes in Computer Science

... A preliminary version of this work appeared in [19]. Compared to the preliminary version, we have added the following material: (i), we defined a new non-learned heuristic based on important features and experimentally tested it (Sect. ...

Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning

... Computing distances that are based on tree rearrangement operations and related dissimilarity measures such as the minimum hybridisation number for two phylogenetic trees [4] remains an active area of research (e.g. [18,22,25]) despite the NP-hardness of the associated optimisation problems. Indeed, recent algorithmic progress facilitates computations that exactly calculate the aforementioned measures for data sets of remarkable size [26,27]. ...

A duality based 2-approximation algorithm for maximum agreement forest

Mathematical Programming

... A third way to deal with missing values is missing value replacement (a.k.a. imputation) [14,20,21,34,56,67,69,74,102,105,106,115,117]. Missing value replacement methods have the benefit that their output can be used in any task. ...

Hide and Mine in Strings: Hardness, Algorithms, and Experiments

IEEE Transactions on Knowledge and Data Engineering

... Both of these problems are different from those considered in this work since they do not require a pattern to occur as a consecutive substring. Another related line of work is differentially private pattern counting on a single string with different definitions of neighboring (and therefore different privacy guarantees) [16,31,59]. The problems of frequent string mining and frequent sequence mining have also been studied in the local model of differential privacy [14,61,62]. ...

Differentially Private String Sanitization for Frequency-Based Mining Tasks

... To understand more about how SARS-CoV-2 unfolded globally in the initial phase of the pandemic, we used a phylogenetic network as part of our research. This reticulate network of phylogeny captures biological events between taxonomical units more aptly than phylogenetic trees, which force taxa to be at the bifurcation tip (5,6). In the present study, we used 161 country-specific earliest sequences of SARS-CoV-2, causative agent of COVID-19, from GISAID repository to decode the sequence configuration in terms of haplogroup with their characteristic mutations during the restricted international travel. ...

Applicability of several rooted phylogenetic network algorithms for representing the evolutionary history of SARS-CoV-2

BMC Ecology and Evolution

... Recently, Dyer et al. [15] tried to generate simple hypergraphs with given degree sequence uniformly at random using a bijection between bipartite graphs and k-hypergraphs, which requires less assumptions than the configuration model for hypergraphs. Their methods allow for scenarios where d 1 = o(min{σ 1/2 , σ 1−2/k }) (see Theorem 1.6 in [15]), while they require d 1 = O(log n) for the configuration model (see Lemma 2.3 in [15]). ...

Sampling hypergraphs with given degrees
  • Citing Article
  • November 2021

Discrete Mathematics