Minos N. Garofalakis

Technical University of Crete, La Canée, Crete, Greece

Are you Minos N. Garofalakis?

Claim your profile

Publications (151)49.66 Total impact

  • Minos Garofalakis, Rajeev Rastogi, Kyuseok Shim
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequential pattern mining under various constraints is a challenging data mining task. The paper provides a generic framework based on constraint programming to discover sequence patterns defined by constraints on local patterns (e.g., gap, regular expressions) or constraints on patterns involving combination of local patterns such as relevant subgroups and top-k patterns. This framework enables the user to mine in a declarative way both kinds of patterns. The solving step is done by exploiting the machinery of Constraint Programming. For complex patterns involving combination of local patterns, we improve the mining step by using dynamic CSP. Finally, we present two case studies in biomedical information extraction and stylistic analysis in linguistics.
    2014 IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI 2014); 11/2014
  • Source
    Minos Garofalakis, Yannis Ioannidis
    [Show abstract] [Hide abstract]
    ABSTRACT: Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and communicates with remote sites by message-passing. Earlier work on parallel query scheduling employs either (a) one-dimensional models of parallel task scheduling, effectively ignoring the potential benefits of resource sharing, or (b) models of globally accessible resource units, which are appropriate only for shared-memory architectures, since they cannot capture the affinity of system resources to sites. In this paper, we develop a general approach capturing the full complexity of scheduling distributed, multi-dimensional resource units for all forms of parallelism within and across queries and operators. We present a level-based list scheduling heuristic algorithm for independent query tasks (i.e., physical operator pipelines) that is provably near-optimal for given degrees of partitioned parallelism (with a worst-case performance ratio that depends on the number of time-shared and space-shared resources per site and the granularity of the clones). We also propose extensions to handle blocking constraints in logical operator (e.g., hash-join) pipelines and bushy query plans as well as on-line task arrivals (e.g., in a dynamic or multi-query execution environment). Experiments with our scheduling algorithms implemented on top of a detailed simulation model verify their effectiveness compared to existing approaches in a realistic setting. Based on our analytical and experimental results, we revisit the open problem of designing efficient cost models for parallel query optimization and propose a solution that captures all the important parameters of parallel execution.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks in living organisms can be seen as the ultimate means of understanding the underlying mechanisms in complex diseases, such as oral cancer. During the last decade, many algorithms based on high-throughput genomic data have been developed to unravel the complexity of gene network construction and their progression in time. However, the small size of samples compared to the number of observed genes makes the inference of the network structure quite challenging. In this study, we propose a framework for constructing and analyzing gene networks from sparse experimental temporal data and investigate its potential in oral cancer. We use two network models based on partial correlations and kernel density estimation, in order to capture the genetic interactions. Using this network construction framework on real clinical data of the tissue and blood at different time stages, we identified common disease-related structures that may decipher the association between disease state and biological processes in oral cancer. Our study emphasizes an altered MET (hepatocyte growth factor receptor) network during oral cancer progression. In addition, we demonstrate that the functional changes of gene interactions during oral cancer progression might be particularly useful for patient categorization at the time of diagnosis and/or at follow-up periods.
    IEEE Journal of Biomedical and Health Informatics 03/2014; 18(2):562-573. · 1.98 Impact Factor
  • Odysseas Papapetrou, Minos Garofalakis
    [Show abstract] [Hide abstract]
    ABSTRACT: Distributed skyline computation is important for a wide range of application domains, from distributed and web-based systems to ISP-network monitoring and distributed databases. The problem is particularly challenging in dynamic distributed settings, where the goal is to efficiently monitor a continuous skyline query over a collection of distributed streams. All existing work relies on the assumption of a single point of reference for object attributes/dimensions, i.e., objects may be vertically or horizontally partitioned, but the accurate value of each dimension for each object is always maintained by a single site. This assumption is unrealistic for several distributed monitoring applications, where object information is fragmented over a set of distributed streams (each monitored by a different site) and needs to be aggregated (e.g., averaged) across several sites. Furthermore, it is frequently useful to define skyline dimensions through complex functions over the aggregated objects, which raises further challenges for dealing with object fragmentation. In this paper, we present the first known distributed approach for continuous fragmented skylines, namely distributed monitoring of skylines over complex functions of fragmented multi-dimensional objects. We also propose several optimizations, including a new technique based on random-walk models for adaptively determining the most efficient monitoring strategy for each object. A thorough experimental study with synthetic and real-life data sets verifies the effectiveness of our approach, demonstrating order-of-magnitude improvements in communication costs compared to the only available centralized solution.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many organizations today are faced with the challenge of processing and distilling information from huge and growing collections of data. Such organizations are increasingly deploying sophisticated mathematical algorithms to model the behavior of their ...
    ACM SIGMOD Record 02/2014; 42(4):61-66. · 0.96 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Oral cancer is characterized by multiple genetic events such as alterations of a number of oncogenes and tumour suppressor genes. The aim of this study is to identify genes and their functional interactions that may play a crucial role on a specific disease-state, especially during oral cancer progression. We examine gene interaction networks on blood genomic data, obtained from twenty three oral cancer patients at four different time stages. We generate the gene-gene networks from sparse experimental temporal data using two methods, Partial Correlations and Kernel Density Estimation, in order to capture genetic interactions. The network study reveals an altered MET (hepatocyte growth factor receptor) network during oral cancer progression, which is further analyzed in relation to other studies.
    Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 07/2013; 2013:575-578.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider a stream outsourcing setting, where a data owner delegates the management of a set of disjoint data streams to an untrusted server. The owner authenticates his streams via signatures. The server processes continuous queries on the union of the streams for clients trusted by the owner. Along with the results, the server sends proofs of result correctness derived from the owner's signatures, which are easily verifiable by the clients. We design novel constructions for a collection of fundamental problems over streams represented as linear algebraic queries. In particular, our basic schemes authenticate dynamic vector sums and dot products, as well as dynamic matrix products. These techniques can be adapted for authenticating a wide range of important operations in streaming environments, including group by queries, joins, in-network aggregation, similarity matching, and event processing. All our schemes are very lightweight, and offer strong cryptographic guarantees derived from formal definitions and proofs. We experimentally confirm the practicality of our schemes.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • Source
    Amol Deshpande, Minos Garofalakis, Michael I. Jordan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an efficient way of performing stepwise selection in the class of decomposable models. The main contribution of the paper is a simple characterization of the edges that canbe added to a decomposable model while keeping the resulting model decomposable and an efficient algorithm for enumerating all such edges for a given model in essentially O(1) time per edge. We also discuss how backward selection can be performed efficiently using our data structures.We also analyze the complexity of the complete stepwise selection procedure, including the complexity of choosing which of the eligible dges to add to (or delete from) the current model, with the aim ofminimizing the Kullback-Leibler distance of the resulting model from the saturated model for the data.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time- and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answ...[truncated].
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation queries, where proper action is dictated based on a trigger condition placed upon the current value that a monitored function receives. Recent work studies the problem of (non-linear) sophisticated function tracking in a distributed manner. The main concept behind the geometric monitoring approach proposed there, is for each distributed site to perform the function monitoring over an appropriate subset of the input domain. In the current work, we examine whether the distributed monitoring mechanism can become more efficient, in terms of the number of communicated messages, by extending the geometric monitoring framework to utilize prediction models. We initially describe a number of local estimators (predictors) that are useful for the applications that we consider and which have already been shown particularly useful in past work. We then demonstrate the feasibility of incorporating predictors in the geometric monitoring framework and show that prediction-based geometric monitoring in fact generalizes the original geometric monitoring framework. We propose a large variety of different prediction-based monitoring models for the distributed threshold monitoring of complex functions. Our extensive experimentation with a variety of real data sets, functions and parameter settings indicates that our approaches can provide significant communication savings ranging between two times and up to three orders of magnitude, compared to the transmission cost of the original monitoring framework.
    SIGMOD; 05/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks are often described as probabilistic graphs in the context of gene and protein sequence analysis in molecular biology. Microarrays and proteomics technology allow the monitoring of expression levels over thousands of biological units over time. In experimental efforts we are interested in unveiling pairwise interactions. Many graphical models have been introduced in order to discover associations from the expression data analysis. However, the small size of samples compared to the number of observed genes/proteins makes the inference of the network structure quite challenging. In this study we generate gene-protein networks from sparse experimental data using two methods, partial correlations and Kernel Density Estimation, in order to capture genetic interactions. Dynamic Gaussian analysis is used to match special characteristics to genes and proteins at different time stages utilizing the KDE method for expressing Gaussian associations with non-linear parameters.
    Bioinformatics & Bioengineering (BIBE), 2012 IEEE 12th International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
    Foundations and Trends in Databases 01/2012; 4(1–3):1-294.
  • Source
    Vibhor Rastogi, Nilesh Dalvi, Minos Garofalakis
    [Show abstract] [Hide abstract]
    ABSTRACT: There have been several recent advancements in Machine Learning community on the Entity Matching (EM) problem. However, their lack of scalability has prevented them from being applied in practical settings on large real-life datasets. Towards this end, we propose a principled framework to scale any generic EM algorithm. Our technique consists of running multiple instances of the EM algorithm on small neighborhoods of the data and passing messages across neighborhoods to construct a global solution. We prove formal properties of our framework and experimentally demonstrate the effectiveness of our approach in scaling EM algorithms.
    Computing Research Repository - CORR. 03/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a probabilistic database (PDB) based declarative IE system that supports a leading statistical IE model, and an associated inference algorithm to answer top-k-style queries over the probabilistic IE outcome. Still, the broader problem of effectively supporting general probabilistic inference inside a PDB-based declarative IE system remains open. In this paper, we explore the in-database implementations of a wide variety of inference algorithms suited to IE, including two Markov chain Monte Carlo algorithms, the Viterbi and the sum-product algorithms. We describe the rules for choosing appropriate inference algorithms based on the model, the query and the text, considering the trade-off between accuracy and runtime. Based on these rules, we describe a hybrid approach to optimize the execution of a single probabilistic IE query to employ different inference algorithms appropriate for different records. We show that our techniques can achieve up to 10-fold speedups compared to the non-hybrid solutions proposed in the literature.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: R ECENT years have witnessed the emergence of novel database applications in various nontraditional do-mains, including location-based services, sensor networks, RFID systems, and biological and biometric databases. Traditionally, data mining has been widely used to reveal interesting patterns in the vast amounts of data generated by such applications. However, for most of these emerging domains, data is often riddled with uncertainty, arising, for instance, from inherent measurement inaccuracies, sampling and curation errors, and network latencies, or even from intentional blurring of the data (to preserve anonymity). Such forms of data uncertainty have to be handled carefully, or else the results of long and tedious data analyses could be inaccurate or even incorrect. In particular, it is important to collect and distill the knowledge from experts in developing mining and data processing methods that are uncertainty-aware. Recently, there has been active interest in the database community by treating uncertain data as a "first-class citizen," where the probability and statistical information of data are stored in the DBMS. Novel queries can be evaluated on these data to produce probabilistic results [1]. More recently, new mining algorithms that take into account data uncertainty, such as frequent pattern mining [2] and clustering [3], have also been proposed. Aggarwal and Yu [4] give a survey on the area of uncertain data querying and mining, where the key challenges include: . models and structures for uncertain information in data mining and complex data analysis, . association rule mining and clustering of uncertain data, . machine learning in uncertain data, . mining moving-object trajectories and biological data with noise, . similarity matching of objects with uncertainty, and . efficient mining and analysis of uncertain/probabil-istic data streams. This special section of the IEEE Transactions on Knowledge and Data Engineering features a collection of four papers, selected from 23 submissions, representing recent advances in the mining of uncertain databases. These works present new techniques for mining patterns, clustering, and ranking on uncertain data. In applications like biological databases, graph data are often incomplete and imprecise. Mining uncertain graph data is semantically different from and computationally more challenging than mining exact graph data. The first paper, "Mining Frequent Subgraph Patterns from Uncertain Graph Data" by Zhaonian Zou, Jianzhong Li, Hong Gao, and Shuo Zhang, investigates the mining of frequent subgraph patterns from uncertain graph data. They propose an uncertain graph model, and formalize the subgraph mining problem, which is NP-hard. Thus, they develop an approximate and scalable algorithm. The problem of clustering large uncertain location databases is investigated in "Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index," written by-Shing Ho. They show that the UK-means algorithm, which generalizes the k-means algorithm to handle uncer-tain objects, is inefficient due to a large amount of expensive expected distance calculation. They propose pruning tech-niques based on Voronoi diagrams to reduce the amount of expected distance calculation. These techniques are analyti-cally proven to be more effective than the basic bounding-box-based technique. The authors also use an R-tree index to speed up the retrieval of the uncertain objects. The third paper, "Scalable Probabilistic Similarity Rank-ing in Uncertain Databases" by Thomas Bernecker, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, and Andreas Zuefle, studies how to rank uncertain data according to their distances to a reference object. They propose a framework that incrementally computes, for each object instance and ranking position, the probability of the object falling at that ranking position. While existing approaches compute this probability distribution by using quadratic-complexity algorithms, the new algorithm re-quires linear time with the same memory requirements. They also show how the output of their method can be used to apply probabilistic top-k ranking for the objects accord-ing to different state-of-the-art definitions.
    IEEE Transactions on Knowledge and Data Engineering 09/2010; 22. · 1.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, there has been increasing interest in extending relational query processing to include data obtained from unstructured sources. A common approach is to use stand-alone Information Extraction (IE) techniques to identify and label entities within blocks of text; the resulting entities are then imported into a standard database and processed using relational queries. This two-part approach, however, suffers from two main drawbacks. First, IE is inherently probabilistic, but traditional query processing does not properly handle probabilistic data, resulting in reduced answer quality. Second, performance inefficiencies arise due to the separation of IE from query processing. In this paper, we address these two problems by building on an in-database implementation of a leading IE model---Conditional Random Fields using the Viterbi inference algorithm. We develop two different query approaches on top of this implementation. The first uses deterministic queries over maximum-likelihood extractions, with optimizations to push the relational operators into the Viterbi algorithm. The second extends the Viterbi algorithm to produce a set of possible extraction "worlds", from which we compute top-k probabilistic query answers. We describe these approaches and explore the trade-offs of efficiency and effectiveness between them using two datasets.
    PVLDB. 09/2010; 3:1057-1067.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Thousands of scientific conferences happen every year, and each involves a laborious scientific peer review process conducted by one or more busy scientists serving as Technical/Scientific Program Committee (TPC) chair(s). The chair(s) must match submitted papers to their reviewer pool in such a way that i) each paper is reviewed by experts in its subject matter, and ii) no reviewer is overloaded with reviews or under-utilized. Towards this end, seasoned TPC chairs know the value of reviewer and paper profiling: summarizing the expertise/interests of each reviewer and the subject matter of each paper using judiciously chosen domain-specific keywords. An automated profiling algorithm is proposed for this purpose, which starts from generic/noisy reviewer profiles extracted using Google Scholar and derives custom conference-centric reviewer and paper profiles. Each reviewer is expert on few sub-topics, whereas the pool of reviewers and the conference may collectively need many more keywords for appropriate specificity. Exploiting this sparsity, we propose a sparse matrix factorization approach in lieu of classical SVD-based LSI or NMF-type approaches. We illustrate the merits of our approach using real conference data, and expert scoring of the assignments by a seasoned TPC chair in the area.
    ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 14 December 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unstructured text represents a large fraction of the world's data. It often contain snippets of structured information within them (e.g., people's names and zip codes). Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and probabilistic databases for querying the output of IE. In this paper, we make the first steps to merge these two directions, without loss of statistical robustness, by implementing a state-of-the-art statistical IE model - Conditional Random Fields (CRFs) - in the setting of a Probabilistic Database that treats statistical models as first- class data objects. We show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL. We also show the performance benefits relative to a standalone open-source Viterbi implementation. This work opens up the optimization opportunities for queries involving both inference and relational operators over IE models.
    Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA; 01/2010
  • Source
    Communications of the ACM 11/2009; 52:87-95. · 2.86 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There is a growing realization that modern database management systems (DBMSs) must be able to manage data that contains un- certainties that are represented in the form of probabilistic rela- tions. Consequently, the design of each core DBMS component must be revisited in the presence of uncertain and probabilistic in- formation. In this paper, we study how to build histogram syn- opses for probabilistic relations, for the purposes of enabling both DBMS-internal decisions (such as indexing and query planning), and (possibly, user-facing) approximate query processing tools. In contrast to initial work in this area, our probabilistic histograms retain the key possible-worlds semantics of probabilistic data, al- lowing for more accurate, yet concise, representation of the un- certainty characteristics of data and query results. We present a variety of techniques for building optimal probabilistic histograms, each one tuned to a different choice of approximation-error metric. We show that these can be incorporated into a general Dynamic Programming (DP) framework, which generalizes that used for ex- isting histogram constructions. The end result is a histogram where each "bucket" is approximately represented by a compact proba- bility distribution function (PDF), which can be used as the basis for query planning and approximate query answering. We present novel, polynomial-time algorithms tofind optimal probabilistic his- tograms for a variety of PDF-error metrics (including variation dis- tance, sum squared error, max error and EMD1). Our experimental study shows that our probabilistic histogram synopses can accu- rately capture the key statistical properties of uncertain data, while being much more compact to store and work with than the original uncertain relations.
    PVLDB. 08/2009; 2:526-537.

Publication Stats

5k Citations
49.66 Total Impact Points


  • 2009–2013
    • Technical University of Crete
      • Department of Electronic and Computer Engineering
      La Canée, Crete, Greece
  • 2008
    • Mission College
      Santa Clara, California, United States
  • 2007–2008
    • University of California, Berkeley
      • Department of Electrical Engineering and Computer Sciences
      Berkeley, California, United States
    • Tsinghua University
      Peping, Beijing, China
  • 2002–2005
    • Alcatel Lucent
      Lutetia Parisorum, Île-de-France, France
  • 2001
    • Research Center on Scientific and Technical Information
      Alger, Alger, Algeria
  • 1996–1999
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States