Minos N. Garofalakis

Technical University of Crete, La Canée, Crete, Greece

Are you Minos N. Garofalakis?

Claim your profile

Publications (165)62.07 Total impact

  • Source
    Ekaterini Ioannou · Minos Garofalakis
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent entity resolution approaches exhibit benefits when addressing the problem through unmerged duplicates : instances describing real-world objects are not merged based on apriori thresholds or human intervention, instead relevant resolution information is employed for evaluating resolution decisions during query processing using “possible worlds” semantics. In this paper, we present the first known approach for efficiently handling complex analytical queries over probabilistic databases with unmerged duplicates. We propose the entity-join operator that allows expressing complex aggregation and iceberg/top-k queries over joins between tables with unmerged duplicates and other database tables. Our technical content includes a novel indexing structure for efficient access to the entity resolution information and novel techniques for the efficient evaluation of complex probabilistic queries that retrieve analytical and summarized information over a (potentially, huge) collection of possible resolution worlds. Our extensive experimental evaluation verifies the benefits of our approach.
    Full-text · Article · Aug 2015 · IEEE Transactions on Knowledge and Data Engineering
  • [Show abstract] [Hide abstract]
    ABSTRACT: While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is “stale” and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 h). Such distributed data streaming applications mandate novel algorithmic solutions that are both time and space efficient (to manage high-speed data streams) and also communication efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point, as well as inner product, queries and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving merging of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.
    No preview · Article · Jun 2015 · The VLDB Journal

  • No preview · Article · Jan 2015 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider a stream outsourcing setting, where a data owner delegates the management of a set of disjoint data streams to an untrusted server. The owner authenticates his streams via signatures. The server processes continuous queries on the union of the streams for clients trusted by the owner. Along with the results, the server sends proofs of result correctness derived from the owner's signatures, which are verifiable by the clients. We design novel constructions for a collection of fundamental problems over streams represented as linear algebraic queries. In particular, our basic schemes authenticate dynamic vector sums, matrix products, and dot products. These techniques can be adapted for authenticating a wide range of important operations in streaming environments, including group-by queries, joins, in-network aggregation, similarity matching, and event processing. We also present extensions to address the case of sliding window queries, and when multiple clients are interested in different subsets of the data. These methods take advantage of a novel nonce chaining technique that we introduce, which is used to reduce the verification cost without affecting any other costs. All our schemes are lightweight and offer strong cryptographic guarantees derived from formal definitions and proofs. We experimentally confirm the practicality of our schemes in the performance-sensitive streaming setting.
    No preview · Article · Dec 2014 · ACM Transactions on Database Systems
  • Minos Garofalakis · Rajeev Rastogi · Kyuseok Shim
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequential pattern mining under various constraints is a challenging data mining task. The paper provides a generic framework based on constraint programming to discover sequence patterns defined by constraints on local patterns (e.g., gap, regular expressions) or constraints on patterns involving combination of local patterns such as relevant subgroups and top-k patterns. This framework enables the user to mine in a declarative way both kinds of patterns. The solving step is done by exploiting the machinery of Constraint Programming. For complex patterns involving combination of local patterns, we improve the mining step by using dynamic CSP. Finally, we present two case studies in biomedical information extraction and stylistic analysis in linguistics.
    No preview · Conference Paper · Nov 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many modern streaming applications, such as online analysis of financial, network, sensor, and other forms of data, are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation queries, where proper action is dictated based on a trigger condition placed upon the current value that a monitored function receives. Recent work [Sharfman et al. 2006, 2007b, 2008] studies the problem of (nonlinear) sophisticated function tracking in a distributive manner. The main concept behind the geometric monitoring approach proposed there is for each distributed site to perform the function monitoring over an appropriate subset of the input domain. In the current work, we examine whether the distributed monitoring mechanism can become more efficient, in terms of the number of communicated messages, by extending the geometric monitoring framework to utilize prediction models. We initially describe a number of local estimators (predictors) that are useful for the applications that we consider and which have already been shown particularly useful in past work. We then demonstrate the feasibility of incorporating predictors in the geometric monitoring framework and show that predictionbased geometric monitoring in fact generalizes the original geometric monitoring framework. We propose a large variety of different prediction-based monitoring models for the distributed threshold monitoring of complex functions. Our extensive experimentation with a variety of real datasets, functions, and parameter settings indicates that our approaches can provide significant communication savings ranging between two times and up to three orders of magnitude, compared to the transmission cost of the original monitoring framework.
    No preview · Article · May 2014 · ACM Transactions on Database Systems
  • Source
    Minos Garofalakis · Yannis Ioannidis
    [Show abstract] [Hide abstract]
    ABSTRACT: Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and communicates with remote sites by message-passing. Earlier work on parallel query scheduling employs either (a) one-dimensional models of parallel task scheduling, effectively ignoring the potential benefits of resource sharing, or (b) models of globally accessible resource units, which are appropriate only for shared-memory architectures, since they cannot capture the affinity of system resources to sites. In this paper, we develop a general approach capturing the full complexity of scheduling distributed, multi-dimensional resource units for all forms of parallelism within and across queries and operators. We present a level-based list scheduling heuristic algorithm for independent query tasks (i.e., physical operator pipelines) that is provably near-optimal for given degrees of partitioned parallelism (with a worst-case performance ratio that depends on the number of time-shared and space-shared resources per site and the granularity of the clones). We also propose extensions to handle blocking constraints in logical operator (e.g., hash-join) pipelines and bushy query plans as well as on-line task arrivals (e.g., in a dynamic or multi-query execution environment). Experiments with our scheduling algorithms implemented on top of a detailed simulation model verify their effectiveness compared to existing approaches in a realistic setting. Based on our analytical and experimental results, we revisit the open problem of designing efficient cost models for parallel query optimization and propose a solution that captures all the important parameters of parallel execution.
    Preview · Article · Mar 2014
  • Odysseas Papapetrou · Minos Garofalakis
    [Show abstract] [Hide abstract]
    ABSTRACT: Distributed skyline computation is important for a wide range of application domains, from distributed and web-based systems to ISP-network monitoring and distributed databases. The problem is particularly challenging in dynamic distributed settings, where the goal is to efficiently monitor a continuous skyline query over a collection of distributed streams. All existing work relies on the assumption of a single point of reference for object attributes/dimensions, i.e., objects may be vertically or horizontally partitioned, but the accurate value of each dimension for each object is always maintained by a single site. This assumption is unrealistic for several distributed monitoring applications, where object information is fragmented over a set of distributed streams (each monitored by a different site) and needs to be aggregated (e.g., averaged) across several sites. Furthermore, it is frequently useful to define skyline dimensions through complex functions over the aggregated objects, which raises further challenges for dealing with object fragmentation. In this paper, we present the first known distributed approach for continuous fragmented skylines, namely distributed monitoring of skylines over complex functions of fragmented multi-dimensional objects. We also propose several optimizations, including a new technique based on random-walk models for adaptively determining the most efficient monitoring strategy for each object. A thorough experimental study with synthetic and real-life data sets verifies the effectiveness of our approach, demonstrating order-of-magnitude improvements in communication costs compared to the only available centralized solution.
    No preview · Conference Paper · Mar 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks in living organisms can be seen as the ultimate means of understanding the underlying mechanisms in complex diseases, such as oral cancer. During the last decade, many algorithms based on high-throughput genomic data have been developed to unravel the complexity of gene network construction and their progression in time. However, the small size of samples compared to the number of observed genes makes the inference of the network structure quite challenging. In this study, we propose a framework for constructing and analyzing gene networks from sparse experimental temporal data and investigate its potential in oral cancer. We use two network models based on partial correlations and kernel density estimation, in order to capture the genetic interactions. Using this network construction framework on real clinical data of the tissue and blood at different time stages, we identified common disease-related structures that may decipher the association between disease state and biological processes in oral cancer. Our study emphasizes an altered MET (hepatocyte growth factor receptor) network during oral cancer progression. In addition, we demonstrate that the functional changes of gene interactions during oral cancer progression might be particularly useful for patient categorization at the time of diagnosis and/or at follow-up periods.
    No preview · Article · Mar 2014 · IEEE Journal of Biomedical and Health Informatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many organizations today are faced with the challenge of processing and distilling information from huge and growing collections of data. Such organizations are increasingly deploying sophisticated mathematical algorithms to model the behavior of their ...
    Full-text · Article · Feb 2014 · ACM SIGMOD Record
  • Minos Garofalakis · Daniel Keren · Vasilis Samoladas
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guarantee communication efficiency (over the underlying communication network). The complexity of the monitored query adds to the difficulty of the problem - this is especially true for nonlinear queries (e.g., joins), where no obvious solutions exist for distributing the monitor condition across sites. The recently proposed geometric method offers a generic methodology for splitting an arbitrary (non-linear) global threshold-monitoring task into a collection of local site constraints; still, the approach relies on maintaining the complete stream(s) at each site, thus raising serious efficiency concerns for massive data streams. In this paper, we propose novel algorithms for efficiently tracking a broad class of complex aggregate queries in such distributed-streams settings. Our tracking schemes rely on a novel combination of the geometric method with compact sketch summaries of local data streams, and maintain approximate answers with provable error guarantees, while optimizing space and processing costs at each remote site and communication cost across the network. One of our key technical insights for the effective use of the geometric method lies in exploiting a much lower-dimensional space for monitoring the sketch-based estimation query. Due to the complex, highly nonlinear nature of these estimates, efficiently monitoring the local geometric constraints poses challenging algorithmic issues for which we propose novel solutions. Experimental results on real-life data streams verify the effectiveness of our approach.
    No preview · Article · Aug 2013 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: Oral cancer is characterized by multiple genetic events such as alterations of a number of oncogenes and tumour suppressor genes. The aim of this study is to identify genes and their functional interactions that may play a crucial role on a specific disease-state, especially during oral cancer progression. We examine gene interaction networks on blood genomic data, obtained from twenty three oral cancer patients at four different time stages. We generate the gene-gene networks from sparse experimental temporal data using two methods, Partial Correlations and Kernel Density Estimation, in order to capture genetic interactions. The network study reveals an altered MET (hepatocyte growth factor receptor) network during oral cancer progression, which is further analyzed in relation to other studies.
    No preview · Article · Jul 2013 · Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider a stream outsourcing setting, where a data owner delegates the management of a set of disjoint data streams to an untrusted server. The owner authenticates his streams via signatures. The server processes continuous queries on the union of the streams for clients trusted by the owner. Along with the results, the server sends proofs of result correctness derived from the owner's signatures, which are easily verifiable by the clients. We design novel constructions for a collection of fundamental problems over streams represented as linear algebraic queries. In particular, our basic schemes authenticate dynamic vector sums and dot products, as well as dynamic matrix products. These techniques can be adapted for authenticating a wide range of important operations in streaming environments, including group by queries, joins, in-network aggregation, similarity matching, and event processing. All our schemes are very lightweight, and offer strong cryptographic guarantees derived from formal definitions and proofs. We experimentally confirm the practicality of our schemes.
    No preview · Conference Paper · Jun 2013
  • Source
    Amol Deshpande · Minos Garofalakis · Michael I. Jordan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an efficient way of performing stepwise selection in the class of decomposable models. The main contribution of the paper is a simple characterization of the edges that canbe added to a decomposable model while keeping the resulting model decomposable and an efficient algorithm for enumerating all such edges for a given model in essentially O(1) time per edge. We also discuss how backward selection can be performed efficiently using our data structures.We also analyze the complexity of the complete stepwise selection procedure, including the complexity of choosing which of the eligible dges to add to (or delete from) the current model, with the aim ofminimizing the Kullback-Leibler distance of the resulting model from the saturated model for the data.
    Full-text · Article · Jan 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks are often described as probabilistic graphs in the context of gene and protein sequence analysis in molecular biology. Microarrays and proteomics technologies facilitate the monitoring of expression levels over thousands of biological units over time. Several experimental efforts have appeared aiming to unveiling pairwise interactions, with many graphical models being introduced in order to discover associations from expression-data analysis. However, the small size of samples compared to the number of observed genes/proteins makes the inference of the network structure quite challenging. In this study, we generate gene-protein networks from sparse experimental temporal data using two methods, partial correlations and Kernel Density Estimation (KDE), in an attempt to capture genetic interactions. Applying KDE method we model the genetic associations as Gaussians approximations, while through the dynamic Gaussian analysis we aim to identify relationships between genes and proteins at different time stages. The statistical results demonstrate valid biological interactions and indicate potential new indirect relations that deserve further biological examination for validation.
    No preview · Article · Jan 2013 · International Journal of Biomedical Engineering and Technology
  • K.D. Kalantzaki · E.S. Bei · M. Garofalakis · M. Zervakis
    [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks are often described as probabilistic graphs in the context of gene and protein sequence analysis in molecular biology. Microarrays and proteomics technology allow the monitoring of expression levels over thousands of biological units over time. In experimental efforts we are interested in unveiling pairwise interactions. Many graphical models have been introduced in order to discover associations from the expression data analysis. However, the small size of samples compared to the number of observed genes/proteins makes the inference of the network structure quite challenging. In this study we generate gene-protein networks from sparse experimental data using two methods, partial correlations and Kernel Density Estimation, in order to capture genetic interactions. Dynamic Gaussian analysis is used to match special characteristics to genes and proteins at different time stages utilizing the KDE method for expressing Gaussian associations with non-linear parameters.
    No preview · Conference Paper · Nov 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time- and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answ...[truncated].
    Full-text · Article · Jun 2012 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation queries, where proper action is dictated based on a trigger condition placed upon the current value that a monitored function receives. Recent work studies the problem of (non-linear) sophisticated function tracking in a distributed manner. The main concept behind the geometric monitoring approach proposed there, is for each distributed site to perform the function monitoring over an appropriate subset of the input domain. In the current work, we examine whether the distributed monitoring mechanism can become more efficient, in terms of the number of communicated messages, by extending the geometric monitoring framework to utilize prediction models. We initially describe a number of local estimators (predictors) that are useful for the applications that we consider and which have already been shown particularly useful in past work. We then demonstrate the feasibility of incorporating predictors in the geometric monitoring framework and show that prediction-based geometric monitoring in fact generalizes the original geometric monitoring framework. We propose a large variety of different prediction-based monitoring models for the distributed threshold monitoring of complex functions. Our extensive experimentation with a variety of real data sets, functions and parameter settings indicates that our approaches can provide significant communication savings ranging between two times and up to three orders of magnitude, compared to the transmission cost of the original monitoring framework.
    No preview · Conference Paper · May 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Full-text documents represent a large fraction of the world’s data. Although not structured per se, they often contain snippets of structured information within them: e.g., names, addresses, and document titles. Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and IE as an uncertain data source for Probabilistic Databases. It is natural to consider merging these two directions, but efforts to do so have had to compromise on the statistical robustness of IE algorithms in order to fit with early Probabilistic Database models. In this paper, we bridge the gap between these ideas by implementing a state-of-the-art statistical IE approach – Conditional Random Fields (CRFs) – in the setting of Probabilistic Databases that treat statistical models as first-class data objects. Using standard relational tables to capture CRF parameters, and inverted-file representations of text, we show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL, in a manner that can both choose likely segmentations, and provide detailed marginal distributions for label assignment. Given this implementation, we propose query processing optimizations that effectively combine probabilistic inference and relational operators such as selections and joins. In an experimental study with two data sets, we demonstrate the efficiency of our in-database Viterbi implementation in PostgreSQL relative to an open-source CRF library, and show the performance benefits of our optimizations. 1
    Preview · Article · Apr 2012
  • Graham Cormode · Minos Garofalakis · Peter J. Haas · Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
    No preview · Article · Jan 2012 · Foundations and Trends in Databases

Publication Stats

6k Citations
62.07 Total Impact Points

Institutions

  • 2008-2015
    • Technical University of Crete
      • Department of Electronic and Computer Engineering
      La Canée, Crete, Greece
  • 2001-2012
    • University of California, Berkeley
      • Department of Electrical Engineering and Computer Sciences
      Berkeley, California, United States
    • Research Center on Scientific and Technical Information
      Alger, Alger, Algeria
  • 2010
    • C-Crete Technologies
      Stafford, Texas, United States
  • 2007-2009
    • Mission College
      Santa Clara, California, United States
    • Tsinghua University
      Peping, Beijing, China
    • Kent State University
      • Department of Computer Science
      Кент, Ohio, United States
  • 2006
    • Indian Institute of Technology Delhi
      • Department of Computer Science and Engineering
      New Dilli, NCT, India
  • 2002-2005
    • Alcatel Lucent
      Lutetia Parisorum, Île-de-France, France
  • 2000
    • Stanford University
      Stanford, California, United States
  • 1996-1999
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States
  • 1998
    • Bell and Howell
      Wheeling, Illinois, United States