Minos N. Garofalakis

Technical University of Crete, La Canée, Crete, Greece

Are you Minos N. Garofalakis?

Claim your profile

Publications (147)36.88 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Many organizations today are faced with the challenge of processing and distilling information from huge and growing collections of data. Such organizations are increasingly deploying sophisticated mathematical algorithms to model the behavior of their ...
    ACM SIGMOD Record 02/2014; 42(4):61-66. · 0.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks in living organisms can be seen as the ultimate means of understanding the underlying mechanisms in complex diseases, such as oral cancer. During the last decade, many algorithms based on high-throughput genomic data have been developed to unravel the complexity of gene network construction and their progression in time. However, the small size of samples compared to the number of observed genes makes the inference of the network structure quite challenging. In this study, we propose a framework for constructing and analyzing gene networks from sparse experimental temporal data and investigate its potential in oral cancer. We use two network models based on partial correlations and kernel density estimation, in order to capture the genetic interactions. Using this network construction framework on real clinical data of the tissue and blood at different time stages, we identified common disease-related structures that may decipher the association between disease state and biological processes in oral cancer. Our study emphasizes an altered MET (hepatocyte growth factor receptor) network during oral cancer progression. In addition, we demonstrate that the functional changes of gene interactions during oral cancer progression might be particularly useful for patient categorization at the time of diagnosis and/or at follow-up periods.
    IEEE Journal of Biomedical and Health Informatics 01/2014; 18(2):562-573. · 1.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Oral cancer is characterized by multiple genetic events such as alterations of a number of oncogenes and tumour suppressor genes. The aim of this study is to identify genes and their functional interactions that may play a crucial role on a specific disease-state, especially during oral cancer progression. We examine gene interaction networks on blood genomic data, obtained from twenty three oral cancer patients at four different time stages. We generate the gene-gene networks from sparse experimental temporal data using two methods, Partial Correlations and Kernel Density Estimation, in order to capture genetic interactions. The network study reveals an altered MET (hepatocyte growth factor receptor) network during oral cancer progression, which is further analyzed in relation to other studies.
    Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 07/2013; 2013:575-578.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider a stream outsourcing setting, where a data owner delegates the management of a set of disjoint data streams to an untrusted server. The owner authenticates his streams via signatures. The server processes continuous queries on the union of the streams for clients trusted by the owner. Along with the results, the server sends proofs of result correctness derived from the owner's signatures, which are easily verifiable by the clients. We design novel constructions for a collection of fundamental problems over streams represented as linear algebraic queries. In particular, our basic schemes authenticate dynamic vector sums and dot products, as well as dynamic matrix products. These techniques can be adapted for authenticating a wide range of important operations in streaming environments, including group by queries, joins, in-network aggregation, similarity matching, and event processing. All our schemes are very lightweight, and offer strong cryptographic guarantees derived from formal definitions and proofs. We experimentally confirm the practicality of our schemes.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • Source
    Amol Deshpande, Minos Garofalakis, Michael I. Jordan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an efficient way of performing stepwise selection in the class of decomposable models. The main contribution of the paper is a simple characterization of the edges that canbe added to a decomposable model while keeping the resulting model decomposable and an efficient algorithm for enumerating all such edges for a given model in essentially O(1) time per edge. We also discuss how backward selection can be performed efficiently using our data structures.We also analyze the complexity of the complete stepwise selection procedure, including the complexity of choosing which of the eligible dges to add to (or delete from) the current model, with the aim ofminimizing the Kullback-Leibler distance of the resulting model from the saturated model for the data.
    01/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks are often described as probabilistic graphs in the context of gene and protein sequence analysis in molecular biology. Microarrays and proteomics technologies facilitate the monitoring of expression levels over thousands of biological units over time. Several experimental efforts have appeared aiming to unveiling pairwise interactions, with many graphical models being introduced in order to discover associations from expression-data analysis. However, the small size of samples compared to the number of observed genes/proteins makes the inference of the network structure quite challenging. In this study, we generate gene-protein networks from sparse experimental temporal data using two methods, partial correlations and Kernel Density Estimation (KDE), in an attempt to capture genetic interactions. Applying KDE method we model the genetic associations as Gaussians approximations, while through the dynamic Gaussian analysis we aim to identify relationships between genes and proteins at different time stages. The statistical results demonstrate valid biological interactions and indicate potential new indirect relations that deserve further biological examination for validation.
    Int. J. of Biomedical Engineering and Technology. 01/2013; 13(4):383 - 409.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time- and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answ...[truncated].
    06/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation queries, where proper action is dictated based on a trigger condition placed upon the current value that a monitored function receives. Recent work studies the problem of (non-linear) sophisticated function tracking in a distributed manner. The main concept behind the geometric monitoring approach proposed there, is for each distributed site to perform the function monitoring over an appropriate subset of the input domain. In the current work, we examine whether the distributed monitoring mechanism can become more efficient, in terms of the number of communicated messages, by extending the geometric monitoring framework to utilize prediction models. We initially describe a number of local estimators (predictors) that are useful for the applications that we consider and which have already been shown particularly useful in past work. We then demonstrate the feasibility of incorporating predictors in the geometric monitoring framework and show that prediction-based geometric monitoring in fact generalizes the original geometric monitoring framework. We propose a large variety of different prediction-based monitoring models for the distributed threshold monitoring of complex functions. Our extensive experimentation with a variety of real data sets, functions and parameter settings indicates that our approaches can provide significant communication savings ranging between two times and up to three orders of magnitude, compared to the transmission cost of the original monitoring framework.
    SIGMOD; 05/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Biological networks are often described as probabilistic graphs in the context of gene and protein sequence analysis in molecular biology. Microarrays and proteomics technology allow the monitoring of expression levels over thousands of biological units over time. In experimental efforts we are interested in unveiling pairwise interactions. Many graphical models have been introduced in order to discover associations from the expression data analysis. However, the small size of samples compared to the number of observed genes/proteins makes the inference of the network structure quite challenging. In this study we generate gene-protein networks from sparse experimental data using two methods, partial correlations and Kernel Density Estimation, in order to capture genetic interactions. Dynamic Gaussian analysis is used to match special characteristics to genes and proteins at different time stages utilizing the KDE method for expressing Gaussian associations with non-linear parameters.
    Bioinformatics & Bioengineering (BIBE), 2012 IEEE 12th International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
    Foundations and Trends in Databases 01/2012; 4(1–3):1-294.
  • Source
    Vibhor Rastogi, Nilesh Dalvi, Minos Garofalakis
    [Show abstract] [Hide abstract]
    ABSTRACT: There have been several recent advancements in Machine Learning community on the Entity Matching (EM) problem. However, their lack of scalability has prevented them from being applied in practical settings on large real-life datasets. Towards this end, we propose a principled framework to scale any generic EM algorithm. Our technique consists of running multiple instances of the EM algorithm on small neighborhoods of the data and passing messages across neighborhoods to construct a global solution. We prove formal properties of our framework and experimentally demonstrate the effectiveness of our approach in scaling EM algorithms.
    Computing Research Repository - CORR. 03/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a probabilistic database (PDB) based declarative IE system that supports a leading statistical IE model, and an associated inference algorithm to answer top-k-style queries over the probabilistic IE outcome. Still, the broader problem of effectively supporting general probabilistic inference inside a PDB-based declarative IE system remains open. In this paper, we explore the in-database implementations of a wide variety of inference algorithms suited to IE, including two Markov chain Monte Carlo algorithms, the Viterbi and the sum-product algorithms. We describe the rules for choosing appropriate inference algorithms based on the model, the query and the text, considering the trade-off between accuracy and runtime. Based on these rules, we describe a hybrid approach to optimize the execution of a single probabilistic IE query to employ different inference algorithms appropriate for different records. We show that our techniques can achieve up to 10-fold speedups compared to the non-hybrid solutions proposed in the literature.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unstructured text represents a large fraction of the world's data. It often contain snippets of structured information within them (e.g., people's names and zip codes). Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and probabilistic databases for querying the output of IE. In this paper, we make the first steps to merge these two directions, without loss of statistical robustness, by implementing a state-of-the-art statistical IE model - Conditional Random Fields (CRFs) - in the setting of a Probabilistic Database that treats statistical models as first- class data objects. We show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL. We also show the performance benefits relative to a standalone open-source Viterbi implementation. This work opens up the optimization opportunities for queries involving both inference and relational operators over IE models.
    Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: R ECENT years have witnessed the emergence of novel database applications in various nontraditional do-mains, including location-based services, sensor networks, RFID systems, and biological and biometric databases. Traditionally, data mining has been widely used to reveal interesting patterns in the vast amounts of data generated by such applications. However, for most of these emerging domains, data is often riddled with uncertainty, arising, for instance, from inherent measurement inaccuracies, sampling and curation errors, and network latencies, or even from intentional blurring of the data (to preserve anonymity). Such forms of data uncertainty have to be handled carefully, or else the results of long and tedious data analyses could be inaccurate or even incorrect. In particular, it is important to collect and distill the knowledge from experts in developing mining and data processing methods that are uncertainty-aware. Recently, there has been active interest in the database community by treating uncertain data as a "first-class citizen," where the probability and statistical information of data are stored in the DBMS. Novel queries can be evaluated on these data to produce probabilistic results [1]. More recently, new mining algorithms that take into account data uncertainty, such as frequent pattern mining [2] and clustering [3], have also been proposed. Aggarwal and Yu [4] give a survey on the area of uncertain data querying and mining, where the key challenges include: . models and structures for uncertain information in data mining and complex data analysis, . association rule mining and clustering of uncertain data, . machine learning in uncertain data, . mining moving-object trajectories and biological data with noise, . similarity matching of objects with uncertainty, and . efficient mining and analysis of uncertain/probabil-istic data streams. This special section of the IEEE Transactions on Knowledge and Data Engineering features a collection of four papers, selected from 23 submissions, representing recent advances in the mining of uncertain databases. These works present new techniques for mining patterns, clustering, and ranking on uncertain data. In applications like biological databases, graph data are often incomplete and imprecise. Mining uncertain graph data is semantically different from and computationally more challenging than mining exact graph data. The first paper, "Mining Frequent Subgraph Patterns from Uncertain Graph Data" by Zhaonian Zou, Jianzhong Li, Hong Gao, and Shuo Zhang, investigates the mining of frequent subgraph patterns from uncertain graph data. They propose an uncertain graph model, and formalize the subgraph mining problem, which is NP-hard. Thus, they develop an approximate and scalable algorithm. The problem of clustering large uncertain location databases is investigated in "Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index," written by-Shing Ho. They show that the UK-means algorithm, which generalizes the k-means algorithm to handle uncer-tain objects, is inefficient due to a large amount of expensive expected distance calculation. They propose pruning tech-niques based on Voronoi diagrams to reduce the amount of expected distance calculation. These techniques are analyti-cally proven to be more effective than the basic bounding-box-based technique. The authors also use an R-tree index to speed up the retrieval of the uncertain objects. The third paper, "Scalable Probabilistic Similarity Rank-ing in Uncertain Databases" by Thomas Bernecker, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, and Andreas Zuefle, studies how to rank uncertain data according to their distances to a reference object. They propose a framework that incrementally computes, for each object instance and ranking position, the probability of the object falling at that ranking position. While existing approaches compute this probability distribution by using quadratic-complexity algorithms, the new algorithm re-quires linear time with the same memory requirements. They also show how the output of their method can be used to apply probabilistic top-k ranking for the objects accord-ing to different state-of-the-art definitions.
    IEEE Transactions on Knowledge and Data Engineering 01/2010; 22. · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, there has been increasing interest in extending relational query processing to include data obtained from unstructured sources. A common approach is to use stand-alone Information Extraction (IE) techniques to identify and label entities within blocks of text; the resulting entities are then imported into a standard database and processed using relational queries. This two-part approach, however, suffers from two main drawbacks. First, IE is inherently probabilistic, but traditional query processing does not properly handle probabilistic data, resulting in reduced answer quality. Second, performance inefficiencies arise due to the separation of IE from query processing. In this paper, we address these two problems by building on an in-database implementation of a leading IE model---Conditional Random Fields using the Viterbi inference algorithm. We develop two different query approaches on top of this implementation. The first uses deterministic queries over maximum-likelihood extractions, with optimizations to push the relational operators into the Viterbi algorithm. The second extends the Viterbi algorithm to produce a set of possible extraction "worlds", from which we compute top-k probabilistic query answers. We describe these approaches and explore the trade-offs of efficiency and effectiveness between them using two datasets.
    PVLDB. 01/2010; 3:1057-1067.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Thousands of scientific conferences happen every year, and each involves a laborious scientific peer review process conducted by one or more busy scientists serving as Technical/Scientific Program Committee (TPC) chair(s). The chair(s) must match submitted papers to their reviewer pool in such a way that i) each paper is reviewed by experts in its subject matter, and ii) no reviewer is overloaded with reviews or under-utilized. Towards this end, seasoned TPC chairs know the value of reviewer and paper profiling: summarizing the expertise/interests of each reviewer and the subject matter of each paper using judiciously chosen domain-specific keywords. An automated profiling algorithm is proposed for this purpose, which starts from generic/noisy reviewer profiles extracted using Google Scholar and derives custom conference-centric reviewer and paper profiles. Each reviewer is expert on few sub-topics, whereas the pool of reviewers and the conference may collectively need many more keywords for appropriate specificity. Exploiting this sparsity, we propose a sparse matrix factorization approach in lieu of classical SVD-based LSI or NMF-type approaches. We illustrate the merits of our approach using real conference data, and expert scoring of the assignments by a seasoned TPC chair in the area.
    ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 14 December 2010; 01/2010
  • Alin Dobra, Minos N. Garofalakis, Johannes Gehrke, Rajeev Rastogi
    [Show abstract] [Hide abstract]
    ABSTRACT: Randomized techniques, based on computing small “sketch” synopses for each stream, have recently been shown to be a very effective tool for approximating the result of a single SQL query over streaming data tuples. In this paper, we investigate the problems arising when data-stream sketches are used to process multiple such queries concurrently. We demonstrate that, in the presence of multiple query expressions, intelligently sharing sketches among concurrent query evaluations can result in substantial improvements in the utilization of the available sketching space and the quality of the resulting approximation error guarantees. We provide necessary and sufficient conditions for multi-query sketch sharing that guarantee the correctness of the result-estimation process. We also investigate the difficult optimization problem of determining sketch-sharing configurations that are optimal (e.g., under a certain error metric for a given amount of space). We prove that optimal sketch sharing typically gives rise to NP-hard questions, and we propose novel heuristic algorithms for finding good sketch-sharing configurations in practice. Results from our experimental study with queries from the TPC-H benchmark verify the effectiveness of our approach, clearly demonstrating the benefits of our sketch-sharing methodology.
    Inf. Syst. 01/2009; 34:209-230.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There is a growing realization that modern database management systems (DBMSs) must be able to manage data that contains un- certainties that are represented in the form of probabilistic rela- tions. Consequently, the design of each core DBMS component must be revisited in the presence of uncertain and probabilistic in- formation. In this paper, we study how to build histogram syn- opses for probabilistic relations, for the purposes of enabling both DBMS-internal decisions (such as indexing and query planning), and (possibly, user-facing) approximate query processing tools. In contrast to initial work in this area, our probabilistic histograms retain the key possible-worlds semantics of probabilistic data, al- lowing for more accurate, yet concise, representation of the un- certainty characteristics of data and query results. We present a variety of techniques for building optimal probabilistic histograms, each one tuned to a different choice of approximation-error metric. We show that these can be incorporated into a general Dynamic Programming (DP) framework, which generalizes that used for ex- isting histogram constructions. The end result is a histogram where each "bucket" is approximately represented by a compact proba- bility distribution function (PDF), which can be used as the basis for query planning and approximate query answering. We present novel, polynomial-time algorithms tofind optimal probabilistic his- tograms for a variety of PDF-error metrics (including variation dis- tance, sum squared error, max error and EMD1). Our experimental study shows that our probabilistic histogram synopses can accu- rately capture the key statistical properties of uncertain data, while being much more compact to store and work with than the original uncertain relations.
    PVLDB. 01/2009; 2:526-537.
  • Source
    Commun. ACM. 01/2009; 52:87-95.
  • Minos N. Garofalakis, Johannes Gehrke, Divesh Srivastava
    [Show abstract] [Hide abstract]
    ABSTRACT: Efficient management of RDF data is an important prerequisite for realizing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, ...
    The VLDB Journal 01/2009; 18:383-384. · 1.40 Impact Factor

Publication Stats

5k Citations
36.88 Total Impact Points

Institutions

  • 2009–2013
    • Technical University of Crete
      • Department of Electronic and Computer Engineering
      La Canée, Crete, Greece
  • 2008
    • University of California, Berkeley
      • Department of Electrical Engineering and Computer Sciences
      Berkeley, California, United States
    • Mission College
      Santa Clara, California, United States
  • 2007
    • Tsinghua University
      Peping, Beijing, China
    • University of Geneva
      Genève, Geneva, Switzerland
  • 2004–2007
    • Kent State University
      • Department of Computer Science
      Kent, Ohio, United States
  • 2006
    • University of California, Santa Cruz
      • Department of Computer Science
      Santa Cruz, California, United States
  • 2002–2005
    • Alcatel Lucent
      Lutetia Parisorum, Île-de-France, France
  • 2001
    • Research Center on Scientific and Technical Information
      Alger, Alger, Algeria
  • 2000
    • AT&T Labs
      Austin, Texas, United States
  • 1996–1999
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States