Hazem Elmeleegy

AT&T Labs, Austin, Texas, United States

Are you Hazem Elmeleegy?

Claim your profile

Publications (13)1.4 Total impact

  • Source
    Hazem Elmeleegy, Jayant Madhavan, Alon Y. Halevy
    VLDB J. 01/2011; 20:209-226.
  • Source
    Hazem Elmeleegy, Ahmed K. Elmagarmid, Jaewoo Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce U-MAP, a new system for schema mapping generation. U-MAP builds upon and extends existing schema mapping techniques. However, it mitigates some key problems in this area, which have not been previously addressed. The key tenet of U-MAP is to exploit the usage information extracted from the query logs associated with the schemas being mapped. We describe our experience in applying our proposed system to realistic datasets from the retail and life sciences domains. Our results demonstrate the effectiveness and efficiency of U-MAP compared to traditional approaches.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This demo shows how usage information buried in query logs can play a central role in data integration and data exchange. More specifically, our system U-Map uses query logs to generate correspondences between the attributes of two different schemas and the complex mapping rules to transform and restructure data records from one of these schemas to another. We introduce several novel features showing the benefit of incorporating query log analysis into these key components of data integration and data exchange systems.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This demo shows how usage information buried in query logs can play a central role in data integration and data exchange. More specifically, our system U-Map uses query logs to generate correspondences between the attributes of two different schemas and the complex mapping rules to transform and restructure data records from one of these schemas to another. We introduce several novel features showing the benefit of incorporating query log analysis into these key components of data integration and data exchange systems.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    PVLDB. 09/2010; 3:439-448.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a new record linkage approach that uses entity behavior to decide if potentially different entities are in fact the same. An entity’s behavior is extracted from a transaction log that records the actions of this entity with respect to a given data source. The core of our approach is a technique that merges the behavior of two possible matched entities and computes the gain in recognizing behavior patterns as their matching score. The idea is that if we obtain a well recognized behavior after merge, then most likely, the original two behaviors belong to the same entity as the behavior becomes more complete after the merge. We present the necessary algorithms to model entities’ behavior and compute a matching score for them. To improve the computational efficiency of our approach, we precede the actual matching phase with a fast candidate generation that uses a ”quick and dirty” matching method. Extensive experiments on real data show that our approach can significantly enhance record linkage quality while being practical for large transaction logs.
    Cyber Center Publications. 01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Peer-to-peer data integration - a.k.a. Peer Data Management Systems (PDMSs) - promises to extend the classical data integration approach to the Internet scale. Unfortunately, some challenges remain before realizing this promise. One of the biggest challenges is preserving the privacy of the exchanged data while passing through several intermediate peers. Another challenge is protecting the mappings used for data translation. Protecting the privacy without being unfair to any of the peers is yet a third challenge. This paper presents a novel query answering protocol in PDMSs to address these challenges. The protocol employs a technique based on noise selection and insertion to protect the query results, and a commutative encryption-based technique to protect the mappings and ensure fairness among peers. An extensive security analysis of the protocol shows that it is resilient to several possible types of attacks. We implemented the protocol within an established PDMS: the Hyperion system. We conducted an experimental study using real data from the healthcare domain. The results show that our protocol manages to achieve its privacy and fairness goals, while maintaining query processing time at the interactive level.
    Cyber Center Publications. 01/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing popularity of social networks, such as Facebook and Orkut, has raised several privacy concerns. Traditional ways of safeguarding privacy of personal information by hiding sensitive attributes are no longer adequate. Research shows that probabilistic classification techniques can effectively infer such private information. The disclosed sensitive information of friends, group affiliations and even participation in activities, such as tagging and commenting, are considered background knowledge in this process. In this paper, we present a privacy protection tool, called Privometer, that measures the amount of sensitive information leakage in a user profile and suggests self-sanitization actions to regulate the amount of leakage. In contrast to previous research, where inference techniques use publicly available profile information, we consider an augmented model where a potentially malicious application installed in the user's friend profiles can access substantially more information. In our model, merely hiding the sensitive information is not sufficient to protect the user privacy. We present an implementation of Privometer in Facebook.
    Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Continuous "always-on" monitoring is beneficial for a number of applications, but potentially imposes a high load in terms of communication, storage and power consumption when a large number of variables need to be monitored. We introduce two new filtering techniques, swing filters and slide filters, that represent within a prescribed precision a time-varying numerical signal by a piecewise linear function, consisting of connected line segments for swing filters and (mostly) disconnected line segments for slide filters. We demonstrate the effectiveness of swing and slide filters in terms of their compression power by applying them to a real-life data set plus a variety of synthetic data sets. For nearly all combinations of signal behavior and precision requirements, the proposed techniques outperform the earlier approaches for online filtering in terms of data reduction. The slide filter, in particular, consistently dominates all other filters, with up to twofold improvement over the best of the previous techniques.
    08/2009;
  • Source
    Hazem Elmeleegy, Jayant Madhavan, Alon Y. Halevy
    [Show abstract] [Hide abstract]
    ABSTRACT: A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well defined templates - they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully un- supervised manner. We first use multiple sources of information to split individual lines into multiple fields, and then compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely fields and good alignments. For each extracted table, we compute an ex- traction score that reflects our confidence in the table's quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experi- ments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.
    The VLDB Journal 08/2009; 2:1078-1089. · 1.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Peer Data Management Systems (PDMSs) promise to extend the classical data integration approach to the Internet scale. Unfortunately, some challenges remain before realizing this promise. One of the biggest challenges is preserving the privacy of the exchanged data while passing through several intermediate peers. Another challenge is protecting the mappings used for data translation. Achieving privacy preservation without being unfair to any of the peers is yet a third challenge. This paper presents a novel query answering protocol in PDMSs to address these challenges. The protocol employs a technique based on noise selection and insertion to protect the query results, and a commutative encryption-based technique to protect the mappings and ensure fairness among peers. An extensive security analysis of the protocol shows that it is resilient to seven possible types of attacks, assuming a malicious model. We implemented the protocol within an established PDMS: the Hyperion system. We conducted an experimental study using real data from the healthcare domain. The results show that our protocol introduces a moderate communication overhead compared to its non-privacy preserving counterpart and manages to achieve fairness among the peers.
    11/2008;
  • Source
    H. Elmeleegy, A. Ivan, R. Akkiraju, R. Goodwin
    [Show abstract] [Hide abstract]
    ABSTRACT: Mashup editors, like Yahoo Pipes and IBM Lotus Mashup Maker, allow non-programmer end-users to ldquomash-uprdquo information sources and services to meet their information needs. However, with the increasing number of services, information sources and complex operations like filtering and joining, even an easy to use editor is not sufficient. MashupAdvisor aims to assist mashup creators to build higher quality mashups in less time. Based on the current state of a mashup, the MashupAdvisor quietly suggests outputs (goals) that the user might want to include in the final mashup. MashupAdvisor exploits a repository of mashups to estimate the popularity of specific outputs, and makes suggestions using the conditional probability that an output will be included, given the current state of the mashup. When a suggestion is accepted, MashupAdvisor uses a semantic matching algorithm and a metric planner to modify the mashup to produce the suggested output. Our prototype was implemented on top of IBM Lotus MashupMaker and our initial results show that it is effective.
    Web Services, 2008. ICWS '08. IEEE International Conference on; 10/2008
  • Source
    H. Elmeleegy, M. Ouzzani, A. Elmagarmid
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing techniques for schema matching are classified as either schema-based, instance-based, or a combination of both. In this paper, we define a new class of techniques, called usage-based schema matching. The idea is to exploit information extracted from the query logs to find correspondences between attributes in the schemas to be matched. We propose methods to identify co-occurrence patterns between attributes in addition to other features such as their use in joins and with aggregate functions. Several scoring functions are considered to measure the similarity of the extracted features, and a genetic algorithm is employed to find the highest- score mappings between the two schemas. Our technique is suitable for matching schemas even when their attribute names are opaque. It can further be combined with existing techniques to obtain more accurate results. Our experimental study demonstrates the effectiveness of the proposed approach and the benefit of combining it with other existing approaches.
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Continuous monitoring of distributed systems is part of the necessary system infras- tructure for a number of applications, including detection o f various anomalies such as failures or performance degradation. The LeWYS project aims at building a monitoring infrastructure to be used by system observers that will implement system-wide strategies to offer a globally coherent autonomic behavior. In this paper, we focus on J2EE application servers as the distributed system to observe and we describe the implementation of efficient probes in Java that re ifies the state of various J2EE cluster components with a controllable intrusiveness. We give an overview of the overall architecture of the LeWYS framework and we present the design and implementation of hardware and operating system probes for Linux and Windows as well as a JMX probe for J2EE specific components. We evaluate our probes with micro-benchmarks and we show that it is possible to implement probes in Java with a low and controllable intrusiveness.
    Stud. Inform. Univ. 01/2005; 4:31-40.