H. V. Jagadish

University of Michigan, Ann Arbor, Michigan, United States

Are you H. V. Jagadish?

Claim your profile

Publications (339)172.63 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Due to the coarse granularity of data accesses and the heavy use of latches, indices in the B-tree family are not efficient for in-memory databases, especially in the context of today's multi-core architecture. In this paper, we present PI, a Parallel in-memory skip list based Index that lends itself naturally to the parallel and concurrent environment, particularly with non-uniform memory access. In PI, incoming queries are collected, and disjointly distributed among multiple threads for processing to avoid the use of latches. For each query, PI traverses the index in a Breadth-First-Search (BFS) manner to find the list node with the matching key, exploiting SIMD processing to speed up the search process. In order for query processing to be latch-free, PI employs a light-weight communication protocol that enables threads to re-distribute the query workload among themselves such that each list node that will be modified as a result of query processing will be accessed by exactly one thread. We conducted extensive experiments, and the results show that PI can be up to three times as fast as the Masstree, a state-of-the-art B-tree based index.
    No preview · Article · Jan 2016
  • Jinyang Gao · H. V. Jagadish · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent years have witnessed amazing outcomes from "Big Models" trained by "Big Data". Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration. Typically, the data are either uniformly sampled or sequentially accessed. In this paper, we study how the data access pattern can affect model training. We propose an Active Sampler algorithm, where training data with more "learning value" to the model are sampled more frequently. The goal is to focus training effort on valuable instances near the classification boundaries, rather than evident cases, noisy data or outliers. We show the correctness and optimality of Active Sampler in theory, and then develop a light-weight vectorized implementation. Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm. Extensive experimental evaluations demonstrate that Active Sampler can speed up the training procedure of SVM, feature selection and deep learning, for comparable training quality by 1.6-2.2x.
    No preview · Article · Dec 2015
  • H. V. Jagadish · Aoying Zhou

    No preview · Article · Aug 2015 · The VLDB Journal
  • [Show abstract] [Hide abstract]
    ABSTRACT: When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results. We call this problem the MisMatch problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: target node type and Distinguishability. Target Node Type represents the type of node a query result intends to match, and Distinguishability is used to measure the importance of the query keywords. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any lowest common ancestor-based matching semantics (for XML data without ID references) or minimal Steiner tree-based matching semantics (for XML data with ID references) which return tree structures as results. It is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.
    No preview · Article · Aug 2015 · The VLDB Journal
  • Li Qian · Jinyang Gao · H. V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: Users make choices among multi-attribute objects in a data set in a variety of domains including used car purchase, job search and hotel room booking. Individual users sometimes have strong preferences between objects, but these preferences may not be universally shared by all users. If we can cast these preferences as derived from a quantitative user-specific preference function, then we can predict user preferences by learning their preference function, even though the preference function itself is not directly observable, and may be hard to express. In this paper we study the problem of preference learning with pairwise comparisons on a set of entities with multiple attributes. We formalize the problem into two subproblems, namely preference estimation and comparison selection. We propose an innovative approach to estimate the preference, and introduce a binary search strategy to adaptively select the comparisons. We introduce the concept of an orthogonal query to support this adaptive selection, as well as a novel S-tree index to enable efficient evaluation of orthogonal queries. We integrate these components into a system for inferring user preference with adaptive pairwise comparisons. Our experiments and user study demonstrate that our adaptive system significantly outperforms the naïve random selection system on both real data and synthetic data, with either simulated or real user feedback. We also show our preference learning approach is much more effective than existing approaches, and our S-tree can be constructed efficiently and perform orthogonal query at interactive speeds.
    No preview · Article · Jul 2015 · Proceedings of the VLDB Endowment
  • Source
    H.V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: As Big Data inexorably draws attention from every segment of society, it has also suffered from many characterizations that are incorrect. This article explores a few of the more common myths about Big Data, and exposes the underlying truths.
    Preview · Article · Feb 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that Big Data has now become a de ning challenge of our time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact. To do so, the report recommends signi cantly more attention to ve research areas: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and managing the diverse roles of people in the data life cycle. The Beckman Report on Database Research. Available from: https://www.researchgate.net/publication/286244634_The_Beckman_Report_on_Database_Research?origin=mail [accessed Dec 10, 2015].
    Full-text · Article · Dec 2014 · ACM SIGMOD Record
  • [Show abstract] [Hide abstract]
    ABSTRACT: Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.
    No preview · Article · Dec 2014 · Proceedings of the VLDB Endowment
  • Source
    Manish Singh · Arnab Nandi · H. V. Jagadish

    Full-text · Dataset · Sep 2014
  • Fei Li · H. V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: Natural language has been the holy grail of query interface designers, but has generally been considered too hard to work with, except in limited specic circumstances. In this paper, we describe the architecture of an interactive natural language query interface for relational databases. Through a carefully limited interaction with the user, we are able to correctly interpret complex natural language queries, in a generic manner across a range of domains. By these means, a logically complex English language sentence is correctly translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and can be evaluated against an RDBMS.We have constructed a system, NaLIR (Natural Language Interface for Relational databases), embodying these ideas. Our experimental assessment, through user studies, demonstrates that NaLIR is good enough to be usable in practice: even naive users are able to specify quite complex ad-hoc queries.
    No preview · Article · Sep 2014 · Proceedings of the VLDB Endowment
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    Full-text · Article · Jul 2014 · IEEE Transactions on Knowledge and Data Engineering
  • [Show abstract] [Hide abstract]
    ABSTRACT: Exploring the inherent technical challenges in realizing the potential of Big Data.
    No preview · Article · Jul 2014 · Communications of the ACM
  • Fei Li · Tianyin Pan · Hosagrahar V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: Querying data in relational databases is often challenging since SQL requires its users to know the exact schema of the database, the roles of various entities in a query, and the precise join paths to be followed. On the other hand, keyword search is unable to express much desired query semantics. In this paper, we propose a query language, Schema-free SQL, which enables its users to query a relational database using whatever partial schema they know. If they know the full schema, they can write full SQL. But, to the extent they do not know the schema, Schema-free SQL is tolerant of unknown or inaccurately specified relation names and attribute names, and it also does not require information regarding which relations are involved and how they are joined. We present techniques to evaluate Schema-free SQL by first converting it to full SQL. We show experimentally that a small amount of schema information, which one can reasonably expect most users to have, is enough to get queries evaluated as if they had been completely and correctly specified.
    No preview · Article · Jun 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Schema matching is a central challenge for data integration systems. Due to the inherent uncertainty arose from the inability of schema in fully capturing the semantics of the represented data, automatic tools are often uncertain about suggested matching results. However, human is good at understanding data represented in various forms and crowdsourcing platforms are making the human annotation process more affordable. Thus in this demo, we will show how to utilize the crowd to find the right matching. In order to do that, we need to make the tasks posted on the crowdsouricng platforms extremely simple, to be performed by non-expert people, and reduce the number of tasks as less as possible to save the cost. We demonstrate CrowdMatcher, a hybrid machine-crowd system for schema matching. The machine-generated matchings are verified by correspondence correctness queries (CCQs), which is to ask the crowd to determine whether a given correspondence is correct or not. CrowdMatcher includes several original features: it integrates different matchings generated from classical schema matching tools; in order to minimize the cost of crowdsourcing, it automatically selects the most informative set of CCQs from the possible matchings; it is able to manage inaccurate answers provided by the workers; the crowdsourced answers are used to improve matching results.
    No preview · Article · Jun 2014
  • Fei Li · Hosagrahar V Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: In this demo, we present NaLIR, a generic interactive natural language interface for querying relational databases. NaLIR can accept a logically complex English language sentence as query input. This query is first translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and then evaluated against an RDBMS. In this demonstration, we show that NaLIR, while far from being able to pass the Turing test, is perfectly usable in practice, and able to handle even quite complex queries in a variety of application domains. In addition, we also demonstrate how carefully designed interactive communication can avoid misinterpretation with minimum user burden.
    No preview · Article · Jun 2014
  • Yong Zeng · Zhifeng Bao · Tok Wang Ling · H.V. Jagadish · Guoliang Li
    [Show abstract] [Hide abstract]
    ABSTRACT: When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results.We call this problem the MisMatch Problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: Target Node Type and Distinguishability. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any LCA-based matching semantics and is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.
    No preview · Conference Paper · Mar 2014
  • Conference Paper: CrowdMatcher

    No preview · Conference Paper · Jan 2014
  • Caleb Chen Cao · Yongxin Tong · Lei Chen · H. V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: The benefits of crowdsourcing are well-recognized today for an increasingly broad range of problems. Meanwhile, the rapid development of social media makes it possible to seek the wisdom of a crowd of targeted users. However, it is not trivial to implement the crowdsourcing platform on social media, specifically to make social media users as workers, we need to address the following two challenges: 1) how to motivate users to participate in tasks, and 2) how to choose users for a task. In this paper, we present Wise Market as an effective framework for crowdsourcing on social media that motivates users to participate in a task with care and correctly aggregates their opinions on pairwise choice problems. The Wise Market consists of a set of investors each with an associated individual confidence in his/her prediction, and after the investment, only the ones whose choices are the same as the whole market are granted rewards. Therefore, a social media user has to give his/her ``best'' answer in order to get rewards, as a consequence, careless answers from sloppy users are discouraged. Under the Wise Market framework, we define an optimization problem to minimize expected cost of paying out rewards while guaranteeing a minimum confidence level, called the Effective Market Problem (EMP). We propose exact algorithms for calculating the market confidence and the expected cost with O(nlog2n) time cost in a Wise Market with n investors. To deal with the enormous number of users on social media, we design a Central Limit Theorem-based approximation algorithm to compute the market confidence with O(n) time cost, as well as a bounded approximation algorithm to calculate the expected cost with O(n) time cost. Finally, we have conducted extensive experiments to validate effectiveness of the proposed algorithms on real and synthetic data.
    No preview · Conference Paper · Aug 2013
  • Chen Jason Zhang · Lei Chen · H. V. Jagadish · Chen Caleb Cao
    [Show abstract] [Hide abstract]
    ABSTRACT: Schema matching is a central challenge for data integration systems. Automated tools are often uncertain about schema matchings they suggest, and this uncertainty is inherent since it arises from the inability of the schema to fully capture the semantics of the represented data. Human common sense can often help. Inspired by the popularity and the success of easily accessible crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since it is typical to ask simple questions on crowdsourcing platforms, we assume that each question, namely Correspondence Correctness Question (CCQ), is to ask the crowd to decide whether a given correspondence should exist in the correct matching. We propose frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely "Single CCQ" and "Multiple CCQ", which adaptively select, publish and manage the questions. We verified the value of our solutions with simulation and real implementation.
    No preview · Article · Jul 2013 · Proceedings of the VLDB Endowment
  • Jing Zhang · H.V. Jagadish
    [Show abstract] [Hide abstract]
    ABSTRACT: Many text documents today are collaboratively edited, often with multiple small changes. The problem we consider in this paper is how to find provenance for a specific part of interest in the document. A full revision history, represented as a version tree, can tell us about all updates made to the document, but most of these updates may apply to other parts of the document, and hence not be relevant to answer the provenance question at hand. In this paper, we propose the notion of a revision unit as a flexible unit to capture the necessary provenance. We demonstrate through experiments the capability of the revision units in keeping only relevant updates in the provenance representation and the flexibility of the revision units in adjusting to updates reflected in the version tree.
    No preview · Conference Paper · Jan 2013

Publication Stats

15k Citations
172.63 Total Impact Points

Institutions

  • 1999-2015
    • University of Michigan
      • • Division of Computer Science and Engineering
      • • College of Engineering
      • • Museum of Zoology
      • • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
  • 2000-2014
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States
  • 2007
    • Albany State University
      Georgia, United States
  • 1988-2007
    • AT&T Labs
      • Research
      Austin, Texas, United States
  • 2006
    • North Carolina State University
      Raleigh, North Carolina, United States
  • 1999-2000
    • University of Illinois, Urbana-Champaign
      • Department of Computer Science
      Urbana, Illinois, United States
  • 1998-2000
    • University of Maryland, College Park
      • Department of Computer Science
      CGS, Maryland, United States
  • 1997-1999
    • AT&T
      Dallas, Texas, United States
  • 1992
    • University of Wisconsin–Madison
      • Department of Computer Sciences
      Madison, Wisconsin, United States
    • University of Florida
      Gainesville, Florida, United States
  • 1984-1986
    • Stanford University
      • Information Systems Laboratory
      Stanford, California, United States