[Show abstract][Hide abstract] ABSTRACT: As Big Data inexorably draws attention from every segment of society, it has also suffered from many characterizations that are incorrect. This article explores a few of the more common myths about Big Data, and exposes the underlying truths.
[Show abstract][Hide abstract] ABSTRACT: Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.
Proceedings of the VLDB Endowment 12/2014; 8(4):437-448. DOI:10.14778/2735496.2735506
[Show abstract][Hide abstract] ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Schema matching is a central challenge for data integration systems. Due to the inherent uncertainty arose from the inability of schema in fully capturing the semantics of the represented data, automatic tools are often uncertain about suggested matching results. However, human is good at understanding data represented in various forms and crowdsourcing platforms are making the human annotation process more affordable. Thus in this demo, we will show how to utilize the crowd to find the right matching. In order to do that, we need to make the tasks posted on the crowdsouricng platforms extremely simple, to be performed by non-expert people, and reduce the number of tasks as less as possible to save the cost. We demonstrate CrowdMatcher, a hybrid machine-crowd system for schema matching. The machine-generated matchings are verified by correspondence correctness queries (CCQs), which is to ask the crowd to determine whether a given correspondence is correct or not. CrowdMatcher includes several original features: it integrates different matchings generated from classical schema matching tools; in order to minimize the cost of crowdsourcing, it automatically selects the most informative set of CCQs from the possible matchings; it is able to manage inaccurate answers provided by the workers; the crowdsourced answers are used to improve matching results.
[Show abstract][Hide abstract] ABSTRACT: The benefits of crowdsourcing are well-recognized today for an increasingly broad range of problems. Meanwhile, the rapid development of social media makes it possible to seek the wisdom of a crowd of targeted users. However, it is not trivial to implement the crowdsourcing platform on social media, specifically to make social media users as workers, we need to address the following two challenges: 1) how to motivate users to participate in tasks, and 2) how to choose users for a task. In this paper, we present Wise Market as an effective framework for crowdsourcing on social media that motivates users to participate in a task with care and correctly aggregates their opinions on pairwise choice problems. The Wise Market consists of a set of investors each with an associated individual confidence in his/her prediction, and after the investment, only the ones whose choices are the same as the whole market are granted rewards. Therefore, a social media user has to give his/her ``best'' answer in order to get rewards, as a consequence, careless answers from sloppy users are discouraged. Under the Wise Market framework, we define an optimization problem to minimize expected cost of paying out rewards while guaranteeing a minimum confidence level, called the Effective Market Problem (EMP). We propose exact algorithms for calculating the market confidence and the expected cost with O(nlog2n) time cost in a Wise Market with n investors. To deal with the enormous number of users on social media, we design a Central Limit Theorem-based approximation algorithm to compute the market confidence with O(n) time cost, as well as a bounded approximation algorithm to calculate the expected cost with O(n) time cost. Finally, we have conducted extensive experiments to validate effectiveness of the proposed algorithms on real and synthetic data.
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
[Show abstract][Hide abstract] ABSTRACT: Schema matching is a central challenge for data integration systems. Automated tools are often uncertain about schema matchings they suggest, and this uncertainty is inherent since it arises from the inability of the schema to fully capture the semantics of the represented data. Human common sense can often help. Inspired by the popularity and the success of easily accessible crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since it is typical to ask simple questions on crowdsourcing platforms, we assume that each question, namely Correspondence Correctness Question (CCQ), is to ask the crowd to decide whether a given correspondence should exist in the correct matching. We propose frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely "Single CCQ" and "Multiple CCQ", which adaptively select, publish and manage the questions. We verified the value of our solutions with simulation and real implementation.
Proceedings of the VLDB Endowment 07/2013; 6(9):757-768. DOI:10.14778/2536360.2536374
[Show abstract][Hide abstract] ABSTRACT: Many text documents today are collaboratively edited, often with multiple small changes. The problem we consider in this paper is how to find provenance for a specific part of interest in the document. A full revision history, represented as a version tree, can tell us about all updates made to the document, but most of these updates may apply to other parts of the document, and hence not be relevant to answer the provenance question at hand. In this paper, we propose the notion of a revision unit as a flexible unit to capture the necessary provenance. We demonstrate through experiments the capability of the revision units in keeping only relevant updates in the provenance representation and the flexibility of the revision units in adjusting to updates reflected in the version tree.
Data Engineering (ICDE), 2013 IEEE 29th International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of "Big Data," including the recent announcement from the White House about new funding initiatives across different agencies, that target research for Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is no clear consensus on what is Big Data. In fact, there have been many controversial statements about Big Data, such as "Size is the only thing that matters." In this panel we will try to explore the controversies and debunk the myths surrounding Big Data.
Proceedings of the VLDB Endowment 08/2012; 5(12):2032-2033. DOI:10.14778/2367502.2367572
[Show abstract][Hide abstract] ABSTRACT: Visual analytics (VA), which combines analytical techniques with advanced visualization features, is fast becoming a standard tool for extracting information from graph data. Researchers have developed many tools for this purpose, suggesting a need for formal methods to guide these tools' creation. Increased data demands on computing requires redesigning VA tools to consider performance and reliability in the context of analysis of exascale datasets. Furthermore, visual analysts need a way to document their analyses for reuse and results justification. A VA graph framework encapsulated in a graph algebra helps address these needs. Its atomic operators include selection and aggregation. The framework employs a visual operator and supports dynamic attributes of data to enable scalable visual exploration of data.
[Show abstract][Hide abstract] ABSTRACT: Results of high throughput experiments can be challenging to interpret. Current approaches have relied on bulk processing the set of expression levels, in conjunction with easily obtained external evidence, such as co-occurrence. While such techniques can be used to reason probabilistically, they are not designed to shed light on what any individual gene, or a network of genes acting together, may be doing. Our belief is that today we have the information extraction ability and the computational power to perform more sophisticated analyses that consider the individual situation of each gene. The use of such techniques should lead to qualitatively superior results.
The specific aim of this project is to develop computational techniques to generate a small number of biologically meaningful hypotheses based on observed results from high throughput microarray experiments, gene sequences, and next-generation sequences. Through the use of relevant known biomedical knowledge, as represented in published literature and public databases, we can generate meaningful hypotheses that will aide biologists to interpret their experimental data.
We are currently developing novel approaches that exploit the rich information encapsulated in biological pathway graphs. Our methods perform a thorough and rigorous analysis of biological pathways, using complex factors such as the topology of the pathway graph and the frequency in which genes appear on different pathways, to provide more meaningful hypotheses to describe the biological phenomena captured by high throughput experiments, when compared to other existing methods that only consider partial information captured by biological pathways.
[Show abstract][Hide abstract] ABSTRACT: A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted. In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a "good impression" of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.
[Show abstract][Hide abstract] ABSTRACT: End-users increasingly find the need to perform light-weight, customized schema mapping. State-of-the-art tools provide powerful functions to generate schema mappings, but they usually require an in-depth understanding of the semantics of multiple schemas and their correspondences, and are thus not suitable for users who are technically unsophisticated or when a large number of mappings must be performed. We propose a system for sample-driven schema mapping. It automatically constructs schema mappings, in real time, from user-input sample target instances. Because the user does not have to provide any explicit attribute-level match information, she is isolated from the possibly complex structure and semantics of both the source schemas and the mappings. In addition, the user never has to master any operations specific to schema mappings: she simply types data values into a spreadsheet-style interface. As a result, the user can construct mappings with a much lower cognitive burden. In this paper we present Mweaver, a prototype sample-driven schema mapping system. It employs novel algorithms that enable the system to obtain desired mapping results while meeting interactive response performance requirements. We show the results of a user study that compares Mweaver with two state-of-the-art mapping tools across several mapping tasks, both real and synthetic. These suggest that the Mweaver system enables users to perform practical mapping tasks in about 1/5th the time needed by the state-of-the-art tools.
[Show abstract][Hide abstract] ABSTRACT: Numerous applications such as wireless communication and telematics need to keep track of evolution of spatio-temporal data for a limited past. Limited retention may even be required by regulations. In general, each data entry can have its own user specified lifetime. It is desired that expired entries are automatically removed by the system through some garbage collection mechanism. This kind of limited retention can be achieved by using a sliding window semantics similar to that from stream data processing. However, due to the large volume and relatively long lifetime of data in the aforementioned applications (in contrast to the real-time transient streaming data), the sliding window here needs to be maintained for data on disk rather than in memory. It is a new challenge to provide fast access to the information from the recent past and, at the same time, facilitate efficient deletion of the expired entries. In this paper, we propose a disk based, two-layered, sliding window indexing scheme for discretely moving spatio-temporal data. Our index can support efficient processing of standard time slice and interval queries and delete expired entries with almost no overhead. In existing historical spatio-temporal indexing techniques, deletion is either infeasible or very inefficient. Our sliding window based processing model can support both current and past entries, while many existing historical spatio-temporal indexing techniques cannot keep these two types of data together in the same index. Our experimental comparison with the best known historical index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows that our index is about five times faster in terms of insertion time and comparable in terms of search performance. MV3R follows a partial persistency model, whereas our index can support very efficient deletion and update.
[Show abstract][Hide abstract] ABSTRACT: Metabolomics is a rapidly evolving field that holds promise to provide insights into genotype-phenotype relationships in cancers, diabetes and other complex diseases. One of the major informatics challenges is providing tools that link metabolite data with other types of high-throughput molecular data (e.g. transcriptomics, proteomics), and incorporate prior knowledge of pathways and molecular interactions.
We describe a new, substantially redesigned version of our tool Metscape that allows users to enter experimental data for metabolites, genes and pathways and display them in the context of relevant metabolic networks. Metscape 2 uses an internal relational database that integrates data from KEGG and EHMN databases. The new version of the tool allows users to identify enriched pathways from expression profiling data, build and analyze the networks of genes and metabolites, and visualize changes in the gene/metabolite data. We demonstrate the applications of Metscape to annotate molecular pathways for human and mouse metabolites implicated in the pathogenesis of sepsis-induced acute lung injury, for the analysis of gene expression and metabolite data from pancreatic ductal adenocarcinoma, and for identification of the candidate metabolites involved in cancer and inflammation.
Metscape is part of the National Institutes of Health-supported National Center for Integrative Biomedical Informatics (NCIBI) suite of tools, freely available at http://metscape.ncibi.org. It can be downloaded from http://cytoscape.org or installed via Cytoscape plugin manager.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: The National Center for Integrative and Biomedical Informatics (NCIBI) is one of the eight NCBCs. NCIBI supports information access and data analysis for biomedical researchers, enabling them to build computational and knowledge models of biological systems to address the Driving Biological Problems (DBPs). The NCIBI DBPs have included prostate cancer progression, organ-specific complications of type 1 and 2 diabetes, bipolar disorder, and metabolic analysis of obesity syndrome. Collaborating with these and other partners, NCIBI has developed a series of software tools for exploratory analysis, concept visualization, and literature searches, as well as core database and web services resources. Many of our training and outreach initiatives have been in collaboration with the Research Centers at Minority Institutions (RCMI), integrating NCIBI and RCMI faculty and students, culminating each year in an annual workshop. Our future directions include focusing on the TranSMART data sharing and analysis initiative.
Journal of the American Medical Informatics Association 11/2011; 19(2):166-70. DOI:10.1136/amiajnl-2011-000552 · 3.50 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Diabetic neuropathy is a common complication of diabetes. While multiple pathways are implicated in the pathophysiology of diabetic neuropathy, there are no specific treatments and no means to predict diabetic neuropathy onset or progression. Here, we identify gene expression signatures related to diabetic neuropathy and develop computational classification models of diabetic neuropathy progression. Microarray experiments were performed on 50 samples of human sural nerves collected during a 52-week clinical trial. A series of bioinformatics analyses identified differentially expressed genes and their networks and biological pathways potentially responsible for the progression of diabetic neuropathy. We identified 532 differentially expressed genes between patient samples with progressing or non-progressing diabetic neuropathy, and found these were functionally enriched in pathways involving inflammatory responses and lipid metabolism. A literature-derived co-citation network of the differentially expressed genes revealed gene subnetworks centred on apolipoprotein E, jun, leptin, serpin peptidase inhibitor E type 1 and peroxisome proliferator-activated receptor gamma. The differentially expressed genes were used to classify a test set of patients with regard to diabetic neuropathy progression. Ridge regression models containing 14 differentially expressed genes correctly classified the progression status of 92% of patients (P < 0.001). To our knowledge, this is the first study to identify transcriptional changes associated with diabetic neuropathy progression in human sural nerve biopsies and describe their potential utility in classifying diabetic neuropathy. Our results identifying the unique gene signature of patients with progressive diabetic neuropathy will facilitate the development of new mechanism-based diagnostics and therapies.
[Show abstract][Hide abstract] ABSTRACT: Many decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. In consequence, for many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall query experience we provide users, we should pay attention to how we can assist users in the holistic process of obtaining the information they desire from the database, and not just the constituent activity of efficiently generating a result given a complete precise query. In this paper, we examine the conventional query-result paradigm employed by databases and demonstrate challenges encountered when following this paradigm for an information seeking task. We recognize that the process of query specification itself is a major stumbling block. With current computational abilities, we are at a point where we can make use of the data in the database to aid in this process. To this end, we propose a new paradigm, guided interaction, to solve the noted challenges, by using interaction to guide the user through the query formulation, query execution and result examination processes. The user can be given advance information during query specification that can not only assist in query formulation, but may also lead to abandonment of an unproductive query direction or the satisfaction of information need even before the query specification is complete. There are significant engineering challenges to constructing the system we envision, and the technological building blocks to address these challenges exist today. 1.
Proceedings of the VLDB Endowment 08/2011; 4(12):1466-1469.