[Show abstract][Hide abstract] ABSTRACT: Schema matching is a central challenge for data integration systems. Due to the inherent uncertainty arose from the inability of schema in fully capturing the semantics of the represented data, automatic tools are often uncertain about suggested matching results. However, human is good at understanding data represented in various forms and crowdsourcing platforms are making the human annotation process more affordable. Thus in this demo, we will show how to utilize the crowd to find the right matching. In order to do that, we need to make the tasks posted on the crowdsouricng platforms extremely simple, to be performed by non-expert people, and reduce the number of tasks as less as possible to save the cost. We demonstrate CrowdMatcher, a hybrid machine-crowd system for schema matching. The machine-generated matchings are verified by correspondence correctness queries (CCQs), which is to ask the crowd to determine whether a given correspondence is correct or not. CrowdMatcher includes several original features: it integrates different matchings generated from classical schema matching tools; in order to minimize the cost of crowdsourcing, it automatically selects the most informative set of CCQs from the possible matchings; it is able to manage inaccurate answers provided by the workers; the crowdsourced answers are used to improve matching results.
[Show abstract][Hide abstract] ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
IEEE Transactions on Knowledge and Data Engineering 01/2014; 26(7):1670-1678. · 1.89 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The benefits of crowdsourcing are well-recognized today for an increasingly broad range of problems. Meanwhile, the rapid development of social media makes it possible to seek the wisdom of a crowd of targeted users. However, it is not trivial to implement the crowdsourcing platform on social media, specifically to make social media users as workers, we need to address the following two challenges: 1) how to motivate users to participate in tasks, and 2) how to choose users for a task. In this paper, we present Wise Market as an effective framework for crowdsourcing on social media that motivates users to participate in a task with care and correctly aggregates their opinions on pairwise choice problems. The Wise Market consists of a set of investors each with an associated individual confidence in his/her prediction, and after the investment, only the ones whose choices are the same as the whole market are granted rewards. Therefore, a social media user has to give his/her ``best'' answer in order to get rewards, as a consequence, careless answers from sloppy users are discouraged. Under the Wise Market framework, we define an optimization problem to minimize expected cost of paying out rewards while guaranteeing a minimum confidence level, called the Effective Market Problem (EMP). We propose exact algorithms for calculating the market confidence and the expected cost with O(nlog2n) time cost in a Wise Market with n investors. To deal with the enormous number of users on social media, we design a Central Limit Theorem-based approximation algorithm to compute the market confidence with O(n) time cost, as well as a bounded approximation algorithm to calculate the expected cost with O(n) time cost. Finally, we have conducted extensive experiments to validate effectiveness of the proposed algorithms on real and synthetic data.
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
[Show abstract][Hide abstract] ABSTRACT: Schema matching is a central challenge for data integration systems. Automated tools are often uncertain about schema matchings they suggest, and this uncertainty is inherent since it arises from the inability of the schema to fully capture the semantics of the represented data. Human common sense can often help. Inspired by the popularity and the success of easily accessible crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since it is typical to ask simple questions on crowdsourcing platforms, we assume that each question, namely Correspondence Correctness Question (CCQ), is to ask the crowd to decide whether a given correspondence should exist in the correct matching. We propose frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely "Single CCQ" and "Multiple CCQ", which adaptively select, publish and manage the questions. We verified the value of our solutions with simulation and real implementation.
Proceedings of the VLDB Endowment. 07/2013; 6(9):757-768.
[Show abstract][Hide abstract] ABSTRACT: The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of "Big Data," including the recent announcement from the White House about new funding initiatives across different agencies, that target research for Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is no clear consensus on what is Big Data. In fact, there have been many controversial statements about Big Data, such as "Size is the only thing that matters." In this panel we will try to explore the controversies and debunk the myths surrounding Big Data.
Proceedings of the VLDB Endowment. 08/2012; 5(12):2032-2033.
[Show abstract][Hide abstract] ABSTRACT: Visual analytics (VA), which combines analytical techniques with advanced visualization features, is fast becoming a standard tool for extracting information from graph data. Researchers have developed many tools for this purpose, suggesting a need for formal methods to guide these tools' creation. Increased data demands on computing requires redesigning VA tools to consider performance and reliability in the context of analysis of exascale datasets. Furthermore, visual analysts need a way to document their analyses for reuse and results justification. A VA graph framework encapsulated in a graph algebra helps address these needs. Its atomic operators include selection and aggregation. The framework employs a visual operator and supports dynamic attributes of data to enable scalable visual exploration of data.
[Show abstract][Hide abstract] ABSTRACT: A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted. In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a "good impression" of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.
[Show abstract][Hide abstract] ABSTRACT: Numerous applications such as wireless communication and telematics need to keep track of evolution of spatio-temporal data for a limited past. Limited retention may even be required by regulations. In general, each data entry can have its own user specified lifetime. It is desired that expired entries are automatically removed by the system through some garbage collection mechanism. This kind of limited retention can be achieved by using a sliding window semantics similar to that from stream data processing. However, due to the large volume and relatively long lifetime of data in the aforementioned applications (in contrast to the real-time transient streaming data), the sliding window here needs to be maintained for data on disk rather than in memory. It is a new challenge to provide fast access to the information from the recent past and, at the same time, facilitate efficient deletion of the expired entries. In this paper, we propose a disk based, two-layered, sliding window indexing scheme for discretely moving spatio-temporal data. Our index can support efficient processing of standard time slice and interval queries and delete expired entries with almost no overhead. In existing historical spatio-temporal indexing techniques, deletion is either infeasible or very inefficient. Our sliding window based processing model can support both current and past entries, while many existing historical spatio-temporal indexing techniques cannot keep these two types of data together in the same index. Our experimental comparison with the best known historical index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows that our index is about five times faster in terms of insertion time and comparable in terms of search performance. MV3R follows a partial persistency model, whereas our index can support very efficient deletion and update.
[Show abstract][Hide abstract] ABSTRACT: End-users increasingly find the need to perform light-weight, customized schema mapping. State-of-the-art tools provide powerful functions to generate schema mappings, but they usually require an in-depth understanding of the semantics of multiple schemas and their correspondences, and are thus not suitable for users who are technically unsophisticated or when a large number of mappings must be performed. We propose a system for sample-driven schema mapping. It automatically constructs schema mappings, in real time, from user-input sample target instances. Because the user does not have to provide any explicit attribute-level match information, she is isolated from the possibly complex structure and semantics of both the source schemas and the mappings. In addition, the user never has to master any operations specific to schema mappings: she simply types data values into a spreadsheet-style interface. As a result, the user can construct mappings with a much lower cognitive burden. In this paper we present Mweaver, a prototype sample-driven schema mapping system. It employs novel algorithms that enable the system to obtain desired mapping results while meeting interactive response performance requirements. We show the results of a user study that compares Mweaver with two state-of-the-art mapping tools across several mapping tasks, both real and synthetic. These suggest that the Mweaver system enables users to perform practical mapping tasks in about 1/5th the time needed by the state-of-the-art tools.
[Show abstract][Hide abstract] ABSTRACT: Results of high throughput experiments can be challenging to interpret. Current approaches have relied on bulk processing the set of expression levels, in conjunction with easily obtained external evidence, such as co-occurrence. While such techniques can be used to reason probabilistically, they are not designed to shed light on what any individual gene, or a network of genes acting together, may be doing. Our belief is that today we have the information extraction ability and the computational power to perform more sophisticated analyses that consider the individual situation of each gene. The use of such techniques should lead to qualitatively superior results. The specific aim of this project is to develop computational techniques to generate a small number of biologically meaningful hypotheses based on observed results from high throughput microarray experiments, gene sequences, and next-generation sequences. Through the use of relevant known biomedical knowledge, as represented in published literature and public databases, we can generate meaningful hypotheses that will aide biologists to interpret their experimental data. We are currently developing novel approaches that exploit the rich information encapsulated in biological pathway graphs. Our methods perform a thorough and rigorous analysis of biological pathways, using complex factors such as the topology of the pathway graph and the frequency in which genes appear on different pathways, to provide more meaningful hypotheses to describe the biological phenomena captured by high throughput experiments, when compared to other existing methods that only consider partial information captured by biological pathways.
[Show abstract][Hide abstract] ABSTRACT: Metabolomics is a rapidly evolving field that holds promise to provide insights into genotype-phenotype relationships in cancers, diabetes and other complex diseases. One of the major informatics challenges is providing tools that link metabolite data with other types of high-throughput molecular data (e.g. transcriptomics, proteomics), and incorporate prior knowledge of pathways and molecular interactions.
We describe a new, substantially redesigned version of our tool Metscape that allows users to enter experimental data for metabolites, genes and pathways and display them in the context of relevant metabolic networks. Metscape 2 uses an internal relational database that integrates data from KEGG and EHMN databases. The new version of the tool allows users to identify enriched pathways from expression profiling data, build and analyze the networks of genes and metabolites, and visualize changes in the gene/metabolite data. We demonstrate the applications of Metscape to annotate molecular pathways for human and mouse metabolites implicated in the pathogenesis of sepsis-induced acute lung injury, for the analysis of gene expression and metabolite data from pancreatic ductal adenocarcinoma, and for identification of the candidate metabolites involved in cancer and inflammation.
Metscape is part of the National Institutes of Health-supported National Center for Integrative Biomedical Informatics (NCIBI) suite of tools, freely available at http://metscape.ncibi.org. It can be downloaded from http://cytoscape.org or installed via Cytoscape plugin manager.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: The National Center for Integrative and Biomedical Informatics (NCIBI) is one of the eight NCBCs. NCIBI supports information access and data analysis for biomedical researchers, enabling them to build computational and knowledge models of biological systems to address the Driving Biological Problems (DBPs). The NCIBI DBPs have included prostate cancer progression, organ-specific complications of type 1 and 2 diabetes, bipolar disorder, and metabolic analysis of obesity syndrome. Collaborating with these and other partners, NCIBI has developed a series of software tools for exploratory analysis, concept visualization, and literature searches, as well as core database and web services resources. Many of our training and outreach initiatives have been in collaboration with the Research Centers at Minority Institutions (RCMI), integrating NCIBI and RCMI faculty and students, culminating each year in an annual workshop. Our future directions include focusing on the TranSMART data sharing and analysis initiative.
Journal of the American Medical Informatics Association 11/2011; 19(2):166-70. · 3.57 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Diabetic neuropathy is a common complication of diabetes. While multiple pathways are implicated in the pathophysiology of diabetic neuropathy, there are no specific treatments and no means to predict diabetic neuropathy onset or progression. Here, we identify gene expression signatures related to diabetic neuropathy and develop computational classification models of diabetic neuropathy progression. Microarray experiments were performed on 50 samples of human sural nerves collected during a 52-week clinical trial. A series of bioinformatics analyses identified differentially expressed genes and their networks and biological pathways potentially responsible for the progression of diabetic neuropathy. We identified 532 differentially expressed genes between patient samples with progressing or non-progressing diabetic neuropathy, and found these were functionally enriched in pathways involving inflammatory responses and lipid metabolism. A literature-derived co-citation network of the differentially expressed genes revealed gene subnetworks centred on apolipoprotein E, jun, leptin, serpin peptidase inhibitor E type 1 and peroxisome proliferator-activated receptor gamma. The differentially expressed genes were used to classify a test set of patients with regard to diabetic neuropathy progression. Ridge regression models containing 14 differentially expressed genes correctly classified the progression status of 92% of patients (P < 0.001). To our knowledge, this is the first study to identify transcriptional changes associated with diabetic neuropathy progression in human sural nerve biopsies and describe their potential utility in classifying diabetic neuropathy. Our results identifying the unique gene signature of patients with progressive diabetic neuropathy will facilitate the development of new mechanism-based diagnostics and therapies.
[Show abstract][Hide abstract] ABSTRACT: Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
[Show abstract][Hide abstract] ABSTRACT: Databases today are carefully engineered: there is an expensive and deliberate design process, after which a database schema is defined; during this design process, various possible instance examples and use cases are hypothesized and carefully analyzed; finally, the schema is ready and then can be populated with data. All of this effort is a major barrier to database adoption. In this paper, we explore the possibility of organic database creation instead of the traditional engineered approach. The idea is to let the user start storing data in a database with a schema that is just enough to cove the instances at hand. We then support efficient schema evolution as new data instances arrive. By designing the database to evolve, we can sidestep the expensive front-end cost of carefully engineering the design of the database. The same set of issues also apply to database querying. Today, databases expect queries to be carefully specified, and to be valid with respect to the database schema. In contrast, the organic query specification model would allow users to construct queries incrementally, with little knowledge of the database. We also examine this problem in this paper.
Databases in Networked Information Systems - 7th International Workshop, DNIS 2011, Aizu-Wakamatsu, Japan, December 12-14, 2011. Proceedings; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Reactive oxygen species (ROS) are known mediators of cellular damage in multiple diseases including diabetic complications. Despite its importance, no comprehensive database is currently available for the genes associated with ROS.
We present ROS- and diabetes-related targets (genes/proteins) collected from the biomedical literature through a text mining technology. A web-based literature mining tool, SciMiner, was applied to 1,154 biomedical papers indexed with diabetes and ROS by PubMed to identify relevant targets. Over-represented targets in the ROS-diabetes literature were obtained through comparisons against randomly selected literature. The expression levels of nine genes, selected from the top ranked ROS-diabetes set, were measured in the dorsal root ganglia (DRG) of diabetic and non-diabetic DBA/2J mice in order to evaluate the biological relevance of literature-derived targets in the pathogenesis of diabetic neuropathy.
SciMiner identified 1,026 ROS- and diabetes-related targets from the 1,154 biomedical papers (http://jdrf.neurology.med.umich.edu/ROSDiabetes/). Fifty-three targets were significantly over-represented in the ROS-diabetes literature compared to randomly selected literature. These over-represented targets included well-known members of the oxidative stress response including catalase, the NADPH oxidase family, and the superoxide dismutase family of proteins. Eight of the nine selected genes exhibited significant differential expression between diabetic and non-diabetic mice. For six genes, the direction of expression change in diabetes paralleled enhanced oxidative stress in the DRG.
Literature mining compiled ROS-diabetes related targets from the biomedical literature and led us to evaluate the biological relevance of selected targets in the pathogenesis of diabetic neuropathy.
BMC Medical Genomics 10/2010; 3:49. · 3.91 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to deter- mine the lineage of a tuple in a database, can be leveraged to as- sist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a tech- nique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research - Almaden and experimentally demonstrate its effectiveness.