[Show abstract][Hide abstract] ABSTRACT: Enterprise mashup scenarios often involve feeds derived from data created primarily for eye consumption, such as email, news, calendars, blogs, and web feeds. These data sources can test the capabilities of current data mashup products, as the attributes needed to perform join, aggregation, and other operations are often buried within unstructured feed text. Information extraction technology is a key enabler in such scenarios, using annotators to convert unstructured text into structured information that can facilitate mashup operations. Our demo presents the integration of SystemT, an information extraction system from IBM Research, with IBM's InfoSphere MashupHub. We show how to build domain-specific annotators with SystemT's declarative rule language, AQL, and how to use these annotators to combine structured and unstructured information in an enterprise mashup.
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Increasingly large numbers of situational applications are being created by enterprise business users as a by-product of solving day-to-day problems. In efforts to address the demand for such applications, corporate IT is moving toward Web 2.0 architectures. In particular, the corporate intranet is evolving into a platform of readily accessible data and services where communities of business users can assemble and deploy situational applications. Damia is a web style data integration platform being developed to address the data problem presented by such applications, which often access and combine data from a variety of sources. Damia allows business users to quickly and easily create data mashups that combine data from desktop, web, and traditional IT sources into feeds that can be consumed by AJAX, and other types of web applications. This paper describes the key features and design of Damia's data integration engine, which has been packaged with Mashup Hub, an enterprise feed server currently available for download on IBM alphaWorks. Mashup Hub exposes Damia's data integration capabilities in the form of a service that allows users to create hosted data mashups.
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: Situational applications require business users to create combine, and catalog data feeds and other enterprise data sources.
Damia is a lightweight enterprise data integration engine inspired by the Web 2.0 mashup phenomenon. It consists of (1) a
browser-based user-interface that allows for the specification of data mashups as data flow graphs using a set of Damia operators
specified by programming-by-example principles, (2) a server with an execution engine, as well as (3) APIs for searching,
debugging, executing and managing mashups. Damia provides a base data model and primitive operators based on the XQuery Infoset.
A feed abstraction built on that model enables combining, filtering and transforming data feeds. This paper presents an overview
of the Damia system as well as a research vision for data-intensive situational applications. A first version of Damia realizing
some of the concepts described in this paper is available as a webserivce  and for download as part of IBM’s Mashup Starter
[Show abstract][Hide abstract] ABSTRACT: Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. ABSTRACT Today's enterprise systems typically include both data-centric and document-centric applications. Data-centric applications are built on top of DBMS products which have excelled on advanced query processing and ACID transaction support for structured data. On the other hand, document-centric applications usually rely on content management system (CMS) products to perform advanced unstructured data management operations due to inherent differences in the usage patterns and required feature set (e.g. versioning, records management, etc.). We observe that a new class of hybrid applications are emerging that require the combined set of DBMS and CMS features on structured and unstructured integrated content due in large part to increasingly complex business requirements and the widespread adoption of XML technologies. However, today's hybrid applications are forced to fragment their business artifacts in separate DBMS and CMS repositories, and cope with accessing, augmenting, and processing the separate pieces. The lack of a unified repository model for integrated content makes the development of hybrid enterprise applications painfully difficult, and often leads to short-lived, inadequate solutions. In this paper, we explore the trends in hybrid enterprise applications and their requirements for a unified repository model. We suggest a holistic approach for the design of the new repository model covering both DBMS and CMS features under one umbrella. We discuss the integration challenges, and present our experience with a prototype that we developed in the MUSIC (Management of Unstructured and Structured Integrated Content) project.
[Show abstract][Hide abstract] ABSTRACT: Damia is a lightweight enterprise data integration service where line of business users can create and catalog high value data feeds for consumption by situational applications. Damia is inspired by the Web 2.0 mashup phenomenon. It consists of (1) a browser-based user-interface that allows for the specification of data mashups as data flow graphs using a set of operators, (2) a server with an execution engine, as well as (3) APIs for searching, debugging, executing and managing mashups. Damia offers a framework and functionality for dynamic entity resolution, streaming and other higher value features particularly important in the enterprise domain. Damia is currently in perpetual beta in the IBM Intranet.
Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007; 01/2007
[Show abstract][Hide abstract] ABSTRACT: DB2 XML is a hybrid database system that combines the relational capabilities of DB2 Universal Database™ (UDB) with comprehensive native XML support. DB2 XML augments DB2® UDB with a native XML store, XML indexes, and query processing capabilities for both XQuery and SQL/XML that are integrated with those of SQL. This paper presents the extensions made to the DB2 UDB compiler, and especially its cost-based query optimizer, to support XQuery and SQL/XML queries, using much of the same infrastructure developed for relational data queried by SQL. It describes the challenges to the relational infrastructure that supporting XQuery and SQL/XML poses and provides the rationale for the extensions that were made to the three main parts of the optimizer: the plan operators, the cardinality and cost model, and statistics collection.
Ibm Systems Journal 01/2006; 45:299-320. · 1.29 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Progressive Optimization (POP) is a technique to make query plans robust, and minimize need for DBA intervention, by repeatedly re-optimizing a query during runtime if the cardinalities estimated during optimization prove to be significantly incorrect. POP works by carefully calculating validity ranges for each plan operator under which the overall plan can be optimal. POP then instruments the query plan with checkpoints that validate at runtime that cardinalities do lie within validity ranges, and reoptimizes the query otherwise. In this demonstration we showcase POP implemented for a research prototype version of IBM's DB2 DBMS, using a mix of real-world and synthetic benchmark databases and workloads. For selected queries of the workload we display the query plans with validity ranges as well as the placement of the various kinds of CHECK operators using the DB2 graphical plan explain tool. We also execute the queries, showing how and where re-optimization is triggered through the CHECK operators, the new plan generated upon re-optimization, and the extent to which previously computed intermediate results are reused.
[Show abstract][Hide abstract] ABSTRACT: Virtually every commercial query optimizer chooses the best plan for a query using a cost model that relies heavily on accurate cardinality estimation. Cardinality estimation errors can occur due to the use of inaccurate statistics, invalid assumptions about attribute independence, parameter markers, and so on. Cardinality estimation errors may cause the optimizer to choose a sub-optimal plan. We present an approach to query processing that is extremely robust because it is able to detect and recover from cardinality estimation errors. We call this approach "progressive query optimization" (POP). POP validates cardinality estimates against actual values as measured during query execution. If there is significant disagreement between estimated and actual values, execution might be stopped and re-optimization might occur. Oscillation between optimization and execution steps can occur any number of times. A re-optimization step can exploit both the actual cardinality and partial results, computed during a previous execution step. Checkpoint operators (CHECK) validate the optimizer's cardinality estimates against actual cardinalities. Each CHECK has a condition that indicates the cardinality bounds within which a plan is valid. We compute this validity range through a novel sensitivity analysis of query plan operators. If the CHECK condition is violated, CHECK triggers re-optimization. POP has been prototyped in a leading commercial DBMS. An experimental evaluation of POP using TPC-H queries illustrates the robustness POP adds to query processing, while incurring only negligible overhead. A case-study applying POP to a real-world database and workload shows the potential of POP, accelerating complex OLAP queries by almost two orders of magnitude.
Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004; 01/2004
[Show abstract][Hide abstract] ABSTRACT: Outerjoins and antijoins are two important classes of joins in
database systems. Reordering outerjoins and antijoins with innerjoins is
challenging because not all the join orders preserve the semantics of
the original query. Previous work did not consider antijoins and was
restricted to a limited class of queries. We consider using a
conventional bottom-up optimizer to reorder different types of joins. We
propose extending each join predicate's eligibility list, which contains
all the tables referenced in the predicate. An extended eligibility list
(EEL) includes all the tables needed by a predicate to preserve the
semantics of the original query. We describe an algorithm that can set
up the EELs properly in a bottom-up traversal of the original operator
tree. A conventional join optimizer is then modified to check the EELs
when generating sub-plans. Our approach handles antijoin and can resolve
many practical issues. It is now being implemented in an upcoming
release of IBM's Universal Database Server for Unix, Windows and OS/2
Data Engineering, 2001. Proceedings. 17th International Conference on; 02/2001