About
76
Publications
6,545
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,089
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (76)
The scale and significance of graph structured data today has led to the development of graph management systems that are optimized either for graph navigation requests or graph analytic requests. We present a general purpose graph system that provides high performance concurrently for both navigation and analytic requests. In addition, it supports...
Software developers at large tech companies spend a lot of time writing code for tasks that colleagues elsewhere in the organization have already addressed. Scripts are rarely written or documented with discovery in mind, and the APIs on which they depend are frequently inconsistent, further limiting reuse. For mobile devices the App Catalog serves...
Disclosed herein are techniques for measuring or assessing the costs of executing operations across a plurality of computing systems. The cost of transferring data across at least one arrangement of computing systems is determined. The cost of executing at least one arrangement of the operations is also determined.
A method of assigning resources of a computer duster with resource sharing according to objectives. The method includes monitoring resources of each of a plurality of cloud nodes, providing information descriptive of the cloud node resources, receiving a reservation, determining whether resources are available to satisfy the reservation and any oth...
A method of managing the execution of a workload of transactions of different transaction types on a computer system. Each transaction type may have a different resource requirement. The method may include intermittently, during execution of the workload, determining the performance of each transaction type. A determination may be made of whether i...
A computer implemented method and apparatus calculate a freshness cost for each of a plurality of information integration flow graphs and select one of the plurality of information integration flow graphs based upon the calculated freshness cost.
Database query monitoring tools collect performance metrics, such as memory and cpu usage, while a query is executing and make them available through log files or system tables. The metrics can be used to understand and diagnose query performance issues. However, analytic queries over big data presents new challenges for query monitoring tools. A l...
A complex analytic data flow may perform multiple, inter-dependent tasks where each task uses a different processing engine. Such a multi-engine flow, termed a hybrid flow, may comprise subflows written in more than one programming language. However, as the number and variety of these engines grow, developing and maintaining hybrid flows at the phy...
Computer-based methods, computer-readable storage media and computer systems are provided for optimizing integration flow plans. An initial integration flow plan, one or more objectives and/or an objective function related to the one or more objectives may be received as input. A computing cost of the initial integration flow plan may be compared w...
A method for quality objective-based ETL pipeline optimization is provided. An improvement objective is obtained from user input into a computing system. The improvement objective represents a priority optimization desired by a user for improved ETL flows for an application designed to run in memory of the computing system. An ETL flow is created i...
A method of producing a representation of the progress of a process being performed on a database may be embodied in a data processing system. The method may include obtaining for each of a plurality of subprocesses included in the database process an estimated rate of using a system resource during execution of the subprocess and an estimated volu...
Embodiments include methods, apparatus, and systems for using range queries in multidimensional data in a database. One embodiment is a method that defines a query box from a search for multidimensional data in a database. The method examines an intersection between a Z-interval and the query box by decomposing the Z-interval into hyper-boxes that...
A search method includes the step of creating a list of candidate probe words. For each candidate probe word, the number of item descriptions that contain the candidate probe word is counted. Q probe words are chosen whose word count most equally divides the number of remaining item descriptions into q+1 subgroups. The q probe words are presented f...
A complex analytic flow in a modern enterprise may perform multiple, logically independent, tasks where each task uses a different processing engine. We term these multi-engine flows hybrid flows. Using multiple processing engines has advantages such as rapid deployment, better performance, lower cost, and so on. However, as the number and variety...
To implement at least one functionality, a template having one or more logic components corresponding to the at least one functionality is provided. The template and a data collection are accessed to load the one or more logic components and data into a closure document. The closure document is provided to enable updating of data in the closure doc...
An apparatus and method provides automatic information integration flow optimization. The apparatus may include an input/output port connecting the information integration flow optimizer to extract-transform-load tools. The information integration flow optimizer includes a parser unit to create a tool-agnostic input file containing rich semantics,...
To remain competitive, enterprises are evolving in order to quickly respond to changing market conditions and customer needs. In this new environment, a single centralized data warehouse is no longer sufficient. Next generation business intelligence involves data flows that span multiple, diverse processing engines, that contain complex functionali...
As enterprises become more automated, real-time, and data-driven, they need to integrate new data sources and specialized processing engines. The traditional business intelligence architecture of Extract-Transform-Load (ETL) flows, followed by querying, reporting, and analytic operations, is being generalized to analytic data flows that utilize a v...
To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near realtime views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call...
Modern business intelligence systems integrate a variety of data sources using multiple data execution engines. A common example is the use of Hadoop to analyze unstructured text and merging the results with relational database queries over a data warehouse. These analytic data flows are generalizations of ETL flows. We refer to multi-engine data f...
MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence const...
Modern data analytic flows involve complex data computations that may span multiple execution engines and need to be optimized for a variety of objectives like performance, fault-tolerance, freshness, and so on. In this paper, we present optimization techniques and tradeoffs in terms of a real-world, cyber-physical flow that starts with raw time se...
Cloud computing has emerged as a promising environment capable of providing flexibility, scalability, elasticity, fail-over mechanisms, high availability, and other important features to applications. Compute clusters are relatively easy to create and use, but tools to effectively share cluster resources are lacking. CloudAlloc addresses this probl...
Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and t...
MapReduce is an important paradigm to support modern data-intensive applications. In this paper we address the challenge of modeling performance of one implementation of MapReduce called Hadoop Online Prototype (HOP), with a specific target on the intra-job pipeline parallelism. We use a hierarchical model that combines a precedence model and a que...
This paper addresses the challenge of optimizing analytic data flows for modern business intelligence (BI) applications. We first describe the changing nature of BI in today's enterprises as it has evolved from batch-based processes, in which the back-end extraction-transform-load (ETL) stage was separate from the front-end query and analytics stag...
The design and implementation of an ETL (extract-transform-load) process for a data warehouse proceeds from a conceptual model to a logical model, and then a physical model and implementation. The conceptual model conveys at a high level the data sources and targets, and the transformation steps from sources to targets. The current state of the art...
As Business Intelligence evolves from off-line strategic decision making to on-line operational decision making, the design
of the back-end Extract-Transform-Load (ETL) processes is becoming even more complex. Many challenges arise in this new context
like their optimization and modeling. In this paper, we focus on the disconnection between the IT-...
As data warehousing technology gains a ubiquitous presence in business today, companies are becoming increasingly reliant
upon the information contained in their data warehouses to inform their operational decisions. This information, known as
business intelligence (BI), traditionally has taken the form of nightly or monthly reports and batched ana...
Extract-Transform-Load (ETL) processes play an important role in data warehousing. Typically, design work on ETL has focused on performance as the sole metric to make sure that the ETL process finishes within an allocated time window. However, other quality metrics are also important and need to be considered during ETL design. In this paper, we ad...
Ideally, a data warehouse would be able to run multiple types of queries concurrently, meeting different performance objectives for each type. However, due to the difficulty of managing mixed workloads, most commercial systems segregate distinct workload components by using strict resource partitioning and/or time multiplexing. This approach avoids...
Workload management for operational business intelligence (BI) databases is difficult. Queries vary widely in length and objectives. Resource contention is difficult to predict and to control as dynamically-arriving, long, analyst queries compete for resources with ongoing online-transaction processing (OLTP) queries and batch report queries. Curre...
ABSTRACT We explore how,to manage,database workloads,that contain a mix- ture of OLTP-like queries that run for milliseconds as well as busi- ness intelligence queries and maintenance,tasks that last f or hours. As data warehouses,grow in size to petabytes and complex,ana- lytic queries play a greater role in day-to-day business ope rations, factor...
Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (or one or more data marts), which consolidates data from several operational databases, and serv...
Business Intelligence query workloads that run against very large data warehouses contain queries whose execution times range, sometimes unpredictably, from seconds to hours. The presence of even a handful of long-running queries can significantly slow down a workload consisting of thousands of queries, creating havoc for queries that require a qui...
Business processes drive the operations of an enterprise. In the past, the focus was primarily on business process design, model- ing, and automation. Recently, enterprises have realized that they can benefit tremendously from analyzing the behavior of their business processes with the objective of optimizing or improving them. In our research, we...
As business intelligence becomes increasingly essential for organ- izations and as it evolves from strategic to operational, the com- plexity of Extract-Transform-Load (ETL) processes grows. In consequence, ETL engagements have become very time consum- ing, labor intensive, and costly. At the same time, additional re- quirements besides functionali...
A common approach to providing persistent storage for RDF is to store statements in a three-column table in a relational database system. This is commonly referred to as a triple store. Each table row represents one RDF statement. For RDF graphs with frequent patterns, an alternative storage scheme is a property table. A property table comprises on...
Enterprises commonly outsource all or part of their IT to vendors as a way to reduce the cost of IT, to accurately estimate
what they spend on IT, and to improve its effectiveness. These contracts vary in complexity from the outsourcing of a world-wide
IT function to smaller, country-specific, deals.
methodologies for service-oriented applications, programming models for service-oriented applications, service development and maintenance In this paper, we discuss techniques for supporting enterprise applications for business services. These applications can be quite complex, requiring support for regional variations in business logic, multi-step...
Sequential configuration is a fundamental pattern that occurs when integrating systems that span domains and levels of abstraction.
This task involves not only the integration of heterogeneous autonomous information systems, but also the integration of processes
and applications. Challenges include the lack of a common information model, the lack o...
The new Semantic Web recommendations for RDF, RDFS and OWL have, at their heart, the RDF graph. Jena2, a secondgeneration RDF toolkit, is similarly centered on the RDF graph. RDFS and OWL reasoning are seen as graph-to-graph transforms, producing graphs of virtual triples. Rich APIs are provided. The Model API includes support for other aspects of...
This paper describes the persistence subsystem of Jena2 which is intended to support large datasets. This paper describes its features, the changes from Jena1, relevant details of the implementation and performance tuning issues. Query optimization for RDF is identified as a promising area for future research
In order to realize the vision of the Semantic Web, a semantic model for encoding content in the World Wide Web, efficient storage and retrieval of large RDF data sets is required. A common technique for storing RDF data (graphs) is to use a single relational database table, a triple store, for the graph.
To realize the vision of the Semantic Web, efficient storage and retrieval of large RDF data sets is required. A common technique for persisting RDF data (graphs) is to use a single relational database table, a triple store. But, we believe a single triple store cannot scale for large-scale applications. This paper describes storing and querying pe...
While RDF provides a powerful means to store knowledge, it can be cumbersome to represent and query collections of statements in context. To this end, we introduce a new higher-level object, the Snippet, to hold a fragment of RDF that is about a single subject and made within a particular context. Each snippet may be represented in a standard form...
In order to realize the vision of the Semantic Web, a semantic model for encoding content in the World Wide Web, efficient storage and retrieval of large RDF data sets is required. A common technique for storing RDF data (graphs) is to use a single relational database table, a triple store, for the graph. However, we believe a single triple store c...
Summary form only given. The authors focus on the performance of
integrated specialized data managers. In particular, they focus on the
customizations of parallel executions of data manager operators in a
variety of computer configurations. This is done by specifying the glue
that connects data manager operators in a way that is independent of the...
Iris is an object-oriented database management system being developed at Hewlett-Packard Laboratories [1], [3]. This videotape provides an overview of the Iris data model and a summary of our experiences in converting a computer-integrated manufacturing application to Iris. An abstract of the videotape follows.Iris is intended to meet the needs of...
The goals of the Iris database management system are to enhance database programmer productivity and to provide generalized database support for the integration of future applications. Iris is based on an object and function model. Iris objects are typed but unlike other object systems, they contain no state. Attribute values, relationships and beh...
We describe an architecture for a database system based on an object/function model. The architecture efficiently supports the evaluation of functional expressions. The goal of the architecture is to provide a database system that is powerful enough to support the definition of functions and procedures that implement the semantics of the data model...
The authors describe the extensible transaction management
architecture of Papyrus and its motivation. In Papyrus, persistent data
are encapsulated in data servers, which provide a collection of methods
that clients may use to access and update the data. The Papyrus
architecture allows data servers to specify their own
concurrency-control algorithm...
An abstract is not available.
JASMIN is a functionally distributed database system running on
multiple microcomputers that communicate with each other by message
passing. The software modules in JASMIN can be cloned and distributed
across computer boundaries. One important module is the intelligent
store, a page manager that includes transaction-management facilities.
It provid...
The authors investigate the feasibility and efficiency of a
parallel sort-merge algorithm by considering its implementation of the
JASMIN prototype, a backend multiprocessor built around a fast packet
bus. They describe the design and implementation of a parallel sort
utility and present and analyze the results of measurements
corresponding to a ra...
Bubba is a parallel database machine under development at MCC. This paper describes KEV, Bubba’s Operating System kernel. After a brief overview of Bubba a set of requirements for KEV is presented. Our requirements are contrasted with the requirements for conventional, uniprocessor database systems as outlined by Gray, Stonebraker and others. The K...
The Jasmin database machine is being implemented as part of a research project in distributed processing and database management A primary goal of the work is to demonstrate the feasibility of a practical multiprocessor database machine suitable for large database, high transaction-rate applications Key features of Jasmin are its configurable perfo...
The Jasmin database machine is being implemented as part of a research project in distributed processing and database management A primary goal of the work is to demonstrate the feasibility of a practical multiprocessor database machine suitable for large database, high transaction-rate applications Key features of Jasmin are its configurable perfo...
Algorithms for parallel processing of relational database operations are presented and analyzed in a general multiprocessor framework. To analyze alternative algorithms, authors introduce an analysis methodology which incorporates I/O, CPU, and message costs and which can be adjusted to fit different multiprocessor architectures. Algorithms are pre...
This paper discusses a model for associative disk architectures. Simulation results of an event driven simulation based on this model are presented. The designs analyzed are Processor-per-Track (PPT), Processor-per-Bubble-cell (PPB), Processor-per-Head (PPH), and Processor-per-Disk (PPD). The effects of a number of factors, including output channel...
DIRECT is a multiprocessor database machine designed and implemented at the University of Wisconsin. This paper describes our experiences with the implementation of DIRECT. We start with a brief overview of the original machine proposal and how it differs from what was actually implemented. We then describe the structure of the DIRECT software. Thi...
A description is given experiences with the implementation of DIRECT. A brief overview is presented of the original machine proposal and how it differs from what was actually implemented. A description is also given of the structure of the DIRECT software. This includes software on host computers that interfaces with the database machine; software...