M. Tamer Özsu

M. Tamer Özsu
University of Waterloo | UWaterloo · David R. Cheriton School of Computer Science

PhD, Ohio State University

About

465
Publications
91,354
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
17,267
Citations
Introduction
I am a Professor of Computer Science at the Cheriton School of Computer Science and Associate Dean of Research of the Faculty of Mathematics at the University of Waterloo. I was the Director of the Cheriton School of Computer Science from January 2007 to June 2010. My research is on data management following two threads: (1) application of database technology to non-traditional data types, and (2) distributed & parallel data management.
Additional affiliations
July 2000 - present
University of Waterloo
Position
  • Professor
July 1984 - May 2000
University of Alberta
Education
January 1980 - March 1983
The Ohio State University
Field of study
  • Computer Science

Publications

Publications (465)
Preprint
Connectivity queries, which check whether vertices belong to the same connected component, are fundamental in graph computations. Sliding window connectivity processes these queries over sliding windows, facilitating real-time streaming graph analytics. However, existing methods struggle with low-latency processing due to the significant overhead o...
Article
We study index-based processing for connectivity queries within sliding windows on streaming graphs. These queries, which determine whether two vertices belong to the same connected component, are fundamental operations in real-time graph data processing and demand high throughput and low latency. While indexing methods that leverage data structure...
Article
Full-text available
In designing a distributed RDF system, it is quite common to divide an RDF graph into subgraphs, called partitions, which are then distributed. Graph partitioning in general and RDF graph partitioning in particular are challenging problems. In this paper, we propose an RDF graph partitioning approach, called Minimum Motif-Cut (MMC for short) to max...
Preprint
We study index-based processing for connectivity queries within sliding windows on streaming graphs. These queries, which determine whether two vertices belong to the same connected component, are fundamental operations in real-time graph data processing and demand high throughput and low latency. While indexing methods that leverage data structure...
Article
While the success of early data-science applications is evident, the full impact of data science has yet to be realized.
Preprint
The proliferation of RDF datasets has resulted in studies focusing on optimizing SPARQL query processing. Most existing work focuses on basic graph patterns (BGPs) and ignores other vital operators in SPARQL, such as UNION and OPTIONAL. SPARQL queries with these operators, which we abbreviate as SPARQL-UO, pose serious query plan generation challen...
Preprint
There has been an increasing recognition of the value of data and of data-based decision making. As a consequence, the development of data science as a field of study has intensified in recent years. However, there is no systematic and comprehensive treatment and understanding of data science. This article describes a systematic and end-to-end fram...
Article
Bipartite graphs are rich data structures with prevalent applications and characteristic structural features. However, less is known about their growth patterns, particularly in streaming settings. Current works study the patterns of static or aggregated temporal graphs optimized for certain down-stream analytics or ignoring multipartite /non-stati...
Article
Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better ind...
Preprint
Differential computation (DC) is a highly general incremental computation/view maintenance technique that can maintain the output of an arbitrary and possibly recursive dataflow computation upon changes to its base inputs. As such, it is a promising technique for graph database management systems (GDBMS) that support continuous recursive queries ov...
Article
This column was established by Richard Snodgrass in 1998 and was continued by Ken Ross from 1999 to 2005. It celebrated one of the key aspects that makes us grow as a research community: the papers that influence us. At each issue, different members of the data management community wrote anecdotes about a paper that had a unique impact in their car...
Preprint
Full-text available
Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better ind...
Article
Full-text available
Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thin...
Article
We study the fundamental problem of butterfly (i.e., (2,2)-bicliques) counting in bipartite streaming graphs. Similar to triangles in unipartite graphs, enumerating butterflies is crucial in understanding the structure of bipartite graphs. This benefits many applications where studying the cohesion in a graph shaped data is of particular interest....
Article
Due to the inherent hardness of subgraph isomorphism, the performance is often a bottleneck in various real-world applications. We address this by designing an efficient subgraph isomorphism algorithm leveraging features of GPU architecture. Existing GPU-based solutions adopt two-step output scheme, performing the same join twice in order to write...
Preprint
Full-text available
Bipartite graphs are rich data structures with prevalent applications and identifier structural features. However, less is known about their growth patterns, particularly in streaming settings. Current works study the patterns of static or aggregated temporal graphs optimized for certain down-stream analytics or ignoring multipartite/non-stationary...
Article
Full-text available
Ensuring the success of big graph processing for the next decade and beyond.
Preprint
Efficient execution of SPARQL queries over large RDF datasets is a topic of considerable interest due to increased use of RDF to encode data. Most of this work has followed either relational or graph-based approaches. In this paper, we propose an alternative query engine, called gSmart, based on matrix algebra. This approach can potentially better...
Chapter
Until a decade ago, the database world was all SQL, distributed, sometimes replicated, and fully consistent. Then, web and cloud applications emerged that need to deal with complex big data, and NoSQL came in to address their requirements, trading consistency for scalability and availability. NewSQL has been the latest technology in the big data ma...
Preprint
We study the fundamental problem of butterfly (i.e. (2,2)-bicliques) counting in bipartite streaming graphs. Similar to triangles in unipartite graphs, enumerating butterflies is crucial in understanding the structure of bipartite graphs. This benefits many applications where studying the cohesion in a graph shaped data is of particular interest. E...
Preprint
In this paper, we study the problem of evaluating persistent queries over streaming graphs in a principled fashion. These queries need to be evaluated over unbounded and very high speed graph streams. We define a streaming graph model and a streaming graph query model incorporating navigational queries, subgraph queries and paths as first-class cit...
Preprint
Full-text available
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the...
Article
Given a user-specified minimum degree threshold γ , a γ -quasiclique is a subgraph g = (V g , E g ) where each vertex ν ∈ V g connects to at least γ fraction of the other vertices (i.e., ⌈ γ · (| V g |- 1)⌉ vertices) in g. Quasi-clique is one of the most natural definitions for dense structures useful in finding communities in social networks and d...
Article
Full-text available
The growing popularity of dynamic applications such as social networks provides a promising way to detect valuable information in real time. These applications create high-speed data that can be easily modeled as streaming graph. Efficient analysis over these data is of great significance. In this paper, we study the subgraph (isomorphism) search o...
Chapter
The figures included in the original version of this book has been replaced. The figures have been updated throughout the book in this version of the book.
Article
Full-text available
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software...
Preprint
Full-text available
Given a user-specified minimum degree threshold {\gamma}, a {\gamma}-quasi-clique is a subgraph where each vertex connects to at least {\gamma} fraction of the other vertices. Mining maximal quasi-cliques is notoriously expensive with the state-of-the-art algorithm scaling only to small graphs with thousands of vertices. This has hampered its popul...
Article
Multi-dimensional, large-scale, and sparse data, which can be neatly represented by sparse tensors, are increasingly used in various applications such as data analysis and machine learning. A high-performance sparse tensor-vector product (SpTV), one of the most fundamental operations of processing sparse tensors, is necessary for improving efficien...
Preprint
We study persistent query evaluation over streaming graphs, which is becoming increasingly important. We focus on navigational queries that determine if there exists a path between two entities that satisfies a user-specified constraint. We adopt the Regular Path Query (RPQ) model that specifies navigational patterns with labeled constraints. We pr...
Chapter
A typical database design is a process which starts from a set of requirements and results in the definition of a schema that defines the set of relations. The distribution design starts from this global conceptual schema (GCS) and follows two tasks: partitioning (fragmentation) and allocation.
Chapter
An important requirement of a DBMS is the ability to support data control, i.e., controlling how data is accessed using a high-level language. Data control typically includes view management, access control, and semantic integrity control .
Chapter
By hiding the low-level details about the physical organization of the data, relational database languages allow the expression of complex queries in a concise and simple manner. In particular, to construct the answer to the query, the user does not precisely specify the procedure to follow; this procedure is actually devised by a module, called a...
Chapter
The past decade has seen an explosion of “data-intensive” or “data-centric” applications where the analysis of large volumes of heterogeneous data is the basis of solving problems.
Chapter
Up to this point, we considered distributed DBMSs that are designed in a top-down fashion. In particular, Chap. 2 focuses on techniques for partitioning and allocating a database, while Chap. 4 focuses on distributed query processing over such a database.
Chapter
Many data-intensive applications require support for very large databases (e.g., hundreds of terabytes or exabytes). Supporting very large databases efficiently for either OLTP or OLAP can be addressed by combining parallel computing and distributed database management.
Chapter
The World Wide Web (“WWW” or “web” for short) has become a major repository of data and documents. Although measurements differ and change, the web has grown at a phenomenal rate. Besides its size, the web is very dynamic and changes rapidly. For all practical purposes, the web represents a very large, dynamic, and distributed data store and there...
Chapter
The concept of a transaction is used in database systems as a basic unit of consistent and reliable computing. Thus, queries are executed as transactions once their execution strategies are determined and they are translated into primitive database operations. Transactions ensure that database consistency and durability are maintained when concurre...
Chapter
As we discussed in previous chapters, distributed databases are typically replicated .
Chapter
In this chapter, we discuss the data management issues in the “modern” peer-to-peer (P2P) data management systems. We intentionally use the phrase “modern” to differentiate these from the early P2P systems that were common prior to client/server computing.
Article
The fourth edition of this classic textbook provides major updates. This edition has completely new chapters on Big Data Platforms (distributed storage systems, MapReduce, Spark, data stream processing, graph analytics) and on NoSQL, NewSQL and polystore systems. It also includes an updated web data management chapter that includes RDF and semantic...
Article
This paper revisits the classical problem of multiple query optimization in the context of federated RDF systems. We propose a heuristic query rewriting-based approach to optimize the evaluation of multiple queries. This approach can take full advantage of SPARQL 1.1 to share the common computation of multiple queries while considering the cost of...
Conference Paper
We report a systematic performance study of streaming graph partitioning algorithms. Graph partitioning plays a crucial role in overall system performance as it has a significant impact on both load balancing and inter-machine communication. The streaming model for graph partitioning has recently gained attention due to its ability to scale to very...
Preprint
Subgraph isomorphism is a well-known NP-hard problem that is widely used in many applications, such as social network analysis and query over the knowledge graph. Due to the inherent hardness, its performance is often a bottleneck in various real-world applications. Therefore, we address this by designing an efficient subgraph isomorphism algorithm...
Article
Full-text available
The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in Information Extraction, Linked Data Management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As busin...
Conference Paper
Many computationally expensive problems are solved by a divide-and-conquer algorithm: a problem over a big dataset can be recursively divided into independent tasks over smaller subsets of the dataset. We present a distributed general-purpose framework called T-thinker which effectively utilizes the CPU cores in a cluster by properly decomposing an...
Article
Multi-relation graphs intuitively capture the heterogeneous correlations among real-world entities by allowing multiple types of relationships to be represented as entity-connecting edges, i.e., two entities could be correlated with more than one type of relationship. This is important in various applications such as social network analysis, ecolog...
Article
In this article, the authors provide their views on whether organizations should scale up or scale out their graph computations. This question was explored in a previous installment of this column by Jimmy Lin, where he made a case for scale-up through several examples. In response, the authors discuss three cases for scale-out.
Preprint
This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop, Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK) over four very large datasets (Twitter, World Road Network, UK 200705, and ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main objective is to perform an independent scale-...
Conference Paper
The performance of modern distributed stream processing systems is largely dependent on balanced distribution of the workload across cluster. Input streams with large, skewed domains pose challenges to these systems, especially for stateful applications. Key splitting, where state of a single key is partially maintained across multiple workers, is...
Conference Paper
We present Stream WatDiv -- an open-source benchmark for streaming RDF data management systems. The proposed benchmark extends the existing WatDiv benchmark, and includes a streaming data generator, a query generator that can produce a diverse set of SPARQL queries, and a testbed to monitor correctness and latency. We use Stream WatDiv to evaluate...
Article
This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop, Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK) over four very large datasets (Twitter, World Road Network, UK 200705, and ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main objective is to perform an independent scale-...
Article
The growing popularity of dynamic applications such as social networks provides a promising way to detect valuable information in real time. Efficient analysis over high-speed data from dynamic applications is of great significance. Data from these dynamic applications can be easily modeled as streaming graph. In this paper, we study the subgraph (...
Article
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software...
Article
Full-text available
This paper proposes a general system for compute-intensive graph mining tasks that find from a big graph all subgraphs that satisfy certain requirements (e.g., graph matching and community detection). Due to the broad range of applications of such tasks, many single-threaded algorithms have been proposed. However, graphs such as online social netwo...
Article
We present ViewDF: a flexible and declarative framework for incremental maintenance of materialized views (i.e., results of continuous queries) over streaming data. The main component of the proposed framework is the View Delta Function (ViewDF), which declaratively specifies how to update a materialized view when a new batch of data arrives. We de...
Conference Paper
With the advent of online social networks, there is an increasing demand for storage and processing of graph-structured data. Social networking applications pose new challenges to data management systems due to demand for real-time querying and manipulation of the graph structure. Recently, several systems specialized systems for graph-structured d...
Article
Full-text available
The increasing size of RDF data requires efficient systems to store and query them. There have been efforts to map RDF data to a relational representation, and a number of systems exist that follow this approach. We have been investigating an alternative approach of maintaining the native graph model to represent RDF data, and utilizing graph datab...
Conference Paper
The traversal-based approach to execute queries over Linked Data on the WWW fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. While the downside of this approach is the delay before the query engine completes a query execution, user perceived response time may be improved si...
Conference Paper
This panel critically examines the state of data: how its growth and ubiquity have confronted the computer science and particularly the database community, with new challenges. These challenges require practitioners and teachers to learn new skills and engage with other disciplines in ways they had not done before. Panelists will examine the impact...
Conference Paper
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World W...
Technical Report
Full-text available
The emergence of Linked Data on the WWW has spawned research interest in an online execution of declarative queries over this data. A particularly interesting approach is traversal-based query execution which fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. The downside of...
Article
Full-text available
We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a “partial evaluation and assembly” framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce l...
Article
Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computatio...
Article
RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed...
Conference Paper
Full-text available
Recent advances in Linked Data Management and the Semantic Web have led to a rapid increase in both the quantity as well as the variety of Web applications that rely on the SPARQL interface to query RDF data. Thus, RDF data management systems are increasingly exposed to workloads that are far more diverse and dynamic than what these systems were de...
Conference Paper
Full-text available
In this demonstration, we present the gStore RDF triple store. gStore is based on graph encoding and subgraph match, distinct from many other systems. More importantly, it can handle, in a uniform manner, different data types (strings and numerical data) and SPARQL queries with wildcards, aggregate, range and top-k operators over dynamic RDF datase...
Article
Software models formalize the requirements, structure and behavior of a system or application. They represent essential artifacts that simplify the process of software development. Software repositories have been developed to store models in order to facilitate the reuse of know-how from software projects; however, methods for searching these model...
Article
Full-text available
We propose techniques for processing SPARQL queries over linked data. We follow a graph-based approach where answering a query Q is equivalent to finding its matches over a distributed RDF data graph G. We adopt a "partial evaluation and assembly" framework. Partial evaluation results of query Q over each repository-called local partial match-are f...
Conference Paper
Full-text available
The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data management systems are being exposed to queries that are far...
Article
We address efficient processing of SPARQL queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uniform and scalable manner, SPARQL queries with wildcards and aggregate operators over dynamic RDF datasets. Our approach is graph based. We store RDF data as a large graph and also represent a SPARQL quer...
Article
The introduction of Google's Pregel generated much inter-est in the field of large-scale graph data processing, inspir-ing the development of Pregel-like systems such as Apache Giraph, GPS, Mizan, and GraphLab, all of which have ap-peared in the past two years. To gain an understanding of how Pregel-like systems perform, we conduct a study to ex-pe...
Article
Lazy replication with snapshot isolation (SI) has emerged as a popular choice for distributed databases. However, lazy replication often requires execution of update transactions at one (master) site so that it is relatively easy for a total SI order to be determined for consistent installation of updates in the lazily replicated system. We propose...
Article
The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistent...
Conference Paper
Full-text available
Traversal-based approaches to execute queries over data on the Web have recently been studied. These approaches make use of up-to-date data from initially unknown data sources and, thus, enable applications to tap the full potential of the Web. While existing work focuses primarily on implementation techniques, a principled analysis of subwebs that...
Conference Paper
The publication of Linked Open Data on the Web has gained tremendous momentum over the last six years. As a consequence, we currently witness the emergence of a new research area that focuses on an online execution of Linked Data queries; i.e., declarative queries that range over Web data that is made available using the Linked Data publishing prin...
Conference Paper
It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awar...
Article
Lazy replication with snapshot isolation (SI) has emerged as a popular choice for distributed databases. However, lazy replication often requires execution of update transactions at one (master) site so that it is relatively easy for a total SI order to be determined for consistent installation of updates in the lazily replicated system. We propose...
Article
The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistent...
Article
Existing main-memory hash join algorithms for multi-core can be classified into two camps. Hardware-oblivious hash join variants do not depend on hardware-specific parameters. Rather, they consider qualitative characteristics of modern hardware and are expected to achieve good performance on any technologically similar platform. The assumption behi...

Network