M. Tamer Ozsu
I am a Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. Previously he was with the Department of Computing Science of the University of Alberta between 1984 and 2000.
My current research focuses on three areas: (a) Internet-scale data distribution, (b) multimedia data management, and (c) integration of information retrieval and database technologies mainly focusing on XML query processing and optimization.
I was the Director of the Cheriton School of Computer Science from January 2007 to June 2010. I also served as the Acting Chair of the Department of Computing Science at the University of Alberta during 1994-1995.
Research interests
-
InterestsDatabase Systems, Distributed Databases, Multimedia
Education
-
Jan 1980–
Mar 1983Ohio State University
Computer Science · PhDUnited States of America (USA) · Columbus, OH
Other
-
LanguagesEnglish, Turkish
-
Scientific MembershipsFellow, Association for Computing Machinery (ACM)
Fellow, Institute of Electrical and Electronics Engineers (IEEE)
Member, Sigma Xi -
Other InterestsACM Publications Board
Publications
-
Efficient Core Decomposition in Massive Networks
Proc. 27th Int. Conf. on Data Engineering; 01/2011
The k-core of a graph is the largest subgraph in which every vertex is connected to at least k other vertices within the subgraph. Core decomposition finds the k-core of the graph for every possible k. Past studies have shown important applications of core decomposition such as in the study of the p... [more] The k-core of a graph is the largest subgraph in which every vertex is connected to at least k other vertices within the subgraph. Core decomposition finds the k-core of the graph for every possible k. Past studies have shown important applications of core decomposition such as in the study of the properties of large networks (e.g., sustainability, connectivity, centrality, etc.), for solving NP-hard problems efficiently in real networks (e.g., maximum clique finding, densest subgraph approximation, etc.), and for large-scale network fingerprinting and visualization. The k-core is a well accepted concept partly because there exists a simple and efficient algorithm for core decomposition, by recursively removing the lowest degree vertices and their incident edges. However, this algorithm requires random access to the graph and hence assumes the entire graph can be kept in main memory. Nevertheless, real-world networks such as online social networks have become exceedingly large in recent years and still keep growing at a steady rate. In this paper, we propose the first external-memory algorithm for core decomposition in massive graphs. When the memory is large enough to hold the graph, our algorithm achieves comparable performance as the in-memory algorithm. When the graph is too large to be kept in the memory, our algorithm requires only O(k_max) scans of the graph, where kmax is the largest core number of the graph. We demonstrate the efficiency of our algorithm on real networks with up to 52.9 million vertices and 1.65 billion edges.
-
Generating Efficient Execution Plans for Vertically Partitioned XML Databases
Proc. VLDB. 01/2010; 4(1):1-11.
Experience with relational systems has shown that distribution is an effective way of improving the scalability of query evaluation. In this paper, we show how distributed query evaluation can be performed in a vertically partitioned XML database system. We propose a novel technique for constructing... [more] Experience with relational systems has shown that distribution is an effective way of improving the scalability of query evaluation. In this paper, we show how distributed query evaluation can be performed in a vertically partitioned XML database system. We propose a novel technique for constructing distributed execution plans that is independent of local query evaluation strategies. We then present a number of optimizations that allow us to further improve the performance of distributed query execution. Finally, we present a response time-based cost model that allows us to pick the best execution plan for a given query and database instance. Based on an implementation of our techniques within a native XML database system, we verify that our execution plans take advantage of the parallelism in a distributed system and that our cost model is effective at identifying the most advantageous plans.
-
Popularity-aware Prefetch in P2P Range Caching
Peer-to-Peer Networking and Applications. 01/2010; 3(2):145-160.
Unstructured peer-to-peer infrastructure has been widely employed to support large-scale distributed applications. Many of these applications, such as location-based services and multimedia content distribution, require the support of range selection queries. Under the widely-adopted query shipping ... [more] Unstructured peer-to-peer infrastructure has been widely employed to support large-scale distributed applications. Many of these applications, such as location-based services and multimedia content distribution, require the support of range selection queries. Under the widely-adopted query shipping protocols, the cost of query processing is affected by the number of result copies or replicas in the system. Since range queries can return results that include poorly-replicated data items, the cost of these queries is usually dominated by the retrieval cost of these data items. In this work, we propose a popularity-aware prefetch-based approach that can effectively facilitate the caching of poorly-replicated data items that are potentially requested in subsequent range queries, resulting in substantial cost savings. We prove that the performance of retrieving poorly-replicated data items is guaranteed to improve under an increasing query load. Extensive experiments show that the overall range query processing cost decreases significantly under various query load settings.
-
A Framework for Testing DBMS Features
VLDB Journal. 01/2010; 19(2):203-230.
Testing a specific feature of a DBMS requires controlling the inputs and outputs of the operators in the query execution plan. However, that is practically difficult to achieve because the inputs/outputs of a query depend on the content of the test database. In this paper, we propose a framework to ... [more] Testing a specific feature of a DBMS requires controlling the inputs and outputs of the operators in the query execution plan. However, that is practically difficult to achieve because the inputs/outputs of a query depend on the content of the test database. In this paper, we propose a framework to test DBMS features. The framework includes a database generator called QAGen so that the generated test databases are able to meet the test requirements defined on the test queries. The framework also includes a set of tools to automate test case constructions and test executions. A wide range of DBMS feature testing tasks can be facilitated by the proposed framework.
-
Dynamic Skyline Queries in Large Graphs
Proc. 15th Int. Conf. on Database Systems for Advanced Applications; 01/2010
Given a set of query points, a dynamic skyline query reports all data points that are not dominated by other data points according to the distances between data points and query points. In this paper, we study dynamic skyline queries in a large graph (DSG-query for short). Although dynamic skylines ... [more] Given a set of query points, a dynamic skyline query reports all data points that are not dominated by other data points according to the distances between data points and query points. In this paper, we study dynamic skyline queries in a large graph (DSG-query for short). Although dynamic skylines have been studied in Euclidean space [14], road network [5], and metric space [3,6], there is no previous work on dynamic skylines over large graphs. We employ a filter-and-refine framework to speed up the query processing that can answer DSG-query efficiently.We propose a novel pruning rule based on graph properties to derive the candidates for DSG-query, that are guaranteed not to introduce false negatives. In the refinement step, with a carefully-designed index structure, we compute short path distances between vertices in O(H), where H is the number of maximal hops between any two vertices. Extensive experiments demonstrate that our methods outperform existing algorithms by orders of magnitude.
-
A Partial Order Based Active Cache for Recommender Systems
Proc. 3rd ACM Conf. on Recommender Systems; 01/2009
Recommender systems aim to substantially reduce information overload by suggesting lists of similar items that users may find interesting. Caching has been a useful technique for reducing stress on limited resources and improving response time. In this paper, we propose an 'active caching' t... [more] Recommender systems aim to substantially reduce information overload by suggesting lists of similar items that users may find interesting. Caching has been a useful technique for reducing stress on limited resources and improving response time. In this paper, we propose an 'active caching' technique for recommender systems based on a partial order approach that not only benefits from popularity and temporal locality, but also exploits spatial locality. This approach allows the processing of answers to neighboring non-cached queries in addition to the reporting of cached query results. Test results for several data sets and recommendation techniques show substantial improvement in the cache hit ratio and computational costs, while achieving reasonable recall rates.
-
Efficient Method for Maximizing Bichromatic Reverse Nearest Neighbor
Proc. 35th Int. Conf. on Very Large Data Bases; 01/2009
Bichromatic reverse nearest neighbor (BRNN) has been ex- tensively studied in spatial database literature. In this pa- per, we study a related problem called MaxBRNN: find an optimal region that maximizes the size of BRNNs. Such a problem has many real life applications, including the prob- lem of f... [more] Bichromatic reverse nearest neighbor (BRNN) has been ex- tensively studied in spatial database literature. In this pa- per, we study a related problem called MaxBRNN: find an optimal region that maximizes the size of BRNNs. Such a problem has many real life applications, including the prob- lem of finding a new server point that attracts as many cus- tomers as possible by proximity. A straightforward approach is to determine the BRNNs for all possible points that are not feasible since there are a large (or infinite) number of possible points. To the best of our knowledge, the fastest known method has exponential time complexity on the data size. Based on some interesting properties of the problem, we come up with an efficient algorithm called MaxOverlap. Extensive experiments are conducted to show that our algo- rithm is many times faster than the best-known technique.
-
K-Automorphism: A General Framework For Privacy Preserving Network Publication
Proc. 35th Int. Conf. on Very Large Data Bases; 01/2009
The growing popularity of social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Simply removing all identifiable personal informatio... [more] The growing popularity of social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. It is easy for an attacker to identify the target by performing different structural queries. In this paper we propose k-automorphism to protect against multiple structural attacks and develop an algorithm (called KM) that ensures k-automorphism. We also discuss an extension of KM to handle "dynamic" releases of the data. Extensive experiments show that the algorithm performs well in terms of protection it provides.
-
DistanceJoin: Pattern Match Query In a Large Graph Database
Proc. 35th Int. Conf. on Very Large Data Bases; 01/2009
The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest-path query, reachability verification, and pattern match. Among these, a pattern match query is more flexible compared to a subgraph search and more informative compared to... [more] The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest-path query, reachability verification, and pattern match. Among these, a pattern match query is more flexible compared to a subgraph search and more informative compared to a shortest-path or reachability query. In this paper, we address pattern match problems over a large data graph G. Specifically, given a pattern graph (i.e., query Q), we want to find all matches (in G) that have the similar connections as those in Q. In order to reduce the search space significantly, we first transform the vertices into points in a vector space via graph embedding techniques, coverting a pattern match query into a distance-based multi-way join problem over the converted vector space. We also propose several pruning strategies and a join order selection method to process join processing e$pm$ciently. Extensive experiments on both real and synthetic datasets show that our method outperforms existing ones by orders of magnitude.
-
Mining Data Streams with Periodically Changing Distributions
Proc. 18th ACM Int. Conf. on Information and Knowledge Management; 01/2009
Dynamic data streams are those whose underlying distribution changes over time. They occur in a number of application domains, and mining them is important for these applications. Coupled with the unboundedness and high arrival rates of data streams, the dynamism of the underlying distribution makes... [more] Dynamic data streams are those whose underlying distribution changes over time. They occur in a number of application domains, and mining them is important for these applications. Coupled with the unboundedness and high arrival rates of data streams, the dynamism of the underlying distribution makes data mining challenging. In this paper, we focus on a large class of dynamic streams that exhibit periodicity in distribution changes. We propose a framework, called DMM, for mining this class of streams that includes a new change detection technique and a novel match-and-reuse approach. Once a distribution change is detected, we compare the new distribution with a set of historically observed distribution patterns and use the mining results from the past if a match is detected. Since, for two highly similar distributions, their mining results should also present high similarity, by matching and reusing existing mining results, the overall stream mining efficiency is improved while the accuracy is maintained. Our experimental results confirm this conjecture.
-
XCube: Processing XPath queries in a hypercube overlay network
Peer-to-Peer Networking and Applications. 01/2009; 2(2):128-145.
In this paper, we present the design and performance of XCube, a tag-based system for managing XML data in a hypercube overlay network. In XCube, each node in a d-dimensional hypercube is identified by a d-bit vector. A peer manages a smaller hypercube with dimension d'� < d. An XML document ... [more] In this paper, we present the design and performance of XCube, a tag-based system for managing XML data in a hypercube overlay network. In XCube, each node in a d-dimensional hypercube is identified by a d-bit vector. A peer manages a smaller hypercube with dimension d'� < d. An XML document is compactly represented as a structure summary and a content summary. The structure summary comprises a d-bit vector derived from the distinct tag names in the document and a synopsis capturing the structure of the document. The content summary consists of a bit map that summarizes the document content. The metadata of a document, i.e., owner IP, document identifier, structure summary and content summary, is indexed at its anchor peer (the peer that manages the node with matching bit vector). In addition, the structure summary is further indexed at all peers that manages nodes whose bit vectors are covered by the document's bit vector. An XPath query is processed in four phases. In phase 1, the query is routed to its anchor peer according to the bit vector of the query. In phase 2, the query is evaluated against all the synopses stored in its anchor peer and forwarded to the anchor peers of the matching synopses. In phase 3, the anchor peer of each related synopsis examines the query on the related bit maps and forwards the query to the related owner peers. Finally in phase 4, the owner peers evaluate the query on the XML documents and return answers to the querying peer.We also present a scheme that dynamically partitions the hypercube to balance the load across peers. We further exploit the partition history to remove redundant messages. We conduct a comprehensive experimental study and the results show the efficiency of XCube.
-
Efficient Decision Tree Construction for Mining Time-Varying Data Streams
Proc. Conf. of the IBM Centre for Advanced Studies on Collaborative Research; 01/2009
-
Mining Frequent Itemsets in Time-Varying Data Streams
Proc. 18th ACM Int. Conf. on Information and Knowledge Management; 01/2009
Mining frequent itemsets in data streams is beneficial to many real-world applications but is also a challenging task since data streams are unbounded and have high arrival rates. Moreover, the distribution of data streams can change over time, which makes the task of maintaining frequent itemsets e... [more] Mining frequent itemsets in data streams is beneficial to many real-world applications but is also a challenging task since data streams are unbounded and have high arrival rates. Moreover, the distribution of data streams can change over time, which makes the task of maintaining frequent itemsets even harder. In this paper, we propose a falsenegative oriented algorithm, called TWIM, that can find most of the frequent itemsets, detect distribution changes, and update the mining results accordingly. Experimental results show that our algorithm performs as good as other false-negative algorithms on data streams without distribution change, and has the ability to detect changes over time-varying data streams in real-time with a high accuracy rate.
-
Creating Competitive Products
Proc. 35th Int. Conf. on Very Large Data Bases; 01/2009
The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous works study how to help customers find a set of "best" possible products from a pool of given products. In this paper, we identify an interesting problem,... [more] The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous works study how to help customers find a set of "best" possible products from a pool of given products. In this paper, we identify an interesting problem, creating competitive products, which has not been studied before. Given a set of products in the existing market, we want to study how to create a set of "best" possible products such that the newly created products are not dominated by the products in the existing market. We refer such products as competitive products. A straightforward solution is to generate a set of all possible products and check for dominance relationships. However, the whole set is quite large. In this paper, we propose a solution to generate a subset of this set effectively. An extensive performance study using both synthetic and real datasets is reported to verify its effectiveness and efficiency.
-
Multiple Materialized View Selection for XPath Query Rewriting
Proc. 24th Int. Conf. on Data Engineering; 01/2008
-
Potential-Driven Load Distribution for Distributed Data Stream Processing
Proc. 2nd Int. Workshop on Scalable Stream Processing Systems; 01/2008