Vijayshankar RamanIBM · Centre for Advanced Studies, Rochester
Vijayshankar Raman
About
77
Publications
18,820
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,823
Citations
Publications
Publications (77)
In a classic transactional distributed database management system (DBMS), write transactions invariably synchronize with a coordinator before final commitment. While enforcing serializability, this model has long been criticized for not satisfying the applications' availability requirements. When entering the era of Internet of Things (IoT), this p...
The rising demands of real-time analytics have emphasized the need for Hybrid Transactional and Analytical Processing (HTAP) systems, which can handle both fast transactions and analyt-ics concurrently. Wildfire is such a large-scale HTAP system prototyped at IBM Research-Almaden, with many techniques developed in this project incorporated into the...
We demonstrate Hybrid Transactional and Analytics Processing (HTAP) on the Spark platform by the Wildfire prototype, which can ingest up to ~6 million inserts per second per node and simultaneously perform complex SQL analytics queries. Here, a simplified mobile application uses Wildfire to recommend advertising to mobile customers based upon their...
Although the DRAM for main memories of systems continues to grow exponentially according to Moore's Law and to become less expensive, we argue that memory hierarchies will always exist for many reasons, both economic and practical, and in particular due to concurrent users competing for working memory to perform joins and grouping. We present the i...
Embodiments relate to data compression using dictionary encoding. An aspect includes subdividing a table of uncompressed data into a first block and a second block of complete rows. Another aspect includes determining information about a frequency of occurrence of different values for each column of the first block. Another aspect includes selectin...
We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, a...
Embodiments of the present invention provide query processing for column stores by accumulating table record attributes during application of query plan operators on a table. The attributes and associated attribute values are compacted when said attribute values are to be consumed for an operation in the query plan, during the execution of the quer...
An approach is provided in which a processor receives a scan request to scan data included in a data table. The processor selects a column in the data table corresponding to the scan request and retrieves column data entries from the selected column. In addition, the processor identifies the width of the selected column and selects a scan algorithm...
Compression has historically been used to reduce the cost of storage, I/Os from that storage, and buffer pool utilization, at the expense of the CPU required to decompress data every time it is queried. However, significant additional CPU efficiencies can be achieved by deferring decompression as late in query processing as possible and performing...
A system is described for creating compact aggregation working areas for efficient grouping and aggregation using multi-core CPUs. The system implements operations including computing a running aggregate for a group within a business intelligence (BI) query, and identifying a location to store running aggregate information within an aggregation wor...
A query processing method intersects two or more unsorted lists based on a conjunction of predicates. Each list comprises a union of multiple sorted segments. The method performs lazy segment merging and an adaptive n-ary intersecting process. The lazy segment merging comprises starting with each list being a union of completely unmerged segments,...
In data centers today, servers are stationary and data flows on a hierarchical network of switches and routers. But such static server arrangements require very scalable networks, and many applications are bottlenecked by network bandwidth. In addition, server density is kept low to enable maintenance and upgrades, as well as to increase air flow....
According to one embodiment of the present invention, a method for dictionary encoding data without using three-valued logic is provided. According to one embodiment of the invention, a method includes encoding data in a database table using a dictionary, wherein the data includes values representing NULLs. A query having a predicate is received an...
DB2 with BLU Acceleration deeply integrates innovative new techniques for defining and processing column-organized tables that speed read-mostly Business Intelligence queries by 10 to 50 times and improve compression by 3 to 10 times, compared to traditional row-organized tables, without the complexity of defining indexes or materialized views on t...
A cell-specific dictionary is applied adaptively to adequate cells, where the cell-specific dictionary subsequently optimizes the handling of frequency-partitioned multi-dimensional data. This includes improved data partitioning with super cells or adjusting resulting cells by sub-dividing very large cells and merging multiple small cells, both of...
According to one embodiment of the present invention, a method for the parallel computation of frequency histograms in joined tables is provided. The method includes reading data in a table row-by-row from a database system using a coordinator unit and distributing each read row to separate worker units. Each worker unit computes a partial frequenc...
A computer-implemented method for scan sharing across multiple cores in a business intelligence (BI) query. The method includes receiving a plurality of BI queries, storing a block of data in a first cache, scanning the block of data in the first cache against a first batch of queries on a first processor core, and scanning the block of data agains...
The Blink project's ambitious goals are to answer all Business Intelligence (BI) queries in mere seconds, regardless of the database size, with an extremely low total cost of ownership. It takes a very innovative and counter-intuitive approach to processing BI queries, one that exploits several disruptive hardware and software technology trends. Sp...
Writing parallel programs that can take advantage of non-dedicated processors is much more difficult than writing such programs for networks of dedicated processors. In a non-dedicated environment such programs must use autonomic techniques to respond to the unpredictable load fluctuations that prevail in the computational environment. In adaptive...
BLINK is a prototype of an in-memory based query processor that exploits heavily the underlying CPU infrastructure. It is very sensitive to the processor's caches and instruction set. In this paper, we describe how to close two major functional gaps in BLINK, which arise from real-world workloads. The manipulation of the data maintained by BLINK re...
Computer architectures are increasingly based on multi-core CPUs and large memories. Memory bandwidth, which has not kept pace with the increasing number of cores, has become the primary pro- cessing bottleneck, replacing disk I/O as the limiting factor. To address this challenge, we provide novel algorithms for increas- ing the throughput of Busin...
Table scans have become more interesting recently due to greater use of ad-hoc queries and greater availability of multi- core, vector-enabled hardware. Table scan performance is limited by value representation, table layout, and process- ing techniques. In this paper we propose a new layout and processing technique for efficient one-pass predicate...
A common technique for processing conjunctive queries is to first match each predicate separately using an index lookup, and then compute the intersection of the resulting row- id lists, via an AND-tree. The performance of this technique depends crucially on the order of lists in this tree: it is important to compute early the intersections that wi...
Query performance in current systems depends significantly on tuning: how well the query matches the available indexes, materialized views etc. Even in a well tuned system, there are always some queries that take much longer than others. This frustrates users who increasingly want consistent response times to ad hoc queries. We argue that query pro...
Storage architecture includes more and more processing power for increasing requirement of reliability, managibility and scalability. For example, an IBM storage server is equipped with 4 or 8 state-of-the-art processors and gigabytes of memories. This trend enables analyzing data locally inside a storage server. Processing data locally is appealin...
Adaptive query processing has been the subject of a great deal of recent work, particularly in emerging data management environments such as data integration and data streams. We provide an overview of the work in this area, identifying its common themes, laying out the space of query plans, and discussing open research problems. We discuss why ada...
Two trends are converging to make the CPU cost of a ta- ble scan a more important component of database perfor- mance. First, table scans are becoming a larger fraction of the query processing workload, and second, large mem- ories and compression are making table scans CPU, rather than disk bandwidth, bound. Data warehouse systems have found that...
RID-List (row id list) intersection is a common strategy in query processing, used in star joins, column stores, and even search engines. To apply a conjunction of predicates on a table, a query process ordoes index lookups to form sorted RID-lists (or bitmap) of the rows matching each predicate, then intersects the RID-lists via an AND-tree, and f...
As the data management field has diversified to consider settings in which queries are increasingly complex, statistics are less available, or data is stored remotely, there has been an acknowledgment that the traditional optimize-then-execute paradigm is insufficient. This has led to a plethora of new techniques, generally placed under the common...
Adaptive Query Processing surveys the fundamental issues, techniques, costs, and benefits of adaptive query processing. It begins with a broad overview of the field, identifying the dimensions of adaptive techniques. It then looks at the spectrum of approaches available to adapt query execution at runtime - primarily in a non-streaming context. The...
Database Management Systems (DBMS) perform query plan selection by mathematically modeling the execution cost of candidate
execution plans and choosing the cheapest query execution plan (QEP) according to that cost model. The cost model requires
accurate estimates of the sizes of intermediate results of all steps in the QEP. Outdated or incomplete...
Peer-to-Peer systems have recently become a popular means to share resources. Effective search is a critical requirement in such systems, and a number of distributed search structures have been proposed in the literature. Most of these structures provide ...
We present a method to compress relations close to their entropy while still allowing efficient queries. Column valu es are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenat ed and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlat...
Federated queries are regular relational queries ac cessing data on one or more remote relational or non-relational dat a sources, possibly combining them with tables stored in the f ederated DBMS server. Their execution is typically divided b etween the federated server and the remote data sources. Outda ted and incomplete statistics have a bigger...
Writing parallel programs that can take advantage of non-dedicated processors is much more difficult than writing such programs for networks of dedicated processors. In a non-dedicated environment such programs must use autonomic techniques to respond to the unpredictable load fluctuations that characterize the computational model. In the area of a...
Current federated systems deploy cost-based query optimization mechanisms; i.e., the optimizer selects a global query plan with the lowest cost to execute. Thus, cost functions influence what remote sources (i.e. equivalent data sources) to access and how federated queries are processed. In most federated systems, the underlying cost model is based...
If presented with inaccurate statistics, even the most sophisticated query optimizers make mistakes. They may wrongly estimate the output cardinality of a certain operation and thus make sub-optimal plan choices based on that cardinality. Maintaining accurate statistics is hard, both because each table may need a specifically parameterized set of s...
We present DITN, a new method of parallelquerying based on dynamic outsourcing ofjoin processing tasks to non-dedicated, heterogeneous computers. In DITN, partitioning isnot the means of parallelism. Data layout decisionsare taken outside the scope of the DBMS,and handled within the storage software; queryprocessors see a "Data In The Network" imag...
A wide variety of applications require access to multiple heterogeneous, distributed data sources. By transparently integrating such diverse data sources, underlying differences in DBMSs, languages, and data models can be hidden and users can use a single data model and a single high-level query language to access the unified data through a global...
The use of inaccurate or outdated database statistics by the query optimizer in a relational DBMS often results in a poor choice of query execution plans and hence unacceptably long query processing times. Configura-tion and maintenance of these statistics has tradition-ally been a time-consuming manual operation, requir-ing that the database admin...
Progressive Optimization (POP) is a technique to make query plans robust, and minimize needs for database administrator (DBA) intervention by repeatedly re-optimizing a query during runtime if the cardinalities estimated during optimization prove to be significantly incorrect. POP works by carefully calculating validity ranges for each plan operato...
Progressive Optimization (POP) is a technique to make query plans robust, and minimize need for DBA intervention, by repeatedly re-optimizing a query during runtime if the cardinalities estimated during optimization prove to be significantly incorrect. POP works by carefully calculating validity ranges for each plan operator under which the overall...
We present a query architecture in which join operators are decomposed into their constituent data structures (State Modules, or SteMs), and dataow among these SteMs is managed adaptively by an Eddy routing operator. Breaking the encapsulation of joins serves two purposes. First, it allows the Eddy to observe multiple physical operations embedded i...
In this paper we present our vision of an information infrastructure for grid computing, which is based on a service-oriented architecture. The infrastructure supports a virtualized view of the computing and data resources, is autonomic (driven by policies) in order to meet application goals for quality of service, and is compatible with the standa...
Virtually every commercial query optimizer chooses the best plan for a query using a cost model that relies heavily on accurate cardinality estimation. Cardinality estimation errors can occur due to the use of inaccurate statistics, invalid assumptions about attribute independence, parameter markers, and so on. Cardinality estimation errors may cau...
We present a query architecture in which join operators are decomposed into their constituent data structures (State Modules, or SteMs), and dataflow among these SteMs is managed adaptively by an eddy routing operator [R. Avnur et al., (2000)]. Breaking the encapsulation of joins serves two purposes. First, it allows the eddy to observe multiple ph...
SQL has emerged as an industry standard for querying relational database management systems, largely because a user need only specify what data is wanted, not the details of how to access that data. A query optimizer uses a mathematical model of query execution to determine automatically the best way to access and process any given SQL query. This...
Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. Instead, query processors based on adaptive dataflow will be...
We present a continuously adaptive, continuous query (CACQ) implementation based on the eddy query processing framework. We show that our design provides significant performance benefits over existing approaches to evaluating continuous queries, not only because of its adaptivity, but also because of the aggressive crossquery sharing of work and sp...
Traditional query processors generate full, accurate query results, either in batch or in pipelined fashion. We argue that this strict model is too rigid for exploratory queries over diverse and distributed data sources, such as sources on the Internet. Instead, we propose a looser model of querying in which a user submits a broad initial query out...
Cleaning data of errors in structure and content is important for data warehousing and integration. Current solutions for data cleaning involve many iterations of data "auditing" to find errors, and long-running transformations to fix them. Users need to endure long waits, and often write complex transformation scripts. We present Potter's Wheel, a...
We present a pipelining, dynamically tunable reorder operator for providing user control during long running, data- intensive operations. Users can see partial results and accordingly
direct the processing by specifying preferences for various data items; data of interest is prioritized for early processing.
The reordering mechanism is efficient an...
The goal of the CONTROL project at Berkeley is to develop systems for interactive analysis of large data sets. We focus on systems that provide users with iteratively refining answers to requests and online control of processing, thereby tightening the loop in the data analysis process. This paper presents the database-centric subproject of CONTROL...
The WorldWide Web is a huge, growing repository of information on a wide range of topics. It is also becoming important, commercially and sociologically, as a place of human interaction within different communities. In this paper we present an experimental study of the structure of the Web. We analyze link topologies of various communities, and pat...
As query engines are scaled and federated, they must cope with highly unpredictable and changeable environments. In the Telegraph project, we are attempting to architect and implement a continuously adaptive query engine suitable for global-area systems, massive parallelism, and sensor networks. To set the stage for our research, we present a surve...
Real world data often has discrepancies in structure and content. Traditional methods for #cleaning" the data involve many iterations of time-consuming #data quality" analysis to #nd discrepancies, and long-running transformations to #x them. This process requires users to endure long waits and often write complex transformation programs. We presen...
Data analysis is fundamentally an iterative process in which you
issue a query, receive a response, formulate the next query based on the
response, and repeat. You usually don't issue a single, perfectly chosen
query and get the information you want from a database; indeed, the
purpose of data analysis is to extract unknown information, and in most...
Interactive responses and natural, intuitive controls are important for data analysis. We are building ABC, a scalable spreadsheet for data analysis that combines exploration, grouping, and aggregation. We focus on interactive responses in place of long delays, and intuitive, directmanipulation operations in place of complex queries. ABC allows ana...
Recovery logic is one of the most complex parts of applications such as Database Management Systems (DBMSs), and is very hard to debug or maintain. We present a simple and light-weight ACID memory library for use by such applications, which want to access data in a persistent, atomic manner. It supports concurrency control and recovery in a unified...
We present a pipelining, dynamically user- controllable reorder operator, for use in data- intensive applications. Allowing the user to re- order the data delivery on the fly increases the interactivity in several contexts such as online ag- gregation and large-scale spreadsheets; it allows the user to control the processing of data by dy- namicall...
We discuss strategies for building locality preserving dictionaries (LPDs) in which all data items within a range lie together, within a space that is a small function of the number of items in the range. We describe an approach where the memory space is partitioned and items are placed in sorted order, with judiciously placed gaps between them, re...
Optical networks based on wavelength division multiplexing (WDM) and wavelength routing are considered to be potential candidates for the next generation of wide area networks. One of the main issues in these networks is the development of efficient routing algorithms which require minimum number of wavelengths. In this paper, we focus on the permu...
The CONTROL project at U.C. Berkeley has developed technologies to provide online behavior for data-intensive applications. Using new query processing algorithms, these technologies continuously improve estimates and confidence statistics. In addition, they react to user feedback, thereby giving the user control over the behavior of long-running op...
The CONTROL project at U.C. Berkeley has developed technologies to provide online behavior for data-intensive applications. Using new query processing algorithms, these technologies continuously improve estimates and confidence statistics. In addition, they react to user feedback, thereby giving the user control over the behavior of long-running op...
As query engines are scaled and federated, they must cope with highly unpredictable and changeable environments. In the Telegraph project, we are attempting to architect and implement a continuously adaptive query engine suitable for global-area systems, massive parallelism, and sensor networks. To set the stage for our research, we present a surve...
We propose RPL, a programming language where both code and data are modeled as tuples in relations, linked by relational oper- ators. Each line of code (LOC) is represented as a tuple, and is defined to be sideeffect-free. State changes occur only via tuple movement across relational operators. Modules in RPL run concur- rently, but RPL tracks each...