Jignesh M. Patel

University of Wisconsin–Madison, Madison, Wisconsin, United States

Are you Jignesh M. Patel?

Claim your profile

Publications (102)61.03 Total impact

  • Spyros Blanas, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: High-performance analytical data processing systems often run on servers with large amounts of main memory. A common operation in such environments is combining data from two or more sources using some "join" algorithm. The focus of this paper is on studying hash-based and sort-based equi-join algorithms when the data sets being joined fully reside in main memory. We only consider a single node setting, which is an important building block for larger high-performance distributed data processing systems. A critical contribution of this work is in pointing out that in addition to query response time, one must also consider the memory footprint of each join algorithm, as it impacts the number of concurrent queries that can be serviced. Memory footprint becomes an important deployment consideration when running analytical data processing services on hardware that is shared by other concurrent services. We also consider the impact of particular physical properties of the input and the output of each join algorithm. This information is essential for optimizing complex query pipelines with multiple joins. Our key contribution is in characterizing the properties of hash-based and sort-based equi-join algorithms, thereby allowing system implementers and query optimizers to make a more informed choice about which join algorithm to use.
    Proceedings of the 4th annual Symposium on Cloud Computing; 10/2013
  • Yinan Li, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper focuses on running scans in a main memory data processing system at "bare metal" speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data storage devices are getting "smarter." Smart Flash storage devices (a.k.a. "Smart SSD") are on the horizon and will package CPU processing and DRAM storage inside a Smart SSD, and make that available to run user programs inside a Smart SSD. The focus of this paper is on exploring the opportunities and challenges associated with exploiting this functionality of Smart SSDs for relational analytic query processing. We have implemented an initial prototype of Microsoft SQL Server running on a Samsung Smart SSD. Our results demonstrate that significant performance and energy gains can be achieved by pushing selected query processing components inside the Smart SSDs. We also identify various changes that SSD device manufacturers can make to increase the benefits of using Smart SSDs for data processing applications, and also suggest possible research opportunities for the database community.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Murine models are valuable instruments in defining the pathogenesis of diabetic nephropathy (DN), but they only partially recapitulate disease manifestations of human DN, limiting their utility. To define the molecular similarities and differences between human and murine DN, we performed a cross-species comparison of glomerular transcriptional networks. Glomerular gene expression was profiled in patients with early type 2 DN and in three mouse models (streptozotocin DBA/2, C57BLKS db/db, and eNOS-deficient C57BLKS db/db mice). Species-specific transcriptional networks were generated and compared with a novel network-matching algorithm. Three shared human-mouse cross-species glomerular transcriptional networks containing 143 (Human-DBA STZ), 97 (Human-BKS db/db), and 162 (Human-BKS eNOS(-/-) db/db) gene nodes were generated. Shared nodes across all networks reflected established pathogenic mechanisms of diabetes complications, such as elements of Janus kinase (JAK)/signal transducer and activator of transcription (STAT) and vascular endothelial growth factor receptor (VEGFR) signaling pathways. In addition, novel pathways not previously associated with DN and cross-species gene nodes and pathways unique to each of the human-mouse networks were discovered. The human-mouse shared glomerular transcriptional networks will assist DN researchers in selecting mouse models most relevant to the human disease process of interest. Moreover, they will allow identification of new pathways shared between mice and humans.
    Diabetes 11/2012; · 7.90 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other end of the spectrum, big data DSS analytics that used to be the domain of parallel RDBMSs is now under attack by another class of NoSQL data analytics systems, such as Hive on Hadoop. So, are the traditional RDBMSs, aka "big elephants", doomed as they are challenged from both ends of this "big data" spectrum? In this paper, we compare one representative NoSQL system from each end of this spectrum with SQL Server, and analyze the performance and scalability aspects of each of these approaches (NoSQL vs. SQL) on two workloads (decision support analysis and interactive data-serving) that represent the two ends of the application spectrum. We present insights from this evaluation and speculate on potential trends for the future.
    08/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy is a growing component of the operational cost for many "big data" deployments, and hence has become increasingly important for practitioners of large-scale data analysis who require scale-out clusters or parallel DBMS appliances. Although a number of recent studies have investigated the energy efficiency of DBMSs, none of these studies have looked at the architectural design space of energy-efficient parallel DBMS clusters. There are many challenges to increasing the energy efficiency of a DBMS cluster, including dealing with the inherent scaling inefficiency of parallel data processing, and choosing the appropriate energy-efficient hardware. In this paper, we experimentally examine and analyze a number of key parameters related to these challenges for designing energy-efficient database clusters. We explore the cluster design space using empirical results and propose a model that considers the key bottlenecks to energy efficiency in a parallel DBMS. This paper represents a key first step in designing energy-efficient database clusters, which is increasingly important given the trend toward parallel database appliances.
    08/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The current computing trend towards cloud-based Database-as-a-Service (DaaS) as an alternative to traditional on-site relational database management systems (RDBMSs) has largely been driven by the perceived simplicity and cost-effectiveness of migrating to a DaaS. However, customers that are attracted to these DaaS alternatives may find that the range of different services and pricing options available to them add an unexpected level of complexity to their decision making. Cloud service pricing models are typically 'pay-as-you-go' in which the customer is charged based on resource usage such as CPU and mem-ory utilization. Thus, customers considering different DaaS options must take into account how the performance and efficiency of the DaaS will ultimately impact their monthly bill. In this paper, we show that the current DaaS model can produce unpleasant surprises – for example, the case study that we present in this paper illustrates a scenario in which a DaaS service powered by a DBMS that has a lower hourly rate actually costs more to the end user than a DaaS service that is powered by an-other DBMS that charges a higher hourly rate. Thus, what we need is a method for the end-user to get an accurate estimate of the true costs that will be incurred without worrying about the nuances of how the DaaS operates. One potential solution to this problem is for DaaS providers to offer a new service called Benchmark as a Service (BaaS) where in the user provides the parameters of their workload and SLA requirements, and get a price quote.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As traditional and mission-critical relational database workloads migrate to the cloud in the form of Database-as-a-Service (DaaS), there is an increasing motivation to provide performance goals in Service Level Objectives (SLOs). Providing such performance goals is challenging for DaaS providers as they must balance the performance that they can deliver to tenants and the data center's operating costs. In general, aggressively aggregating tenants on each server reduces the operating costs but degrades performance for the tenants, and vice versa. In this paper, we present a framework that takes as input the tenant workloads, their performance SLOs, and the server hardware that is available to the DaaS provider, and outputs a cost-effective recipe that specifies how much hardware to provision and how to schedule the tenants on each hardware resource. We evaluate our method and show that it produces effective solutions that can reduce the costs for the DaaS provider while meeting performance goals.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read-only transactions from updates but differ in how atomicity is ensured: one is optimistic and one is pessimistic. To avoid expensive context switching, transactions never block during normal processing but they may have to wait before commit to ensure correct serialization ordering. We also implemented a main-memory optimized version of single-version locking. Experimental results show that while single-version locking works well when transactions are short and contention is low performance degrades under more demanding conditions. The multiversion schemes have higher overhead but are much less sensitive to hotspots and the presence of long-running transactions.
    12/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data center operators face a bewildering set of choices when considering how to provision resources on machines with complex I/O subsystems. Modern I/O subsystems often have a rich mix of fast, high performing, but expensive SSDs sitting alongside with cheaper but relatively slower (for random accesses) traditional hard disk drives. The data center operators need to determine how to provision the I/O resources for specific workloads so as to abide by existing Service Level Agreements (SLAs), while minimizing the total operating cost (TOC) of running the workload, where the TOC includes the amortized hardware costs and the run time energy costs. The focus of this paper is on introducing this new problem of TOC-based storage allocation, cast in a framework that is compatible with traditional DBMS query optimization and query processing architecture. We also present a heuristic-based solution to this problem, called DOT. We have implemented DOT in PostgreSQL, and experiments using TPC-H and TPC-C demonstrate significant TOC reduction by DOT in various settings.
    The VLDB Journal 12/2011; · 1.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.
    Computing Research Repository - CORR. 05/2011;
  • Source
    Willis Lang, Ramakrishnan Kandhan, Jignesh M. Patel
    IEEE Data Eng. Bull. 01/2011; 34:12-23.
  • Source
    Yinan Li, Allison Terrell, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the last decade the cost of producing genomic sequences has dropped dramatically due to the current so called "next-gen" sequencing methods. However, these next-gen sequencing methods are critically dependent on fast and sophisticated data processing methods for aligning a set of query sequences to a reference genome using rich string matching models. The focus of this work is on the design, development and evaluation of a data processing system for this crucial "short read alignment" problem. Our system, called WHAM, employs novel hash-based indexing methods and bitwise operations for sequence alignments. It allows richer match models than existing methods and it is significantly faster than the existing state-of-the-art method. In addition, its relative speedup over the existing method is poised to increase in the future in which read sequence lengths will increase. The WHAM code is available at http://www.cs.wisc.edu/wham/.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    Wen Jin, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: An important feature of the existing methods for ranked top-k processing is to avoid searching all the objects in the underlying dataset, and limiting the number of random accesses to the data. However, the performance of these methods degrades rapidly as the number of random accesses increases. In this paper, we propose a novel and general sequential access scheme for top-k query evaluation, which outperforms existing methods. We extend this scheme to efficiently answer top-k queries in subspace and on dynamic data. We also study the "dual" form of top-k queries called "ranking" queries, which returns the rank of a specified record/object, and propose an exact as well as two approximate solutions. An extensive empirical evaluation validates the robustness and efficiency of our techniques.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Avrilia Floratou, Sandeep Tata, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of "approximate" patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLexible and Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always finds the pattern if it exists. Using both real and synthetic data sets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also address a more general problem, named extended structured motif extraction, which allows mining frequent combinations of motifs under relaxed constraints.
    IEEE Transactions on Knowledge and Data Engineering 01/2011; 23:1154-1168. · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Flash solid-state drives (SSDs) are changing the I/O landscape, which has largely been dominated by traditional hard disk drives (HDDs) for the last 50 years. In this paper we propose and systematically explore designs for using an SSD to improve the performance of a DBMS buffer manager. We propose three alternatives that differ mainly in the way that they deal with the dirty pages evicted from the buffer pool. We implemented these alternatives, as well another recently proposed algorithm for this task (TAC), in SQL Server, and ran experiments using a variety of benchmarks (TPC-C, E and H) at multiple scale factors. Our empirical evaluation shows significant performance improvements of our methods over the default HDD configuration (up to 9.4X), and up to a 6.8X speedup over TAC.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    Spyros Blanas, Yinan Li, Jignesh M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of this paper is on investigating efficient hash join algorithms for modern multi-core processors in main memory environments. This paper dissects each internal phase of a typical hash join algorithm and considers different alternatives for implementing each phase, producing a family of hash join algorithms. Then, we implement these main memory algorithms on two radically different modern multi-processor systems, and carefully examine the factors that impact the performance of each method. Our analysis reveals some interesting results -- a very simple hash join algorithm is very competitive to the other more complex methods. This simple join algorithm builds a shared hash table and does not partition the input relations. Its simplicity implies that it requires fewer parameter settings, thereby making it far easier for query optimizers and execution engines to use it in practice. Furthermore, the performance of this simple algorithm improves dramatically as the skew in the input data increases, and it quickly starts to outperform all other algorithms.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    W. Lang, M. Morse, J.M. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: Long time-series data sets are common in many domains, especially scientific domains. Applications in these fields often require comparing trajectories using similarity measures. Existing methods perform well for short time series but their evaluation cost degrades rapidly for longer time series. In this work, we develop a new time-series similarity measure called the Dictionary Compression Score (DCS) for determining time-series similarity. We also show that this method allows us to accurately and quickly calculate similarity for both short and long time series. We use the well-known Kolmogorov Complexity in information theory and the Lempel-Ziv compression framework as a basis to calculate similarity scores. We show that off-the-shelf compressors do not fair well for computing time-series similarity. To address this problem, we developed a novel dictionary-based compression technique to compute time-series similarity. We also develop heuristics to automatically identify suitable parameters for our method, thus, removing the task of parameter tuning found in other existing methods. We have extensively compared DCS with existing similarity methods for classification. Our experimental evaluation shows that for long time-series data sets, DCS is accurate, and it is also significantly faster than existing methods.
    IEEE Transactions on Knowledge and Data Engineering 12/2010; · 1.89 Impact Factor
  • You Jung Kim, J. Patel
    [Show abstract] [Hide abstract]
    ABSTRACT: Multidimensional point indexing plays a critical role in a variety of data-centric applications, including image retrieval, sequence matching, and moving object database search. A common choice of indexing method for these applications is often the "ubiquitous” R*-tree. Choosing the right indexing method requires careful consideration of various factors such as query operations and index construction methods. In this work, we present an experimental study comparing the R*-tree and Quadtree using various criteria including the query operations and index construction methods. Although a variety of query operations can be performed using these index structures, previous work has largely focused only on the range search operation. We go beyond this previous work and compare the performance of these index structures using k-nearest neighbor (kNN) and distance join queries. In addition, we also consider the impact of index construction methods in evaluating these index structures. Our study sheds light on how the choice of the underlying index structure affects the performance of different query operations, and shows that the method used for constructing the index and the dynamic nature of the data set has a dramatic impact on the performance of these index structures.
    IEEE Transactions on Knowledge and Data Engineering 08/2010; · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: When a database system is extended with the skyline operator, it is important to determine the most efficient way to execute a skyline query across tables with join operations. This paper describes a framework for evaluating skylines in the presence of equijoins, including: (1) the development of algorithms to answer such queries over large input tables in a non-blocking, pipeline fashion, which significantly speeds up the entire query evaluation time. These algorithms are built on top of the traditional relational Nested-Loop and the Sort-Merge join algorithms, which allows easy implementation of these methods in existing relational systems; (2) a novel method for estimating the skyline selectivity of the joined table; (3) evaluation of skyline computation based on the estimation method and the proposed evaluation techniques; and (4) a systematic experimental evaluation to validate our skyline evaluation framework.
    Data Engineering (ICDE), 2010 IEEE 26th International Conference on; 04/2010

Publication Stats

3k Citations
61.03 Total Impact Points

Institutions

  • 1997–2013
    • University of Wisconsin–Madison
      • Department of Computer Sciences
      Madison, Wisconsin, United States
  • 2000–2009
    • University of Michigan
      • • Department of Electrical Engineering and Computer Science (EECS)
      • • Division of Computer Science and Engineering
      Ann Arbor, Michigan, United States
  • 2007
    • University of Michigan-Dearborn
      • Department of Computer & Information Science
      Dearborn, Michigan, United States
  • 2005
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States