Praveen Yedlapalli's research while affiliated with VMware and other places

Publications (15)

Article
One of the important characteristics of emerging multicores/manycores is the existence of ”shared on-chip caches”, through which different threads/processes can share data (help each other) or displace each other’s data (hurt each other). Most of current commercial multicore systems on the market have on-chip cache hierarchies with multiple layers...
Chapter
This chapter presents an image-matching application that can take advantage of many-core architectures. Different parallelization strategies are explored that can take advantage of inter- and intraimage parallelism. The two main metrics that determine the application performance, tree creation time and search time, were studied in the context of sc...
Article
Energy management in handheld devices is becoming a daunting task with the growing number of accelerators, increasing memory demands and high computing capacities required to support applications with stringent QoS needs. Current DVFS techniques that modulate power states of a single hardware component, or even recent proposals that manage multiple...
Article
Handheld devices are ubiquitous in today's world. With their advent, we also see a tremendous increase in device-user interactivity and real-time data processing needs. Media (audio/video/camera) and gaming use-cases are gaining substantial user attention and are defining product successes. The combination of increasing demand from these use-cases...
Article
Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This p...
Conference Paper
As the demand for feature-rich mobile systems such as smartphones and tablets has outpaced other computing systems and is expected to continue at a faster rate, it is projected that SoCs with tens of cores and hundreds of IPs (or accelerator) will be designed to provide unprecedented level of features and functionality in future. Design of such mob...
Article
As the demand for feature-rich mobile systems such as smartphones and tablets has outpaced other computing systems and is expected to continue at a faster rate, it is projected that SoCs with tens of cores and hundreds of IPs (or accelerator) will be designed to provide unprecedented level of features and functionality in future. Design of such mob...
Conference Paper
This paper presents a cache hierarchy-aware code mapping and scheduling strategy for multicore architectures. Our mapping strategy determines a loop iteration-to-core mapping by taking into account application data access patterns and on-chip cache hierarchy. It employs a novel concept called “core vectors” to obtain a mapping matrix which exploits...
Conference Paper
Both on-chip resource contention and off-chip latencies have a significant impact on memory requests in large-scale chip multiprocessors. We propose a memory-side prefetcher, which brings data on-chip from DRAM, but does not proactively further push this data to the cores/caches. Sitting close to memory, it avails close knowledge of DRAM state and...
Conference Paper
We propose a cooperation between the programmer, the compiler and the runtime system to identify, exploit and efficiently exercise the parallelism available in many pointer based applications. Our parallelization strategy, called Cooperative Parallelization, is driven by programmer directives as well as runtime information. We show that minimal inf...
Conference Paper
Full-text available
Elementary functions are extensively used in computer graphics, signal and image processing, and communication systems. This paper presents a special-purpose compiler that automatically generates customized look-up tables and implementations for elementary functions under user given constraints. The generated implementations include a C/C++ code th...

Citations

... At the highest level of the hierarchy, DRAM arrays are partitioned into banks that can be accessed simultaneously. This enhances parallelism as it allows serving multiple memory requests that target di erent banks at the same time [3][4][5][6]. At the lowest level, a row of DRAM cells is typically divided into multiple portions that can be accessed individually. ...
... CPU time slice sharing is a trivial and linearly measurable mechanism for fine grained compute resource allocation and sharing [Ding et al., 2014]. Several run- time management frameworks, middleware and operating system schedulers use the notion of CPU Utilization or time slice sharing to provide dedicated com- pute resources among different processes on a time-shared basis [Hindman et al., 2011]. ...
... In [28] Online Transaction Processing (OLTP) threads are pinned in NUMA islands grouping different computing nodes. In [13,18,25] the data is partitioned in specific NUMA nodes, and querying threads are statically pinned to the data location. The database scheduler of the HyPer database uses a similar technique to control the dispatching of query fragments, called "morsels" [19]. ...
... The flag and signal can be determined using two data elements that are already stored in configuration registers in the VD and DC. First, since each video application injects its own requests into the VD using the driver API [31,69,103,105], the VD already keeps track of the number of concurrently-running video applications (and their requirements) in its control and status registers (CSRs). Second, each application also sends its requirements to the DC [47,64], the number of the used planes and each plane's type (e.g., video, graphics, or cursor) are available in the DC CSRs (e.g., SR02 and GRX registers in Intel DC [47]). ...
... We evaluate BurstLink with planar and VR video-streaming workloads [5,24], which are used in standard industrial benchmarks for battery-life [1,5,72,73,74] and academic evaluations of video-streaming optimizations [16,30,61,68,69,78,103,105,108]. In typical evaluations, it is assumed that only a single application (e.g., video streaming in our evaluation) is running on the system. ...
... Unlike LogCA, Gables considers multiple IP blocks running concurrently on a mobile SoC. GemDroid [9] is a simulation framework for evaluating SoCs; it features high level models of IP blocks which may include simplications. Rather than a high-level system, Mocktails targets something completely different: allowing integration of more recent, closed-source IP into a simulation infrastructure of the researcher's choice. ...
... Muddukrishna et al. [104] proposed a locality aware task scheduling and runtime system assisted data distribution algorithm for OpenMP tasks on NUMA systems and many-core processors. Ding et al. [105] proposed a cache hierarchy aware loop-iterations-to-core mapping strategy by exploiting data reuse and minimising data dependencies which results in improved data locality. Lifflander et al. [106] proposed locality aware optimization at different phases of fork/join programs with optimal load balance based on Cilk and also provides programmatic support for work stealing schedules which helps in user guidance on data locality. ...
... Memory-Side Prefetching techniques [66][67][68] place the hardware for data prefetching near DRAM, for the sake of saving precious SRAM budget. In such approaches (e.g., [67]), prefetching is performed by a user thread running near the DRAM, and prefetched pieces of data are sent to the on-chip caches. ...
... Applications in scientific computing are often performance-limited by elementary function calls [28,33]. Such functions are common in scientific code, so designers have long studied how to accelerate them with lookup table (LUT) hardware [9,29]. ...