January 2014
·
36 Reads
·
11 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
January 2014
·
36 Reads
·
11 Citations
June 2012
·
568 Reads
·
1,016 Citations
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere, at any time, and only pay for what they use and store. To provide durability for that data and to keep the cost of storage low, WAS uses erasure coding. In this paper we introduce a new set of codes for erasure coding called Local Reconstruction Codes (LRC). LRC reduces the number of erasure coding fragments that need to be read when reconstructing data fragments that are offline, while still keeping the storage overhead low. The important benefits of LRC are that it reduces the bandwidth and I/Os required for repair reads over prior codes, while still allowing a significant reduction in storage overhead. We describe how LRC is used in WAS to provide low overhead durable storage with consistently low read latencies.
October 2011
·
1,273 Reads
·
755 Citations
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
August 2010
·
28 Reads
SimPoint is a technique used to pick what parts of the program's execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program's execution, and it chooses one sample to represent each unique repetitive behavior. Each sample is then simulated and weighted appropriately, and then together the results from these samples represent an accurate picture of the complete execution of the program.
October 2008
·
129 Reads
·
13 Citations
As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.
May 2008
·
33 Reads
·
35 Citations
Phase-based tuning methodologies specialize system parameters for each application phase of execution. Parameters are varied during execution, as opposed to remaining fixed as in an application-based tuning methodology. Prior work and logic suggests phase-based tuning may provide significant savings over application-based tuning. We investigate this hypothesis using a detailed cache model and tune a highly-configurable cache on a per-phase basis compared to tuning once per application, and found phase-based tuning to yield improvements of up to 37% in performance and 20% in energy over application-based tuning. Furthermore, we extend previous phase-based tuning of a configurable cache by significantly increasing configurability and show 14% energy improvement compared to previous methods. In addition, we quantify the overhead imposed due to cache reconfiguration.
May 2008
·
11 Reads
ACM Transactions on Architecture and Code Optimization
November 2007
·
100 Reads
·
4 Citations
Value specialization is a technique which can improve a program's performance when its code frequently takes the same values. In this paper, speculative value specialization is applied dynamically by utilizing the trace cache hardware. We implement a small, efficient hardware profiler to identify loads that have semi-invariant runtime values. A specialization engine off the program's critical path generates highly optimized traces using these values, which reside in the trace cache. Specialized traces are dynamically verified during execution, and mis-specialization is recovered automatically without new hardware overhead. Our simulation shows that dynamic value specialization in the trace cache achieves a 17% speedup, even over a system with support for hardware value prediction. When combined with other techniques aimed at tolerating memory latencies, this technique still performs well -this technique combined with an aggressive hardware prefetcher achieves 24% better performance than prefetching alone.
November 2007
·
24 Reads
·
33 Citations
Microprocessors can have design errors that escape the test and validation process. The cost to rectify these errors after shipping the processors can be very expensive as it may require replacing the processors and stalling the shipment. In this paper, we discuss architecture support to allow patching the design errors in the processors that have already been shipped out. A contribution of this paper is our analysis showing that a majority of errors can be detected by monitoring a subset of signals in the processors. We propose to incorporate a programmable error detector in the processor that monitors these signals to detect and initiate recovery using one of the mechanisms that we discuss. The proposed hardware units can be programmed using patches consisting of the errata signatures which the manufacturer develops and distributes when errors are discovered in the post-design phase.
October 2007
·
26 Reads
·
21 Citations
Almost all new consumer-grade processors are capable of executing multiple programs simultaneously. The analysis of multiprogrammed workloads for multicore and SMT processors is challenging and time-consuming because there are many possible combinations of benchmarks to execute and each combination may exhibit several different interesting behaviors. Missing particular combinations of program behaviors could hide performance problems with designs. It is thus of utmost importance to have a representative multiprogrammed workload when evaluating multithreaded processor designs. This paper presents a methodology that uses phase analysis, principal components analysis (PCA) and cluster analysis (CA) applied to microarchitecture-independent program characteristics in order to find important program interactions in multiprogrammed workloads. The end result is a small set of co-phases with associated weights that are representative for a multiprogrammed workload across multithreaded processor architectures. Applying our methodology to the SPEC CPU 2000 benchmark suite yields 50 distinct combinations for two-context multithreaded processor simulation that6 researchers and architects can use for simulation. Each combination is simulated for 50 million instructions, giving a total of 2.5 billion instructions to be simulated for the SPEC CPU2000 benchmark suite. The performance prediction error with these representative combinations is under 2.5% of the real workload for absolute throughput prediction and can be used to make relative throughput comparisons across processor architectures.
... We use Simplescalar 2 [8] to evaluate our solutions. We execute 500M representative instructions from SPEC2000 [9] using SimPoint [11]. In order to estimate cache dynamic and leakage energy, we use CACTI 6 [10] tool. ...
January 2003
... While record-andreplay techniques can be used for post-mortem analysis as they can replay a recorded execution, they also do not fit well in our scenario. Specifically, recording the fine-grained program execution [80], [5], [39], [70] often imposes significant runtime overhead. Coarse-grained record-and-replay techniques [27], [42] focus on system-level events to reduce the overhead. ...
October 2006
ACM SIGOPS Operating Systems Review
... SimPoint melakukan analisis dengan mengkombinasikan titik awal sample simulasi program, panjang sample, serta instruksi per clock/cycle. B. Calder (2003) menyatakan bahwan analisis kesalahan dengan SimPoint menyatakan rerata kesalahan untuk berbagai macam benchmark yang dapat dilakukan adalah kurang dari 10%. ...
September 2005
... This allowed us to analyse suspect behaviours, fix the HL3 specification and model, and then check that the corrected model solved the uncovered issues by replaying the trace again. That is, we automatically got record and replay capabilities of debuggers as in [65]. ...
May 2005
ACM SIGARCH Computer Architecture News
... Uma auditoria de performance permite com que se tenha a compreensão de todas as perspectivas que envolvem o objeto auditado e, permite também, com que as informações já existentes acerca desse business se tornem mais acessíveis a todos os seus stakeholders, uma vez que, as recomendações a serem consideradas são baseadas nos resultados encontrados por meio de tais auditorias (CAGI, 2014). Em suma, auditorias de desempenho são avaliações acerca das atividades de uma organização a fim de verificar-se se os recursos estão sendo gerenciados com o devido respaldo pela economia, eficiência, eficácia e accountability (Khan, 1988). ...
June 2006
ACM SIGPLAN Notices
... Erasure coding is commonly used in many DSSs [3,6,14,17,28], because it provides the same fault tolerance as replication but at a significantly lower cost. Typically described as an ( , ) erasure code, it encodes data blocks into − parity blocks to form a stripe of width . ...
June 2012
... The dynamic switching between native mode and virtual mode is enabled by utilizing the debug information. One interesting instrumentation tool BitRaker Anvil [8] is proposed by Calder et al.. The target binary to be simulated is instrumented by invoking host native annotation code. ...
January 2004
... Table 1 contains some characteristics of the thirteen C ++ programs we have analyzed. These are some of the benchmarks used in [PR96, BS96,CGZ95]. 12 The columns lines, ICFG nodes, methods, virtual calls, SCC's and Max SCC respectively show the number of lines of code, ICFG nodes, methods, dynamically dispatched call sites, nodes in SCC-DAG and methods in the maximum-sized SCC for each program. Table 2 contains the timings using a Sparc-20 with 352 megabytes of memory. ...
Reference:
Relevant Context Inference
January 2014
... A recent study [13] investigates the use of a wide word pipelined memory that allows concurrent accesses. This is an interesting alternative to the multibanked shared memory that we have been assessing but performance comparisons are yet to be done. ...
January 2003
... Given that the CBPs of Intel [26], Apple, and Qualcomm cores have been recovered (as shown in Table 6), we implement them on an in-house ChampSim-like [11] standalone branch predictor model that aligns with a commercial processor. We simulate these CBPs using SPEC INT 2017 and Geekbench 5 benchmarks compiled for ARM64, employing the SimPoint [18] methodology. The results, depicted in Figure 9, indicate that the Firestorm CBP performs the best, narrowly outperforming the Oryon CBP by 1%, while the Skylake CBP significantly lags behind by more than 20%. ...
August 2003
ACM SIGMETRICS Performance Evaluation Review