Fig 2 - uploaded by Ahmad Yasin
Content may be subject to copyright.
Instruction-per-cycle (IPC) for 400.perlbench on recent x86 processors (configuration details in Table 2).

Instruction-per-cycle (IPC) for 400.perlbench on recent x86 processors (configuration details in Table 2).

Source publication
Article
Full-text available
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This article presents a Metric-Guided Method (MGM) that extends Top-Down analysis with carefully selected, dynamically adapted metrics in a structured approach. Using MGM, we conduct two evaluations at the microarchitecture and th...

Contexts in source publication

Context 1
... hardware vendors introduce several changes at once in new generations [5][6][7]. While this can benefit certain workloads (as Figure 2 demonstrates), it induces inter-feature interactions. Background on modern architectures and associated challenges is provided in Section 2. ...
Context 2
... In the first scenario, we wanted to understand the IPC increase for the benchmark in Figure 2 when comparing Skylake to a previous microarchitecture (uarch) generation. ...
Context 3
... benchmarks target two ISAs: AVX (-xAVX) and AVX2 (-xCORE-AVX2). 400.perlbench in Figure 2 uses a different AVX binary with common ISA for AMD and Intel. Iso-frequency was used in that case as well. ...
Context 4
... hardware vendors introduce several changes at once in new generations [5][6][7]. While this can benefit certain workloads (as Figure 2 demonstrates), it induces inter-feature interactions. Background on modern architectures and associated challenges is provided in Section 2. ...
Context 5
... In the first scenario, we wanted to understand the IPC increase for the benchmark in Figure 2 when comparing Skylake to a previous microarchitecture (uarch) generation. ...
Context 6
... benchmarks target two ISAs: AVX (-xAVX) and AVX2 (-xCORE-AVX2). 400.perlbench in Figure 2 uses a different AVX binary with common ISA for AMD and Intel. Iso-frequency was used in that case as well. ...

Similar publications

Article
Full-text available
In order to assess pollutants and impact of environmental changes along the Egyptian Red Sea coast, seven recent and Pleistocene coral species have been analyzed for Zn, Pb, Mn, Fe, Cr, Co, Ni, and Cu. Results show that the concentration of trace elements in recent coral skeletons is higher than those of Pleistocene counterpart except for Mn and Ni...
Article
Full-text available
Diffusion-weighted magnetic resonance imaging (DW-MRI) is a diagnostic tool that is increasingly used for the detection and characterization of focal masses in the abdomen, among these, pancreatic ductal adenocarcinoma (PDAC). DW-MRI reflects the microarchitecture of the tissue, and changes in diffusion, which are reflected by changes in the appare...
Article
Full-text available
It is well established that tissue macrophages and tissue-resident memory CD8 ⁺ T cells (T RM ) play important roles for pathogen sensing and rapid protection of barrier tissues. In contrast, the mechanisms by which these two cell types cooperate for homeostatic organ surveillance after clearance of infections is poorly understood. Here, we used in...
Conference Paper
Full-text available
Microarchitecture based side-channel attacks are common threats nowadays. Intel SGX technology provides a strong isolation from an adversarial OS, however, does not guarantee protection against side-channel attacks. In this paper, we analyze the security of the mbedTLS binary GCD algorithm, an implementation that offers interesting challenges when...
Article
Full-text available
Background: To reveal trends in bone microarchitectural parameters with increasing spatial resolution on ultra-high-resolution computed tomography (UHRCT) in vivo and to compare its performance with that of conventional-resolution CT (CRCT) and micro-CT ex vivo. Methods: We retrospectively assessed 5 tiger vertebrae ex vivo and 16 human tibiae i...

Citations

... We define the frequency scalability of a workload as the change in its performance with unit change in frequency, as in[107,[144][145][146]. ...
... Note that the machine-learning method we used automatically assigns low additive-regression coefficients to less relevant performance metrics in the models, so irrelevant metrics could be automatically discarded from the models [27]. Notably, most of the metrics identified as relevant by the machine-learning engine for SF prediction on the big core are based on the TMA (Top-Down Microarchitecture Analysis) event type, which were recently introduced by Intel to aid in the fine-grained identification of application performance bottlenecks [34]. As a new feature of Intel Alder Lake processors, these TMA metrics can be monitored altogether on P-cores by using a single PMC [15]. ...
... 3) memory system. Fig. 1(a) shows the architecture used in recent Intel processors (e.g., Skylake [23,33,34], Co ee Lake [25], and Cannon Lake [26]) with a focus on CPU cores. Power Management. ...
... DarkGates' three key components are implemented within the Intel Skylake SoC [23,33,34]. ...
... These mechanisms optimize voltage guardband using hardware and/or software sensors to reduce the operating margin for energy savings. Multiple of these guardband reduction mechanisms are already applied in the Skylake processor [14,23,34,72,74]. DarkGates can be applied orthogonally to these mechanisms since it physically optimizes the system impedance by bypassing the power-gates and sharing the power delivery resources on the package. ...
Preprint
Full-text available
To reduce the leakage power of inactive (dark) silicon components, modern processor systems shut-off these components' power supply using low-leakage transistors, called power-gates. Unfortunately, power-gates increase the system's power-delivery impedance and voltage guardband, limiting the system's maximum attainable voltage (i.e., Vmax) and, thus, the CPU core's maximum attainable frequency (i.e., Fmax). As a result, systems that are performance constrained by the CPU frequency (i.e., Fmax-constrained), such as high-end desktops, suffer significant performance loss due to power-gates. To mitigate this performance loss, we propose DarkGates, a hybrid system architecture that increases the performance of Fmax-constrained systems while fulfilling their power efficiency requirements. DarkGates is based on three key techniques: i) bypassing on-chip power-gates using package-level resources (called bypass mode), ii) extending power management firmware to support operation either in bypass mode or normal mode, and iii) introducing deeper idle power states. We implement DarkGates on an Intel Skylake microprocessor for client devices and evaluate it using a wide variety of workloads. On a real 4-core Skylake system with integrated graphics, DarkGates improves the average performance of SPEC CPU2006 workloads across all thermal design power (TDP) levels (35W-91W) between 4.2% and 5.3%. DarkGates maintains the performance of 3DMark workloads for desktop systems with TDP greater than 45W while for a 35W-TDP (the lowest TDP) desktop it experiences only a 2% degradation. In addition, DarkGates fulfills the requirements of the ENERGY STAR and the Intel Ready Mode energy efficiency benchmarks of desktop systems.
... Our finding on Ivy Bridge vs. Broadwell corroborates the recent work of Yasin et al. [60] where SPEC benchmarks are evaluated across Ivy Bridge and Skylake (the generation after Broadwell) micro-architectures. Yasin et al. also show that the improvement on the Skylake micro-architecture, which inherits the improvements from the Broadwell microarchitecture, significantly reduces the Icache stalls. ...
Article
Full-text available
Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP.
Chapter
The Top-Down method makes it possible to identify bottlenecks as instructions traverse the CPU’s pipeline. Once bottlenecks are identified, incremental changes to the code can be made to mitigate the negative effects bottlenecks might have in performance. This is an iterative process that could potentially result in a more optimal use of CPU resources. It can be difficult to compare bottleneck metrics of the same program generated by different compilers running on the same system. Different compilers could potentially generate different instructions, arrange the instructions in different order, and require different number of cycles to execute the program. Ratios with relatively similar values could hide valuable information that could be used to identify differences in magnitude and influence of bottlenecks. To amplify magnitude differences of bottleneck metrics, we use the cycles required to complete the program as a reference point. We can then quantify the relative difference the effect a bottleneck has when compared with the bottleneck of the reference compiler. This study’s proposed approach is based on the Purchasing Power Parity theory, which is used by economists to compare the purchasing power of different currencies by comparing similar products. We show that this approach can give us more information on how effective each compiler is in using the CPU’s architectural features by comparing their respective bottlenecks. For example, using conventional methods, our measurements show that for the 363.swim benchmark, BackEnd Bound rates for GCC4 was 0.949, and 0.956 for GCC6 and GCC7 respectively. However, using the PPP normalization approach, we showed that there were differences of 55.3% for GCC6 and 54.9% for GCC7 over GCC4.