ArticlePDF Available

Abstract and Figures

The advancement in technology has brought immense amount of changes in the design and productivity of applications designed for being used in the personal computers. By implementing greater number of cores to the same chip also results in facing challenges. In this case the challenge that is being faced is the core to core communication as well as the memory in addition to cache coherence. This paper presents a detailed analysis on performance of FFT a divide and conquer algorithm across with the Multi-core architecture with Internal and external network. The architectures are being defined using memory configuration and context configuration with help of Multi2Sim 3.4 simulator. The performance of these architectures have been simulated with Splash 2 Benchmark.
Content may be subject to copyright.
A preview of the PDF is not available
Chapter
CMOS technology in contemporary period is enhanced with advanced features and compatible storage system. Advanced CMOS technology provides functional density, increased performance, reduced power, etc. System-on-chip (SoC) technology provides a path for continual improvement in performance, power, cost, and size at the system level in contrast with the conventional CMOS scaling. When a single processor is transformed into multicore processor, it faces a lot of hazards to confine the circuits into single chip. To emphasize the importance of multicore architecture, this paper provides a comprehensive survey on multicore architectures designs, constraints, and practical issues.
Conference Paper
Full-text available
Multi-core and heterogeneous processor systems are widely used now a days. Even mobile devices have two or more cores to improve their performance. While these two technologies are widely used, it is not clear which one would perform better and which hardware configuration is optimal for a specific target domain. In this paper, we proposed an interconnect architecture Multi-core processor with Hybrid network(MPHN). The performance of MPHN has been compared with: (i) Multi-core processor with internal network, and (ii) Multi-core processor with Ring network. This architecture substantially minimizes the communication delay in multicore processors.
Conference Paper
Full-text available
As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors to the PC market. In this paper, performance studies on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64times2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multiprogrammed and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for the best performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4.
Article
Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely-cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions.This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracies of 62% to 93%. Cosmos' high prediction accuracy is a result of predictable coherence message signatures that arise from stable sharing patterns of cache blocks.
Conference Paper
A comprehensive set of development methods is needed during large-scale software development based on multi-core systems. The paper presents a complete solution and a set of detailed analysis methods: parallel mode decomposition, data-dependent and task-dependent analysis, parallel algorithm design, choice in parallel programming patterns, coding and performance optimization. It is proven to be valuable that the solution has been applied to guide the development of programming based on multi-core systems.
Article
As personal computers have become more prevalent and more applications have been designed for them, the end-user has seen the need for a faster, more capable system to keep up. Speedup has been achieved by increasing clock speeds and, more recently, adding multiple processing cores to the same chip. Although chip speed has increased exponentially over the years, that time is ending and manufac- turers have shifted toward multicore processing. However, by increasing the number of cores on a single chip challenges arise with memory and cache coherence as well as communication between the cores. Coherence protocols and interconnection networks have resolved some issues, but until programmers learn to write parallel applications, the full benefit and efficiency of multicore processors will not be at- tained. Background The trend of increasing a processor"s speed to get a boost in performance is a way of the past. Multicore processors are the new direction manufacturers are focusing on. Using multiple cores on a single chip is advantageous in raw processing power, but nothing comes for free. With additional cores, power consumption and heat dissi- pation become a concern and must be simulated before lay- out to determine the best floorplan which distributes heat across the chip, while being careful not to form any hot spots. Distributed and shared caches on the chip must adhere to coherence protocols to make sure that when a core reads from memory it is reading the current piece of data and not a value that has been updated by a different core. With multicore processors come issues that were previously unforeseen. How will multiple cores communicate? Should all cores be homogenous, or are highly specialized cores more efficient? And most importantly, will programmers be able to write multithreaded code that can run across multiple cores?
Article
Solaris Application Programming is a comprehensive guide to optimizing the performance of applications running in your Solaris environment. From the fundamentals of system performance to using analysis and optimization tools to their fullest, this wide-ranging resource shows developers and software architects how to get the most from Solaris systems and applications. Whether youre new to performance analysis and optimization or an experienced developer searching for the most efficient ways to solve performance issues, this practical guide gives you the background information, tips, and techniques for developing, optimizing, and debugging applications on Solaris. The text begins with a detailed overview of the components that affect system performance. This is followed by explanations of the many developer tools included with Solaris OS and the Sun Studio compiler, and then it takes you beyond the basics with practical, real-world examples. In addition, you will learn how to use the rich set of developer tools to identify performance problems, accurately interpret output from the tools, and choose the smartest, most efficient approach to correcting specific problems and achieving maximum system performance. Coverage includes A discussion of the chip multithreading (CMT) processors from Sun and how they change the way that developers need to think about performance A detailed introduction to the performance analysis and optimization tools included with the Solaris OS and Sun Studio compiler Practical examples for using the developer tools to their fullest, including informational tools, compilers, floating point optimizations, libraries and linking, performance profilers, and debuggers Guidelines for interpreting tool analysis output Optimization, including hardware performance counter metrics and source code optimizations Techniques for improving application performance using multiple processes, or multiple threads An overview of hardware and software components that affect system performance, including coverage of SPARC and x64 processors
Article
This research intended to find the relationships between the memory system and performance in both single core and multi-core context. For the single core part, several parameters have been considered to improve the performance. Based on these work we extend to the multi-core issues. Overall, the simulations showed results similar to configurations of many current consumer CMPs. Many results were expected, like a large L2 cache size certainly helped performance a lot and the L1 cache size, optimal system frequency was typical of many of the Intel Core 2 Duos, and that after already having 2-4 core any more cores do not help performance in FFT nearly as greatly. However, some results were more surprising like the low speedups that the n-core systems presented. Also, some of the experiments failed, like for 16-core systems and for cache coherence tests. Overall, the excess of 500 simulations recorded helped understand that in multi-core systems, communication overhead and memory latencies are a limiting factor in performance. Finding good cache configurations certainly help increase performance by taking the pressure off of the main memory and reducing communication and cache coherence latencies.
Article
Computer performance has been driven largely by decreasing the size of chips while increasing the number of transistors they contain. In accordance with Moore's law, this has caused chip speeds to rise and prices to drop. This ongoing trend has driven much of the computing industry for years. Manufacturers are building chips with multiple cooler-running, more energy-efficient processing cores instead of one increasingly powerful core. The multicore chips don't necessarily run as fast as the highest performing single-core models, but they improve overall performance by handling more work in parallel. Multicores are a way to extend Moore's law so that the user gets more performance out of a piece of silicon. Chip makers AMD, IBM, Intel, and Sun are now introducing multicore chips for servers, desktops, and laptops.