ArticleLiterature Review

There’s plenty of room at the Top: What will drive computer performance after Moore’s law?

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

From bottom to top The doubling of the number of transistors on a chip every 2 years, a seemly inevitable trend that has been called Moore's law, has contributed immensely to improvements in computer performance. However, silicon-based transistors cannot get much smaller than they are today, and other approaches should be explored to keep performance growing. Leiserson et al. review recent examples and argue that the most promising place to look is at the top of the computing stack, where improvements in software, algorithms, and hardware architecture can bring the much-needed boost. Science , this issue p. eaam9744

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Traditional CPU-based architectures are increasingly inadequate for handling the intensive computational demands of DNA data readout. Heterogeneous computing has emerged as a promising paradigm that integrates multiple accelerators, such as CPU, GPU, and FPGA [19,20]. It overcomes the performance bottlenecks of single-processor architectures under high workloads [21][22][23]. ...
... For each pool, a set of index sequences of 12 nucleotides was designed. The shortened Hamming code (24,19) was used to encode a 19-bit binary number into 24 bits, which was then mapped to a 12 nt base sequence. In the generated index set, the first 415,800 index sequences with homopolymer lengths of less than or equal to 3 were used, and each DNA molecule was assigned a unique index. ...
... A shortened systematic Hamming code (24,19) is implemented as the error correction scheme ( Figure 6). The received codeword vector r = c + e is stored in a register, where e represents the error vector. ...
Article
Full-text available
DNA data storage has emerged as a promising alternative to traditional storage media due to its high density and durability. However, large-scale DNA storage systems generate massive sequencing reads, posing substantial computational complexity and latency challenges for data readout. Here, we propose a novel heterogeneous computing architecture based on a field-programmable gate array (FPGA) to accelerate DNA data readout. The software component, running on a general computing platform, manages data distribution and schedules acceleration kernels. Meanwhile, the hardware acceleration kernel is deployed on an Alveo U200 data center accelerator card, executing multiple logical computing units within modules and utilizing task-level pipeline structures between modules to handle sequencing reads step by step. This heterogeneous computing acceleration system enables the efficient execution of the entire readout process for DNA data storage. We benchmark the proposed system against a CPU-based software implementation under various error rates and coverages. The results indicate that under high-error, low-coverage conditions (error rate of 1.5% and coverage of 15×), the accelerator achieves a peak speedup of up to 373.1 times, enabling the readout of 59.4 MB of stored data in just 12.40 s. Overall, the accelerator delivers a speedup of two orders of magnitude. Our proposed heterogeneous computing acceleration strategy provides an efficient solution for large-scale DNA data readout.
... Efficiency is critical in hyperscale data centers, with implications not only for cost but also the environment [1,2]. With the end of exponential hardware performance gains, optimizing software and algorithms is increasingly important [3]. Since data centers run a diverse range of workloads [4][5][6], data center efficiency has often focused on system-wide optimizations, affecting large numbers of workloads at the same time. ...
... Our system currently identifies and generates performance savings equivalent to over 500 k normalized cores per quarter with >99.5% production success rate. 3 This paper is organized as follows: Section 2 discusses the datasets used to build ECO. Section 3 describes how we localize new opportunities in the code base. ...
... All code samples provided in this paper represent production code, but some of the variables, names, or values within code samples have been renamed.3 Success rate is the percentage of code edits that are successfully submitted in our data centers without causing rollbacks. ...
Preprint
Full-text available
With the end of Moore's Law, optimizing code for performance has become paramount for meeting ever-increasing compute demands, particularly in hyperscale data centers where even small efficiency gains translate to significant resource and energy savings. Traditionally, this process requires significant programmer effort to identify optimization opportunities, modify the code to implement the optimization, and carefully deploy and measure the optimization's impact. Despite a significant amount of work on automating program edits and promising results in small-scale settings, such performance optimizations have remained elusive in large real-world production environments, due to the scale, high degree of complexity, and reliability required. This paper introduces ECO (Efficient Code Optimizer), a system that automatically refactors source code to improve performance at scale. To achieve these performance gains, ECO searches through historical commits at scale to create a dictionary of performance anti-patterns that these commits addressed. These anti-patterns are used to search for similar patterns in a code base of billions of lines of code, pinpointing other code segments with similar potential optimization opportunities. Using a fine-tuned LLM, ECO then automatically refactors the code to generate and apply similar edits. Next, ECO verifies the transformed code, submits it for code review, and measures the impact of the optimization in production. Currently deployed on Google's hyperscale production fleet, this system has driven >25k changed lines of production code, across over 6.4k submitted commits, with a >99.5% production success rate. Over the past year, ECO has consistently resulted in significant performance savings every quarter. On average, the savings produced per quarter are equivalent to over 500k normalized CPU cores.
... At present, there is considerable interest in increasing the productivity of computing equipment [1][2][3][4]. One approach to solving this problem is the use of parallel-sequential calculations [5,6], which are implemented, among other methods, using the residue number system (RNS) [7,8]. ...
... We also note that the approach using a specific Galois field (or varieties of such fields) is not fundamentally new. Modern binary computing technology is de facto built on the use of a specific field, GF (2). Consequently, the question of developing computing technology that uses the specifics of a particular type of Galois fields in a certain sense corresponds to the existing tradition. ...
... Thus, this paper substantiates the possibility of using another standard of computer technology, one which has the same computing capabilities as existing 16-bit processors but allows for serial-parallel computations using a well-defined RNS. This approach is particularly convenient for computations corresponding to convolutional neural networks due to the fulfillment of criterion (2). In addition, as will become clear, it is the use of quasi-Mersenne numbers that allows us to propose fairly simple computing devices based on standard elements developed for binary logic. ...
Article
Full-text available
It is shown that a serial-parallel processor, comparable in bit capacity to a 16-bit binary processor, can be implemented based on an algorithm built on the residue number system, a distinctive feature of which is the use of the first four quasi-Mersenne numbers, i.e., prime numbers representable as pk=2k+1, k=1,2,3,4. Such a set of prime numbers satisfies the criterion 2p1p2p3p4+1=P, where P is also a prime number. Fulfillment of this criterion ensures the possibility of convenient use of the considered RNS for calculating partial convolutions developed for the convenience of using convolutional neural networks. It is shown that the processor of the proposed type can be based on the use of a set of adders modulo a quasi-Mersenne number, each of which operates independently. A circuit of a modulo 2k+1 adder is proposed, which can be called a trigger circuit, since its peculiarity is the existence (at certain values of the summed quantities) of two stable states. The advantage of such a circuit, compared to known analogs, is the simplicity of the design. Possibilities for further development of the proposed approach related to the use of the digital logarithm operation, which allows reducing the operations of multiplication modulo 2k+1 to addition operations, are discussed.
... Not all cores on a chip could be powered on simultaneously, due to limitations in heat dissipation and power delivery. This essentially rendered some transistors inactive, reducing overall efficiency [4] [7]. ...
... In contrast to homogeneous computing, which exclusively relies on Central Processing Units (CPUs) for all tasks, heterogeneous computing combines different types of processors such as CPUs which handles general-purpose tasks while controlling the overall system operations, Graphics Processing Units (GPUs) which perform parallel processing tasks such as rendering and Machine Learning, and specialized accelerators on a single chip. Each type of processor in a heterogeneous computing architecture is optimized for specific tasks, leading to improved performance and reduced power consumption for workloads that can be parallelized across these diverse cores while also allowing efficient communication and resource sharing between all components [7]. ...
... Not all cores on a chip could be powered on simultaneously, due to limitations in heat dissipation and power delivery. This essentially rendered some transistors inactive, reducing overall efficiency [4] [7]. ...
... In contrast to homogeneous computing, which exclusively relies on Central Processing Units (CPUs) for all tasks, heterogeneous computing combines different types of processors such as CPUs which handles general-purpose tasks while controlling the overall system operations, Graphics Processing Units (GPUs) which perform parallel processing tasks such as rendering and Machine Learning, and specialized accelerators on a single chip. Each type of processor in a heterogeneous computing architecture is optimized for specific tasks, leading to improved performance and reduced power consumption for workloads that can be parallelized across these diverse cores while also allowing efficient communication and resource sharing between all components [7]. ...
Preprint
Full-text available
The evolution of computer architecture has led to a paradigm shift from traditional single-core processors to multi-core and domain-specific architectures that address the increasing demands of modern computational workloads. This paper provides a comprehensive study of this evolution, highlighting the challenges and key advancements in the transition from single-core to multi-core processors. It also examines state-of-the-art hardware accelerators, including Tensor Processing Units (TPUs) and their derivatives, RipTide and the Catapult fabric, and evaluates their strategies for optimizing critical performance metrics such as energy consumption, latency, and flexibility. Ultimately, this study emphasizes the role of reconfigurable systems in overcoming current architectural challenges and driving future advancements in computational efficiency.
... In the current era where Moore's Law and Dennard scaling no longer drive performance improvements, researchers have turned to architecture specialization and domain-specific programming systems for further scaling gains [24]. Data movement is now the dominant cost in execution time and energy [21], and optimizations to reduce data movement must take center stage. ...
Preprint
Full-text available
We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO's integration with the MLIR and Triton compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability in its integration with MLIR and CUDA.
... As society and industry depend on increasingly complex signal processing systems, solutions become more energy-intensive. This trend drives the need for optimized processing pipelines that minimize energy consumption and data bandwidth while improving performance and reliability [1]. The automotive industry is paradigmatic due to the limited energy availability in cars and the need for precise and reliable low-latency systems to achieve fully autonomous driving [2,3]. ...
Article
Full-text available
Automotive radar systems face the challenge of managing high sampling rates and large data bandwidth while complying with stringent real-time and energy efficiency requirements. Neuromorphic computing offers promising solutions because of its inherent energy efficiency and parallel processing capacity. Yet, most sensor systems, such as radars, do not produce suitable data for further neuromorphic processing of the data. This research presents a novel spiking neuron model for signal processing of frequency-modulated continuous wave (FMCW) radars that outperforms the state-of-the-art spectrum analysis algorithms in latency and data bandwidth. These spiking neural resonators are based on the resonate-and-fire neuron model and optimized to dynamically process raw radar data while simultaneously emitting an output in the form of spikes. We designed the first neuromorphic neural network consisting of these spiking neural resonators that estimates range and angle from FMCW radar data, evaluated the network on simulated automotive datasets and compared the results with a state-of-the-art pipeline for radar processing. The proposed neuron model significantly reduces the processing latency compared to traditional frequency analysis algorithms, such as the Fourier transformation (FT), which needs to sample and store entire data frames before processing. The evaluations demonstrate that these spiking neural resonators achieve state-of-the-art detection accuracy while emitting spikes simultaneously to processing and transmitting only 0.02% of the data compared to a float-32 FT. The results showcase the potential for neuromorphic signal processing for FMCW radar systems and pave the way for designing neuromorphic radar sensors.
... Over the past decade, photonics research has explored accelerated tensor operations, foundational to artificial intelligence (AI) and deep learning 1-4 , as a path towards enhanced energy efficiency and performance 5-14 . The field is centrally motivated by finding alternative technologies to extend computational progress in a post-Moore's law and Dennard scaling era [15][16][17][18][19] . Despite these advances, no photonic chip has achieved the precision necessary for practical AI applications, and demonstrations have been limited to simplified benchmark tasks. ...
Article
Full-text available
Over the past decade, photonics research has explored accelerated tensor operations, foundational to artificial intelligence (AI) and deep learning1, 2, 3–4, as a path towards enhanced energy efficiency and performance5, 6, 7, 8, 9, 10, 11, 12, 13–14. The field is centrally motivated by finding alternative technologies to extend computational progress in a post-Moore’s law and Dennard scaling era15, 16, 17, 18–19. Despite these advances, no photonic chip has achieved the precision necessary for practical AI applications, and demonstrations have been limited to simplified benchmark tasks. Here we introduce a photonic AI processor that executes advanced AI models, including ResNet³ and BERT20,21, along with the Atari deep reinforcement learning algorithm originally demonstrated by DeepMind²². This processor achieves near-electronic precision for many workloads, marking a notable entry for photonic computing into competition with established electronic AI accelerators²³ and an essential step towards developing post-transistor computing technologies.
... The bandwidth and latency of a memory system are significantly influenced by the interaction between memory access and the "3D" structure of banks, rows, and columns inherent to modern DRAM chips. Therefore, we hold that it is of utmost importance to focus on hardware architecture, algorithm development, and software performance engineering to continuously improve computer applications in this new era [1,2]. ...
Article
Full-text available
Emerging applications like deep neural networks require high off-chip memory bandwidth and low dynamic loaded Double Data Rate SDRAM (DDR) latency. However, under the stringent physical constraints of chip packages and system boards, it is extremely expensive to further increase the bandwidth and reduce the dynamic loaded latency of off-chip memory in terms of DDR devices. To address the latency issues in DDR subsystems, this paper presents a novel architecture aiming at achieving latency optimization through a use case sensitive controller. We propose a reevaluation of conventional decoupling mechanisms and quasi-static arbitration methods in the DDR scheduling architecture. The adaptive scheduling algorithms offer significant advantages in various real-world scenarios. The research methodology involves implementing a rank-level timing aware read/write turnaround arbiter and setting read/write queue thresholds and read/write turnaround settings based on observed patterns. By implementing the arbiter and dynamically adjusting these parameters, the proposed architecture aims to optimize the performance of the DDR subsystem. To validate the effectiveness of the architecture, we conduct multiple experiments. These experiments evaluate the performance of the DDR subsystem under various workloads and configurations. The results demonstrate that the adaptive scheduling algorithms have advantages in achieving DDR performance attributes for workloads and improving system performance. The experimental results provide evidence of the architecture’s effectiveness in reducing latency by around 10% to 50% in various real-world scenarios.
... Motivated by the requirement to detect exoplanet transits in everincreasing volumes of high cadence stellar light curves, and the apparent stalling of the trend toward increasing CPU clock speeds observed in the last few decades (Leiserson et al. 2020), we present a new transit detection algorithm that takes advantage of GPUs and the highly parallel nature of the task. Cetra has been implemented with CUDA for NVIDIA GPUs, although we're exploring the possibility of porting to other frameworks to enable processing on devices from other manufacturers. ...
Preprint
We present the Cambridge Exoplanet Transit Recovery Algorithm (CETRA), a fast and sensitive transit detection algorithm, optimised for GPUs. CETRA separates the task into a search for transit signals across linear time space, followed by a phase-folding of the former to enable a periodic signal search, using a physically motivated transit model to improve detection sensitivity. It outperforms traditional methods like Box Least Squares and Transit Least Squares in both sensitivity and speed. Tests on synthetic light curves demonstrate that CETRA can identify at least 20 per cent more low-SNR transits than Transit Least Squares in the same data, particularly those of long period planets. It is also shown to be up to a few orders of magnitude faster for high cadence light curves, enabling rapid large-scale searches. Through application of CETRA to Transiting Exoplanet Survey Satellite short cadence data, we recover the three planets in the HD 101581 system with improved significance. In particular, the transit signal of the previously unvalidated planet TOI-6276.03 is enhanced from SNR=7.9{\rm SNR}=7.9 to SNR=16.0{\rm SNR}=16.0, which means it may now meet the criteria for statistical validation. CETRA's speed and sensitivity make it well-suited for current and future exoplanet surveys, particularly in the search for Earth analogues. Our implementation of this algorithm uses NVIDIA's CUDA platform and requires an NVIDIA GPU, it is open-source and available from GitHub and PyPI.
... Cette étape conditionne aussi grandement l'efficacité du traitement, puisque c'est à ce niveau que les algorithmes sont conçus pour résoudre des problèmes. Par exemple, multiplier naïvement deux matrices de 4 096 × 4 096 (contenant 10 9 bits d'information) avec le langage de programmation Python prend 7 heures, tandis que le même calcul implémenté en utilisant des instructions spécifiquement optimisées pour le CPU prend 0,41 seconde sur la même machine [52], soit un gain de vitesse d'un facteur 60 000 simplement en utilisant un algorithme plus adapté et en envoyant des instructions plus spécifiques à l'unité de calcul. Cet exemple est particulièrement pertinent dans le cas de l'apprentissage profond, où la majorité des opérations reposent sur des multiplications matricielles (sections 2.2 et 2.4). ...
Thesis
Full-text available
This thesis explores new methods to overcome the challenges of scaling quantum computers, focusing specifically on the automatic calibration of spin qubits and quantum error corrections. The proposed solutions rely on artificial neural networks integrated on emerging electronic components known as memristors. These components leverage in-memory computing for machine learning applications, surpassing traditional processors by a factor of 100 in terms of energy efficiency. In the first phase, an automatic quantum dot calibration method is developed and experimentally validated. This approach relies on artificial neural networks to identify patterns in noisy current measurements, confining a single electron within a nanometer-scale device without human intervention. Special attention is given to quantifying the uncertainty of these networks, leading to a significant reduction in calibration errors. The results show a success rate of over 99% on an offline dataset and 95% for a real-time experiment, demonstrating the robustness of the proposed method. In the second phase, a memristor-based architecture is proposed to efficiently process syndromes generated by quantum error correction codes. This cryogenic decoder architecture meets the strict time and energy requirements necessary for integration near the qubits, though with some performance loss due to hardware imperfections. To address this limitation, a hardware-aware re-training method that accounts for these imperfections is developed, restoring the original fidelity of the decoder. This research offers concrete solutions to current issues in quantum computing while highlighting the potential of memristors for the design of low-energy artificial intelligence applications.
... Moreover, up to 68% of the CPU cycles are used by the replication operations. With Moore's law slowing [30], the prospects for significant future improvements in CPU performance are limited; therefore, to scale the performance of strongly consistent protocols, developing specialized hardware is becoming a reasonable option. ...
Preprint
Full-text available
Today's datacenter applications rely on datastores that are required to provide high availability, consistency, and performance. To achieve high availability, these datastores replicate data across several nodes. Such replication is managed through a reliable protocol designed to keep the replicas consistent using a consistency model, even in the presence of faults. For several applications, strong consistency models are favored over weaker consistency models, as the former guarantee a more intuitive behavior for clients. Furthermore, to meet the demands of high online traffic, datastores must offer high throughput and low latency. However, delivering both strong consistency and high performance simultaneously can be challenging. Reliable replication protocols typically require multiple rounds of communication over the network stack, which introduces latency and increases the load on network resources. Moreover, these protocols consume considerable CPU resources, which impacts the overall performance of applications, especially in high-throughput environments. In this work, we aim to design a hardware-accelerated system for replication protocols to address these challenges. We approach offloading the replication protocol onto SmartNICs, which are specialized network interface cards that can be programmed to implement custom logic directly on the NIC. By doing so, we aim to enhance performance while preserving strong consistency, all while saving valuable CPU cycles that can be used for applications' logic.
... Harith G. Ayoub et al / ESPIJAST, 3(2),[20][21][22][23][24][25][26] 2025 ...
Article
Full-text available
The accurate interpretation of electromyography (EMG) signals is crucial for medical diagnostics and rehabilitation systems. However, the inherent presence of noise, including motion artifacts and powerline interference, significantly hampers signal clarity. This research presents a novel, cost-effective approach for advanced EMG signal denoising using an Arduino-based platform integrated with custom signal processing techniques. By leveraging optimized digital filtering algorithms, the proposed system effectively suppresses noise while preserving critical muscle activity patterns. Experimental results demonstrate a significant improvement in signal-to-noise ratio (SNR), providing clean and reliable EMG data for enhanced medical analysis and real-time biomedical applications. The simplicity, affordability, and efficiency of the system position it as a promising solution for portable healthcare and rehabilitation devices.
... Rising throughput and latency demands from scientific algorithms mean that they can no longer rely on advancements in computer architecture or fabrication processes for uplifts in performance. Instead, they must take advantage of changes across the stack by reallocating silicon budget to custom accelerators in both FPGA and ASIC designs [17]. Resource-constrained environments, such as embedded real-time control systems and signal processing algorithms, have already adopted this approach to develop next-generation electric motors and software-defined radios. ...
Preprint
Co-developing scientific algorithms and hardware accelerators requires domain-specific knowledge and large engineering resources. This leads to a slow development pace and high project complexity, which creates a barrier to entry that is too high for the majority of developers to overcome. We are developing a reusable end-to-end compiler toolchain for the Julia language entirely built on permissively-licensed open-source projects. This unifies accelerator and algorithm development by automatically synthesising Julia source code into high-performance Verilog.
... As society and industry depend on increasingly complex signal processing systems, solutions become more energyintensive. This trend drives the need for optimized processing pipelines that minimize energy consumption and data bandwidth while improving performance and reliability [15]. The automotive industry is paradigmatic due to the limited energy availability in cars and the need for precise and reliable lowlatency systems to achieve fully autonomous driving [16], [28]. ...
Preprint
Full-text available
Automotive radar systems face the challenge of managing high sampling rates and large data bandwidth while complying with stringent real-time and energy efficiency requirements. The growing complexity of autonomous vehicles further intensifies these requirements. Neuromorphic computing offers promising solutions because of its inherent energy efficiency and parallel processing capacity. This research presents a novel spiking neuron model for signal processing of frequency-modulated continuous wave (FMCW) radars that outperforms the state-of-the-art spectrum analysis algorithms in latency and data bandwidth. These spiking neural resonators are based on the resonate-and-fire neuron model and optimized to dynamically process raw radar data while simultaneously emitting an output in the form of spikes. We designed the first neuromorphic neural network consisting of these spiking neural resonators that estimates range and angle from FMCW radar data. We evaluated the range-angle maps on simulated datasets covering multiple scenarios and compared the results with a state-of-the-art pipeline for radar processing. The proposed neuron model significantly reduces the processing latency compared to traditional frequency analysis algorithms, such as the Fourier transformation (FT), which needs to sample and store entire data frames before processing. The evaluations demonstrate that these spiking neural resonators achieve state-of-the-art detection accuracy while emitting spikes simultaneously to processing and transmitting only 0.02 % of the data compared to a float-32 FT. The results showcase the potential for neuromorphic signal processing for FMCW radar systems and pave the way for designing neuromorphic radar sensors.
... With the increasing miniaturization and integration of electronic devices, the dramatic rise in thermal density presents a giant challenge to the stability and lifespan of electronic devices. [1][2][3] Consequently, the development of efficient and novel thermal management materials has become a focal point of scientific research. Emerging 2D materials like MXene, hexagonal boron nitride (BN), etc., exhibit promising thermal conductive performance. ...
Article
Full-text available
The growing heat flow density from the miniaturization trend of electronic devices seriously challenges the heat diffusion in electronic systems. Consequently, there is an increasing demand for thermal management materials with both thermal conductivity (K) and material thickness (d) to effectively transfer devices’ heat flux. Graphene films (GFs) with high K have attracted significant attention, but achieving both high K and large d remains challenging due to graphene's intrinsic properties and fabrication limitations. Here, a novel non‐stacking strategy is proposed for fabricating monolithic thick GFs. By utilizing the ultra‐small‐sized graphene oxide slurry, introducing multi‐line shearing, and utilizing a specially designed frame, stable and highly oriented thick films are successfully produced. These thick films eliminate the interfacial defects and enable a monolithic GF with ultra‐high K over 1600 W m⁻¹ K⁻¹ (improved by 17.03%) when d exceeds 300 µm compared to the conventional multi‐layer stacking method. While the K × d value, which represents the film's heat transfer capability, increased by 21.34% to 0.544 W K⁻¹, the chip's operating temperature further decreased by 3.3 °C. The proposed strategy provides a promising solution to produce high‐performance thick GFs and represents an effective route for heat dissipation of electronic systems.
... The OpenBLAS library also contains separate implementations using RVV 0.7.1 and RVV 1.0 intrinsics. However, many codes have potential for additional optimization [33], and OpenBLAS is not an exception. ...
Preprint
Full-text available
The rapid development of RISC-V instruction set architecture presents new opportunities and challenges for software developers. Is it sufficient to simply recompile high-performance software optimized for x86-64 onto RISC-V CPUs? Are current compilers capable of effectively optimizing C and C++ codes or is it necessary to use intrinsics or assembler? Can we analyze and improve performance without well-developed profiling tools? Do standard optimization techniques work? Are there specific RISC-V features that need to be considered? These and other questions require careful consideration. In this paper, we present our experience optimizing four BLAS algorithms for band matrix operations on RISC-V processors. We demonstrate how RISC-V-optimized implementations of OpenBLAS algorithms can be significantly accelerated through improved vectorization of computationally intensive loops. Experiments on Lichee Pi 4A and Banana Pi BPI-F3 devices using RVV 0.7.1 and RVV 1.0 vector instruction sets respectively, show speedups of 1.5x to 10x depending on the operation compared to the OpenBLAS baseline. In particular, the successful use of vector register grouping with RVV can lead to significant performance improvements.
... The crystalline distortion lifts the band degeneracy such that the correlation energy becomes comparable to the bandwidth and hence, the MIT is triggered. The Mott's criterion for the electronic transition is given as ( ) 1 expected from CDW-induced band gaps alone. As illustrated in Figure 3b, the NCCDW phase also contains star-of-David clusters, albeit arranged in a less uniform pattern. ...
Preprint
Full-text available
Volatile threshold resistive switching and neuronal oscillations in phase-change materials, specifically those undergoing metal-to-insulator and charge density wave transitions, offer unique attributes such as fast and low-field volatile switching, tunability, and non-linear behaviors. These characteristics are particularly promising for emulating neuronal behavior and thus hold great potential for realizing energy-efficient neuromorphic computing. In this review, we summarize recent advances in the development of neuronal oscillator devices based on three archetypal electronic phase-change materials: the correlated oxide VO2, the charge density wave transition metal dichalcogenide 1T-TaS2, and the emerging phase-change chalcogenide perovskite BaTiS3. We discuss progress from the perspective of materials development, including structural phase transitions, synthesis methods, electrical properties, and device implementation. Finally, we emphasize the major challenges that must be addressed for practical applications of these phase-change materials and provide our outlook on the future research directions in this rapidly evolving field.
... As traffic simulation models continue to evolve, there is a growing demand to enhance their scale and level of detail, significantly increasing the computational requirements of such software [4]. At the same time, CPU (Central Processing Unit) clock rates have plateaued for nearly two decades, and the miniaturization in semiconductor design is approaching physical limits [5]. Realizing performance improvements to accommodate computationally expensive simulations while maintaining fast execution times requires leveraging the parallel computing capabilities of modern computer hardware. ...
Article
Full-text available
Striving for better simulation results, transport planners want to simulate larger domains with increased levels of detail. Achieving fast execution times for these complex traffic simulations requires the parallel computing power of modern hardware. This paper presents an architectural update to the MATSim traffic simulation framework, introducing a prototype that adapts the existing traffic flow model to a distributed parallel algorithm. The prototype is capable of scaling across multiple compute nodes, utilizing the parallel computing power of modern hardware. Benchmarking reveals a 119-fold improvement in execution speed over the current implementation, and a 43 times speedup when compared to single-core performance. The prototype can simulate 24 h of large-scale traffic in just 3.5 s. Based on these results, we advocate for integrating a distributed simulation approach into MATSim and outline steps for further optimizing the prototype for large-scale applications.
... Although typical CT imaging has lower resolution than histological imaging, it is nondestructive and can be performed on living and non-living samples (Bouxsein et al., 2010). Coupled with advances in high-performance computing (Leiserson et al., 2020) and availability of commercial scanners, CT is now accessible, reproducible, and relatively fast: medical CT and fineresolution micro-CT units can collect data in minutes to several hours. Therefore, the ability to collect large volumes of CT-based data has enabled increasingly complex biomedical, ecological, and evolutionary projects related to bone architecture (e.g., Doube et al., 2009;Harbers et al., 2020;Houssaye & Botton-Divet, 2018;Mosey et al., 2017). ...
Article
Full-text available
Computed tomography (CT) enables rapid imaging of large‐scale studies of bone, but those datasets typically require manual segmentation, which is time‐consuming and prone to error. Convolutional neural networks (CNNs) offer an automated solution, achieving superior performance on image data. In this methodology‐focused paper, we used CNNs to train segmentation models from scratch on 2D and 3D patches from micro‐CT scans of otter long bones. These new models, collectively called BONe (Bone One‐shot Network), aimed to be fast and accurate, and we expected enhanced results from 3D training due to better spatial context. Contrary to expectations, 2D models performed slightly better than 3D models in labeling details such as thin trabecular bone. Although lacking in some detail, 3D models appeared to generalize better and predict smoother internal surfaces than 2D models. However, the massive computational costs of 3D models limit their scalability and practicality, leading us to recommend 2D models for bone segmentation. BONe models showed potential for broader applications with variation in performance across species and scan quality. Notably, BONe models demonstrated promising results on skull segmentation, suggesting their potential utility beyond long bones with further refinement and fine‐tuning.
... Estos problemas, junto con la creciente generación de calor en procesadores más densos, complican la continuación de la Ley de Moore. 24 La industria de los semiconductores está mirando hacia nuevas estrategias como la optimización a través de software y algoritmos, permitiendo que los dispositivos usen su hardware de manera más eficiente. Además, están surgiendo arquitecturas innovadoras, como los sistemas de computación heterogénea (CPU, GPU y unidades de procesamiento neural en un solo chip) y el uso de materiales avanzados, que podrían reemplazar el silicio en el futuro. ...
Article
Full-text available
Los avances en nanotecnología han acercado la posibilidad de teléfonos celulares más ligeros, con baterías con mayor duración, procesadores rápidos e incluso pantallas flexibles. Esta ciencia, que manipula la materia a escala nanométrica (<100 nm), ha sido clave en la evolución de la industria electrónica y móvil. El presente artículo pretende explorar cómo las innovaciones en nanotecnología han redefinido los dispositivos móviles y su impacto en el futuro de los celulares, destacando tanto sus beneficios como sus limitaciones. Conforme la nanotecnología avanza, surgen nuevas posibilidades que transforman la experiencia digital, que podrían aplicarse a otros campos.
... Según (Tanenbaum & Bos, 2024), menciona que los distintos operativos pueden obtener lentitud a lo largo de su crecimiento en cuanto a funciones adicionales que no realmente sean necesarios y que obligan al hardware a ejecutar cada vez más tareas. En otras palabras, uno de los desafíos principales al buscar el mejor rendimiento de un sistema operativo es el propio crecimiento en nuevas características y funcionalidades de este Además, (Leiserson et al., 2020) menciona que frente a las limitación que involucra una reducción en la velocidad de evolución del hardware que ha tenido en los último años debido a su creciente dificultad de fabricación, la única forma momentánea que existe de confrontar dichas dificultades es el de desarrollar e implementar algoritmos más eficientes y buscar alternativas al crecimiento del rendimiento del hardware basado en la miniaturización de núcleos hechos de silicón. ...
... To overcome these inefficiencies, silicon photonics has recently emerged as a groundbreaking solution to the limitations of conventional electronic architectures. As electronic accelerators face inherent limitations in the post-Moore era, including high fabrication costs and diminishing performance returns [6], the transmission of data over metallic wires presents significant bandwidth and energy bottlenecks. Silicon photonics, with its ultra-high bandwidth, low latency, and energy-efficient data communication, has emerged as a promising solution. ...
Preprint
Full-text available
Generative Adversarial Networks (GANs) are at the forefront of AI innovation, driving advancements in areas such as image synthesis, medical imaging, and data augmentation. However, the unique computational operations within GANs, such as transposed convolutions and instance normalization, introduce significant inefficiencies when executed on traditional electronic accelerators, resulting in high energy consumption and suboptimal performance. To address these challenges, we introduce PhotoGAN, the first silicon-photonic accelerator designed to handle the specialized operations of GAN models. By leveraging the inherent high throughput and energy efficiency of silicon photonics, PhotoGAN offers an innovative, reconfigurable architecture capable of accelerating transposed convolutions and other GAN-specific layers. The accelerator also incorporates a sparse computation optimization technique to reduce redundant operations, improving computational efficiency. Our experimental results demonstrate that PhotoGAN achieves at least 4.4x higher GOPS and 2.18x lower energy-per-bit (EPB) compared to state-of-the-art accelerators, including GPUs and TPUs. These findings showcase PhotoGAN as a promising solution for the next generation of GAN acceleration, providing substantial gains in both performance and energy efficiency.
... The chip, which is the heart of electronic devices, has always been developing towards higher integration and more miniaturization. According to the famous 'Moore's Law' [1] (the number of transistors that can be accommodated on an integrated circuit will double approximately every 18 months), integrated circuits (IC) have also evolved from the earlier two-dimensional structure to the current three-dimensional structure (Three-Dimensional Integrated Circuit, 3D-IC). A three-dimensional integrated circuit is an circuit made by stacking silicon wafers and interconnecting them vertically using Through Silicon Vias (TSVs). ...
Chapter
Full-text available
As integrated circuits develop towards higher integration and miniaturization, the diameter of single micro-interconnection structure has approached the scale of several nanometers, which means consist of only one or a few crystals in a cross-section. In such microscopic scale, some problems start to emerge. In this study, a micro-tensile test method based on the FIB system was developed and the strength of copper thin film samples were measured. The crystallinity was quantitatively determined by applying IQ values obtained from EBSD analysis. Multiple single-crystal copper and bicrystal copper specimens were used for tensile testing. By establishing a curve between the strength and IQ value of these copper samples, it was found that the strength of microscopic copper samples did not change with the IQ value monotonically. A degree of crystallinity that may meant the mechanical properties reach a criticality were realized, which would provide a basis for controlling the reliability of TSV copper interconnection structure.
... As electronic devices continue to shrink in size [1][2][3][4][5][6][7][8][9][10], their performance is increasingly governed by nanoscale phenomena, where surface effects, charge dynamics, and localized interactions play a pivotal role [11][12][13][14]. Modeling and optimizing these devices require simulations capable of resolving dynamic processes at ultrafast timescales. ...
Article
Full-text available
Interference gating (iGate) has emerged as a groundbreaking technique for ultrafast time-resolved electron holography in transmission electron microscopy, delivering nanometer spatial and nanosecond temporal resolution with minimal technological overhead. This study employs iGate to dynamically observe the local projected electric potential within the space-charge region of a contacted transmission electron microscopy (TEM) lamella manufactured from a silicon diode during switching between unbiased and reverse-biased conditions, achieving a temporal resolution of 25 ns at a repetition rate of 3 MHz. By synchronizing the holographic acquisition with the applied voltage, this approach enables the direct visualization of time-dependent potential distributions with high precision. Complementary static and dynamic experiments reveal a remarkable correspondence between modeled and measured projected potentials, validating the method’s robustness. The observed dynamic phase progressions resolve and allow one to differentiate between localized switching dynamics and preparation-induced effects, such as charge recombination near the sample edges. These results establish iGate as a transformative tool for operando investigations of semiconductor devices, paving the way for advancing the nanoscale imaging of high-speed electronic processes.
Article
Full-text available
In the post‐Moore era, single‐atom magnets and metal‐fullerene clusters are gradually replacing conventional magnetic storage semiconductor devices due to their high‐density magnetic storage capability. However, the stability of these materials in room‐temperature environments remains a challenge. A solution to this problem is proposed by doping atomically precise gold nanoclusters (NCs) with heteroatoms to induce high‐ and low‐spin isomers, which are governed by the point group symmetry of metallic core using time‐dependent density functional theory (TD‐DFT) combined with the complete active space self‐consistent field (CASSCF). Based on the field‐dependent magnetic susceptibility and electron paramagnetic resonance, the high‐ and low‐spin isomers of M@Au8 NCs (M = Fe, Cr, Mn) are formed by core–shell electron coupling to form a stable magnetism, and all of them show paramagnetic properties with the magnetic order remaining intact, and they are capable of stable information storage at room temperature. These computational results provide a novel research direction for the development of magnetic semiconductor switching devices.
Article
Full-text available
The continuous evolution of field-effect transistor (FET) technologies is essential to address the increasing demand for energy-efficient and high-performance electronics. This review provides a comprehensive analysis of advanced low-power FETs, focusing on semiconductor materials, architectures, fabrication techniques and applications. Emerging materials such as 2D semiconductors, IGZO (indium gallium zinc oxide), TMDs (transition metal dichalcogenides) and III–V compounds play a pivotal role in enabling innovative FET topologies like FinFETs, stacked nanosheet FETs (NSFETs), vertical NSFETs, TreeFETs and complementary FETs. These materials, with superior properties such as high-mobility channels, improved scalability and energy efficiency, are critical in overcoming the challenges posed by conventional CMOS technology node scaling. In particular, NSFETs are anticipated to substitute the state-of-the-art nanowire FET and FinFET devices due to their ability to provide better electrostatic control and tunable channel widths. This transition is expected to reshape the semiconductor technology in the years ahead. A critical aspect of integrating these architectures lies in the advanced fabrication steps such as epitaxial growth techniques, spacer-based lithography and high-k metal gate (HKMG) integration that enables precise control over device dimensions and enhancing performance. These innovations facilitate the integration of advanced architectures for diverse applications including logic circuits, memory devices including SRAM, MRAM, sensing technologies and RF applications. By incorporating these material innovations with architectural advancements, this study highlights their combined potential to address current progress and challenges, driving the future of low-power FETs and shaping sustainable high-performing modern electronics. Graphical Abstract
Article
We present the Cambridge Exoplanet Transit Recovery Algorithm (cetra), a fast and sensitive transit detection algorithm, optimized for GPUs. cetra separates the task into a search for transit signals across linear time space, followed by a phase-folding of the former to enable a periodic signal search, using a physically motivated transit model to improve detection sensitivity. It outperforms traditional methods like Box Least Squares and Transit Least Squares in both sensitivity and speed. Tests on synthetic light curves demonstrate that cetra can identify at least 20 per cent more low-SNR transits than Transit Least Squares in the same data, particularly those of long period planets. It is also shown to be up to a few orders of magnitude faster for high cadence light curves, enabling rapid large-scale searches. Through application of cetra to Transiting Exoplanet Survey Satellite short cadence data, we recover the three planets in the HD 101581 system with improved significance. In particular, the transit signal of the previously unvalidated planet TOI-6276.03 is enhanced from SNR=7.9{\rm SNR}=7.9 to SNR=16.0{\rm SNR}=16.0, which means it may now meet the criteria for statistical validation. cetra’s speed and sensitivity make it well-suited for current and future exoplanet surveys, particularly in the search for Earth analogues. Our implementation of this algorithm uses NVIDIA’s CUDA platform and requires an NVIDIA GPU, it is open-source and available from GitHub and PyPI.
Article
Full-text available
As the electronics industry advances toward enhanced performance and miniaturization, the high heat flux generated poses significant challenges for maintaining operational stability. Carbon‐based thermally conductive films, including those derived from graphite, graphene, and polyimide, have shown notable in‐plane thermal conductivity (Kin), making them increasingly valuable for electronic heat dissipation. However, their cross‐plane thermal conductivity (Kout) remains suboptimal, typically not exceeding 8 W m⁻¹ K⁻¹, which limits their overall heat transfer efficiency under elevated heat flux density. Herein, an innovative approach to fabricate aramid‐derived graphite films (AGFs) characterized by minimal defects, large grain sizes, and well‐ordered stacking through the graphitization of aramid films (AFs) is proposed. Notably, after thermal annealing at 3000 °C, the AGFs exhibit impressive bidirectional thermal conductivity, achieving a Kin of up to 1754 W m⁻¹ K⁻¹ and a Kout of 14.2 W m⁻¹ K⁻¹. High‐performance AGFs demonstrate exceptional cooling efficiency in simulated smartphone thermal management scenarios, facilitating rapid heat transfer crucial for the thermal management of high‐power semiconductor chips. This work contributes critical insights into the synthesis of high‐quality graphite films from AFs and offers guidance for the design of bidirectionally thermally conductive graphite films tailored for effective electronic thermal management.
Article
Full-text available
Free‐space optical systems are emerging as a hardware platform for high‐throughput and energy‐efficient computing. In this review, the pioneering works are first introduced to lay the foundation for the principles and architectures of systems. The modern hardware implementations of two types of optical computing systems, matrix, and vector multiplication systems and diffractive optical neural network systems, are covered from material, device, and system perspectives. Further, the system deployment to various applications is also discussed. This review serves as an introduction and guideline to the current progress of developing and utilizing free‐space optical computing systems in various domains.
Article
Full-text available
As social networks and related data processes have grown exponentially in complexity, the efficient resolution of combinatorial optimization problems has become increasingly crucial. Recent advancements in probabilistic computing approaches have demonstrated significant potential for addressing these problems more efficiently than conventional deterministic computing methods. In this study, we demonstrate a highly durable probabilistic bit (p‐bit) device utilizing two‐dimensional materials, specifically hexagonal boron nitride (h‐BN) and tin disulfide (SnS2) nanosheets. By leveraging the inherently stochastic nature of electron trapping and detrapping at the h‐BN/SnS2 interface, the device achieves durable probabilistic fluctuations over 10⁸ cycles with minimal energy consumption. To mitigate the static power consumption, we integrated an active switch in series with a p‐bit device, replacing conventional resistors. Furthermore, employing the pulse width as the control variable for probabilistic switching significantly enhances noise immunity. We demonstrate the practical application of the proposed p‐bit device in implementing invertible Boolean logic gates and subsequent integer factorization, highlighting its potential for solving complex combinatorial optimization problems and extending its applicability to real‐world scenarios such as cryptographic systems. image
Chapter
Field-Programmable Gate Arrays (FPGAs) emerged in the mid-1980s, providing a way to produce complex custom hardware at one’s desk. The concept of hardware reconfigurability that was thus created is truly fascinating: An integrated circuit can be customized with no manufacturing steps whatsoever; all that is needed is loading an appropriate configuration into the FPGA’s memory. The key principle that made this possible was the use of stored-select multiplexers for connecting prefabricated logic blocks. However, this opened two fundamental questions of reconfigurable architecture design: (1) which logic blocks should be prefabricated and (2) how should the stored-select multiplexer network be organized.
Article
Modulating the interface between the molecule and electrode is an effective way to enhance the spin-polarized transport properties of molecular junctions. In this study, using first-principles calculations combined with the...
Article
Full-text available
Manipulation of thermal transport in oxides is exceedingly important for their thermal based applications. The continuous pursuit of the thermal conductivity limit calls for the rise of more effective thermal manipulation approaches while it remains a long-standing challenge. In this study, a demonstration of the effective manipulation of thermal transport in a series of Ruddlesden–Popper (SrTiO3)nSrO superlattice films is reported. Structural engineering, including tailoring of interfaces and the boundary, is used, and an exceptionally low thermal conductivity of 0.65 W m⁻¹ K⁻¹ is achieved. A model derived from the Boltzmann equation is raised to elucidate thermal transport behaviors, which reveals the dominant roles of interface- and boundary-phonon scatterings in the thermal transport manipulation. In the modeling, a scattering coefficient is proposed to evaluate the effectiveness of interfaces in scattering phonons, in a quantitative measure. The growth quality of interfaces can also be reflected by this scattering coefficient. This approach holds promise to be extended to other oxide systems and will facilitate their thermal based applications, including thermal insulation and thermoelectricity.
Article
Nanostructured materials and nanolattices with high porosity can have novel optical and mechanical properties that are attractive for nanophotonic devices. One existing challenge is the integration of microstructures that can be used as waveguides or electrodes on such nanostructures without filling in the pores. This study investigates the fabrication of TiO2 microstructures on nanolattices using a stencil mask. In this approach, the nanostructures are planarized with a polymer film while the microstructures are patterned in a sequential shadow deposition step. Our results demonstrate the successful fabrication of a “dog-bone” microstructure with 400 μm length, 100 μm width, and 30–560 nm thicknesses on nanostructure with 390 and 500 nm period. The experimental results show that cracks can form in the microstructures, which can be attributed to residual stress and the thermal annealing cycle. A key finding is that the film cracks decrease as the TiO2 layer becomes thinner, highlighting an important relationship between grain size distribution and the film thickness. The mechanical stability of the underlying nanolattices also plays a key role, where interconnected architecture mitigated the crack formation when compared with isolated structures. The demonstrated fabrication process can lead to integrated waveguides and microelectrodes on nanolattices, which can find applications for next-generation photonic and electronic devices.
Preprint
We demonstrate reactively sputtered polycrystalline Al₂O₃:Er³⁺ waveguide amplifiers exhibiting external fiber-to-fiber net gain, broadband amplification, and low noise figure. With an erbium concentration of 1.5 × 10²⁰ ions/cm³, a 30 cm amplifier length, and bi-directional pumping at 1480 nm, >14 dB of external gain at 1550 nm is shown with off-chip output powers of over 56 mW measured at the output fiber, as well as sustained gain across 60 nm of bandwidth featuring a fiber-to-fiber NF of 5.6 dB at 1566 nm. Such a device is made possible using polycrystalline Al₂O₃:Er³⁺ capable of high temperature LPCVD SiO₂ cladding, and therefore low coupling and background waveguide losses down to 2.5 dB/facet and ~5 dB/m respectively due to high conformality and optical quality. This demonstration highlights the importance and advantages of using polycrystalline Al₂O₃:Er³⁺ for applications that require optical amplification on a scalable and versatile photonic waveguide platform.
Preprint
We demonstrate reactively sputtered polycrystalline Al₂O₃:Er³⁺ waveguide amplifiers exhibiting external fiber-to-fiber net gain, broadband amplification, and low noise figure. With an erbium concentration of 1.5 × 10²⁰ ions/cm³, a 30 cm amplifier length, and bi-directional pumping at 1480 nm, >14 dB of external gain at 1550 nm is shown with off-chip output powers of over 56 mW measured at the output fiber, as well as sustained gain across 60 nm of bandwidth featuring a fiber-to-fiber NF of 5.6 dB at 1566 nm. Such a device is made possible using polycrystalline Al₂O₃:Er³⁺ capable of high temperature LPCVD SiO₂ cladding, and therefore low coupling and background waveguide losses down to 2.5 dB/facet and ~5 dB/m respectively due to high conformality and optical quality. This demonstration highlights the importance and advantages of using polycrystalline Al₂O₃:Er³⁺ for applications that require optical amplification on a scalable and versatile photonic waveguide platform.
Article
Full-text available
The article discusses various conditions of contemporary artificial intelligence, namely deep learning mechanisms, to emphasize its limitations and argues for an antihumanistic view of contemporary technology. It starts from affirming Turing test and argues that machines can in fact be intelligent but that this intelligence must not be related to a capitalistically hyped idea of artificial general intelligence. Then it outlines various conditions on which deep learning depends in its functioning (brute computing power, capitalist datafication, the world of contingency). These conditions show an epistemological schism in the field of artificial intelligence (between symbolic AI and connectionism) that could be overcome by getting rid of the idea of artificial general intelligence and the competitive relation between humans and machines.
Preprint
Full-text available
Achieving optical computing with thousands of tera-operations per second per watt per square millimeter (TOPs/W/mm ² ) is the key to surpassing electrical computing. This realization requires a breakthrough in the design of a new optical computing architecture and nonlinear activation functions. In this work, we propose an on-chip picosecond spiking optical neural network architecture, which can be expected to achieve 2.13×10 ³ TOPs/mm ² . By leveraging the Kerr effect of silicon and the saturable absorption of graphene, we designed an all-optical nonlinear activator based on a graphene-silicon integrated photonic crystal cavity. The ultralow threshold, high-speed, compact, and reconfigurable all-optical nonlinear activator could achieve a 4 fJ activation energy threshold, a 1.05 ps response time, and an ultrasmall size of 15 µm×10 µm. This device provides foundation blocks for the picosecond spiking optical neural network chip to achieve 10 ⁶ TOPs/W/mm ² level optical computing.
Article
Full-text available
The phosphoinositide family of membrane lipids play diverse and critical roles in eukaryotic molecular biology. Much of this biological activity derives from interactions of phosphoinositide lipids with integral and peripheral membrane proteins, leading to modulation of protein structure, function, and cellular distribution. Since the discovery of phosphoinositides in the 1940s, combined molecular biology, biophysical, and structural approaches have made enormous progress in untangling this vast and diverse cellular network of interactions. More recently, in silico approaches such as molecular dynamics simulations have proven to be an asset in prospectively identifying, characterising, explaining the structural basis of these interactions, and in the best cases providing atomic level testable hypotheses on how such interactions control the function of a given membrane protein. This review details a number of recent seminal discoveries in phosphoinositide biology, enabled by advanced biomolecular simulation, and its integration with molecular biology, biophysical, and structural biology approaches. The results of the simulation studies agree well with experimental work, and in a number of notable cases have arrived at the key conclusion several years in advance of the experimental structures.
Article
Full-text available
Energy efficiency in computation is ultimately limited by noise, with quantum limits setting the fundamental noise floor. Analog physical neural networks hold promise for improved energy efficiency compared to digital electronic neural networks. However, they are typically operated in a relatively high-power regime so that the signal-to-noise ratio (SNR) is large (>10), and the noise can be treated as a perturbation. We study optical neural networks where all layers except the last are operated in the limit that each neuron can be activated by just a single photon, and as a result the noise on neuron activations is no longer merely perturbative. We show that by using a physics-based probabilistic model of the neuron activations in training, it is possible to perform accurate machine-learning inference in spite of the extremely high shot noise (SNR ~ 1). We experimentally demonstrated MNIST handwritten-digit classification with a test accuracy of 98% using an optical neural network with a hidden layer operating in the single-photon regime; the optical energy used to perform the classification corresponds to just 0.038 photons per multiply-accumulate (MAC) operation. Our physics-aware stochastic training approach might also prove useful with non-optical ultra-low-power hardware.
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Article
Full-text available
Program autotuning has been shown to achieve better or more portable performance in a number of domains. However, autotuners themselves are rarely portable between projects, for a number of reasons: using a domain-informed search space representation is critical to achieving good results; search spaces can be intractably large and require advanced machine learning techniques; and the landscape of search spaces can vary greatly between different problems, sometimes requiring domain specific search techniques to explore efficiently. This paper introduces OpenTuner, a new open source framework for building domain-specific multi-objective program autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique representation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously; techniques that perform well will dynamically be allocated a larger proportion of tests. We demonstrate the efficacy and generality of OpenTuner by building autotuners for 7 distinct projects and 16 total benchmarks, showing speedups over prior techniques of these projects of up to 2.8x with little programmer effort.
Conference Paper
Full-text available
Microprocessor designers such as Intel and AMD implement old instruction sets at their modern processors to ensure backward compatibility with legacy code. In addition to old backward compatibility instructions, new extensions are constantly introduced to add functionalities. In this way, the size of the IA-32 ISA is growing at a fast pace, reaching almost 1300 different instructions in 2013 with the introduction of AVX2 and FMA3 by Haswell. Increasing the size of the ISA impacts both hardware and software: it costs a complex microprocessor frontend design, which requires more silicon area, consumes more energy and demands more hardware debugging efforts; it also hinders software performance, since in IA-32 newer instructions are bigger and take up more space in the instruction cache. In this work, after analyzing x86 code from 3 different Windows versions and its respective contemporary applications plus 3 Linux distributions, from 1995 to 2012, we found that up to 30 classes of instructions get unused with time in these software. Should modern x86 processors sacrifice efficiency to provide strict conformance with old software from 30 years ago? Our results show that many old instructions may be truly retired.
Book
Full-text available
Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this book we discuss the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory. For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss prefetching, distribution, and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and constructing convex hulls) and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide a convenient means in online data structures to make effective use of the data accessed from disk. We also reexamine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length (e.g., text strings), when the internal data representations are compressed, or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. During the course of the book, we report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than methods currently used in practice.
Chapter
Full-text available
A general library of software components has been a long-standing dream, but it’s unlikely to work, because there’s no business model for it, it costs the client too much to understand a component, and components have conflicting world views. In spite of this discouraging conclusion, very large components do work very well, because they have lots of clients and you use only three of them. Two other approaches can make software easier to write: declarative programming and specifications with teeth. The latter guarantee something about the behaviour of a module. The enforcement can be done statically, as with a type checker, or dynamically, as with transaction processing.
Conference Paper
Full-text available
We propose a data structure to maintain a collection of vertex-disjoint trees under a sequence of two kinds of operations: a link operation that combines two trees into one by adding an edge, and a cut operation that divides one tree into two by deleting an edge. Our data structure requires O(log n) time per operation when the time is amortized over a sequence of operations. Using our data structure, we obtain new fast algorithms for the following problems: (1) Computing deepest common ancestors. (2) Solving various network flow problems including finding maximum flows, blocking flows, and acyclic flows. (3) Computing certain kinds of constrained minimum spanning trees. (4) Implementing the network simplex algorithm for the transshipment problem. Our most significant application is (2); we obtain an O(mn log n)-time algorithm to find a maximum flow in a network of n vertices and m edges, beating by a factor of log n the fastest algorithm previously known for sparse graphs.
Conference Paper
Full-text available
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.
Article
Full-text available
Recently, Goldberg proposed a new approach to the maximum network flow problem. The approach yields a very simple algorithm running in O(n 3) time on n-vertex networks. Incorporation of the dynamic tree data structure of Sleator and Tarjan yields a more complicated algorithm with a running time of O(nm log (n 2/m)) on m-arc networks. Ahuja and Orlin developed a variant of Goldberg's algorithm that uses scaling and runs in O(nm + n2 log U) time on networks with integer arc capacities bounded by U. In this paper possible improvements to the Ahuja-Orlin algorithm are explored. First, an improved running time of O(nnz + n log U/log log U) is obtained by using a nonconstant scaling factor. Second, an even better bound of O(nm + n2(log U) 1/2) is obtained by combining the Ahuja-Orlin algorithm with the wave algorithm of Tarjan. Third, it is shown that the use of dynamic trees in the latter algorithm reduces the running time to O(nm log ((n/m)(log U)t/2 +2 )). This result shows that the combined use of three different techniques results in speed not obtained by using any of the techniques alone. The above bounds are all for a unit-cost random access machine. Also considered is a semilogarithmic computation model in which the bounds increase by an additive term of O(m log,, U), which is the time needed to read the input in the model.
Article
Full-text available
This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B²), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults. We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.
Article
Full-text available
This paper discusses modularization as a mechanism for improving the flexibility and comprehensibility of a system while allowing the shortening of its development time. The effectiveness of a "modularization" is dependent upon the criteria used in dividing the system into modules. A system design problem is presented and both a conventional and unconventional decomposition are described. It is shown that the unconventional decompositions have distinct advantages for the goals outlined. The criteria used in arriving at the decompositions are discussed. The unconventional decomposition, if implemented with the conventional assumption that a module consists of one or more subroutines, will be less efficient in most cases. An alternative approach to implementation which does not have this effect is sketched.
Article
Full-text available
Good old online backpropagation for plain multilayer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.
Article
A data structure is proposed to maintain a collection of vertex-disjoint trees under a sequence of two kinds of operations: a link operation that combines two trees into one by adding an edge, and a cut operation that divides one tree into two by deleting an edge. Each operation requires O(log n) time. Using this data structure, new fast algorithms are obtained for the following problems: 1.(1) Computing nearest common ancestors. 2.(2) Solving various network flow problems including finding maximum flows, blocking flows, and acyclic flows. 3.(3) Computing certain kinds of constrained minimum spanning trees. 4.(4) Implementing the network simplex algorithm for minimum-cost flows. The most significant application is (2); an O(mn log n)-time algorithm is obtained to find a maximum flow in a network of n vertices and m edges, beating by a factor of log n the fastest algorithm previously known for sparse graphs.
Conference Paper
Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains. We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting. We implement buffets in RTL and show that they only add 2% control overhead over an 8KB RAM. When compared with DMA-managed double-buffered scratchpads and caches across a range of workloads, buffets improve energy-delay-product by 1.53x and 5.39x, respectively.
Article
Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.
Article
The integrated circuit is today synonymous with the concept of technological progress. In the seven decades since the invention of the transistor at Bell Labs, relentless progress in the development of semiconductor devices — Moore’s law — has been achieved despite regular warnings from industry observers about impending limits. Here, drawing on technical and organizational archival work and oral histories, we argue that the current technological and structural challenges facing the industry are unprecedented and undermine the incentives for continued collective action in research and development, which has underpinned the past 50 years of transformational worldwide economic growth and social advance. We conclude by arguing that the lack of private incentives, due in part to a splintering of technology trajectories and short-term private profitability of many of these new splinters, creates a case for greatly increased public funding and the need for leadership beyond traditional stakeholders. This Perspective argues that the challenges facing the semiconductor industry are unprecedented and undermine incentives for continued collective action in research and development, potentially threatening world-wide economic growth; significantly increased public funding, as well as leadership beyond traditional stakeholders, is required.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
This paper presents the many-core architecture, with hundreds to thousands of small cores, to deliver unprecedented compute performance in an affordable power envelope. We discuss fine grain power management, memory bandwidth, on die networks, and system resiliency for the many-core system.
Article
This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on a 1 processor, T ∞ is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is O(T 1 /P+T ∞ ), that is, it achieves linear speedup when P=O(T 1 /T ∞ ), and it is space-efficient if it uses O(S 1 P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to “strict” computations – those in which all arguments to a procedure must be available before the procedure can be invoked – much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time O(T 1 /P+T ∞ lgP) and worst-case space O(S 1 PlgP), including overhead costs.
Conference Paper
Due to the breakdown of Dennardian scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark or dim silicon, i.e., either idle or significantly underclocked. As exponentially larger fractions of a chip's transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that "spend" area to "buy" energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the computational stack. We envision that ultimately we will see widespread use of specialized architectures that leverage these techniques in order to attain orders-of-magnitude improvements in energy efficiency. However, many of these approaches also suffer from massive increases in complexity. As a result, we will need to look towards developing pervasively specialized architectures that insulate the hardware designer and the programmer from the underlying complexity of such systems. In this paper, I discuss four key approaches--the four horsemen--that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that requires a careful understanding of the underlying tradeoffs and benefits.
Article
It is known that in multiprocessing systems composed of many identical processing units operating in parallel, certain timing anomalies may occur; e.g., an increase in the number of processing units can cause an increase in the total length of time needed to process a fixed set of tasks. In this paper, precise bounds are derived for several anomalies of this type.
Article
The flexibility and power needed in the channel for a computer display are considered. To work efficiently, such a channel must have a sufficient number of instruction that it is best understood as a small processor rather than a powerful channel. As it was found that successive improvements to the display processor design lie on a circular path, by making improvements one can return to the original simple design plus one new general purpose computer for each trip around. The degree of physical separation between display and parent computer is a key factor in display processor design.
Chapter
This chapter discusses parallel algorithms for shared-memory machines. Parallel computation is rapidly becoming a dominant theme in all areas of computer science and its applications. It is estimated that, within a decade, virtually all developments in computer architecture, systems programming, computer applications and the design of algorithms will be taking place within the context of parallel computation. In preparation for this revolution, theoretical computer scientists have begun to develop a body of theory centered on parallel algorithms and parallel architectures. As there is no consensus yet on the appropriate logical organization of a massively parallel computer, and as the speed of parallel algorithms is constrained as much by limits on interprocessor communication as it is by purely computational issues, it is not surprising that a variety of abstract models of parallel computation have been pursued. Closest to the hardware level are the VLSI models, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.
Article
t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4.7-n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for inverting a matrix of order n, solving a system of n linear equations in n unknowns, computing a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKINSHCHERBAK [1 ] that Gaussian elimination for solving a system of linearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note that WlNOGRAD [21 modifies the usual algorithms for matrix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. It is a pleasure to thank D. BRILLINGER for inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which multiply matrices of order m2 ~, by induction on k: ~,0 is the usual algorithm, for matrix multiplication (requiring m a multiplications and m 2 (m- t) additions), e~,k already being known, define ~, ~ +t as follows: If A, B are matrices of order m 2 k ~ to be multiplied, write (All A~2 t (B~I B12~
Article
This paper considers the design, fabrication, and characterization of very small MOSFET switching devices suitable for digital integrated circuits using dimensions of the order of 1μ. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation to proVide shallow source and drain regions and a nonuniform substrate doping profile. Onedimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFETs with channel lengths as short as 0.5μ were fabricated, and the device characteristics measured and compared with predicted values. Ibe performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected. Reprintedfrom the IEEE Journal of Solid-State Circuits, Vol. SC-9, October 1974, pp. 256-268.]
Article
A data structure is proposed to maintain a collection of vertex-disjoint trees under a sequence of two kinds of operations: a link operation that combines two trees into one by adding an edge, and a cut operation that divides one tree into two by deleting an edge. Each operation requires O(log n) time. Using this data structure, new fast algorithms are obtained for the following problems: 1.(1) Computing nearest common ancestors.2.(2) Solving various network flow problems including finding maximum flows, blocking flows, and acyclic flows.3.(3) Computing certain kinds of constrained minimum spanning trees.4.(4) Implementing the network simplex algorithm for minimum-cost flows. The most significant application is (2); an O(mn log n)-time algorithm is obtained to find a maximum flow in a network of n vertices and m edges, beating by a factor of log n the fastest algorithm previously known for sparse graphs.
Conference Paper
A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.
Conference Paper
Generally believed to be a problem belonging to the compiler and architecture communities, performance optimization has rarely gained attention in mainstream software engineering research. However, due to the proliferation of large-scale object-oriented software designed to solve increasingly complex problems, performance issues stand out, preventing applications from meeting their performance requirements. Many such issues result from design principles adopted widely in the software research community, such as the idea of software reuse and design patterns. We argue that, in the modern era when Moore's dividend becomes less obvious, performance optimization is more of a software engineering problem than ever and should receive much more attention in the future. We explain why this is the case, review what has been achieved in software bloat analysis, present challenges, and provide a road map for future work.
Conference Paper
The promise of unsupervised learning meth- ods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free param- eters. We consider two well-known unsuper- vised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing re- searchers to focus on smaller-scale models, or to use fewer training examples. In this paper, we suggest massively paral- lel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsu- pervised learning methods. We develop gen- eral principles for massively parallelizing un- supervised learning tasks using graphics pro- cessors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU imple- mentation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free param- eters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.
Conference Paper
Multicore has shown significant performance and power advantages over single cores in commercial systems with a 2-4 cores. Applying a corollary of Moore's Law for multicore, we expect to see 1K multicore chips within a decade. 1K multicore systems introduce significant architectural challenges. One of these is the power efficiency challenge. Today's cores consume 10's of watts. Even at about one watt per core, a 1K-core chip would need to dissipate 1K watts! This paper discusses the "Kill rule for multicore" for power-efficient multicore design, an approach inspired by the "Kiss rule for RISC processor design". Kill stands for Kill if less than linear, and represents a design approach in which any additional area allocated to a resource within a core, such as a cache, is carefully traded off against using the area for additional cores. The Kill Rule states that we must increase resource size (for example, cache size) only if for every 1% increase in core area there is at least a 1% increase in core performance.
Article
This issue's expert guest column is by Eric Allender, who has just taken over the Structural Complexity Column in the Bulletin of the EATCS.Regarding "Journals to Die For" (SIGACT News Complexity Theory Column 16), Joachim von zur Gathen, ...
Article
In this column we review the books listed below. In addition there is a comment by Oded Goldreich on the review of his book Foundations of Cryptography.
Article
Sublinear time algorithms represent a new paradigm in computing, where an algorithm must give some sort of an answer after inspecting only a very small portion of the input. We discuss the types of answers that one can hope to achieve in this setting.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Article
It is shown that arithmetic expressions with n greater than equivalent to 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log//2n plus 10(n minus 1)/p using p greater than equivalent to 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.
Article
Abstract How does a search engine company,decide what ads to display with each query so as to maximize its revenue? This turns out to be a generalization of the online bipartite matching problem. We introduce the notion of a tradeofi revealing LP and use it to derive an optimal algorithm achieving a competitive ratio of 1 ¡ 1=e for this problem.
Article
This paper presents new algorithms for the maximum flow problem, the Hitchcock transportation problem, and the general minimum-cost flow problem. Upper bounds on the numbers of steps in these algorithms are derived, and are shown to compare favorably with upper bounds on the numbers of steps required by earlier algorithms.
Article
The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware
Article
Over the past three decades, regular, predictable improvements in computers have been the norm, progress attributable to Moore\'s Law, the steady 40%-per-year increase in the number of transistors per chip unit area. Faster processors enabled software vendors to add new features. Software changed and improved as computers became more capable. To most of the world, the real dividend of Moore\'s Law was this improvement. Moore\'s Dividend was spent in many ways and places, ranging from programming languages, models, architectures, and development practices, up through software functionality. Parallelism is not a surrogate for faster processors and cannot directly step into their roles. Multicore processors will change software as profoundly as previous hardware revolutions radically altered the size and cost of computers, the software written for them, and the industry that produced and sold the hardware and software. Parallelism will drive software in new directions rather than continuing the evolutionary improvements made familiar by Moore\'s Dividend.
Conference Paper
Not Available