Stefanos Kaxiras

Stefanos Kaxiras
Uppsala University | UU · Department of Information Technology

IEEE Fellow

About

188
Publications
22,967
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,891
Citations
Citations since 2017
55 Research Items
1267 Citations
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
Additional affiliations
February 2010 - present
Uppsala University
Position
  • Professor (Full)
February 2003 - January 2010
University of Patras
Position
  • Professor (Assistant)
September 1998 - August 2003
Bell Labs
Position
  • Member of Technical Staff

Publications

Publications (188)
Conference Paper
Full-text available
A fundamental programming feature that allows Spectre to effortlessly leak the value of secrets via cache side channels is the transformation of data values into addresses. Consider for example sorting, hashing, or many other algorithms that create addresses based on data values. While we understand the mechanism that leaks data as addresses, there...
Article
Applications running on out-of-order cores have benefited for decades of store-to-load forwarding which accelerates communication of store values to loads of the same thread. Despite threads running on a simultaneous multithreading (SMT) core could also access the load queues (LQ) and store queues (SQ) / store buffers (SB) of other threads to allow...
Article
MicroScope and other similar microarchitectural replay attacks take advantage of the characteristics of speculative execution to trap the execution of the victim application in a loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack software t...
Article
Full-text available
Hardware transactional memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occurs due to conflicts...
Article
Speculative side-channel attacks access sensitive data and use transmitters to leak the data during wrong-path execution. Various defenses have been proposed to prevent such information leakage. However, not all speculatively executed instructions are unsafe: Recent work demonstrates that speculation invariant instructions are independent of spec...
Preprint
Full-text available
Speculative side-channel attacks access sensitive data and use transmitters to leak the data during wrong-path execution. Various defenses have been proposed to prevent such information leakage. However, not all speculatively executed instructions are unsafe: Recent work demonstrates that speculation invariant instructions are independent of specul...
Article
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhe...
Preprint
Full-text available
MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack s...
Preprint
Full-text available
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in...
Article
Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need t...
Article
There exist extensive ongoing research efforts on emerging atomic-scale technologies that have the potential to become an alternative to today’s complementary metal--oxide--semiconductor technologies. A common feature among the investigated technologies is that of multi-level devices, particularly the possibility of implementing quaternary logic ga...
Conference Paper
Full-text available
Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to en- able a slew of security attacks. All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be veri...
Conference Paper
Full-text available
Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB...
Conference Paper
Full-text available
Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache...
Article
Full-text available
Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction execution—in other words, in-order instruction commit. While in-order commit has a number of advantages, such...
Conference Paper
Full-text available
Conference Paper
Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve bot...
Article
Full-text available
Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve bot...
Preprint
Full-text available
We present a non-speculative solution for a co-alescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory system. Proposed solutions for coalescing in TSO resort to speculation-and-rollback or centralized arbitration t...
Article
Load reordering is important for performance. It allows a core to continue performing accesses to the memory system even when there are older, in-program-order, unperformed accesses (for example, due to long latency misses). The only known solution to allow such reordering in a strong consistency model such as total store ordering (TSO) has been to...
Article
Full-text available
Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free (xDRF) regions, namely regions of code that provide the same guarantees as...
Article
Full-text available
Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. I...
Technical Report
Full-text available
Reducing the widening gap between processor and memory speed has been steering processors' design over the last decade, as memory accesses became the main performance bottleneck. Out-of-order architectures attempt to hide memory latency by dynamically reordering instructions, while in-order architectures are restricted to static instruction schedul...
Conference Paper
Full-text available
En el modelo de consistencia TSO (To-tal Store Order) implementado en la mayoría de los procesadores del mercado, no se permite la ejecución fuera de orden de las operaciones de carga (load) en un mismo hilo, sino que se debe garantizar el orden en que aparecen en el código. Se debe garantizar, por tanto, el llamado orden load→load. Con el fin de o...
Article
Full-text available
Cache coherence protocols based on self-invalidation allow simpler hardware implementation compared to traditional write-invalidation protocols, by relying on data-race-free semantics and applying self-invalidation on synchronization points. Their simplicity lies in the absence of invalidation traffic. This eliminates the need to track readers in a...
Conference Paper
Full-text available
In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show,...
Article
In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show,...
Conference Paper
Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is limited by the requirement of visibly sequential, atomic instruction execution --- in other words in-order instruction commit. While in-order commit has its advantages, such as providing p...
Article
Full-text available
Building high-performance, next-generation processors require novel techniques to allow improved performance given today’s power- and energy-efficiency requirements. Additionally, a widening gap between processor and memory performance makes it even more difficult to improve efficiency with conventional techniques. While out-of-order architectures...
Article
Full-text available
Energy-efficiency plays a significant role given the battery lifetime constraints in embedded systems and hand-held devices. In this work we target the ARM big.LITTLE, a heterogeneous platform that is dominant in the mobile and embedded market, which allows code to run transparently on different microarchitectures with individual energy and perform...
Research
Full-text available
This is the DRAFT of the Kiloprocessor extensions to SCI, describing the GLOW and STEM extensions. Of historical interest.
Article
Full-text available
Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory instruction reordering, thus causing extra program behaviors that are often not intended by the programmers. W...
Conference Paper
Several recent efforts aim to simplify coherence and its associate costs (e.g., directory size, complexity) in mul-ticores. The bulk of these efforts rely on program data-race-free (DRF) semantics to eliminate explicit invalidations and use self-invalidation instead. While such protocols are simple, they require software cooperation. This is accept...
Poster
Cache coherence protocols based on self-invalidation allow simpler hardware implementation compared to traditional write-invalidation protocols, by relying on data-race-free semantics and applying self-invalidation and self-downgrade on synchronization points. This work examines how self-invalidation and self-downgrade are performed in relation to...
Conference Paper
Cache coherence protocols using self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory instruction reordering, thus causing extra program behaviors that are often not intended by the programmer. We pr...
Conference Paper
There exist extensive ongoing research efforts on emerging atomic scale technologies that have the potential to become an alternative to today's CMOS technologies. A common feature among the investigated technologies is that of multi-value devices, in particular, the possibility of implementing quaternary logic and memory. However, multi-value devi...
Article
This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals, implement coherence protocols for sequential consistency (SC) between CPU cores, and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. T...
Conference Paper
Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach...
Code
ArgoDSM is a highly scalable software distributed shared memory system. Please see our paper in HPDC 2015: Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, Kostis Sagonas ”Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory,” HPDC 2015 (FCRC)...
Article
Full-text available
As energy efficiency became a critical factor in the embedded systems domain, dynamic voltage and frequency scaling (DVFS) techniques have emerged as means to control the system's power and energy efficiency. Additionally, due to the compact design, thermal issues become prominent. State of the art work promotes software decoupled access-execution...
Conference Paper
Full-text available
Directory-based cache coherence is the de-facto standard for scalable shared-memory multi/many-cores and significant effort is invested in reducing its overhead. However, directory area and complexity optimizations are often antithetical to each other. Novel directory-less coherence schemes have been introduced to remove the complexity and cost ass...
Article
Full-text available
Classification of data into private and shared has proven to be a catalyst for techniques to reduce coherence cost, since private data can be taken out of coherence and resources can be concentrated on providing coherence for shared data. In this article, we examine how granularity - page-level versus cache-line level - and adaptivity - going from...
Conference Paper
Full-text available
A coherent global address space in a distributed system en- ables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via “active” home-n...
Conference Paper
Full-text available
Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need...
Article
Please stay tuned for the actual paper. (Coming soon to HPDC/FCRC 2015)
Article
Full-text available
Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need...
Article
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and c...
Conference Paper
Full-text available
Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design patte...
Conference Paper
Full-text available
Existing multi-threaded applications perform synchronization either in an explicit way, e.g., making use of the functionality provided by synchronization libraries or in an implicit or "covert" way, e.g., using shared variables. Unfortunately, the implicit synchronization constructs are prone to errors and difficult to detect. This paper presents a...
Chapter
Issues addressing dynamic power have predominated the power-aware architecture landscape. Amongst these dynamic power techniques, most methods focus on dynamic voltage and frequency scaling (DVFS). The intuition behind many of these approaches [82, 92, 93, 164, 197, 200, 201] is that if the processor and memory operate largely asynchronously from e...