
Alexandra FedorovaSimon Fraser University · School of Computing Science
Alexandra Fedorova
About
90
Publications
12,961
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,125
Citations
Citations since 2017
Publications
Publications (90)
Bitcoin is a top-ranked cryptocurrency that has experienced huge growth and survived numerous attacks. The protocols making up Bitcoin must therefore accommodate the growth of the network and ensure security.
Security of the Bitcoin network depends on connectivity between the nodes. Higher connectivity yields better security. In this paper we make...
Data structure splicing (DSS) refers to reorganizing data structures by merging or splitting them, reordering fields, inlining pointers, etc. DSS has been used, with demonstrated benefits, to improve spatial locality. When data fields that are accessed together are also collocated in the address space, the utilization of hardware caches improves an...
Software debugging is a time-consuming and challenging process. Supporting debugging has been a focus of the software engineering field since its inception with numerous empirical studies, theories, and tools to support developers in this task. Performance bugs and performance debugging is a sub-genre of debugging that has received less attention....
Researchers and practitioners dedicate a lot of effort to improving spatial locality in their programs. Hardware caches rely on spatial locality for efficient operation; when it is absent, they waste memory bandwidth and cache space by fetching data that is never used before it is evicted. Improving spatial locality is difficult. For the most part,...
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices...
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices...
In this work, we present PolyBlaze, a scalable and configurable multicore platform for FPGA-based embedded systems and systems research. PolyBlaze is an extension of the MicroBlaze soft processor, leveraging the configurability of the MicroBlaze and bringing it into the multicore era with Linux Symmetric Multi- Processor (SMP) support. This work de...
As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experim...
The latency of memory access times is hence non-uniform, because it depends on where the request originates and where it is destined to go. Such systems are referred to as nonuniform memory access (or NUMA). Current x86 NUMA systems are cache coherent (called ccNUMA), which means programs can transparently access memory on local and remote nodes wi...
One of the key decisions made by both MapReduce and HPC cluster management frameworks is the placement of jobs within a cluster. To make this decision, they consider factors like resource constraints within a node or the proximity of data to a process. However, they fail to account for the degree of collocation on the cluster's nodes. A tight proce...
Modern server-class systems are typically built as several multicore chips put together in a single system. Each chip has a local DRAM (dynamic random-access memory) module; together they are referred to as a node. Nodes are connected via a high-speed interconnect, and the system is fully coherent. This means that, transparently to the programmer,...
When designing modern embedded computing systems, most software programmers
choose to use multicore processors, possibly in combination with
general-purpose graphics processing units (GPGPUs) and/or hardware
accelerators. They also often use an embedded Linux O/S and run
multi-application workloads that may even be multi-threaded. Modern FPGAs are...
A modular dynamically re-configurable profiling core may be used to provide both operating systems and applications with detailed information about run time performance bottlenecks and may enable them to address these bottlenecks via scheduling or dynamic compilation. As a result, application software may be able to better leverage the intrinsic na...
Application virtual address space is divided into pages, each requiring a virtual-to-physical translation in the page table and the TLB. Large working sets, common among modern applications, necessitate a lot of translations , which increases memory consumption and leads to high TLB and page fault rates. To address this problem , recent hardware in...
A thread scheduling technique for assigning multiple threads on a single integrated circuit is dependent on the CPIs of the threads. The technique attempts to balance, to the extent possible, the loads among the processing cores by assigning threads of relatively long-latency (low CPIs) with threads of relatively short-latency (high CPIs) to the sa...
The disclosed embodiments provide a system that facilitates scheduling threads in a multi-threaded processor with multiple processor cores. During operation, the system executes a first thread in a processor core that is associated with a shared cache. During this execution, the system measures one or more metrics to characterize the first thread....
NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid th...
NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid th...
Servers in most data centers are often underutilized due to concerns about SLA violations that may result from resource contention as server utilization increases. This low utilization means that neither the capital investment in the servers nor the power consumed is being used as effectively as it could be. In this paper, we present a novel method...
When multiple threads or processes run on a multi-core CPU they compete for shared resources, such as caches and memory controllers, and can suffer performance degradation as high as 200%. We design and evaluate a new machine learning model that estimates this degradation online, on previously unseen workloads, and without perturbing the execution....
Performance problems in parallel programs manifest as lack of scalability. These scalability issues are often very difficult to debug. They can stem from synchronization overhead, poor thread scheduling decisions, or contention for hardware resources, such as shared caches. Traditional profiling tools attribute program cycles to different functions...
Chip multicore processors (CMPs) have emerged as the dominant architecture choice for modern computing platforms and will most likely continue to be dominant well into the foreseeable future. As with any system, CMPs offer a unique set of challenges. Chief among them is the shared resource contention that results because CMP cores are not independe...
Smartphone devices are becoming the de facto personal computing platform, rivaling the desktop, as the number of smartphone users is projected to reach 1.1 billion by 2013. Unlike the desktop, smartphones have a constrained energy budget, which is further challenged by increasingly sophisticated applications. Amongst the most popular applications o...
Contention for shared resources in High-Performance Computing (HPC)
clusters occurs when jobs are concurrently executing on the same
multicore node (there is a contention for shared caches, memory buses,
memory controllers and memory domains). The shared resource contention
incurs severe degradation to workload performance and stability and
hence m...
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Modern computing systems increasingly consist of multiple processor cores. From cell phones to datacenters, multicore computing has become the standard. At the same time, our understanding of the performance impact resource sharing has on these platforms is limited, and therefore, prevents these systems from being fully utilized. As the capacity of...
Simultaneous multithreading (SMT) increases CPU utilization and application performance in many circumstances, but it can be detrimental when performance is limited by application scalability or when there is significant contention for CPU resources. This paper describes an SMT-selection metric that predicts the change in application performance wh...
Heterogeneous multicore architectures promise greater energy/area efficiency than their homogeneous counterparts. This efficiency can only be realized, however, if the operating system assigns applications to appropriate cores based on their architectural properties. While several such heterogeneity-aware algorithms were proposed in the past, they...
Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use more area and consume more energy relative to simpler and slower cores, we must use these cores for...
Execution time is no longer the only metric by which computational systems are judged. In fact, explicitly sacrificing raw performance in exchange for energy savings is becoming a common trend in environments ranging from large server farms attempting to minimize cooling costs to mobile devices trying to prolong battery life. Hardware designers, we...
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market.Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propos...
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market. Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propo...
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market. Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propo...
On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto se...
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Thread scheduling in multi-core systems is a challenging problem because cores on a single chip usually share parts of the memory hierarchy, such as last-level caches, prefetchers and memory controllers, making threads running on different cores interfere with each other while competing for these resources. Data center service providers are interes...
Recent research has highlighted the potential benefits of single-ISA heterogeneous multicore processors over cost-equivalent homogeneous ones, and it is likely that future processors will integrate cores that have the same instruction set architecture (ISA) but offer different performance and power characteristics. To fully tap into the potential o...
Processor systems contain a limited number of hardware counters that provide some visibility for certain types of interactions, but do not support sophisticated analysis due to limited resources. By contrast, system software simulators provide multidimensional runtime data, but slowdown application execution, often resulting in an inaccurate pictur...
In this paper, we argue that the modern HPC cluster environments contain several bottlenecks both within cluster multicore nodes and between them in the cluster interconnects. These bottlenecks represent resources that can be of high demand to several jobs, concurrently executing on the cluster. As such, the jobs can compete for accessing these res...
The problem of scheduling on multicore systems re-mains one of the hottest and the most challenging top-ics in systems research. Introduction of non-uniform memory access (NUMA) multicore architectures fur-ther complicates this problem, as on NUMA systems the scheduler needs not only consider the placement of threads on cores, but also the placemen...
Markov Random Fields (MRFs) are of great interest to the medical image analysis community but suffer from high computational complexity and difficulties in parameter selection. For these reasons, efforts have been made to develop more efficient algorithms for solving MRF optimization problems in order to enable reduced run-times and better interact...
Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent content...
Multicore processors have become commonplace in both desk-top and servers. A serious challenge with multicore processors is that cores share on and o chip resources such as caches, memory buses, and memory controllers. Competition for these shared resources between threads running on different cores can result in severe and unpredictable performanc...
In this position paper, we present our vision for the scheduling infrastructure in a many-core hypervisor - the hypervisor targeted for many-core platforms. The key objectives of our system are scalability and heterogeneity-awareness. We see these as first- order objectives, because future many-core processors will consist of thousands of cores and...
The Intel Core i7 processor code named Nehalem provides a feature named Turbo Boost which opportunistically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power consumption, the estimated current consumption, and operating system frequency scaling...
Asymmetric multicore processors (AMP) consist of cores exposing the same instruction-set architecture (ISA) but varying in size, frequency, power consumption and performance. AMPs were shown to be more power efficient than conventional symmetric multicore processors, and it is therefore likely that future multicore systems will include cores of dif...
Asymmetric multicore processors (AMP) promise higher performance per watt than their symmetric counterparts, and it is likely that future processors will integrate a few fast out-of-order cores, coupled with a large number of simpler, slow cores, all exposing the same instruction-set architecture (ISA). It is well known that one of the most effecti...
Transactional Memory (TM) is considered as one of the most promising paradigms for developing concurrent applications. TM has been shown to scale well on >multiple cores when the data access pattern behaves "well," i.e., when few conflicts are induced. In contrast, data patterns with frequent write sharing, with long transactions, or when many thre...
Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent content...
Contention for caches, memory controllers, and interconnects can be eased by contention-aware scheduling algorithms.
On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto sep...
Symmetric-ISA (instruction set architecture) asymmetric-performance multicore processors were shown to deliver higher performance per watt and area for applications with diverse architectural requirements, and so it is likely that future multicore processors will combine a few fast cores characterized by complex pipelines, high clock frequency, hig...
Video games are a performance hungry application do-main with a complexity that often rivals operating sys-tems. These performance and complexity issues in com-bination with tight development times and large teams means that consistent, specialized and pervasive support for parallelism is of paramount importance. The Cas-cade project is focused on...
Several researchers proposed an asymmetric multicore architecture (AMP) that had the potential to save a significant amount of power while delivering similar performance as conventional symmetric multicore processors. The asymmetric multicore architecture was proposed to create more power-efficient CPUs. An AMP consists of cores that use the same i...
The Intelreg Coretrade i7 processor code named Nehalem has a novel feature called Turbo Boost which dynamically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power and the estimated current consumption. We perform an extensive analysis of the Tur...
The transition to multicore architectures has dramatically underscored the necessity for parallelism in software. In particular, while new gaming consoles are by and large multicore, most existing video game engines are essentially sequential and thus cannot easily take ad- vantage of this hardware. In this paper we describe techniques derived from...
Single-ISA heterogeneous multicore architectures promise to deliver plenty of cores with varying complexity, speed and performance in the near future. Virtualization enables multiple operating systems to run concurrently as distinct, independent guest domains, thereby reducing core idle time and maximizing throughput. This paper seeks to identify a...
We present a new operating system scheduling algorithm for multicore processors. Our algorithm reduces the effects of unequal CPU cache sharing that occur on these processors and cause unfair CPU sharing, priority inversion, and inadequate CPU accounting. We describe the implementation of our algorithm in the Solaris operating system and demonstrat...
Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported...
How do we develop software to make the most of the promise that asymmetric multicore systems use a lot less energy?
Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported...
In this work we describe a methodology for develop-ing simple and robust power models using performance monitoring events for AMD Quad-core systems running OpenSolaris TM . The basic idea is correlating power con-sumption of a benchmark program with its performance (a measure of performance monitoring events). By using applicable model selection an...
Abstract—Asymmetric,multicore,processors (AMP) are built of cores that expose the same,ISA but differ in per- formance, complexity, and power consumption. A typical AMP might consist of a plenty of slow, small and simple cores and a handful of fast, large and complex cores. AMPs have been proposed,as a more,energy efficient alternative to symmetric...
Cache affinity between a process and a processor is observed when the processor cache has accumulated some amount of the process state, i.e., data or instructions. Cache affinity is exploited by OS schedulers: they tend to re- schedule processes to run on a recently used processor. On conventional (uni- core) multiprocessor systems, exploitation of...
We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application's performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. This performance dependency is caused by unfair, co-...
Much recent research has focused on oper- ating system scheduling algorithms for managing shared resource contention on chip multiprocessors (CMPs) and simultaneous multithreaded (SMT) systems. While the rele- vance of those algorithms is apparent for server workloads, it is less obvious for desktop workloads. As CMP/SMT processors are becoming inc...
Simultaneous multithreading (SMT) processors run multiple threads simultaneously on a single processing core. Because concurrent threads compete for the processor's shared resources, non-work-conserving scheduling, i.e., running fewer threads than the processor allows even if there are threads ready to run, can often improve performance. Neverthele...
This dissertation addresses operating system thread scheduling for chip multithreaded processors. Chip multithreaded processors are becoming mainstream thanks to their superior performance and power characteristics. Threads running concurrently on a chip multithreaded processor share the processor’s resources. Resource contention, and accordingly p...
We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermi...
The unpredictable nature of modern workloads, characterized by frequent branches and control transfers, can result in processor pipeline utilization as low as 19%. Chip multithreading (CMT), a processor architecture combining chip multiprocessing and hardware multithreading, is designed to address this issue. Hardware vendors plan to ship CMT syste...
The Direct Access File System (DAFS) is a distributed file system built on top of direct-access transports (DAT). Direct-access transports are characterized by using remote direct memory access (RDMA) for data transfer and user-level networking. The motivation behind the DAT-enabled distributed file system architecture is the reduction of the CPU o...
The Direct Access File System (DAFS) is a distributed file system built on top of direct-access transports (DAT). Direct-access transports are characterized by using remote direct memory access (RDMA) for data transfer and user-level networking. The motivation behind the DAT-enabled distributed file system architecture is the reduction of the CPU o...
The performance of high-speed network-attached storage applications is often limited by end-system overhead, caused primarily by memory copying and network protocol processing. In this paper, we examine alternative strategies for reducing overhead in such systems. We consider optimizations to remote procedure call (RPC)-based data transfer using ei...
The Direct Access File System (DAFS) is an emerging industrial standard for network-attached storage. DAFS takes advantage of new user-level network interface standards. This enables a user-level file system structure in which client-side functionality for remote data access resides in a library rather than in the kernel. This structure addresses l...
We make a case that a thread scheduler for heterogeneous multicore systems should target three objectives: optimal performance, core assignment balance and response time fairness. Performance optimization via optimal thread-to-core assignment has been explored in the past; in this paper we demonstrate the need for balanced core assignment. We show...
In this paper we argue that the scheduler, as the interme-diary between hardware and software, needs to be fully data-aware. The old paradigm of envisioning tasks as amorphous blobs of 'work' to be assigned to processors is incomplete and needs be expanded. Some techniques and projects have emerged that implicitly use this idea, but either focus on...
While soft real-time applications must run quickly enough to meet the deadline, there is usually no extra benefit from running more quickly than that. This property provides the opportunity for energy savings using Dynamic Voltage and Frequency Scaling (DVFS). In this paper, we propose the GreenRT framework that allows an application to mon-itor it...
The Cascade Parallel Processing Framework (PPF) is a user level library that facilitates manual parallelization of com-plex C++ systems. In Cascade, processing duties of the sys-tem are enclosed in a Cascade Task. Tasks are linked by dependencies in a task dependency graph. The task graph is traversed at runtime by the Cascade Job Manager who assig...
The unpredictable nature of modern workloads, characterized by frequent branches and control transfers, can result in processor pipeline utilization as low as 19%. Chip multithreading (CMT), a processor architecture combining chip multiprocessing and hardware multithreading, is designed to address this issue. Hardware vendors plan to ship CMT syste...
Chip multithreading (CMT) combines chip multiprocessing (CMP) and hardware multithreading (MT). In order to make the most of CMT systems when they become available, we have developed the Sam CMT simulator toolkit. A Sam simulation is usable as an interactive system, running at about 100Kips on a 1.2GHz UltraSPARC III and about 200Kips on a 1.8GHz A...
Abstract—In this paper,we,examine,the use of base vector applications,as a tool for classifying an,application’s usage of a processor’s resources. We,define a series of base,vector applications, simple applications designed to place directed stress on a single processor,resource. By co-scheduling base,vector applications with a target application o...
In this paper we propose CASC, a cache-aware operating system scheduling algorithm for multithreaded chip multiprocessors (CMT). CMT is emerging as a popular architecture for server platforms, and most major hardware manufacturers plan or already have released CMT processors. It is the job of the operating system to manage the shared resources of t...
Network
Cited