Alexandra Fedorova

Alexandra Fedorova
Simon Fraser University · School of Computing Science

About

90
Publications
12,961
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,125
Citations
Citations since 2017
9 Research Items
1475 Citations
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300

Publications

Publications (90)
Conference Paper
Bitcoin is a top-ranked cryptocurrency that has experienced huge growth and survived numerous attacks. The protocols making up Bitcoin must therefore accommodate the growth of the network and ensure security. Security of the Bitcoin network depends on connectivity between the nodes. Higher connectivity yields better security. In this paper we make...
Conference Paper
Data structure splicing (DSS) refers to reorganizing data structures by merging or splitting them, reordering fields, inlining pointers, etc. DSS has been used, with demonstrated benefits, to improve spatial locality. When data fields that are accessed together are also collocated in the address space, the utilization of hardware caches improves an...
Conference Paper
Software debugging is a time-consuming and challenging process. Supporting debugging has been a focus of the software engineering field since its inception with numerous empirical studies, theories, and tools to support developers in this task. Performance bugs and performance debugging is a sub-genre of debugging that has received less attention....
Conference Paper
Researchers and practitioners dedicate a lot of effort to improving spatial locality in their programs. Hardware caches rely on spatial locality for efficient operation; when it is absent, they waste memory bandwidth and cache space by fetching data that is never used before it is evicted. Improving spatial locality is difficult. For the most part,...
Conference Paper
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices...
Article
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices...
Article
In this work, we present PolyBlaze, a scalable and configurable multicore platform for FPGA-based embedded systems and systems research. PolyBlaze is an extension of the MicroBlaze soft processor, leveraging the configurability of the MicroBlaze and bringing it into the multicore era with Linux Symmetric Multi- Processor (SMP) support. This work de...
Conference Paper
As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experim...
Article
The latency of memory access times is hence non-uniform, because it depends on where the request originates and where it is destined to go. Such systems are referred to as nonuniform memory access (or NUMA). Current x86 NUMA systems are cache coherent (called ccNUMA), which means programs can transparently access memory on local and remote nodes wi...
Conference Paper
One of the key decisions made by both MapReduce and HPC cluster management frameworks is the placement of jobs within a cluster. To make this decision, they consider factors like resource constraints within a node or the proximity of data to a process. However, they fail to account for the degree of collocation on the cluster's nodes. A tight proce...
Article
Modern server-class systems are typically built as several multicore chips put together in a single system. Each chip has a local DRAM (dynamic random-access memory) module; together they are referred to as a node. Nodes are connected via a high-speed interconnect, and the system is fully coherent. This means that, transparently to the programmer,...
Article
Full-text available
When designing modern embedded computing systems, most software programmers choose to use multicore processors, possibly in combination with general-purpose graphics processing units (GPGPUs) and/or hardware accelerators. They also often use an embedded Linux O/S and run multi-application workloads that may even be multi-threaded. Modern FPGAs are...
Patent
A modular dynamically re-configurable profiling core may be used to provide both operating systems and applications with detailed information about run time performance bottlenecks and may enable them to address these bottlenecks via scheduling or dynamic compilation. As a result, application software may be able to better leverage the intrinsic na...
Conference Paper
Full-text available
Application virtual address space is divided into pages, each requiring a virtual-to-physical translation in the page table and the TLB. Large working sets, common among modern applications, necessitate a lot of translations , which increases memory consumption and leads to high TLB and page fault rates. To address this problem , recent hardware in...
Patent
A thread scheduling technique for assigning multiple threads on a single integrated circuit is dependent on the CPIs of the threads. The technique attempts to balance, to the extent possible, the loads among the processing cores by assigning threads of relatively long-latency (low CPIs) with threads of relatively short-latency (high CPIs) to the sa...
Patent
The disclosed embodiments provide a system that facilitates scheduling threads in a multi-threaded processor with multiple processor cores. During operation, the system executes a first thread in a processor core that is associated with a shared cache. During this execution, the system measures one or more metrics to characterize the first thread....
Conference Paper
Full-text available
NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid th...
Conference Paper
NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid th...
Conference Paper
Servers in most data centers are often underutilized due to concerns about SLA violations that may result from resource contention as server utilization increases. This low utilization means that neither the capital investment in the servers nor the power consumed is being used as effectively as it could be. In this paper, we present a novel method...
Conference Paper
Full-text available
When multiple threads or processes run on a multi-core CPU they compete for shared resources, such as caches and memory controllers, and can suffer performance degradation as high as 200%. We design and evaluate a new machine learning model that estimates this degradation online, on previously unseen workloads, and without perturbing the execution....
Conference Paper
Performance problems in parallel programs manifest as lack of scalability. These scalability issues are often very difficult to debug. They can stem from synchronization overhead, poor thread scheduling decisions, or contention for hardware resources, such as shared caches. Traditional profiling tools attribute program cycles to different functions...
Article
Chip multicore processors (CMPs) have emerged as the dominant architecture choice for modern computing platforms and will most likely continue to be dominant well into the foreseeable future. As with any system, CMPs offer a unique set of challenges. Chief among them is the shared resource contention that results because CMP cores are not independe...
Conference Paper
Full-text available
Smartphone devices are becoming the de facto personal computing platform, rivaling the desktop, as the number of smartphone users is projected to reach 1.1 billion by 2013. Unlike the desktop, smartphones have a constrained energy budget, which is further challenged by increasingly sophisticated applications. Amongst the most popular applications o...
Article
Contention for shared resources in High-Performance Computing (HPC) clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for shared caches, memory buses, memory controllers and memory domains). The shared resource contention incurs severe degradation to workload performance and stability and hence m...
Article
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Conference Paper
Modern computing systems increasingly consist of multiple processor cores. From cell phones to datacenters, multicore computing has become the standard. At the same time, our understanding of the performance impact resource sharing has on these platforms is limited, and therefore, prevents these systems from being fully utilized. As the capacity of...
Conference Paper
Full-text available
Simultaneous multithreading (SMT) increases CPU utilization and application performance in many circumstances, but it can be detrimental when performance is limited by application scalability or when there is significant contention for CPU resources. This paper describes an SMT-selection metric that predicts the change in application performance wh...
Article
Heterogeneous multicore architectures promise greater energy/area efficiency than their homogeneous counterparts. This efficiency can only be realized, however, if the operating system assigns applications to appropriate cores based on their architectural properties. While several such heterogeneity-aware algorithms were proposed in the past, they...
Article
Full-text available
Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use more area and consume more energy relative to simpler and slower cores, we must use these cores for...
Article
Execution time is no longer the only metric by which computational systems are judged. In fact, explicitly sacrificing raw performance in exchange for energy savings is becoming a common trend in environments ranging from large server farms attempting to minimize cooling costs to mobile devices trying to prolong battery life. Hardware designers, we...
Conference Paper
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market.Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propos...
Article
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market. Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propo...
Article
Large, Internet based companies service user requests from multiple data centers located across the globe. These data centers often house a heterogeneous computing infrastructure and draw electricity from the local electricity market. Reducing the electricity costs of operating these data centers is a challenging problem, and in this work, we propo...
Conference Paper
On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto se...
Article
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Conference Paper
Full-text available
Shared state access conflicts are one of the greatest sources of error for fine grained parallelism in any domain. Notoriously hard to debug, these conflicts reduce reliability and increase development time. The standard task graph model dictates that tasks with potential conflicting accesses to shared state must be linked by a dependency, even if...
Conference Paper
Full-text available
Thread scheduling in multi-core systems is a challenging problem because cores on a single chip usually share parts of the memory hierarchy, such as last-level caches, prefetchers and memory controllers, making threads running on different cores interfere with each other while competing for these resources. Data center service providers are interes...
Article
Recent research has highlighted the potential benefits of single-ISA heterogeneous multicore processors over cost-equivalent homogeneous ones, and it is likely that future processors will integrate cores that have the same instruction set architecture (ISA) but offer different performance and power characteristics. To fully tap into the potential o...
Conference Paper
Full-text available
Processor systems contain a limited number of hardware counters that provide some visibility for certain types of interactions, but do not support sophisticated analysis due to limited resources. By contrast, system software simulators provide multidimensional runtime data, but slowdown application execution, often resulting in an inaccurate pictur...
Conference Paper
In this paper, we argue that the modern HPC cluster environments contain several bottlenecks both within cluster multicore nodes and between them in the cluster interconnects. These bottlenecks represent resources that can be of high demand to several jobs, concurrently executing on the cluster. As such, the jobs can compete for accessing these res...
Article
The problem of scheduling on multicore systems re-mains one of the hottest and the most challenging top-ics in systems research. Introduction of non-uniform memory access (NUMA) multicore architectures fur-ther complicates this problem, as on NUMA systems the scheduler needs not only consider the placement of threads on cores, but also the placemen...
Conference Paper
Full-text available
Markov Random Fields (MRFs) are of great interest to the medical image analysis community but suffer from high computational complexity and difficulties in parameter selection. For these reasons, efforts have been made to develop more efficient algorithms for solving MRF optimization problems in order to enable reduced run-times and better interact...
Article
Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent content...
Conference Paper
Multicore processors have become commonplace in both desk-top and servers. A serious challenge with multicore processors is that cores share on and o chip resources such as caches, memory buses, and memory controllers. Competition for these shared resources between threads running on different cores can result in severe and unpredictable performanc...
Article
Full-text available
In this position paper, we present our vision for the scheduling infrastructure in a many-core hypervisor - the hypervisor targeted for many-core platforms. The key objectives of our system are scalability and heterogeneity-awareness. We see these as first- order objectives, because future many-core processors will consist of thousands of cores and...
Article
The Intel Core i7 processor code named Nehalem provides a feature named Turbo Boost which opportunistically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power consumption, the estimated current consumption, and operating system frequency scaling...
Conference Paper
Asymmetric multicore processors (AMP) consist of cores exposing the same instruction-set architecture (ISA) but varying in size, frequency, power consumption and performance. AMPs were shown to be more power efficient than conventional symmetric multicore processors, and it is therefore likely that future multicore systems will include cores of dif...
Conference Paper
Full-text available
Asymmetric multicore processors (AMP) promise higher performance per watt than their symmetric counterparts, and it is likely that future processors will integrate a few fast out-of-order cores, coupled with a large number of simpler, slow cores, all exposing the same instruction-set architecture (ISA). It is well known that one of the most effecti...
Conference Paper
Full-text available
Transactional Memory (TM) is considered as one of the most promising paradigms for developing concurrent applications. TM has been shown to scale well on >multiple cores when the data access pattern behaves "well," i.e., when few conflicts are induced. In contrast, data patterns with frequent write sharing, with long transactions, or when many thre...
Conference Paper
Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent content...
Article
Contention for caches, memory controllers, and interconnects can be eased by contention-aware scheduling algorithms.
Conference Paper
On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto sep...
Conference Paper
Full-text available
Symmetric-ISA (instruction set architecture) asymmetric-performance multicore processors were shown to deliver higher performance per watt and area for applications with diverse architectural requirements, and so it is likely that future multicore processors will combine a few fast cores characterized by complex pipelines, high clock frequency, hig...
Article
Full-text available
Video games are a performance hungry application do-main with a complexity that often rivals operating sys-tems. These performance and complexity issues in com-bination with tight development times and large teams means that consistent, specialized and pervasive support for parallelism is of paramount importance. The Cas-cade project is focused on...
Article
Full-text available
Several researchers proposed an asymmetric multicore architecture (AMP) that had the potential to save a significant amount of power while delivering similar performance as conventional symmetric multicore processors. The asymmetric multicore architecture was proposed to create more power-efficient CPUs. An AMP consists of cores that use the same i...
Conference Paper
The Intelreg Coretrade i7 processor code named Nehalem has a novel feature called Turbo Boost which dynamically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power and the estimated current consumption. We perform an extensive analysis of the Tur...
Conference Paper
Full-text available
The transition to multicore architectures has dramatically underscored the necessity for parallelism in software. In particular, while new gaming consoles are by and large multicore, most existing video game engines are essentially sequential and thus cannot easily take ad- vantage of this hardware. In this paper we describe techniques derived from...
Article
Single-ISA heterogeneous multicore architectures promise to deliver plenty of cores with varying complexity, speed and performance in the near future. Virtualization enables multiple operating systems to run concurrently as distinct, independent guest domains, thereby reducing core idle time and maximizing throughput. This paper seeks to identify a...
Article
We present a new operating system scheduling algorithm for multicore processors. Our algorithm reduces the effects of unequal CPU cache sharing that occur on these processors and cause unfair CPU sharing, priority inversion, and inadequate CPU accounting. We describe the implementation of our algorithm in the Solaris operating system and demonstrat...
Article
Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported...
Article
Full-text available
How do we develop software to make the most of the promise that asymmetric multicore systems use a lot less energy?
Article
Full-text available
Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported...
Article
Full-text available
In this work we describe a methodology for develop-ing simple and robust power models using performance monitoring events for AMD Quad-core systems running OpenSolaris TM . The basic idea is correlating power con-sumption of a benchmark program with its performance (a measure of performance monitoring events). By using applicable model selection an...
Article
Abstract—Asymmetric,multicore,processors (AMP) are built of cores that expose the same,ISA but differ in per- formance, complexity, and power consumption. A typical AMP might consist of a plenty of slow, small and simple cores and a handful of fast, large and complex cores. AMPs have been proposed,as a more,energy efficient alternative to symmetric...
Conference Paper
Cache affinity between a process and a processor is observed when the processor cache has accumulated some amount of the process state, i.e., data or instructions. Cache affinity is exploited by OS schedulers: they tend to re- schedule processes to run on a recently used processor. On conventional (uni- core) multiprocessor systems, exploitation of...
Conference Paper
We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application's performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. This performance dependency is caused by unfair, co-...
Article
Much recent research has focused on oper- ating system scheduling algorithms for managing shared resource contention on chip multiprocessors (CMPs) and simultaneous multithreaded (SMT) systems. While the rele- vance of those algorithms is apparent for server workloads, it is less obvious for desktop workloads. As CMP/SMT processors are becoming inc...
Article
Simultaneous multithreading (SMT) processors run multiple threads simultaneously on a single processing core. Because concurrent threads compete for the processor's shared resources, non-work-conserving scheduling, i.e., running fewer threads than the processor allows even if there are threads ready to run, can often improve performance. Neverthele...
Article
This dissertation addresses operating system thread scheduling for chip multithreaded processors. Chip multithreaded processors are becoming mainstream thanks to their superior performance and power characteristics. Threads running concurrently on a chip multithreaded processor share the processor’s resources. Resource contention, and accordingly p...
Conference Paper
We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermi...
Conference Paper
The unpredictable nature of modern workloads, characterized by frequent branches and control transfers, can result in processor pipeline utilization as low as 19%. Chip multithreading (CMT), a processor architecture combining chip multiprocessing and hardware multithreading, is designed to address this issue. Hardware vendors plan to ship CMT syste...
Conference Paper
The Direct Access File System (DAFS) is a distributed file system built on top of direct-access transports (DAT). Direct-access transports are characterized by using remote direct memory access (RDMA) for data transfer and user-level networking. The motivation behind the DAT-enabled distributed file system architecture is the reduction of the CPU o...
Article
The Direct Access File System (DAFS) is a distributed file system built on top of direct-access transports (DAT). Direct-access transports are characterized by using remote direct memory access (RDMA) for data transfer and user-level networking. The motivation behind the DAT-enabled distributed file system architecture is the reduction of the CPU o...
Conference Paper
Full-text available
The performance of high-speed network-attached storage applications is often limited by end-system overhead, caused primarily by memory copying and network protocol processing. In this paper, we examine alternative strategies for reducing overhead in such systems. We consider optimizations to remote procedure call (RPC)-based data transfer using ei...
Conference Paper
Full-text available
The Direct Access File System (DAFS) is an emerging industrial standard for network-attached storage. DAFS takes advantage of new user-level network interface standards. This enables a user-level file system structure in which client-side functionality for remote data access resides in a library rather than in the kernel. This structure addresses l...
Article
We make a case that a thread scheduler for heterogeneous multicore systems should target three objectives: optimal performance, core assignment balance and response time fairness. Performance optimization via optimal thread-to-core assignment has been explored in the past; in this paper we demonstrate the need for balanced core assignment. We show...
Article
Full-text available
In this paper we argue that the scheduler, as the interme-diary between hardware and software, needs to be fully data-aware. The old paradigm of envisioning tasks as amorphous blobs of 'work' to be assigned to processors is incomplete and needs be expanded. Some techniques and projects have emerged that implicitly use this idea, but either focus on...
Article
While soft real-time applications must run quickly enough to meet the deadline, there is usually no extra benefit from running more quickly than that. This property provides the opportunity for energy savings using Dynamic Voltage and Frequency Scaling (DVFS). In this paper, we propose the GreenRT framework that allows an application to mon-itor it...
Article
The Cascade Parallel Processing Framework (PPF) is a user level library that facilitates manual parallelization of com-plex C++ systems. In Cascade, processing duties of the sys-tem are enclosed in a Cascade Task. Tasks are linked by dependencies in a task dependency graph. The task graph is traversed at runtime by the Cascade Job Manager who assig...
Article
The unpredictable nature of modern workloads, characterized by frequent branches and control transfers, can result in processor pipeline utilization as low as 19%. Chip multithreading (CMT), a processor architecture combining chip multiprocessing and hardware multithreading, is designed to address this issue. Hardware vendors plan to ship CMT syste...
Article
Chip multithreading (CMT) combines chip multiprocessing (CMP) and hardware multithreading (MT). In order to make the most of CMT systems when they become available, we have developed the Sam CMT simulator toolkit. A Sam simulation is usable as an interactive system, running at about 100Kips on a 1.2GHz UltraSPARC III and about 200Kips on a 1.8GHz A...
Article
Abstract—In this paper,we,examine,the use of base vector applications,as a tool for classifying an,application’s usage of a processor’s resources. We,define a series of base,vector applications, simple applications designed to place directed stress on a single processor,resource. By co-scheduling base,vector applications with a target application o...
Article
In this paper we propose CASC, a cache-aware operating system scheduling algorithm for multithreaded chip multiprocessors (CMT). CMT is emerging as a popular architecture for server platforms, and most major hardware manufacturers plan or already have released CMT processors. It is the job of the operating system to manage the shared resources of t...

Network