Pedro Trancoso

Pedro Trancoso
University of Cyprus · Department of Computer Science

PhD

About

124
Publications
12,175
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,317
Citations
Additional affiliations
January 2002 - present
University of Cyprus
Position
  • Professor (Associate)

Publications

Publications (124)
Preprint
Full-text available
With their potential to significantly reduce traffic accidents, enhance road safety, optimize traffic flow, and decrease congestion, autonomous driving systems are a major focus of research and development in recent years. Beyond these immediate benefits, they offer long-term advantages in promoting sustainable transportation by reducing emissions...
Preprint
Full-text available
Quantum computing has the potential to revolutionize multiple fields by solving complex problems that can not be solved in reasonable time with current classical computers. Nevertheless, the development of quantum computers is still in its early stages and the available systems have still very limited resources. As such, currently, the most practic...
Article
Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This heterogeneity is present at two levels: intra-layer-type and inter-layer-type. Generic accelerators do...
Preprint
Full-text available
The VEDLIoT project aims to develop energy-efficient Deep Learning methodologies for distributed Artificial Intelligence of Things (AIoT) applications. During our project, we propose a holistic approach that focuses on optimizing algorithms while addressing safety and security challenges inherent to AIoT systems. The foundation of this approach lie...
Conference Paper
Seeking the “sweet spot” in the accuracy-efficiency trade-off is increasing the heterogeneity of state-of-the-art Convolutional Neural Networks (CNNs). Such CNN models exhibit heterogeneity at two levels: intra- and inter-layer-type. Generic accelerators do not capture these levels of heterogeneity. Consequently, researchers have proposed model-spe...
Preprint
The VEDLIoT project targets the development of energy-efficient Deep Learning for distributed AIoT applications. A holistic approach is used to optimize algorithms while also dealing with safety and security challenges. The approach is based on a modular and scalable cognitive IoT hardware platform. Using modular microserver technology enables the...
Preprint
The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is...
Conference Paper
This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently the available off-chip bandwidth improving significantly system performance and energy efficiency. AVR co...
Article
Heterogeneous multicores offer flexibility in the form of different core types and Dynamic Voltage and Frequency Scaling (DVFS), defining a vast configuration space. The optimal configuration choice is not always straightforward, even for single applications, and becomes a very difficult problem for dynamically changing scenarios of concurrent appl...
Article
DRAM caches have shown excellent potential in capturing the spatial and temporal data locality of applications capitalizing on advances of 3D-stacking technology; however, they are still far from their ideal performance. Besides the unavoidable DRAM access to fetch the requested data, tag access is in the critical path, adding significant latency a...
Conference Paper
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged...
Conference Paper
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged...
Article
Energy is increasingly becoming the major constraint in designing multicore chips. Power and performance are the main components of energy and are inversely correlated. In this paper, we study the energy optimization of multicore chips that process parallel workloads using either power or performance optimization. To do so, we propose novel machine...
Conference Paper
Full-text available
Parallelism is inherent in most problems but due to current programming models and architectures which have evolved from a sequential paradigm, the parallelism exploited is restricted. We believe that the most efficient parallel execution is achieved when applications are represented as graphs of operations and data, which can then be mapped for ex...
Conference Paper
An application may have different sensitivity to faults in different subsets of the data it uses. Some data regions may therefore be more critical than others. Capitalizing on this observation, Odd-ECC provides a mechanism to dynamically select the memory fault tolerance of each allocated page of a program on demand depending on the criticality of...
Article
SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences...
Conference Paper
Full-text available
Scheduling task-based parallel applications on many-core processors is becoming more challenging and has received lots of attention recently. The main challenge is to efficiently map the tasks to the underlying hardware topology using application characteristics such as the dependences between tasks, in order to satisfy the requirements. To achieve...
Technical Report
Full-text available
The increasing parallelism offered by the parallel architectures introduced by processor vendors, coupled with the need to extract more parallelism out of the applications, has led the community to examine more efficient programming and execution models. The Dataflow Multithreading model is known to be the model that can exploit the most parallelis...
Technical Report
Full-text available
The current trend in high performance processor design is to increase the number of cores as to achieve desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to exploit that parallelism is one of the biggest challenges. In this work we propose...
Conference Paper
SWITCHES is a task-based Data-ow runtime that implements a lightweight distributed triggering system for runtime dependency resolution, and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop tasks can be increased as to favor data-locality, even when having dependen...
Conference Paper
As the number of cores increases in a single chip processor, several challenges arise: wire delays, contention for out-of-chip accesses, and core heterogeneity. In order to address these issues and the applications demands, future large-scale many-core processors are expected to be organized as a collection of NUMA clusters of heterogeneous cores....
Conference Paper
The trend of increasing the number of cores in a processor will lead to certain challenges, among which the fact that more cores issue more memory requests and this in turn will increase the competition, or interference, for shared resources such as the Last-Level Cache (LLC). In this work we focus on the cache interference while executing Decision...
Article
The introduction of multi-core processors has renewed the interest in programming models which can efficiently exploit general purpose parallelism. Data-Flow is one such model which has demonstrated significant potential in the past. However, it is generally associated with functional styles of programming which do not deal well with shared mutable...
Conference Paper
Migrating computation to memory was proposed a long time ago as a way to overcome the memory bandwidth and latency bottleneck, as well as increase the computation parallelism. While the concept had been applied to several research projects it is only recently that the technological hurdles have been solved and we are able to see products arriving t...
Conference Paper
The current trend in processor design is to increase the number of cores as to achieve a desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to exploit that parallelism is one of the biggest challenges. In this paper we propose a Data-Flow ba...
Conference Paper
The design for continuous computer performance is increasingly becoming limited by the exponential increase in the power consumption. In order to improve the energy efficiency of multicore chips, we propose a novel global power management technique. The goal of the technique is to deliver the maximum performance at a fixed power budget, without sig...
Article
Processors have evolved dramatically in the last years and current multicore systems deliver very high performance. We are observing a rapid increase in the number of cores per processor thus resulting in more dense and powerful systems. Nevertheless, this evolution will meet several challenges such as power consumption, and reliability. It is expe...
Article
Full-text available
The number of computational units integrated in a single processor is rapidly increasing. This suggests that applications will require efficient and effective ways to exploit the parallelism to achieve the performance offered by large-scale multicore processors. The efficient parallelization of the applications relies on the programming and executi...
Conference Paper
Processors have evolved dramatically in the last years and current multicore systems deliver very high performance. We are observing a rapid increase in the number of cores per processor thus resulting in more dense and powerful systems. Nevertheless, this evolution will meet several challenges such as power consumption, and reliability. It is expe...
Conference Paper
Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX i...
Conference Paper
Current processor trends show an increasing number of cores and a diversity of characteristics among them. Such processors offer a large potential for achieving high performance for different applications. Nevertheless, exploiting the characteristics of such processors is a challenge. In particular, considering all cores to be the same for scheduli...
Conference Paper
The massive addition of cores on a chip is adding more pressure to the accesses to main memory. In order to avoid this bottleneck, we propose the use of a simple producer-consumer model, which allows for the temporary results to be transferred directly from one task to another. These data transfer operations are performed within the chip, using on-...
Conference Paper
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessors. In this work we exploit the benefits of using future many-core architectures, more specifically on-chip clustered many-core archit...
Conference Paper
Full-text available
The switch to Multi-core systems has ended the reliance on the single processor for increase in performance and moved into Parallelism. However, the exponential growth in performance of the single processor in the 80's and 90's had overshadowed the drive for efficient Parallelism and relegate it into a niche research area, mostly for High Performan...
Article
Currently, we are facing a situation where applications exhibit increasing computational demands and where a large variety of parallel processor systems are available. In this paper we focus on exploiting fine-grain parallelism for three applications with distinct characteristics: a Bioinformatics application (MrBayes), a Molecular Dynamics applica...
Conference Paper
Full-text available
Reconfigurable hardware can be used as an energy and performance efficient co-processing solution to accelerate certain types of applications. To facilitate the design of hardware accelerators we have proposed a methodology that adopts the stream-based computing model and the usage of Graphics Processing Units as prototyping platforms. In this pape...
Conference Paper
As the number of cores increases in multi-core processors, more applications execute at the same time. In this paper we present a simple and non-intrusive approach that guarantees performance isolation for High Performance Applications. This is achieved using virtualization by creating multiple virtual machines on the same processor, which can be s...
Conference Paper
Full-text available
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communication standards such as DVB-S2 and WiMAX to transmit data inside noisy channels with high error probability. LDPC decoding is computationally demanding and requires irregular accesses to memory which makes it suitable for parallelization. The recent intr...
Article
Multi-core processors have renewed interest in programming models which can efficiently exploit general purpose parallelism. Data-Flow is one such model which has demonstrated significant potential in the past. However, it is generally associated with functional styles of programming which do not deal well with shared mutable state. There have been...
Article
Full-text available
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessor. In this work we analyze the benefits of using future many-core architectures, more specifically on-chip clustered many-core archite...
Article
Full-text available
HPC system architectures are shifting from the traditional clusters of homogeneous nodes to clusters of heterogeneous nodes and accelerators. The future of high-performance computing (HPC) from the technologies developed today to showcase the leadership-class compute systems, the supercomputers. These machines are usually designed to achieve the hi...
Conference Paper
Processors have evolved to the now de-facto standard multi-core architecture. The continuous advances in technology allow for increased component density, thus resulting in a larger number of cores on the chip. This, in turn, places pressure on the off-chip and pin bandwidth. Large Last-Level Caches (LLC), which are shared among all cores, have bee...
Chapter
Current desktop computers are heterogeneous systems that integrate different types of processors. For example, general-purpose processors and GPUs do not only have different characteristics but also adopt diverse programming models. Despite these differences, data parallelism is exploited for both types of processors, by using application processin...
Article
Full-text available
The Cell Broadband Engine is a heterogeneous chip multiprocessor that combines a PowerPC processor core with eight single-instruction multiple-data accelerator cores and delivers high performance on many computationally intensive codes.
Conference Paper
Full-text available
We are currently faced with the situation where applica- tions have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demand- ing Bioinformatics application - MrBayes - and its Phylo- genetic Likelihood Functions (PLF) using different archi...
Conference Paper
Full-text available
The need to exploit multi-core systems for parallel pro- cessing has revived the concept of dataflow. In particu- lar, the Dataflow Multithreading architectures have proven to be good candidates for these systems. In this work we propose an abstraction layer that enables compiling and running a program written for an abstract Dataflow Mul- tithread...
Conference Paper
Full-text available
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that processes large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessor. The topic addressed in this work is to analyze the benefits of using high-performance/low-cost processors such as the GPUs and th...
Conference Paper
Full-text available
In this paper we present thread flux (TFlux), a complete system that supports the data-driven multithreading (DDM) model of execution. TFlux virtualizes any details of the underlying system therefore offering the same programming model independently of the architecture. To achieve this goal, TFlux has a runtime support that is built on top of a com...
Conference Paper
For many applications, data is collected at very large rates from various sources. Applications that produce results from this data have a requirement for very efficient processing in order to achieve timely decisions. An example of such a demanding applications is one that takes decisions on stock acquisition based on the price updates that happen...
Article
The monitoring of data streams is a very important issue in many different areas. Aspects such as accuracy, the speed of response, the use of memory and the adaptability to the changing nature of data may vary in importance depending on the situation. Examples such as Web page access monitoring, approximate aggregation in relational queries or IP m...
Article
Full-text available
This paper presents the FPGA implementation of the prototype for the Data-Driven Chip-Multiprocessor (D 2 -CMP). In particular, we study the implementation of a Thread Synchronization Unit (TSU) on FPGA, a hardware unit that enables thread execution using dataflow-like scheduling policy on a chip multiprocessor. Threads are scheduled for execution...
Conference Paper
Historically, technology has been the main driver of computer performance. For many system generations, CMOS scaling has been leveraged to increase clock speed and build increasingly complex microarchitectures. As technology-driven performance gains ...
Conference Paper
Due to limitations in the traditional microprocessor design, such as high complexity and power, all current commercial high-end processors contain multiple cores on the same chip (multicore). This trend is expected to continue resulting in increasing number of cores on the chip. While these cores may be used to achieve higher throughput, improving...
Conference Paper
The current trend in microprocessor design is to keep the design simple while packing more processors on the same die in order to increase the performance. These chips are called chip multiprocessors (CMP) or multicores. While currently the parallelism provided by the multiple cores is used to achieve higher throughput, it is possible to split a si...
Article
Computer systems have evolved significantly in the last years leading to high-performance systems. This, however, has come with a cost of large power dissipation. As such, power-awareness has become a major factor in processor design. Therefore, it is important to have a complete understanding of the power and performance behavior of all processor...
Article
Full-text available
The increased complexity and operating frequency in current single chip microprocessors is resulting in a decrease in the performance improvements. Consequently, major manufacturers offer chip multiprocessor (CMP) architectures in order to keep up with the expected performance gains. This architecture is successfully being introduced in many market...
Article
Although the dataflow model of execution, with its obvious benefits, has been proposed for a long time, it has not yet been successfully exploited. Nevertheless, as traditional systems have recently started to reach their limits in delivering higher performance, new models of execution that use dataflow-like concepts are being studied. Among these,...
Article
Full-text available
This paper describes the Data-Driven Multithreading (DDM) model and how it may be implemented using off-the-shelf microprocessors. Data-Driven Multithreading is a nonblocking multithreading execution model that tolerates internode latency by scheduling threads for execution based on data availability. Scheduling based on data availability can be us...
Conference Paper
Full-text available
The properties of computer system such as performance, energy, reliability, etc. are commonly evaluated by running benchmarks. However, the benchmarking process is complicated to set-up and use and running the benchmarks takes a substantial amount of time. Furthermore, when designing a computer, architects resort to simulation of the system, increa...
Conference Paper
The Data-Driven Multithreading Chip Multiprocessor (DDM-CMP) architecture has been shown to overcome the power and memory wall limitations by combining two key technologies: the use of the Data-Driven Multithreading (DDM) model of execution, and the Chip-Multiprocessor architecture. DDM is able to hide memory and synchronization latencies providing...
Article
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a...
Article
Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and plac...
Article
Full-text available
Bloom filters are not able to handle deletes and inserts on multisets over time. This is important in many situations when streamed data evolve rapidly and change patterns frequently. Counting Bloom Filters (CBF) have been proposed to overcome this limitation and allow for the dynamic evolution of Bloom filters. The only dynamic approach to a compa...
Conference Paper
Microprocessor development costs are considerably high. To minimize these costs, manufacturers produce a single design that better satisfies, in average, a wide range of applications. Nevertheless, as applications have different characteristics and users have different demands, this single design is suboptimal in many situations. As a consequence t...
Conference Paper
The increased complexity and operating frequency in current microprocessors is resulting in a decrease in the performance improvements. In order to keep up with the expected performance gains, major manufacturers have started to offer chip-multiprocessor architectures. Nevertheless, the integration of several cores on the same chip leads to increas...
Chapter
The different components of the parallel computer system are presented. The topics covered are the parallelism within the modern microprocessors, multiprocessor architectures of different scales, interconnection networks, and parallel input/output systems. The objective is to describe the different options