Nathalie Drach-TemamSorbonne University | UPMC · Laboratoire d'informatique de Paris 6 (LIP6)
Nathalie Drach-Temam
About
44
Publications
5,414
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
283
Citations
Publications
Publications (44)
Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory reso...
Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstracti...
Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstracti...
We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling st...
To efficiently exploit the resources of new many-core architectures,
integrating dozens or even hundreds of cores per chip, parallel programming
models have evolved to expose massive amounts of parallelism, often in the form
of fine-grained tasks. Task-parallel languages, such as OpenStream, X10,
Habanero Java and C or StarSs, simplify the developm...
Embedded systems based on FPGA (Field-Programmable Gate Arrays) must exhibit more performance for new applications. However, no high-performance superscalar soft processor is available on the FPGA, because the superscalar architecture is not suitable for FPGAs. High-performance superscalar processors execute instructions out-of-order and it is nece...
Complex applications, such as multimedia, telephony or cryptography, in embedded systems must provide more and more performance that can be achieved by using multiple levels of parallelism. Today, FPGA are viable alternatives for these kinds of applications. Unfortunately, the available processors on FPGA do not provide sufficient performance. This...
Data prefetching is an effective way to bridge the increasing performance gap between processor and memory. Prefetching can
improve performance but it has some side effects which may lead to no performance improvement while increasing memory pressure
or to performance degradation. Adaptive prefetching aims at reducing negative effects of prefetchin...
To ensure the code integrity in secure embedded processors, most previous
works focus on detecting attacks without paying their attention to
recovery. This paper proposes a novel hardware recovery approach
allowing the processor to resume the execution after detecting an
attack. The experimental results demonstrate that our scheme introduces
a very...
For security issues in portable applications such as smart card, various proposed techniques can be applied to harden the ALU against fault attacks. Among others, time redundancy is a good candidate to offer a low hardware cost. The main disadvantage of this scheme is high extra time due to the recomputation. However, this impact can be considerabl...
For security issues in portable applications such as smart card, various proposed techniques can be applied to harden the ALU against fault attacks. Among others, time redundancy is a good candidate to offer a low hardware cost. The main disadvantage of this scheme is high extra time due to the recomputation. However, this impact can be much reduce...
L'analyse et l'optimisation statiques de code ont leurs limitations. En effet, un grand nombre d'informations indispensables a une generation efficace de code ne sont connues qu'a l'execution (notamment, le comportement des acces memoire et des instructions de controle). Les techniques d'optimisations iteratives permettent d'integrer dans l'optimis...
Increasingly complex consumer electronics applications call for embedded processors with higher performance. Multi-cores are capable of delivering the required performance. However, many of these embedded applications must meet some form of soft real-time constraints, and program behavior on multi-cores is even harder to predict than on single-core...
The characteristics of multimedia applications when executed on general-purpose processors are not well understood. Such knowledge is extremely important in guiding the development of multimedia applications and the design of future processors.In this paper, we characterize and optimize the performance of multimedia applications on superscalar proc...
To meet the high demand for powerful embedded processors, VLIW architectures are increasingly complex (e.g., multiple clusters), and moreover, they now run increasingly sophisticated control-intensive applications. As a result, developing architecture-specific compiler optimizations is becoming both increasingly critical and complex, while time-to-...
Devant la complexite croissante des applications embarquees, le gain en performances des systemes associes est crucial et doit respecter des contraintes de cout et de consommation. Les performances des architectures VLIW utilisees dans ces systemes dependent du parallelisme d'instructions deduit statiquement. Aussi, la disponibilite d'un compilateu...
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can severely disrupt processor behavior by delaying normal cache requests, inducing cache pollution and oc...
Media processing has become one of the dominant computing workloads. In this context, SIMD instructions have been introduced in current processors to raise performance, often the main goal of microprocessor designers. Today, however, designers have become concerned with the power consumption, and in some cases low power is the main design goal (lap...
This paper presents the performance of DSP, image and 3D applications on recent general-purpose microprocessors using streaming SIMD ISA extensions (integer and floating point). The 9 benchmarks benchmark we use for this evaluation have been optimized for DLP and caches use with SIMD extensions and data prefetch. The result of these cumulated optim...
In this paper we evaluate the performance of an SMT processor used as the geometry processor for a 3D polygonal rendering engine. To evaluate this approach, we consider PMesa (a parallel version of Mesa) which parallelizes the geometry stage of the 3D pipeline. We show that SMT is suitable for 3D geometry and we characterize the execution of the ge...
We have designed an algorithm which allows the OpenGL geometry transformations to be processed on a multiprocessor system. We have integrated it in Mesa, a 3D graphics library with an API which is very similar to that of OpenGL. We get speedup up to 1.8 for biprocessor without any modification of the application using the library. In this paper we...
The quality of a real-time high-end virtual reality system depends on its ability to draw millions of textured triangles in 1/60 s. The idea of using commodity PC 3D accelerators to build a parallel machine instead of custom ASICs seems more and more attractive as such chips are getting faster. If image parallelism is used, designers have the choic...
The quality of a real-time high end virtual reality system depends on its ability to draw millions of textured triangles in 1/60s. The idea of using commodity PC 3D accelerators to build a parallel machine instead of custom ASICs seems more and more attractive as such chips are getting faster. If image parallelism is used, designers have the choice...
We have designed an algorithm which allows the OpenGL geometry transformations to be processed on a multiprocessor system.
We have integrated it in Mesa, a 3D graphics library with an API which is very similar to that of OpenGL.We get speedup up
to 1.8 for biprocessor without any modification of the application using the library. In this paper we s...
. A sort-last 3D parallel rendering machine distributes the triangles to draw to different processors. When building such a machine with each processor having a texture cache, the texture locality is worse and the performance is reduced. This article investigates two schemes to preserve this locality while keeping a good load balancing: triangle sl...
As technology enables to integrate real-time good quality 3D rendering in a single chip, the classical problem of the gap between internal data bandwidth and external memories arises. The texture mapping function requires a tremendous number of texture accesses and many past implementations have been based on costly high bandwidth external memory....
Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined through simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and...
Sizes of on-chip caches on current commercial microprocessors range from 16 Kbytes to 36 Kbytes. These microprocessors can be directly used in the design of a low cost single-bus shared memory multiprocessors without using any second-level cache.
In this paper, we explore the viability of such a multi-microprocessor. Simulations results clearly est...
: As the tag check may be executed in a specific pipeline stage, cache pipelining allows to reach the same processor cycle time with a set-associative cache or a direct-mapped cache. On a direct-mapped cache, the data or the instruction flowing out from the cache may be used in parallel with the tag check. When using a pipelined cache, such an opti...
: As the tag check may be executed in a specific pipeline stage, cache pipelining allows to reach the same processor cycle time with a set-associative cache or a direct-mapped cache. On a direct-mapped cache, the data or the instruction flowing out from the cache may be used in parallel with the tag check. When using a pipelined cache, such an opti...
: In 1993, sizes of on-chip caches on current commercial microprocessors range from 16K bytes to 36 Kbytes. These microprocessors can be directly used in the design of a low cost single-bus shared memory multiprocessors without using any second-level cache. In this paper, we explore the viability of such a multi-microprocessor. Simulations results...
The purpose of the semi-unified on-chip cache organization, is to use the data cache (resp. ins truction cache) as an on-chip second-level cache for instructions (resp. data). Thus the associativity de gree of both on-chip caches is artificially increased, and the cache spaces respectively devoted to instruc tions and data are dynamically adjusted....
: Pipelining is a major technique used in high performance processors. But a fundamental drawback of pipelining is the lost time due to branch instructions. A new organization for implementing branch instructions is presented: the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the pipeline depths may be addressed using th...
Les performances des systèmes batis autour d'un microprocesseur dépendent de plus en plus des performances de la hiérarchie mémoire, et plus particulièrement des antémémoires. En effet depuis ces dernières années, le temps de cycle du processeur a diminué beaucoup plus vite que le temps d'accès à la mémoire principale. Cette tendance a augmenté l'i...
Résumé Tandis que les systèmes embarqués se tournent de plus en plus vers des architectures multicoeurs, les applications qui leur sont destinées sont toujours parallélisées de manière statique et spécifique afin d'augmenter la prédictibilité du temps d'exécution. Nous proposons une méthode, que nous dénom-mons parallélisation supervisée, pour obte...
Résumé Il est difficile de prédire le comportement des applications sur les architectures généralistes ou embar-quées haute-performance et donc d'optimiser efficacement ces applications. Ces architectures intègrent des composants dynamiques (cache, prédicteur de branchement, . . .) dont il est difficile de prévoir stati-quement (avant l'exécution)...