
Dirk SchmidlRWTH Aachen University · IT Center
Dirk Schmidl
Dr. rer. nat.
About
28
Publications
21,461
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
721
Citations
Introduction
Additional affiliations
July 2009 - June 2016
Publications
Publications (28)
A future large-scale high-performance computing (HPC) cluster will likely be power capped since the surrounding infrastructure like power supply and cooling is constrained. For such a cluster, it may be impossible to supply thermal design power (TDP) to all components. The default power supply of current system guarantees TDP to each computing node...
Intel’s Knights Landing processor (KNL) is the latest product in the Xeon Phi product line. As a self-hosted system it is the first commercially available many-core architecture which can run unmodified applications. This makes KNL a very interesting option for HPC centers which have to support many different applications including community and IS...
The tasking feature enriches OpenMP by a method to express parallelism in a more general way than before, as it can be applied to loops but also to recursive algorithms without the need of nested parallel regions. However, the performance of a tasking program is very much influenced by the task scheduling inside the OpenMP runtime. Especially on la...
Next-generation sequencing techniques reduced the cost of sequencing a genome rapidly, but came with a relatively high error rate. Therefore, error correction of this data is a necessary task before assembly can take place. Since the input data is huge and error correction is compute intensive, parallelizing this work on a modern shared-memory syst...
Modern processors contain a lot of features to reduce the energy consumption of the chip. The gain of these features highly depends on the workload which is executed. In this work, we investigate the energy consumption of OpenMP applications on the new Intel processor generation, called Haswell. We start with the basic chip characteristics of the c...
Extended Abstract can be found here:
http://sc14.supercomputing.org/sites/all/themes/sc14/files/archive/tech_poster/tech_poster_pages/post108.html
OpenMP 4.0 extended affinity support to allow pinning of threads to places. Places are an abstraction of machine locations which in many cases do not require extensive hardware knowledge by the user. For memory affinity, i.e. data initialization and migration on NUMA systems, support is still missing in OpenMP. In this work we present an extension...
OpenMP is one of the most widely used standards for enabling thread-level parallelism in high performance computing codes. The recently released version 4.0 of the specification introduces directives that enable application developers to offload portions of the computation to massively-parallel target devices. However, to efficiently utilize these...
In 2008 task based parallelism was added to OpenMP as the major update for version 3.0. Tasks provide an easy way to express dynamic parallelism in OpenMP applications. However, achieving a good performance with OpenMP task-parallel programs is a challenging task. OpenMP runtime systems are free to schedule, interrupt and resume tasks in many diffe...
Different types of shared memory machines with large core counts exist today. Standard x86-based servers are build with up to eight sockets per machine. To obtain larger machines, some companies, like SGI or Bull, invented special interconnects to couple a bunch of small servers into one larger SMP, Scalemp uses a special software layer on top of a...
The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its nat...
Parallel programming and performance optimization of parallel programs are not simple tasks. Various HPC and OpenMP courses as well as literature serve as introduction to this topic. Assuming the role of HPC beginners we evaluate how far the knowledge acquired from introductory courses and literature can drive performance optimization of a conjugat...
With the task construct, the OpenMP 3.0 specification introduces an additional level of parallelism that challenges established schemes of performance profiling. First, a thread may execute a sequence of interleaved task fragments the profiling system must properly distinguish to enable correct performance analyses. Furthermore, the additional para...
The multicore era has led to a renaissance of shared memory parallel programming models. Moreover, the introduction of task-level parallelization raises the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be controlled and non-local data is a...
Version 3.0 of the OpenMP specification introduced the task construct for the explicit expression of dynamic task parallelism. Although automated load-balancing capabilities make it an attractive parallelization approach for programmers, the difficulty of integrating this new dimension of parallelism into traditional models of performance data has...
The introduction of task-level parallelization promises to raise the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be maintained. In contrast to traditional OpenMP worksharing constructs for which threads can be bound, the behavior of tasks...
This paper gives an overview about the Score-P performance measure-ment infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both, the de-veloper and the user perspectives, and presents the design and components of the newly developed Score-P performance m...
ScaleMP's vSMP software turns commodity Infiniband clusters with Intel's x86 processors into large shared memory machines providing a single system image at low cost. However, codes need to be tuned to deliver good performance on these machines. TrajSearch, developed at the Institute for Combustion Technology at RWTH Aachen University, is a post-pr...
The rapidly growing number of cores on modern supercomputers imposes scalability demands not only on applications but also on the software tools needed for their development. At the same time, increasing application and system complexity makes the optimization of parallel codes more difficult, creating a need for scalable performance-analysis techn...
Today most multi-socket shared memory systems exhibit a non- uniform memory architecture (NUMA). However, programming models
such as OpenMP do not provide explicit support for that. To overcome this limitation, we propose a platform-independent approach
to describe the system topology and to place threads on the hardware. A distance matrix provides...
With version 3.0, the OpenMP specification introduced a task construct and with it an additional dimension of concurrency.
While offering a convenient means to express task parallelism, the new construct presents a serious challenge to event-based
performance analysis. Since tasking may disrupt the classic sequence of region entry and exit events,...
The novel ScaleMP vSMP architecture employs commodity x86-based servers with an InfiniBand network to assemble a large shared memory system at an attractive price point. We examine this combined hardware- and software-approach of a DSM system using both system-level kernel benchmarks as well as real-world application codes. We compare this architec...
In this work we discuss the performance problems of nested OpenMP programs concerning thread and data locality particularly
on cc-NUMA architectures. We provide a user friendly solution and demonstrate its benefits by comparing the performance of
some kernel benchmarks and some real-world applications with and without applying our affinity optimiza...
MPI and OpenMP are the de-facto standards for distributed-memory and shared-memory parallelization, respectively. By employing
a hybrid approach, that is combing OpenMP and MPI parallelization in one program, a cluster of SMP systems can be exploited.
Nevertheless, mixing programming paradigms and writing explicit message passing code might increas...
The slogan of last year's International Workshop on OpenMP was "A Practical Programming Model for the Multi-Core Era", although OpenMP still is fully hardware architecture agnostic. As a consequence the programmer is left alone with bad performance if threads and data happen to live apart. In this work we examine the programmer's possibil-ities to...
Projects
Project (1)
The Performance Optimisation and Productivity Centre of Excellence in Computing Applications provides performance optimisation and productivity services for academic AND industrial code(s) in all domains! The services are free of charge to organisations in the EU! More information on http://pop-coe.eu