Figure 7 - uploaded by Alecio P. D. Binotto
Content may be subject to copyright.
Source publication
GPUs (Graphics Processing Units) have become one of the main co-processors that contributed to desktops towards high performance computing. Together with multi-core CPUs, a powerful heterogeneous execution platform is built for massive calculations. To improve application performance and explore this heterogeneity, a distribution of workload in a b...
Context in source publication
Similar publications
The long computation times required for fluid dynamics simulations (CFD) has lead the industry to look for some alternatives to boost high performance computing (HPC). This paper is focused on the acceleration of fluid dynamics simulations for industrial complex configurations using modern graphics cards (GPUs) that exhibits a substantial parallel...
A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform t...
While heterogeneous architectures are increasing popular with High Performance Computing systems, their effectiveness depends on how efficient the scheduler is at allocating workloads onto appropriate computing devices and how communication and computation can be overlapped. With different types of resources integrated into one system, the complexi...
Computational fluid dynamics (CFD) can provide detailed information of flow motion, temperature distributions and species dispersion in buildings. However, it may take hours or days, even weeks to simulate airflow in a building by using CFD on a single central processing unit (CPU) computer. Parallel computing on a multi-CPU supercomputer or comput...
Computational fluid dynamics (CFD) can provide detailed information of flow motion, temperature distributions and species dispersion in buildings. However, it may take hours or days, even weeks to simulate airflow in a building by using CFD on a single central processing unit (CPU) computer. Parallel computing on a multi-CPU supercomputer or comput...
Citations
... Instead, in this paper we use PU (processing unit) as the generic term for either CPU or GPU, following other researchers (e.g. [Binotto et al. 2010]). A few other equivalent terms used in literature are CU (computing unit) , CE (computing element) and PE (processing element) [Tsoi and Luk 2010]. ...
... SHOC (Scalable HeterOgeneous Computing) [Danalis et al. 2010] provides both lowlevel microbenchmarks (to evaluate architectural features of the system) and applica Anzt et al. 2011;Benner et al. 2011Benner et al. , 2010Bernabé et al. 2013;Binotto et al. 2010;Clarke et al. 2012;Conti et al. 2012;Daga et al. 2011;Danalis et al. 2010;Diamos and Yalamanchili 2008;Dziekonski et al. 2011;Endo et al. 2010;Gummaraju et al. 2010;Horton et al. 2011;Jiménez et al. 2009;Kim et al. 2012;Kofler et al. 2013;Lee et al. 2012b;Liu and Luk 2012;Matam et al. 2012;Meredith et al. 2011;Pai et al. 2010;Pandit and Govindarajan 2014;Papadrakakis et al. 2011;Pienaar et al. 2011;Prasad et al. 2011;Siegel et al. 2010;Spafford et al. 2012;Stefanski 2013;Stpiczynski 2011;Stpiczynski and Potiopa 2010;Takizawa et al. 2008;Tomov et al. 2010;Veldema et al. 2011;Venkatasubramanian and Vuduc 2009;Vömel et al. 2012;Wang et al. 2013c;Zhong et al. 2012] Video processing, Imaging and/or computer vision Choudhary et al. 2012;Deshpande et al. 2011;Lecron et al. 2011;Mistry et al. 2013a;Nigam et al. 2012;Pajot et al. 2011;Park et al. 2011;Pienaar et al. 2012;Teodoro et al. 2012Teodoro et al. , 2013Teodoro et al. , 2009Toharia et al. 2012;Tsuda and Nakamura 2011;Wang et al. 2013b] Data mining, processing and/or database systems Banerjee and Kothapalli 2011;Breß et al. 2013;Delorme 2013;Gelado et al. 2010;He and Hong 2010;Hetherington et al. 2012;Jablin et al. 2012;Lee et al. 2012a;Munguia et al. 2012;Pandit and Govindarajan 2014;Pienaar et al. 2012;Pirk et al. 2012;Ravi et al. 2010; tion kernels (to evaluate the features of the system such as intranode and internode communication between PUs). In addition to serial version, SHOC provides an embarrassingly parallel version (which executes on different PUs or nodes of a cluster, but have no communication between PUs or nodes), and a true parallel version (which measures multiple nodes, with single or multiple PUs per node, and also involves communication [Mistry et al. 2013b] benchmark suite provides OpenCL applications for studying interaction of processing units in HCSs. ...
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
... We can find a representation S = V ΛV T through eigendecomposition which gives us an orthonormal basis V and its inverse V −1 = V T : The columns of V consist of the eigenvectors of S. In this representation Λ is a diagonal matrix whose diagonal elements are the corresponding eigenvalues. Performing an eigendecomposition in each voxel using standard methods is hard to parallelize on a GPU [7], though. In [8] and [9] one finds methods on how to characterize the curvature of a surface based on gradient informations and, moreover, how to obtain the principal curvature directions. ...
We present an efficient implementation of volumetric anisotropic image diffusion on modern programmable graphics processing units (GPUs). We avoid the computational bottleneck of a time consuming eigenvalue decomposition in ℝ3. Instead, we use a projection of the Hessian matrix along the surface normal onto the tangent plane of the local isodensity surface and solve for the remaining two tangent space eigenvectors. We derive closed formulas to achieve this resulting in efficient GPU code. Our most complex volumetric anisotropic diffusion gains a speed up of more than 600 compared to a CPU solution [1]. (© 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim)
... In this work, three iterative solvers for Systems of Linear Equations (SLEs) -Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient -are used by the CFD application and represent the highlevel tasks for the scheduling strategy. The solvers have different implementations for the CPU and the GPU (using shared memory and with memory coalescing ), as presented in previous work [10]. It is important to mention that, although the GPU is more powerful to deal with those kind of data-intensive tasks, there are many scenarios in which the CPU provides better performance, e.g., when working with multiple applications and tasks with different problem size domains (based on the amount of data to be processed, not known before application execution). ...
... Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs. In this sense, a set of tasks i = 1 to n has an implementation x and an execution cost acquired using a performance benchmark c on each PU j [10]. The allocation can be, then, designed as follows: the task i is not allocated to the processor j when x i,j = 0 and the task i is allocated to the processor j when the x i,j = 1. ...
... To overcome this issue, the code implemented in OpenCL is tuned by using the OpenCL's method clCreateProgramWithBinary( ). Such specific implementations oriented to the CPU/GPU execution platform were previously published on [10]. This way, from the framework's point-of-view, the exploitation of cores is reflected on the tasks performance measurement, making it transparent for the scheduler. ...
Distributing the workload upon all available Processing Units (PUs) of a high-performance heterogeneous platform (e.g., PCs composed by CPU–GPUs) is a challenging task, since the execution cost of a task on distinct PUs is non-deterministic and affected by parameters not known a priori. This paper presents Sm@rtConfig, a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications and the cost of tasks' scheduling on CPU–GPUs' platforms. Using Model-Driven Engineering and Aspect Oriented Software Development, a high-level specification and implementation for Sm@rtConfig has been created, aiming at improving modularization and reuse in different applications. As case study, the simulation subsystem of a CFD application has been developed using the proposed approach. These system's tasks were designed considering only their functional concerns, whereas scheduling and other non-functional concerns are handled by Sm@rtConfig aspects, improving tasks modularity. Although Sm@rtConfig supports multiple PUs, in this case study, these tasks have been scheduled to execute on an platform composed by one CPU and one GPU. Experimental results show an overall performance gain of 21.77% in comparison to the static assignment of all tasks only to the GPU.
... In this work, three iterative solvers for SLEs (Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient) used by the CFD application represent the high-level tasks for the scheduling strategy. The solvers have different implementations for the CPU and the GPU (using shared memory and with memory coalescing), as presented in our previous work [7]. Although the GPU can be more powerful to deal with those kind of data-intensive tasks, there are many scenarios where the CPU provides better performance when working with multiple applications and tasks with different problem size domains (partially based on the amount of data to be processed, not known a priori before application execution). ...
... Therefore, a CPU-GPU platform dedicated to address these two different aspects is assumed to offer a better execution scenario than homogeneous ones. Since the focus of this paper is on the assignment method and not directly on the implementation of the solvers, we address the reader to [9] for a mathematical overview about the solvers and to our previous work [7] for details about their implementations on a CPU-GPU platform, which is the study that base our tuning of solvers. ...
... Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs. In this sense, a set of tasks i = 1 to n have an implementation x and an execution cost acquired using a performance benchmark c on each PU j [7]. The allocation can be, then, designed as follows: the task i is not allocated on the processor j when x i,j = 0 and the task i is allocated on the processor j when the x i,j = 1. ...
A personal computer can be considered as a one-node heterogeneous cluster that simultaneously processes several application tasks. It can be composed by, for example, asymmetric CPU and GPUs. This way, a high-performance heterogeneous platform is built on a desktop for data intensive engineering calculations. In our perspective, a workload distribution over the Processing Units (PUs) plays a key role in such systems. This issue presents challenges since the cost of a task at a PU is non-deterministic and can be affected by parameters not known a priori. This paper presents a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications - due to appropriate dynamic scheduling - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. Results obtained in experimental case studies are encouraging and a performance gain of 21.77% was achieved in comparison to the static assignment of all tasks to the GPU.
... 3ds Max [14]) utilize it as another modeling method, whereas we discuss the FFD in the field of object animation [4]. Despite the advancements in simulating deformable objects such as cloth [12, 15], hair [8, 17], or biological tissues [10, 11, 19] – not least because of the proliferation of multi-core CPUs and the rapid GPU development cycles [16] – it is still rather difficult, if possible at all, to integrate such techniques into interactive 3D applications like in Virtual or Mixed Reality environments [17, 20]. Hence, for ease of access, in this paper we will first discuss, how we have integrated this deformation method into X3D [9], an open ISO standard that not only defines a 3D interchange format, but which can also be used as a declarative application description language. ...
In this paper we present a GPU-accelerated implementation of the well-known freeform deformation algorithm to allow for deformable
objects within fully interactive virtual environments. We furthermore outline how our real-time deformation approach can be
integrated into the X3D standard for more accessibility of the proposed methods. The presented technique can be used to deform
complex detailed geometries without pre-processing the mesh by simply generating a lattice around the model. The local deformation
is then computed for this lattice instead of the complex geometry, which efficiently can be carried out on the GPU using CUDA.
... In this work, three iterative solvers for SLEs (Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient) used by the CFD application represent the high-level tasks for the scheduling strategy. The solvers have different implementations for the CPU and the GPU (using shared memory and with memory coalescing), as presented in our previous work [8]. Although the GPU can be more powerful to deal with those kind of data-intensive tasks, there are scenarios where the CPU provides better performance when working with multiple applications and tasks with different problem size domains (partially based on the amount of data to be processed, not known a priori before application execution). ...
... The 3D model modification is analogous and performs in interactive frame rates. Since the focus of this paper is on the assignment method and not directly on the implementation of the solvers, we address the reader to [10] for a mathematical overview about the solvers and to our previous work [8] for details about their implementations on a CPU-CPU platform. ...
This dissertation examines how immersive technologies can be effectively integrated with BIM technologies to improve processes within the construction project lifecycle. A literature review was conducted to identify research gaps in current immersive BIM literature. An industry workshop was held to identify the existing barriers and future applications of immersive BIM within the construction industry. A software package was developed to overcome interoperability issues associated with immersive BIM applications. Three immersive BIM prototypes were developed and evaluated. The first prototype combines BIM with AR technologies to superimpose BIM models on construction sites for defect management inspections. The second and third prototypes focus on using VR in the architectural phase of the construction project lifecycle to visualise building design features.
The solution of linear equation system (LES) has numerous applications in digital signal processing. However, due to its high computational cost, it becomes difficult to implement it in high-speed free-running (streaming) processing applications. This work presents an implementation, in FPGA, of a digital circuit for LES resolution, using the gradient descent method, capable of operating online. Such implementation brings a new perspective of execution of iterative algorithms, based on matrix calculations, in embedded processing for high-speed (tens of MHz) uninterrupted data-flow. In other words, the proposed method acts as a real-time processing block (at each clock, a new data enters and another one leaves), allowing the implementation of matrix calculations in a transparent way for continuous flow systems. In addition to proposing an implementation architecture, this work discusses the relationship between the number of iterations, the amount of resources used and the allowable acquisition rate. As a case study, the method is applied to a channel equalization problem.
In this paper, we propose a comparative study between two categories of wrinkle augmentation approaches for clothing mesh simulation, namely, the geometric wrinkle augmentation approaches and data-driven approaches that use wrinkle samples stored in database. We aim to compare these categories of approaches in terms of the ability to generate and animate details on a virtual clothing mesh in motion. The comparison is carried out according to the following criteria: deformation detection, wrinkle shape parameters and run-time performance, as well as the possibility of implementation in GPUs.