## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation.
We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.

To read the full-text of this research,

you can request a copy directly from the authors.

... Moreover, in [22], the online latency is reduced, by converting the quadratic programming problem into an equivalent nonnegative lestsquares problem. On the other hand, in [23][24][25], the authors implement a constrained MIMO MPC on high resources processors of 1.6 GHz, 2.2 GHz and 1.25 GHz respectively. These implementations reach low latency due to the amount of resources on their embedded devices. ...

... Although the NI myRIO can be considered a high-speed processing device compared to the microprocessor in [22,26,27]. In contrast with those with acceptable implementation for high speed applications [23][24][25], it can still be considered as a low resources processor. ...

Embedded controllers for multivariable processes have become a powerful tool in industrial implementations. Here, the Model Predictive Control offers higher performances than standard control methods. However, they face low computational resources, which reduces their processing capabilities. Based on pipelining concept, this paper presents a new embedded software-based implementation for a constrained Multi-Input-Multi-Output predictive control algorithm. The main goal of this work focuses on improving the timing performance and the resource usage of the control algorithm. Therefore, a profiling study of the baseline algorithm is developed, and the performance bottlenecks are identified. The functionality and effectiveness of the proposed implementation are validated in the NI myRIO 1900 platform using the simulation of a jet transport aircraft during cruise flight and a tape transport system. Numerical results for the study cases show that the latency and the processor usage are substantially reduced compared with the baseline algorithm, 4.6× and 3.17× respectively. Thus, efficient program execution is obtained which makes the proposed software-based implementation mainly suitable for embedded control systems.

... While GPUs offer a high degree of parallelism for optimization algorithms [17], [18], they primarily achieve their efficiency through extremely wide data paths and deep pipelining, allowing them to stream batched data from off-chip memory to process many parallel threads. However, in the case of sparse problems, the GPU resources are massively underutilized, rendering them inefficient. ...

Applications in robotics or other size-, weight-, and power-constrained (SWaP) autonomous systems at the edge often require real-time and low-energy solutions to large optimization problems. Event-based and memory-integrated neuromorphic architectures promise to solve such optimization problems with superior energy efficiency and performance compared to conventional von Neumann architectures. Here, we present a method to solve convex continuous optimization problems with quadratic cost functions and linear constraints on Intel’s scalable neuromorphic research chip Loihi 2. When applied to model predictive control (MPC) problems for the quadruped robotic platform ANYmal, this method achieves more than two orders of magnitude reduction in the combined energy-delay product (EDP) compared to the state-of-the-art solver, OSQP, on (edge) CPUs and GPUs with solution times under 10 ms for various problem sizes. These results demonstrate the benefit of non-von Neumann architectures for robotic control applications.

... Accordingly, they allow the potential to create new applications for embedded systems [13], [14]. The use of such embedded systems for MPC has been reported in recent works [15]- [17]. ...

We report numerical results on solving constrained linear-quadratic model predictive control (MPC) problems by exploiting graphics processing units (GPUs). The presented method reduces the MPC problem by eliminating the state variables and applies a condensed-space interior-point method to remove the inequality constraints in the KKT system. The final condensed matrix is positive definite and can be efficiently factorized in parallel on GPU/SIMD architectures. In addition, the size of the condensed matrix depends only on the number of controls in the problem, rendering the method particularly effective when the problem has many states but few inputs and moderate horizon length. Our numerical results for PDE-constrained problems show that the approach is an order of magnitude faster than a standard CPU implementation. We also provide an open-source Julia framework that facilitates modeling (DynamicNLPModels.jl) and solution (MadNLP.jl) of MPC problems on GPUs.

... In particular, convex optimization algorithms have been considered as a conductive solution for their computational efficiency and parallelizable characteristic. Gradient-based convex optimization techniques, such as the alternating direction method of multipliers (ADMM) [17][18][19], primal-dual interior point method (PD-IPM), parallel quadratic programming (PQP) [20,21], and active set method (ASM) [22][23][24], are employed. In this work, PD-IPM [24,25], which is the most commonly used technique for convex optimization, is applied. ...

This paper addresses the problem of real-time model predictive control (MPC) in the integrated guidance and control (IGC) of missile systems. When the primal-dual interior point method (PD-IPM), which is a convex optimization method, is used as an optimization solution for the MPC, the real-time performance of PD-IPM degenerates due to the elevated computation time in checking the Karush–Kuhn–Tucker (KKT) conditions in PD-IPM. This paper proposes a graphics processing unit (GPU)-based method to parallelize and accelerate PD-IPM for real-time MPC. The real-time performance of the proposed method was tested and analyzed on a widely-used embedded system. The comparison results with the conventional PD-IPM and other methods showed that the proposed method improved the real-time performance by reducing the computation time significantly.

... In recent years, a number of papers have proposed parallelizable variants of numerical optimization methods such as the interior point method [19], parallel quadratic programming [20], alternating direction of method multipliers (ADMM) [21]- [23] and other proximal algorithms [24], [25]. In these approaches, GPUs are used to parallelize the involved algebraic operations and the solution of linear systems: the primal-dual optimalily conditions in interior point algorithms and equality-constrained QPs in ADMM. ...

Scenario-based stochastic optimal control problems suffer from the curse of dimensionality as they can easily grow to six and seven figure sizes. First-order methods are suitable as they can deal with such large-scale problems, but may fail to achieve accurate solutions within a reasonable number of iterations. To achieve solutions of higher accuracy and high speed, in this paper we propose two proximal quasi-Newtonian limited-memory algorithms - MinFBE applied to the dual problem and the Newton-type alternating minimization algorithm (NAMA) - which can be massively parallelized on lockstep hardware such as graphics processing units (GPUs). We demonstrate the performance of these methods, in terms of convergence speed and parallelizability, on large-scale problems involving millions of variables.

... Additional efficiency can be achieved by using the jit (just-in-time) decorator which produces optimized kernels through accelerated linear algebra (XLA). There are several recent attempts to implement SQP algorithms on GPU platforms (Hu et al. 2017;Yu, Goldsmith, and Di Cairano 2017) that enables an increased efficiency in solving large scale MPC problems. ...

AI and machine learning based approaches are becoming ubiquitous in almost all engineering fields. Control engineering cannot escape this trend. In this paper, we explore how AI tools can be useful in control applications. The core tool we focus on is automatic differentiation. Two immediate applications are linearization of system dynamics for local stability analysis or for state estimation using Kalman filters. We also explore other usages such as conversion of differential algebraic equations to ordinary differential equations for control design. In addition, we explore the use of machine learning models for global parameterizations of state vectors and control inputs in model predictive control applications. For each considered use case, we give examples and results.

... These parallelizations do not change the theoretical properties of the algorithm and therefore can and should be used whenever possible. This research has led to a variety of optimized QP solvers targeting CPUs [17,18], GPUs [19,20], and FGPAs [21,22]. These approaches have also been used to specifically improve the performance of a subclass of QPs that frequently arise in trajectory optimization problems on multi-threaded CPUs [23,24,25,26] and GPUs [27,28]. ...

Parallelism can be used to significantly increase the throughput of computationally expensive algorithms. With the widespread adoption of parallel computing platforms such as GPUs, it is natural to consider whether these architectures can benefit robotics researchers interested in solving trajectory optimization problems online. Differential Dynamic Programming (DDP) algorithms have been shown to achieve some of the best timing performance in robotics tasks by making use of optimized dynamics methods and CPU multi-threading. This paper aims to analyze the benefits and tradeoffs of higher degrees of parallelization using a multiple-shooting variant of DDP implemented on a GPU. We describe our implementation strategy and present results demonstrating its performance compared to an equivalent multi-threaded CPU implementation using several benchmark control tasks. Our results suggest that GPU-based solvers can offer increased per-iteration computation time and faster convergence in some cases, but in general tradeoffs exist between convergence behavior and degree of algorithm-level parallelism.

... Yu and Goldsmith [45] implemented and analyzed the performance of Parallel Quadratic Programing (PQP) and Alternating Direction Method of Multipliers (ADMM) solvers to solve MPC QP, both methods are gradient based. In their work, they used the NVIDIA Jetson TX1, for single thread implementation and multithreaded implementation where the baseline comparison was the ARM Cortex A57 in TX1. ...

Model Predictive Control (MPC) has its reputation since it can handle multiple inputs and outputs with consideration to constraints. However, this comes at the cost of high computational complexity, which limits MPC to slow dynamic systems. This paper provides an overview of the available methods to accelerate MPC process. Various parallel computing approaches using different technologies were proposed to speed up the execution of MPC, some of these approaches are focused on building dedicated hardware for MPC using Field Programmable Arrays (FPGA), and others are focused on parallelizing MPC computation using multi-core processors (CPUs) and many-core processors (GPUs). The focus of this survey is to review the available methods for accelerating MPC process. A brief introduction to the theory of MPC is provided first followed by a description of each approach. A comparison between the different methods is presented in terms of complexity and performance followed by a valid application for each approach. Finally, the paper discusses the challenges and requirements of MPC for future applications.

Scenario‐based stochastic optimal control problems suffer from the curse of dimensionality as they can easily grow to six and seven figure sizes. First‐order methods are suitable as they can deal with such large‐scale problems, but may perform poorly and fail to converge within a reasonable number of iterations. To achieve a fast rate of convergence and high solution speeds, in this article, we propose the use of two proximal quasi‐Newtonian limited‐memory algorithms— minfbe applied to the dual problem and the Newton‐type alternating minimization algorithm ( nama )—which can be massively parallelized on lockstep hardware such as graphics processing units. In particular, we use minfbe and nama to solve scenario‐based stochastic optimal control problems with affine dynamics, convex quadratic cost functions (with the stage cost functions being strongly convex in the control variable) and joint state‐input convex constraints. We demonstrate the performance of these methods, in terms of convergence speed and parallelizability, on large‐scale problems involving millions of variables.

This work introduces Tripartite Adaptive Predictive Control: a novel task parallelization scheme for the original Adaptive Predictive Control (APC) algorithm. Profiling studies of an optimized embedded implementation of the multiple-input-multiple-output (MIMO) version of APC show that the
Recursion
section of the algorithm greatly impacts the overall execution time, specially for larger prediction horizons and for processes with more inputs and outputs. This limits the control iteration speed of APC. Therefore, by leveraging the independent side of the adaptation and prediction goals of APC, a triple task split of the algorithm is proposed which allows the
Control Law
to run asynchronously with
Recursion
. Tripartite APC consists of an
Adaptation Task
which keeps prediction model coefficient matrices up-to-date, a
Recursion Task
which projects such coefficient matrices over the prediction horizon, and a
Control Law Task
which uses the latest available projected coefficient matrices to generate the control signal. This way, Tripartite APC allows up to 37x faster control iterations as exhibited by experimental performance evaluations of the scheme on an embedded system. Tripartite APC is able to control faster dynamical processes than what was originally possible in APC, while still adapting to changes in process dynamics and keeping closed-loop performance.

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 papers published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

In this paper, some data-based control design options that can be used to accommodate for the presence of uncertainties in continuous-state engineering systems are recalled and discussed. Focus is made on reinforcement learning, stochastic model predictive control and certification via randomized optimization. Some thoughts are also shared regarding the positioning of the control community in a data and AI-dominated period for which some suggestions and risks are highlighted.

To effectively control large-scale distributed systems online, model predictive control (MPC) has to swiftly solve the underlying high-dimensional optimization. There are multiple techniques applied to accelerate the solving process in the literature, mainly attributed to software-based algorithmic advancements and hardware-assisted computation enhancements. However, those methods focus on arithmetic accelerations and overlook the benefits of the underlying system's structure. In particular, the existing decoupled software-hardware algorithm design that naively parallelizes the arithmetic operations by the hardware does not tackle the hardware overheads such as CPU-GPU and thread-to-thread communications in a principled manner. Also, the advantages of parallelizable subproblem decomposition in distributed MPC are not well recognized and exploited. As a result, we have not reached the full potential of hardware acceleration for MPC. In this paper, we explore those opportunities by leveraging GPU to parallelize the distributed and localized MPC (DLMPC) algorithm. We exploit the locality constraints embedded in the DLMPC formulation to reduce the hardware-intrinsic communication overheads. Our parallel implementation achieves up to 50x faster runtime than its CPU counterparts under various parameters. Furthermore, we find that the locality-aware GPU parallelization could halve the optimization runtime comparing to the naive acceleration. Overall, our results demonstrate the performance gains brought by software-hardware co-design with the information exchange structure in mind.

In past robotics applications, Model Predictive Control (MPC) has often been limited to linear models and relatively short time horizons. In recent years however, research in optimization, optimal control, and simulation has enabled some forms of nonlinear model predictive control which find locally optimal solutions. The limiting factor for applying nonlinear MPC for robotics remains the computation necessary to solve the optimization, especially for complex systems and for long time horizons. This paper presents a new solution method which addresses computational concerns related to nonlinear MPC called nonlinear Evolutionary MPC (NEMPC), and then we compare it to several existing methods. These comparisons include simulations on torque-limited robots performing a swing-up task and demonstrate that NEMPC is able to discover complex behaviors to accomplish the task. Comparisons with state-of-the-art nonlinear MPC algorithms show that NEMPC finds high quality control solutions very quickly using a global, instead of local, optimization method. Finally, an application in hardware (a 24 state pneumatically actuated continuum soft robot) demonstrates that this method is tractable for real-time control of high degree of freedom systems.

Speech recognition is used in a wide range of applications and devices such as mobile phones, in-car entertainment systems and web-based services. Hidden Markov Models (HMMs) is one of the most popular algorithmic approaches applied in speech recognition. Training and testing a HMM is computationally intensive and time-consuming. Running multiple applications concurrently with speech recognition could overwhelm the compute resources, and introduce unwanted delays in the speech processing, eventually dropping words in the process due to buffer overruns. Graphics processing units (GPUs) have become widely accepted as accelerators which offer massive amounts of parallelism. The host processor (the CPU) can offload compute-intensive portions of an application to the GPU, leaving the CPU to focus on serial tasks and scheduling operations. In this paper, we provide a parallelized Hidden Markov Model to accelerate isolated words speech recognition. We experiment with different optimization schemes and make use of optimized GPU computing libraries to speedup the computation on GPUs. We also explore the performance benefits of using advanced GPU features for concurrent execution of multiple compute kernels. The algorithms are evaluated on multiple Nvidia GPUs using CUDA as a programming framework. Our GPU implementation achieves better performance than traditional serial and multi-threaded implementations. When considering the end-to-end performance of the application, which includes both data transfer and computation, we achieve a 9x speedup for training with the use of a GPU over a multi-threaded version optimized for a multi-core CPU. Index Terms—Speech Recognition, Hidden Markov Model, GPUs.

We investigate the infeasibility detection in the alternating direction method of multipliers (ADMM) when minimizing a convex quadratic objective subject to linear equalities and simple bounds. The ADMM formulation consists of alternating between an equality constrained quadratic program (QP) and a projection onto the bounds. We show that: (i) the sequence of iterates generated by ADMM diverges, (ii) the divergence is restricted to the component of the multipliers along the range space of the constraints and (iii) the primal iterates converge to a minimizer of the Euclidean distance between the subspace defined by equality constraints and the convex set defined by bounds. In addition, we derive the optimal value for the step size parameter in the ADMM algorithm that maximizes the rate of convergence of the primal iterates and dual iterates along the null space. In fact, the optimal step size parameter for the infeasible instances is identical to that for the feasible instances. The theoretical results allow us to specify a practical termination condition for infeasibility and the performance of such criterion is demonstrated in a model predictive control application.

We consider an approach for solving strictly convex quadratic programs (QPs) with general linear inequalities by the alternating direction method of multipliers (ADMM). In particular, we focus on the application of ADMM to the QPs of constrained Model Predictive Control (MPC). After introducing our ADMM iteration, we provide a proof of convergence closely related to the theory of maximal monotone operators. The proof relies on a general measure to monitor the rate of convergence and hence to characterize the optimal step size for the iterations. We show that the identified measure converges at a Q-linear rate while the iterates converge at a 2-step Q-linear rate. This result allows us to relax some of the existing assumptions in optimal step size selection, that currently limit the applicability to the QPs of MPC. The results are validated through a large public benchmark set of QPs of MPC for controlling a four tank process.

KBLAS is a new open source high performance library that provides optimized
kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs.
Since performance of dense matrix-vector multiplication is hindered by the
overhead of memory accesses, a double-buffering optimization technique is
employed to overlap data motion with computation. After identifying a proper
set of tuning parameters, KBLAS is able to efficiently run on various GPU
architectures across different generations, avoiding the time-consuming step of
code rewriting, while still being compliant with the standard BLAS API. Another
advanced optimization technique allows to ensure coalesced memory access when
dealing with submatrices, especially in the context of high level dense linear
algebra algorithms. All four precisions KBLAS kernels have been leveraged to
multi-GPUs environment, which requires the introduction of new APIs to ease
users' experiences on these challenging systems. The KBLAS performance
outperforms existing state-of-the-art implementations on all matrix sizes,
achieves asymptotically up to 50% and 60% speedup on single GPU and multi-GPUs
systems, respectively, and validates our performance model. A subset of KBLAS
high performance kernels has been integrated into NVIDIA's standard BLAS
implementation (cuBLAS) for larger dissemination, starting version 6.0.

The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.

The fast, tailored solution of linear, predictive control problems is important, yet challenging. We present an algorithm based on the fast gradient method and the method of multipliers for model predictive control of linear, discrete-time, time-invariant, systems with box constraints. The algorithm uses the augmented Lagrangian method to handle equality constraints, so it can takes advantage of the sparsity of the problem. We present different schemes to update the multipliers. An example illustrates the performance of the algorithm, which is competitive with other tailored solution methods.

Modern GPUs are able to perform significantly more arithmetic operations than
transfers of a single word to or from global memory. Hence, many GPU kernels
are limited by memory bandwidth and cannot exploit the arithmetic power of
GPUs. However, the memory locality can be often improved by kernel fusion when
a sequence of kernels is executed and some kernels in this sequence share data.
In this paper, we show how kernels performing map, reduce or their nested
combinations can be fused automatically by our source-to-source compiler. To
demonstrate the usability of the compiler, we have implemented several BLAS-1
and BLAS-2 routines and show how the performance of their sequences can be
improved by fusions.
Compared to similar sequences using CUBLAS, our compiler is able to generate
code that is up to 2.61x faster for the examples tested.

We present an Alternating Direction Method of Multipliers (ADMM) algorithm
for solving optimization problems with an l_1 regularized least-squares cost
function subject to recursive equality constraints. The considered optimization
problem has applications in control, for example in l_1 regularized MPC. The
ADMM algorithm is easy to implement, converges fast to a solution of moderate
accuracy, and enables separation of the optimization problem into sub-problems
that may be solved in parallel. We show that the most costly step of the
proposed ADMM algorithm is equivalent to solving an LQ regulator problem with
an extra linear term in the cost function, a problem that can be solved
efficiently using a Riccati recursion. We apply the ADMM algorithm to an
example of l_1 regularized MPC. The numerical examples confirm fast convergence
to moderate accuracy and a linear complexity in the MPC prediction horizon.

The goal of the LAPACK project is to design and implement a portable linear algebra library for efficient use on a variety of high-performance computers. The library is based on the widely used LINPACK and EISPACK packages for solving linear equations, eigenvalue problems, and linear least-squares problems, but extends their functionality in a number of ways. The major methodology for making the algorithms run faster is to restructure them to perform block matrix operations (e.g., matrix-matrix multiplication) in their inner loops. These block operations may be optimized to exploit the memory hierarchy of a specific architecture. The LAPACK project is also working on new algorithms that yield higher relative accuracy for a variety of linear algebra problems.

Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for ℓ1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations.

This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations. Contents 1 Abstract 1 2 Motivation 4 3 ATLAS 6 3.1 The ATLAS approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Building the ge...

In recent years, the mass market of mobile devices has pushed the demand for increasingly fast but cheap processors. ARM, the world leader in this sector, has developed the Cortex-A series of processors with focus on computationally intensive applications. If properly programmed, these processors are powerful enough to solve the complex optimization problems arising in MPC in real-time, while keeping the traditional low-cost and low-power consumption. This makes these processors ideal candidates for use in embedded MPC. In this paper, we investigate the floating-point capabilities of Cortex A7, A9 and A15 and show how to exploit the unique features of each processor to obtain the best performance, in the context of a novel implementation method for the linear-algebra routines used in MPC solvers. This method adapts high-performance computing techniques to the needs of embedded MPC. In particular, we investigate the performance of matrix-matrix and matrix-vector multiplications, which are the backbones of second- and first-order methods for convex optimization. Finally, we test the performance of MPC solvers implemented using these optimized linear-algebra routines.

This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms—from the matrix–matrix multiplication kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library—a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.

We discuss a multiplicative update quadratic programming algorithm with applications to model predictive control for constrained linear systems. The algorithm, named PQP, is very simple to implement and thus verify, does not require projection, offers a linear rate of convergence, and can be completely parallelized. The PQP algorithm is equipped with conditions that guarantee the desired bound on suboptimality and with an acceleration step based on projection-free line search. We also show how PQP can take advantage of the parametric structure of the MPC problem, thus moving offline several calculations and avoiding large input/output dataflows. The algorithm is evaluated on two benchmark problems, where it is shown to compete with, and possibly outperform, other open source and commercial packages.

In this paper, an iterative multiplicative algorithm is proposed for the fast solution of quadratic programming (QP) problems that arise in the real-time implementation of Model Predictive Control (MPC). The proposed algorithm Parallel Quadratic Programming (PQP) is amenable to fine-grained parallelization. Conditions on the convergence of the PQP algorithm are given and proved. Due to its extreme simplicity, even serial implementations offer considerable speed advantages. To demonstrate, PQP is applied to several simulation examples, including a stand-alone QP problem and two MPC examples. When implemented in MATLAB using single-thread computations, numerical simulations of PQP demonstrate a 5−10x speedup compared to the MATLAB active-set based QP solver quadprog. A parallel implementation would offer a further speed-up, linear in the number of parallel processors.

In this paper we review a dual fast gradient-projection approach to solving quadratic programming (QP) problems recently proposed in [Patrinos and Bemporad, 2012] that is particularly useful for embedded model predictive control (MPC) of linear systems subject to linear constraints on inputs and states. We show that the method has a computational effort aligned with several other existing QP solvers typically used in MPC, and in addition it is extremely easy to code, requires only basic and easily parallelizable arithmetic operations, and a number of iterations to reach a given accuracy in terms of optimality and feasibility of the primal solution that can be estimated quite tightly by solving an off-line mixed-integer linear programming problem.
This research was largely motivated by ongoing research activities on embedded MPC for aerospace systems carried out in collaboration with the European Space Agency.

In this paper, an iterative multiplicative algorithm is proposed for the fast solution of quadratic programming (QP) problems that arise in the real-time implementation of Model Predictive Control (MPC). The proposed algorithm–Parallel Quadratic Programming (PQP)–is amenable to fine-grained parallelization. Conditions on the convergence of the PQP algorithm are given and proved. Due to its extreme simplicity, even serial implementations offer considerable speed advantages. To demonstrate, PQP is applied to several simulation examples, including a stand-alone QP problem and two MPC examples. When implemented in MATLAB using single-thread computations, numerical simulations of PQP demonstrate a 5 -10x speed-up compared to the MATLAB active-set based QP solver quadprog. A parallel implementation would offer a further speed-up, linear in the number of parallel processors.

Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream x86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor.

In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing-CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.

Traditionally, rendezvous and proximity maneuvers have been performed using open-loop maneuver planning techniques and ad hoc error corrections. In this paper, a Model Predictive Control (MPC) approach is applied to spacecraft rendezvous and proximity maneuvering problems in the orbital plane. We demonstrate that various constraints arising in these maneuvers can be effectively handled with the MPC approach. These include constraints on thrust magnitude, constraints on spacecraft positioning within Line-of-Sight cone while approaching the docking port on a target platform, and constraints on approach velocity to match the velocity of the docking port. The two cases of a nonrotating and a rotating (tumbling) platform are treated separately, and trajectories are evaluated in terms of maneuver time and fuel consumption. For the case when the platform is not rotating and the docking port position is fixed with respect to the chosen frame, an explicit offline solution of the MPC optimization problem is shown to be possible; this explicit solution has a form of a piecewise affine control law suitable for online implementation without an on-board optimizer. In the case of a fast rotating platform, it is, however, shown that the prediction of the platform rotation is necessary to successfully accomplish the maneuvers and to reduce fuel consumption. Finally, the proposed approach is applied to debris avoidance maneuvers with the debris in the spacecraft rendezvous path. The significance of this paper is in demonstrating that Model Predictive Control can be an effective feedback control approach to satisfy various maneuver requirements, reduce fuel consumption, and provide robustness to disturbances.

Model predictive control has been originally developed for chemical process control, where plants are expensive, have slow dynamics, and a large number of inputs and outputs. Furthermore, in chemical process control each control system is usually deployed to a single plant, and hence can be specifically tuned. In recent years there has been a growing interest towards MPC in other industries, such as automotive, factory automation, and aerospace, where the plants have faster dynamics, fewer inputs and outputs, reduced costs, and each controller is deployed to a large number of plants, i.e., it is deployed in large volumes. The applications in this industries also presents several classes of nonlinearities. While there are several benefits in using MPC in large volumes industries, the difference in the plant characteristics and in production volume targets pose several challenges to the widespread use of MPC that are still partially unsolved. In this paper we discuss the benefits of MPC in large volumes industries, by using examples from automotive, aerospace, and mechatronics that involve specific nonlinearities that can be efficiently handled by MPC.We then discuss the unsolved challenges for these application domains, and the related ongoing research.

Faster, cheaper, and more power efficient optimization solvers than those
currently offered by general-purpose solutions are required for extending the
use of model predictive control (MPC) to resource-constrained embedded
platforms. We propose several custom computational architectures for different
first-order optimization methods that can handle linear-quadratic MPC problems
with input, input-rate, and soft state constraints. We provide analysis
ensuring the reliable operation of the resulting controller under reduced
precision fixed-point arithmetic. Implementation of the proposed architectures
in FPGAs shows that satisfactory control performance at a sample rate beyond 1
MHz is achievable even on low-end devices, opening up new possibilities for the
application of MPC on embedded systems.

Convex optimization problems arise frequently in many different fields. A comprehensive introduction to the subject, this book shows in detail how such problems can be solved numerically with great efficiency. The focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them. The text contains many worked examples and homework exercises and will appeal to students, researchers and practitioners in fields such as engineering, computer science, mathematics, statistics, finance, and economics.

GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications.

Linear quadratic model predictive control (MPC) with input constraints leads to an optimization problem that has to be solved at every instant in time. Although there exists computational complexity analysis for current online optimization methods dedicated to MPC, the worst case complexity bound is either hard to compute or far off from the practically observed bound. In this paper we introduce fast gradient methods that allow one to compute a priori the worst case bound required to find a solution with pre-specified accuracy. Both warm- and cold-starting techniques are analyzed and an illustrative example confirms that small, practical bounds can be obtained that together with the algorithmic and numerical simplicity of fast gradient methods allow online optimization at high rates.

More than 15 years after model predictive control (MPC) appeared in industry as an effective means to deal with multivariable constrained control problems, a theoretical basis for this technique has started to emerge. The issues of feasibility of the on-line optimization, stability and performance are largely understood for systems described by linear models. Much progress has been made on these issues for non-linear systems but for practical applications many questions remain, including the reliability and efficiency of the on-line computation scheme. To deal with model uncertainty ‘rigorously’ an involved dynamic programming problem must be solved. The approximation techniques proposed for this purpose are largely at a conceptual stage. Among the broader research needs the following areas are identified: multivariable system identification, performance monitoring and diagnostics, non-linear state estimation, and batch system control. Many practical problems like control objective prioritization and symptom-aided diagnosis can be integrated systematically and effectively into the MPC framework by expanding the problem formulation to include integer variables yielding a mixed-integer quadratic or linear program. Efficient techniques for solving these problems are becoming available.

Idle speed control is a landmark application of feedback control in automotive vehicles that continues to be of significant interest to automotive industry practitioners, since improved idle performance and robustness translate into better fuel economy, emissions and drivability. In this paper, we develop a model predictive control (MPC) strategy for regulating the engine speed to the idle speed set-point by actuating the electronic throttle and the spark timing. The MPC controller coordinates the two actuators according to a specified cost function, while explicitly taking into account constraints on the control and requirements on the acceptable engine speed range, e.g., to avoid engine stalls. Following a process proposed here for the implementation of MPC in automotive applications, an MPC controller is obtained with excellent performance and robustness as demonstrated in actual vehicle tests. In particular, the MPC controller performs better than an existing baseline controller in the vehicle, is robust to changes in operating conditions, and to different types of disturbances. It is also shown that the MPC computational complexity is well within the capability of production electronic control unit and that the improved performance achieved by the MPC controller can translate into fuel economy improvements.

The goal of the LAPACK project is to design and implement a portable linear algebra library for efficient use on a variety of high-performance computers. The library is based on the widely used LINPACK and EISPACK packages for solving linear equations, eigenvalue problems, and linear least-squares problems, but extends their functionality in a number of ways. The major methodology for making the algorithms run faster is to restructure them to perform block matrix operations (e.g., matrix-matrix multiplication) in their inner loops. These block operations may be optimized to exploit the memory hierarchy of a specific architecture. The LAPACK project is also working on new algorithms that yield higher relative accuracy for a variety of linear algebra problems

The goal of the LAPACK project is to des i gn andimplementaportabl e l i near al gebralibrary for effici entuseonavariety of high-performance computers. The library is based on thewidel y used LINPACKandEISPACKpackages f or sol ving linear equat i ons , ei genval ueproblems, andlinear l east-squares problems, butextendsthei r f uncti onalityinanumber of ways. Themaj or methodol ogy f or makingthe al gor i thmsrun f aster is to restructure themto performblockmatr i x operat i ons (e. g. , matr i x-matr i x multiplication)inthei r i nner l oops. These block operat i onsmay beoptimi zed to exploit thememory hierarchy of a speci fic archi tecture. TheLAPACKproject i s al so working on new divide-and-conquer al gor i thms f or certai n ei genproblems, andnew al gor i thmsthat yieldhigher rel ati ve accuracy for avarietyof linear al gebra problems. 1 Introducti on TheUnivers i ty of Tennessee, theCourantInstitute for Mathemat i cal Sci ences, theNumer i cal Al gor i thmsGroupLtd., Rice Unive...

Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture

- F Wende
- T Steinke
- F Cordes

F. Wende, T. Steinke, and F. Cordes, "Multi-threaded Kernel
Offloading to GPGPU Using Hyper-Q on Kepler Architecture,"
Tech. Rep. ZIB-Report 14-19, Zuse Institute Berlin, Zuse
Institute Berlin, Takustrasse 7, D-14195 Berlin, Germany, Jun
2014.

The art of performance tuning for CUDA and manycore architectures

- D Tarjan
- K Skadron
- P Micikevicius

D. Tarjan, K. Skadron, and P. Micikevicius, "The art of
performance tuning for CUDA and manycore architectures,"
Birds-of-a-feather session at SC, vol. 9, 2009.

Shuffle: Tips and Tricks " GPU Technology Conference

- J Demouth

J. Demouth, "Shuffle: Tips and Tricks," GPU Technology
Conference, 2013.

Cuda C/C++ Streams and Concurrency " NVIDIA GPU Computing Webinars 2012. S. Rennich "Cuda C/C++ Streams and Concurrency

- S Rennich

S. Rennich, "Cuda C/C++ Streams and Concurrency,"
NVIDIA GPU Computing Webinars, 2012.

MPC toolbox with GPU accelerated optimization algorithms

- N F Gade-Nielsen
- J B Jørgensen
- B Dammann

N. F. Gade-Nielsen, J. B. Jørgensen, and B. Dammann, "MPC
toolbox with GPU accelerated optimization algorithms," in
10th European Workshop on Advanced Control and Diagnosis,
2012.

NVIDIA GeForce GTX 1080 Whitepaper

NVIDIA, "NVIDIA GeForce GTX 1080 Whitepaper," 2016.

Zero Copy on Tegra K1

- Arrayfire

ArrayFire, "Zero Copy on Tegra K1." http:
//arrayfire.com/zero-copy-on-tegra-k1/.

Command-line Profiler

NVIDIA, "Command-line Profiler," 2016.

Accelerating numerical dense linear algebra calculations with gpus

- J Dongarra
- M Gates
- A Haidar
- J Kurzak
- P Luszczek
- S Tomov
- I Yamazaki

J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek,
S. Tomov, and I. Yamazaki, "Accelerating numerical dense
linear algebra calculations with gpus," Numerical
Computations with GPUs, pp. 1-26, 2014.

An admm algorithm for solving âĎŞ 1 regularized mpc

- M Annergren
- A Hansson
- B Wahlberg

M. Annergren, A. Hansson, and B. Wahlberg, "An admm
algorithm for solving âĎŞ 1 regularized mpc," in 2012 IEEE
51st IEEE Conference on Decision and Control (CDC),
pp. 4486-4491, IEEE, 2012.