Conference Paper

Collective Knowledge: Towards R&D Sustainability

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... As part of this educational initiative, we implemented an extensible, portable and technology-agnostic workflow for autotuning using the open-source Collective Knowledge framework (CK) [25,62]. Such workflows help researchers to reuse already shared applications, kernels, data sets and tools, or add their own ones using a common JSON API and meta-description [6]. ...
... Since there was no available open-source framework with all these features, we decided to develop such a framework, Collective Knowledge (CK) [25,62], from scratch with initial support from the EU-funded TETRACOM project [19]. CK is implemented as a small and portable Python module with a command line front-end to assist users in converting their local objects (code and data) into searchable, reusable and shareable directory entries with user-friendly aliases and auto-generated Unique ID, JSON API and JSON meta information [6], as described in [62,2] and conceptually shown in Figure 2b. ...
... Since there was no available open-source framework with all these features, we decided to develop such a framework, Collective Knowledge (CK) [25,62], from scratch with initial support from the EU-funded TETRACOM project [19]. CK is implemented as a small and portable Python module with a command line front-end to assist users in converting their local objects (code and data) into searchable, reusable and shareable directory entries with user-friendly aliases and auto-generated Unique ID, JSON API and JSON meta information [6], as described in [62,2] and conceptually shown in Figure 2b. ...
Article
Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry. We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings. As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.
... MILEPOST GCC [101,105] was the first attempt to make a practical on-the-fly machine-learning-based compiler combined with an infrastructure targeted to autotuning and crowdsourcing scenarios. It has been used in practice and revealed many issues yet to be tackled by researchers, including (1) reproducibility of empirical results collected with multiple users and (2) problems with metadata, data representation, models, and massive datasets [102,104]. ...
... These complex learners allow automated systems to efficiently perform these task with minimal programmer effort. Additionally, research on collaborating tuning methodologies have gained attention by the introduction of Collective Knowledge framework (CK) [101][102][103]174]. CK is a cross-platform open research SDK developed in collaboration with academic and industrial partners to share artifacts as reusable and customizable components with a unified, portable, and customizable experimental work flows. ...
Article
Full-text available
Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.
... Finally, we refer to model, when we report the performance of our model-driven CLBLast version. To automatize the workflow of our framework, we used Collective Knowledge technology [18] for generating the datasets, learning the models and evaluating their performance. ...
... This aspect is particularly crucial for embedded architectures where generating the training set is expensive (e.g., it took 7 days to create po2 for the Mali GPU). We believe in a collaborative/community-driven approach for collecting and analyzing datasets, building predictive models, etc. [18]. ...
Preprint
Full-text available
Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.
... The runtime collects a set of metrics (see Section 3.1) that couples dynamic dependence information with source code location for precise application profiling reports (see Section 3.3); -A set of heuristics that, guided by LoopAnalyzer's profiling reports, allows programmers to properly select loops for parallel speculative execution or privatization (see Section 4). -A thorough dependence analysis and discussion of loops from 45 applications of three well-known benchmarks: cBench [15], Parboil [41], and Rodinia [7]. The evaluation in Section 4.2 covered up to 180 loops that are responsible for, at least, 10% of CPU time. ...
... The experimental results aim to show the latent parallelism that a state of the art compiler misses due to the conservative nature of its static analysis. In order to assess LoopAnalyzer capabilities, three benchmark suites widely studied in the literature are used, namely cBench [15], Parboil [41], and Rodinia [7]. cBench is a collection of open-source sequential programs, while Parboil and Rodinia are sets of computing applications with multiple implementations for different parallel models, such as CUDA, OpenMP and OpenCL. ...
Conference Paper
Full-text available
Production compilers such as GCC, Clang, IBM XL and the Intel C Compiler employ multiple loop parallelization techniques that help in the task of parallel programming. Although very effective, these techniques are only applicable to loops that the compiler can statically determine to have no loop-carried dependences (DOALL). Because of this restriction, a plethora of Dynamic DOALL (D-DOALL) loops are outright ignored, leaving the parallelism potential of many computationally intensive applications unexplored. This paper proposes a new analysis tool based on OpenMP clauses that allow the programmer to generate detailed profiling of any given loop by identifying its loop-carried dependences and producing carefully selected execution time metrics. The paper also proposes a set of heuristics to be used in conjunction with the analysis tool metrics to properly select loops which could be paral-lelized through speculative execution, even in the presence of loop-carried dependences. A thorough analysis of 180 loops from 45 benchmarks of three different suites (cBench, Parboil, and Rodinia) was realized using the Intel C Compiler and the proposed approach. Experimental results using static analysis from the Intel C Compiler showed that only 7.8% of the loops are DOALL. The proposed analysis tool exposed 39.5% May DOALL (M-DOALL) loops which could be eventually parallelized using speculative execution, as exemplified by loops from the Parboil sad program which produced a speedup of 1.92x.
... Similar work for measurements in cloud computing was done by Laaber et al. [24], Iosup et al. [18], and Folkerts et al. [10]. Others investigated sources of measurement bias in experimental work [26,39] The Collective Knowledge Framework [12] enables systematic recording of individual experimental steps, which permits independent reproduction and contribution of additional results. Another framework is DataMill [6], which randomizes selected environmental conditions to improve the generalizability of specific measurements. ...
... However, it hides all the software chaos rather than solving it, has some performance overheads and requires enormous amount of space, have a very poor support for embedded devices and do not help to integrate models with native environment and user data. • Collective Knowledge (CK) was introduced as a portable and modular workflow framework to address above issues and bridge the gap between high-level ML operations and systems [16,6]. While it helps companies to automate ML benchmarking and move ML models to production [20] we also noticed two major limitations during its practical use: ...
Preprint
Full-text available
We present CodeReef - an open platform to share all the components necessary to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML models across diverse systems in the most efficient way. We also introduce the CodeReef solution - a way to package and share models as non-virtualized, portable, customizable and reproducible archive files. Such ML packages include JSON meta description of models with all dependencies, Python APIs, CLI actions and portable workflows necessary to automatically build, benchmark, test and customize models across diverse platforms, AI frameworks, libraries, compilers and datasets. We demonstrate several CodeReef solutions to automatically build, run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO dataset from the latest MLPerf inference benchmark across a wide range of platforms from Raspberry Pi, Android phones and IoT devices to data centers. Our long-term goal is to help researchers share their new techniques as production-ready packages along with research papers to participate in collaborative and reproducible benchmarking, compare the different ML/software/hardware stacks and select the most efficient ones on a Pareto frontier using online CodeReef dashboards.
... To enable our vision, we have designed Collective Knowledge (CK), an open framework for collaborative and reproducible R&D in computer systems [4,10]. Based on a methodology originating from natural sciences, CK involves the community to learn and optimize the behavior of complex computer systems consisting of interdependent components [11]. ...
Conference Paper
We invite the community to collaboratively design and optimize convolutional neural networks to meet the performance, accuracy and cost requirements for deployment on a range of form factors -- from sensors to self-driving cars.
... To the best of our knowledge, there were almost no attempts to expose program autotuners as web services. The only notable exception is the work on Collective Mind framework [10] and its successor called Collective Knowledge [9]. Collective Knowledge (CK) framework supports reproducible and collaborative research in computer systems by enabling users to create and share repositories with programs (e.g. ...
Conference Paper
Full-text available
Program autotuning is becoming an increasingly valuable tool for improving performance portability across diverse target architectures, exploring trade-offs between several criteria, or meeting quality of service requirements. Recent work on general autotuning frameworks enabled rapid development of domain-specific autotuners reusing common libraries of parameter types and search techniques. In this work we explore the use of such frameworks to develop general-purpose online services for program autotuning using the Software as a Service model. Beyond the common benefits of this model, the proposed approach opens up a number of unique opportunities, such as collecting performance data and utilizing it to improve further runs, or enabling remote online autotuning. However, the proposed autotuning as a service approach also brings in several challenges, such as accessing target systems, dealing with measurement latency, and supporting execution of user-provided code. This paper presents the first step towards implementing the proposed approach and addressing these challenges. We describe an implementation of generic autotuning service that can be used for tuning arbitrary programs on user-provided computing systems. The service is based on OpenTuner autotuning framework and runs on Everest platform that enables rapid development of computational web services. In contrast to OpenTuner, the service doesn't require installation of the framework, allows users to avoid writing code and supports efficient parallel execution of measurement tasks across multiple machines. The performance of the service is evaluated by using it for tuning synthetic and real programs.
... That is why we build our competition on top of an open-source and portable workflow framework (Collective Knowledge or CK [3]) and a standard ACM artifact evaluation methodology [1] from premier ACM systems conferences (CGO, PPoPP, PACT, SuperComputing) to provide unified evaluation and a live scoreboard of submissions as demonstrated in Figure 2. ...
Article
Co-designing efficient machine learning based systems across the whole hardware/software stack to trade off speed, accuracy, energy and costs is becoming extremely complex and time consuming. Researchers often struggle to evaluate and compare different published works across rapidly evolving software frameworks, heterogeneous hardware platforms, compilers, libraries, algorithms, data sets, models, and environments. We present our community effort to develop an open co-design tournament platform with an online public scoreboard. It will gradually incorporate best research practices while providing a common way for multidisciplinary researchers to optimize and compare the quality vs. efficiency Pareto optimality of various workloads on diverse and complete hardware/software systems. We want to leverage the open-source Collective Knowledge framework and the ACM artifact evaluation methodology to validate and share the complete machine learning system implementations in a standardized, portable, and reproducible fashion. We plan to hold regular multi-objective optimization and co-design tournaments for emerging workloads such as deep learning, starting with ASPLOS'18 (ACM conference on Architectural Support for Programming Languages and Operating Systems - the premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, operating systems and networking) to build a public repository of the most efficient machine learning algorithms and systems which can be easily customized, reused and built upon.
... libVC itself does not provide any automatic selection of which Version should be executed. The decision of which Version is the most suitable for a given task is left to policies defined by the programmer or other autotuning frameworks such as mARGOt [13] or cTuning [14]. ...
Article
Full-text available
We present libVersioningCompiler, a C++ library designed to support the dynamic generation of multiple versions of the same compute kernel in a HPC scenario. It can be used to provide continuous optimization, code specialization based on the input data or on workload changes, or otherwise to dynamically adjust the application, without the burden of a full dynamic compiler. The library supports multiple underlying compilers but specifically targets the llvm framework. We also provide examples of use, showing the overhead of the library, and providing guidelines for its efficient use.
... Stodden (2009a, b) proposes the Reproducible Research Standard, a licensing framework that promotes the sharing of artifacts and ensures that authors are credited for their work. Numerous public platforms have been proposed that use virtualization and containerization to facilitate the long-term persistence and replication of data and source code artifacts (Brammer et al. 2011;Austin et al. 2011;Jimenez et al. 2017;Meng et al. 2016;Timperley et al. 2018;Fursin et al. 2016). Coding conventions and best practices have also been proposed to make it easier for authors to produce high-quality artifacts that can be more easily understood, reused, and replicated by others (Li-Thiao-Té 2012;Krishnamurthi 2014;Stodden et al. 2014). ...
Conference Paper
Full-text available
Software engineering research rarely impacts practitioners in the field. A desire to facilitate transfer alone is not sufficient to bridge the gap between research and practice. Fields from medicine to education have acknowledged a similar challenge over the past 25 years. An empirical approach to the translation of research into evidence-based practice has emerged from the resulting discussion. Implementation science has fundamentally changed the way that biomedical research is conducted, and has revolutionized the daily practice of doctors, social workers, epidemiologists, early childhood educators, and more. In this talk we will explore the methods, frameworks, and practices of implementation science and their application to novel disciplines, including software engineering research. We will close by proposing some directions for future software engineering research to facilitate transfer.
... Performance engineering community: Much attention is paid to the ability to reproduce experiments with reasonable effort. Frameworks like the Collective Knowledge Framework [74] aim at systematic recording of individual experiment steps that permits independent reproduction and contribution of additional results. As unexpected effects can appear during performance evaluation due to relatively unexpected properties of the experimental platform [3], [75], environments such as DataMill [46] can randomize selected environmental conditions and thus improve the ability to generalize from particular measurements. ...
Article
Full-text available
The rapid adoption and the diversification of cloud computing technology exacerbate the importance of a sound experimental methodology for this domain. This work investigates how to measure and report performance in the cloud, and how well the cloud research community is already doing it. We propose a set of eight important methodological principles that combine best-practices from nearby fields with concepts applicable only to clouds, and with new ideas about the time-accuracy trade-off. We show how these principles are applicable using a practical use-case experiment. To this end, we analyze the ability of the newly released SPEC Cloud IaaS benchmark to follow the principles, and showcase real-world experimental studies in common cloud environments that meet the principles. Last, we report on a systematic literature review including top conferences and journals in the field, from 2012 to 2017, analyzing if the practice of reporting cloud performance measurements follows the proposed eight principles. Worryingly, this systematic survey and the subsequent two-round human reviews, reveal that few of the published studies follow the eight experimental principles. We conclude that, although these important principles are simple and basic, the cloud community is yet to adopt them broadly to deliver sound measurement of cloud environments.
... With the hurdles out of the way, a small team or even an individual can add new models. For instance, thanks to the LoadGen and a complementary workflow-automation technology (Fursin et al., 2016), one MLPerf contributor with only three employees swept more than 60 computervision models in the open division. ...
Preprint
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf implements a set of rules and practices to ensure comparability across systems with wildly differing architectures. In this paper, we present the method and design principles of the initial MLPerf Inference release. The first call for submissions garnered more than 600 inference-performance measurements from 14 organizations, representing over 30 systems that show a range of capabilities.
... We have grown to believe that autotuning can only be made practical by performing it in a collaborative way, while continuously sharing representative workloads and optimization knowledge. To enable our vision, we have designed and implemented Collective Knowledge (CK), 1 , an open framework for reproducible and collaborative R&D in computer systems [3]. ...
Conference Paper
Autotuning is a popular technique to ensure performance portability for important algorithms such as BLAS, FFT and DNN across the ever evolving software and hardware stack. Unfortunately, when performed on a single machine, autotuning can explore only a tiny fraction of the ever growing and non-linear optimization spaces and thus can easily miss optimal solutions. We propose to practically solve this problem with the help of the community using the open-source Collective Knowledge framework (CK). We have customized the universal multi-objective autotuning engine of CK to optimize the local work size and other parameters of OpenCL workloads across diverse inputs and devices. Optimal solutions (with speed increases of up to 20x and energy savings of up to 30% over the default configurations) are preserved in the open repository of optimization knowledge at http://cknowledge.org/repo.
... While working with these useful tools and platforms I realized that a higher-level API can help to connect them together into portable workflows with reusable artifacts that can adapt to never-ending changes in systems and environments. That's why I decided to develop the Collective Knowledge framework (CK or cKnowledge) [23,20] -a small and cross-platform Python framework that helps to convert ad-hoc research projects into a file-based database of reusable CK components [13] (code, data, models, pre-/post-processing scripts, experimental results, R&D automation actions [4], best research practices to reproduce results, and live papers) with unified Python and REST APIs, common command line interface, JSON meta information and JSON input/output ( Figure 2). I also provided reusable API to automatically detect different software, models and datasets on a user machine or install/cross-compile the missing ones while supporting different operating systems (Linux, Windows, MacOS, Android) and hardware (Nvidia, Arm, Intel, AMD ...). ...
Preprint
This article provides an overview of the Collective Knowledge technology (CK or cKnowledge) that attempts to make it easier to reproduce ML&systems research, deploy ML models in production and adapt them to continuously changing data sets, models, research techniques, software and hardware. The CK concept is to decompose complex systems and ad-hoc research projects into reusable sub-components with unified APIs, CLI and JSON meta description. Such components can be connected into portable workflows using DevOps principles combined with reusable automation actions, software detection plugins, meta packages and exposed optimization parameters. CK workflows can automatically plug in different models, data and tools from different vendors while building, running and benchmarking research code in a unified way across diverse platforms and environments, performing whole system optimization, reproducing results and comparing them using public or private scoreboards on the cKnowledge.io platform. The modular CK approach was successfully validated with industrial partners to automatically co-design and optimize software, hardware and machine learning models for reproducible and efficient object detection in terms of speed, accuracy, energy, size and other characteristics. The long-term goal is to simplify and accelerate the development and deployment of ML models and systems by helping researchers and practitioners to share and reuse their knowledge, experience, best practices, artifacts and techniques using open CK APIs.
... A company's differentiation strategy is expressed as R&D cost, which is an investment that creates technical ability to increase the company's potential competitiveness that could also be directly aligned with the sustainability of the firms [58][59][60]. Generally, companies with a high proportion of R&D expenditures have been noted to have superior long-term management performance. ...
Article
Full-text available
This research analyzed the moderating effects of the continental factor on the relation between the business strategies (cost advantage strategy and differentiation strategy) of the pharmaceutical industry and mergers and acquisitions (M&A) performance. A total of 1303 M&A cases were collected from the Bloomberg database between 1995 and 2016 for the sake of empirical analyses. The independent variables were represented by the cost advantage strategy and the differentiation strategy. The dependent variable was for the M&A performance, which was measured for the changes in ROA (return on assets). The results showed that the cost advantage strategy was advantageous when an Asian firm acquired one in either Asia or Europe. In contrast, when a European company received one in either Europe or Asia, M&A performance also was higher, although the cost was higher. On the other hand, the differentiation strategy was valid only when a European firm acquired one in Asia. The moderating effect of the continental factor was beneficial only in the relation between the cost advantage strategy and M&A performance. These results could help companies make decisions that maximize M&A performance based on continental factors from the perspective of the sustainable international business strategy establishment.
... Stodden (2009b,a) proposes the Reproducible Research Standard, a licensing framework that promotes the sharing of artifacts and ensures that authors are credited for their work. Numerous public platforms have been proposed that use virtualisation and containerisation to facilitate the long-term persistence and replication of data and source code artifacts (Brammer et al., 2011;Austin et al., 2011;Jimenez et al., 2017;Meng et al., 2016;Timperley et al., 2018;Fursin et al., 2016). Coding conventions and best practices have also been proposed to make it easier for authors to produce high-quality artifacts that can be more easily understood, reused, and replicated by others (Li-Thiao-Té, 2012;Krishnamurthi, 2014;Stodden et al., 2014). ...
Preprint
In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, which include tools, benchmarks, data, and more, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. This often takes the form of a link in the paper pointing to a website containing these additional materials. However, in practice, artifacts suffer from a variety of issues that prevent them from fully realising that potential. To help the software engineering community realise the potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a publication analysis and online survey of 153 software engineering researchers. We apply the established theory of diffusion of innovation, and draw from the field of implementation science, to make evidence-based recommendations. By analysing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using diffusion of innovation as a framework, we analyse how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, using principles from implementation science, we make evidence-based recommendations for specific sub-communities (e.g., students and postdocs, artifact evaluation committees, funding bodies, and professional organisations) to improve the quality of artifacts.
... These challenges are compounded by an ever more formidable and heterogeneous hardware landscape (Reddi et al., 2020;Fursin et al., 2016). As the hardware landscape becomes increasingly fragmented and specialized, fast and efficient code will require ever more niche and specialized skills to write (Lee et al., 2011). ...
Preprint
Full-text available
Hardware, systems and algorithms research communities have historically had different incentive structures and fluctuating motivation to engage with each other explicitly. This historical treatment is odd given that hardware and software have frequently determined which research ideas succeed (and fail). This essay introduces the term hardware lottery to describe when a research idea wins because it is suited to the available software and hardware and not because the idea is superior to alternative research directions. Examples from early computer science history illustrate how hardware lotteries can delay research progress by casting successful ideas as failures. These lessons are particularly salient given the advent of domain specialized hardware which makes it increasingly costly to stray off of the beaten path of research ideas.
... To test the effectiveness of our approach, we carried out an exhaustive experimental analysis using three different GPU architectures: an Nvidia Tesla P100, an Nvidia Titan V, and an embedded ARM Mali-T860 based on the Midgard architecture. To collect and automatize the benchmarks, we used the Collective Knowledge framework [20]. For every benchmark we used from 5 to 10 repetitions and collected the average time. ...
Preprint
Full-text available
Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation and we study the performance, accuracy and generalization ability of the models. Our insights allow to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree-and MLP-based models.
... To test the effectiveness of our approach, we carried out an exhaustive experimental analysis using three different GPU architectures: an Nvidia Tesla P100, an Nvidia Titan V, and an embedded ARM Mali-T860 based on the Midgard architecture. To collect and automatize the benchmarks, we used the Collective Knowledge framework [22]. For every benchmark, we used from 5 to 10 repetitions and collected the average time. ...
Article
Full-text available
Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.
... Instructions of model preparation, installation, and job execution were provided to the participant teams. During the preparation, tools like Popper [78] and CK [79] were introduced to help teams to build and execute container-native workflows. ...
Article
Full-text available
This paper is an extension of work entitled "Computing planetary interior normal modes with a highly parallel polynomial filtering eigensolver." by Shi et al., originally presented at the SC18 conference. A highly parallel polynomial filtered eigensolver was developed and exploited to calculate the planetary normal modes. The proposed method is ideally suited for computing interior eigenpairs for large-scale eigenvalue problems as it greatly enhances memory and computational efficiency. In this work, the second-order finite element method is used to further improve the accuracy as only the first-order finite element method was deployed in the previous work. The parallel algorithm, its parallel performance up to 20k processors, and the great computational accuracy are illustrated. The reproducibility of the previous work was successfully performed on the Student Cluster Competition at the SC19 conference by several participant teams using a completely different Mars-model dataset on different clusters. Both weak and strong scaling performances of the reproducibility by the participant teams were impressive and encouraging. The analysis and reflection of their results are demonstrated and future direction is discussed.
... Stodden (2009a, b) proposes the Reproducible Research Standard, a licensing framework that promotes the sharing of artifacts and ensures that authors are credited for their work. Numerous public platforms have been proposed that use virtualization and containerization to facilitate the long-term persistence and replication of data and source code artifacts (Brammer et al. 2011;Austin et al. 2011;Jimenez et al. 2017;Meng et al. 2016;Timperley et al. 2018;Fursin et al. 2016). Coding conventions and best practices have also been proposed to make it easier for authors to produce high-quality artifacts that can be more easily understood, reused, and replicated by others (Li-Thiao-Té 2012;Krishnamurthi 2014;Stodden et al. 2014). ...
Article
Full-text available
In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, including tools, benchmarks, and data, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. However, in practice, artifacts suffer from a variety of issues that prevent the realization of their full potential. To help the software engineering community realize the full potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a survey of artifacts in software engineering publications, and an online survey of 153 software engineering researchers. By analyzing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using Diffusion of Innovations (DoI) as an analytical framework, we examine how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, we make recommendations to improve the quality of artifacts based on our results and existing best practices.
Chapter
Production compilers such as GCC, Clang, IBM XL and the Intel C Compiler employ multiple loop parallelization techniques that help in the task of parallel programming. Although very effective, these techniques are only applicable to loops that the compiler can statically determine to have no loop-carried dependences (DOALL). Because of this restriction, a plethora of Dynamic DOALL (D-DOALL) loops are outright ignored, leaving the parallelism potential of many computationally intensive applications unexplored. This paper proposes a new analysis tool based on OpenMP clauses that allow the programmer to generate detailed profiling of any given loop by identifying its loop-carried dependences and producing carefully selected execution time metrics. The paper also proposes a set of heuristics to be used in conjunction with the analysis tool metrics to properly select loops which could be parallelized through speculative execution, even in the presence of loop-carried dependences. A thorough analysis of 180 loops from 45 benchmarks of three different suites (cBench, Parboil, and Rodinia) was realized using the Intel C Compiler and the proposed approach. Experimental results using static analysis from the Intel C Compiler showed that only 7.8% of the loops are DOALL. The proposed analysis tool exposed 39.5% May DOALL (M-DOALL) loops which could be eventually parallelized using speculative execution, as exemplified by loops from the Parboil sad program which produced a speedup of 1.92x.
Conference Paper
We present and define a structured digital object, called a "Tale," for the dissemination and publication of computational scientific findings in the scholarly record. The Tale emerges from the NSF funded Whole Tale project (wholetale.org) which is developing a computational environment designed to capture the entire computational pipeline associated with a scientific experiment and thereby enable computational reproducibility. A Tale allows researchers to create and package code, data and information about the workflow and computational environment necessary to support, review, and recreate the computational results reported in published research. The Tale then captures the artifacts and information needed to facilitate understanding, transparency, and execution of the Tale for review and reproducibility at the time of publication.
Chapter
With the increasing complexity of upcoming HPC systems, so-called “co-design” efforts to develop the hardware and applications in concert for these systems also become more challenging. It is currently difficult to gather information about the usage of programming model features, libraries, and data structure considerations in a quantitative way across a variety of applications, and this information is needed to prioritize development efforts in systems software and hardware optimizations. In this paper we propose CAASCADE, a system that can harvest this information in an automatic way in production HPC environments, and we show some early results from a prototype of the system based on GNU compilers and a MySQL database.
Article
Many classes of applications, both in the embedded and high performance domains, can trade off the accuracy of the computed results for computation performance. One way to achieve such a trade-off is precision tuning—that is, to modify the data types used for the computation by reducing the bit width, or by changing the representation from floating point to fixed point. We present a methodology for high-accuracy dynamic precision tuning based on the identification of input classes (i.e., classes of input datasets that benefit from similar optimizations). When a new input region is detected, the application kernels are re-compiled on the fly with the appropriate selection of parameters. In this way, we obtain a continuous optimization approach that enables the exploitation of the reduced precision computation while progressively exploring the solution space, thus reducing the time required by compilation overheads. We provide tools to support the automation of the runtime part of the solution, leaving to the user only the task of identifying the input classes. Our approach provides a significant performance boost (up to 320%) on the typical approximate computing benchmarks, without meaningfully affecting the accuracy of the result, since the error remains always below 3%.
Article
Full-text available
Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.
Article
Full-text available
Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
Conference Paper
Full-text available
Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-core interference is especially problematic when considering latency-sensitive applications that are present in the web service data center application domains, such as web-search. The commonly used solution is to simply disallow the co-location of latency-sensitive applications and throughput-oriented batch applications on a single chip, leaving much of the processing capabilities of multicore micro-architectures underutilized. In this work we present a Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference due to contention, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current multicore processors to infer and respond to contention and requires no added hardware support. We present the design and implementation of the CAER environment, two separate contention detection heuristics, and approaches to respond to contention online. We evaluate our solution using the SPEC2006 benchmark suite. Our experiments show that when allowing co-location with CAER, as opposed to disallowing co-location, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the overhead due to allowing co-location from 17% down to just 4% on average.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Nowadays, engineers have to develop software often without even knowing which hardware it will eventually run on in numerous mobile phones, tablets, desktops, laptops, data centers, supercomputers and cloud services. Unfortunately, optimizing compilers are not keeping pace with ever increasing complexity of computer systems anymore and may produce severely underperforming executable codes while wasting expensive resources and energy. We present our practical and collaborative solution to this problem via light-weight wrappers around any software piece when more than one implementation or optimization choice available. These wrappers are connected with a public Collective Mind autotuning infrastructure and repository of knowledge (c-mind.org/repo) to continuously monitor various important characteristics of these pieces (computational species) across numerous existing hardware configurations together with randomly selected optimizations. Similar to natural sciences, we can now continuously track winning solutions (optimizations for a given hardware) that minimize all costs of a computation (execution time, energy spent, code size, failures, memory and storage footprint, optimization time, faults, contentions, inaccuracy and so on) of a given species on a Pareto frontier along with any unexpected behavior. The community can then collaboratively classify solutions, prune redundant ones, and correlate them with various features of software, its inputs (data sets) and used hardware either manually or using powerful predictive analytics techniques. Our approach can then help create a large, realistic, diverse, representative, and continuously evolving benchmark with related optimization knowledge while gradually covering all possible software and hardware to be able to predict best optimizations and improve compilers and hardware depending on usage scenarios and requirements.
Article
In this report, we share our practical experience on crowdsourcing evaluation of research artifacts and reviewing of publications since 2008. We also briefly discuss encountered problems including reproducibility of experimental results and possible solutions.
Article
Docker promises the ability to package applications and their dependencies into lightweight containers that move easily between different distros, start up quickly and are isolated from each other.
Article
A virtual machine can support individual processes or a complete system depending on the abstraction level where virtualization occurs. Some VMs support flexible hardware usage and software isolation, while others translate from one instruction set to another. Virtualizing a system or component -such as a processor, memory, or an I/O device - at a given abstraction level maps its interface and visible resources onto the interface and resources of an underlying, possibly different, real system. Consequently, the real system appears as a different virtual system or even as multiple virtual systems. Interjecting virtualizing software between abstraction layers near the HW/SW interface forms a virtual machine that allows otherwise incompatible subsystems to work together. Further, replication by virtualization enables more flexible and efficient and efficient use of hardware resources.
Realeyes image processing benchmark
  • E Hajiyev
  • R Dávid
  • L Marák
  • R Baghdadi
E. Hajiyev, R. Dávid, L. Marák, and R. Baghdadi, "Realeyes image processing benchmark." https://github.com/Realeyes/pencil-benchmarksimageproc, 2011-2015.
Live report with shared artifacts and interactive graphs
  • G Fursin
  • A Lokhmotov
G. Fursin and A. Lokhmotov, "Live report with shared artifacts and interactive graphs." http://cknowledge.org/repo/web.php?wcid=report: b0779e2a64c22907.
Collective Mind: Towards practical and collaborative auto-tuning
  • G Fursin
  • R Miceli
  • A Lokhmotov
  • M Gerndt
  • M Baboulin
  • D Malony
  • Z Chamski
  • D Novillo
  • D D Vento
The architecture of virtual machines
  • J E Smith
  • R Nair
J. E. Smith and R. Nair, "The architecture of virtual machines," Computer, vol. 38, pp. 32-38, May 2005.
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM
  • L Nardi
  • B Bodin
  • M Z Zia
  • J Mawer
  • A Nisbet
  • P H J Kelly
  • A J Davison
  • M Luján
  • M F P O'boyle
  • G Riley
  • N Topham
  • S Furber
L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O'Boyle, G. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," in IEEE Intl. Conf. on Robotics and Automation (ICRA), May 2015. arXiv:1410.2167.