Conference Paper

Dynamic data management for continuous retraining

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Nowadays, machine learning projects have become more and more relevant to various real-world use cases. The success of complex Neural Network models depends upon many factors, as the requirement for structured and machine learning-centric project development management arises. Due to the multitude of tools available for different operational phases, responsibilities and requirements become more and more unclear. In this work, Machine Learning Operations (MLOps) technologies and tools for every part of the overall project pipeline, as well as involved roles, are examined and clearly defined. With the focus on the inter-connectivity of specific tools and comparison by well-selected requirements of MLOps, model performance, input data, and system quality metrics are briefly discussed. By identifying aspects of machine learning, which can be reused from project to project, open-source tools which help in specific parts of the pipeline, and possible combinations, an overview of support in MLOps is given. Deep learning has revolutionized the field of Image processing, and building an automated machine learning workflow for object detection is of great interest for many organizations. For this, a simple MLOps workflow for object detection with images is portrayed.
Conference Paper
Full-text available
In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.
Article
Full-text available
Sentiment analysis is one of the new challenges appeared in automatic language processing with the advent of social networks. Taking advantage of the amount of information is now available, research and industry have sought ways to automatically analyze sentiments and user opinions expressed in social networks. In this paper, we place ourselves in a difficult context, on the sentiments that could thinking of suicide. In particular, we propose to address the lack of terminological resources related to suicide by a method of constructing a vocabulary associated with suicide. We then propose, for a better analysis, to investigate Weka as a tool of data mining based on machine learning algorithms that can extract useful information from Twitter data collected by Twitter4J. Therefore, an algorithm of computing semantic analysis between tweets in training set and tweets in data set based on WordNet is proposed. Experimental results demonstrate that our method based on machine learning algorithms and semantic sentiment analysis can extract predictions of suicidal ideation using Twitter Data. In addition, this work verify the effectiveness of performance in term of accuracy and precision on semantic sentiment analysis that could thinking of suicide.
Conference Paper
Full-text available
Hadoop is a very popular general purpose framework for many different classes of data-intensive applications. However, it is not good for iterative operations because of the cost paid for the data reloading from disk at each iteration. As an emerging framework, Spark, which is designed to have a global cache mechanism, can achieve better performance in response time since the in-memory access over the distributed machines of cluster will proceed during the entire iterative process. Although the performance on time has been evaluated for Spark over Hadoop, the memory consumption, another system performance criteria, is not deeply analyzed in the literature. In this work, we conducted extensive experiments for iterative operations to compare the performance in both time and memory cost between Hadoop and Spark. We found that although Spark is in general faster than Hadoop in iterative operations, it has to pay for more memory consumption. Also, its speed advantage is weakened at the moment when the memory is not sufficient enough to store newly created intermediate results.
Conference Paper
Full-text available
In this paper, we give an overview of the HDF5 technology suite and some of its applications. We discuss the HDF5 data model, the HDF5 software architecture and some of its performance enhancing capabilities.
Article
Full-text available
Domain specific languages (DSLs) are increasingly used today. Coping with complex language definitions, evolving them in a structured way, and ensuring their error freeness are the main challenges of DSL design and implementation. The use of modular language definitions and composition operators are therefore inevitable in the independent development of language components. In this article, we discuss these arising issues by describing a framework for the compositional development of textual DSLs and their supporting tools. We use a redundance-free definition of a readable concrete syntax and a comprehensible abstract syntax as both representations significantly overlap in their structure. For enhancing the usability of the abstract syntax, we added concepts like associations and inheritance to a grammar-based definition in order to build up arbitrary graphs (as known from metamodeling). Two modularity concepts, grammar inheritance and embedding, are discussed. They permit compositional language definition and thus simplify the extension of languages based on already existing ones. We demonstrate that compositional engineering of new languages is a useful concept when project-individual DSLs with appropriate tool support are defined.
Article
Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.
Article
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
Conference Paper
Component and Connector (C&C) models, with their corresponding code generators, are widely used by large automotive manufacturers to develop new software functions for embedded systems interacting with their environment; C&C example applications are engine control, remote parking pilots, and traffic sign assistance. This paper presents a complete toolchain to design and compile C&C models to highly-optimized code running on multiple targets including x86/x64, ARM and WebAssembly. One of our contributions are algebraic and threading optimizations to increase execution speed for computationally expensive tasks. A further contribution is an extensive case study with over 50 experiments. This case study compares the runtime speed of the generated code using different compilers and mathematical libraries. These experiments showed that programs produced by our compiler are at least two times faster than the ones compiled by MATLAB/Simulink for machine learning applications such as image clustering for object detection. Additionally, our compiler toolchain provides a complete model-based testing framework and plug-in points for middleware integration. We make all materials including models and toolchains electronically available for inspection and further research.
Article
When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.
Article
Big Data has gained much attention from researchers in healthcare, bioinformatics, and information sciences. As a result, data production at this stage will be 44 times greater than that in 2009. Hence, the volume, velocity, and variety of data rapidly increase. Hence, it is difficult to store, process and visualise this huge data using traditional technologies. Many organisations such as Twitter, LinkedIn, and Facebook are used big data for different use cases in the social networking domain. Also, implementations of such architectures of the use cases have been published worldwide. However, a conceptual architecture for specific big data application has been limited. The intention of this paper is application-oriented architecture for big data systems, which is based on a study of published big data architectures for specific use cases. This paper also provides an overview of the state-of-the-art machine learning algorithms for processing big data in healthcare and other applications.
Conference Paper
The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.
Conference Paper
When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning as standard practice for improved new task performance.
Conference Paper
The paradigm of processing huge datasets has been shifted from centralized architecture to distributed architecture. As the enterprises faced issues of gathering large chunks of data they found that the data cannot be processed using any of the existing centralized architecture solutions. Apart from time constraints, the enterprises faced issues of efficiency, performance and elevated infrastructure cost with the data processing in the centralized environment. With the help of distributed architecture these large organizations were able to overcome the problems of extracting relevant information from a huge data dump. One of the best open source tools used in the market to harness the distributed architecture in order to solve the data processing problems is Apache Hadoop. Using Apache Hadoop's various components such as data clusters, map-reduce algorithms and distributed processing, we will resolve various location-based complex data problems and provide the relevant information back into the system, thereby increasing the user experience.
Machine learning in finance: the case of deep learning for option pricing
  • Robert Culkin
  • R Sanjiv
  • Das
  • Culkin Robert
Robert Ilijason. 2020. Getting Data into Databricks
  • Robert Ilijason
  • Ilijason Robert
Hierarchical data format 5: HDF5 . In Handbook of open source tools
  • Sandeep Koranne
  • Koranne Sandeep
Accelerating the machine learning lifecycle with MLflow
  • Matei Zaharia
  • Andrew Chen
  • Aaron Davidson
  • Ali Ghodsi
  • Sue Ann Hong
  • Andy Konwinski
  • Siddharth Murching
  • Tomas Nykodym
  • Paul Ogilvie
  • Mani Parkhe
  • Zaharia Matei
What is mlops? In Beginning MLOps with MLFlow
  • Sridhar Alla
  • Suman Kalyan Adari
  • Alla Sridhar