Technical ReportPDF Available

Abstract and Figures

Interesting efforts were done to construct tools to facilitate and streamline the development of Machine Learning (ML) workflows composed of several pipelines in the last two decades. From Unix scripts to Web based ML components and solutions to automate and orchestrate ML and Data Mining (DM) pipes, we have tried many high level services for the data scientist iterative process. On the other hand, we have the low level services being investigated, like cloud environments, container orchestration, fault tolerance service and so forth. Normally, scripts are produced to simplify such low level services operations. Unfortunately, no existing solution put both low and high level services on a unique service stack. Furthermore, none of them enables the utilization of different existing tools during the construction of a single pipeline, i.e., they are not flexible enough to permit a tool to build pre-processing pipes, another tool to build parameters tuning steps and a third different tool to perform the training step of a single pipeline. To address these limitations, we present the Learning Orchestra system, a tool to construct complex workflows using different ML tools or players transparently, i.e., from a single interoperable API we can build interesting analytical flows. The workflows can be deployed on a containerized cloud environment capable to scale and be resilient. Initial experiments demonstrated that our system is a promising and innovative alternative for the problem of simplify and streamline the ML iterative process.
Content may be subject to copyright.
A preview of the PDF is not available
... The Learning Orchestra architecture is organized into layers, precisely into eight layers as Figure 3.1 illustrates. The AutoML extensions were made on almost all layers, thus we use a similar architecture presented in (RIBEIRO, 2021). Fewer pipeline steps were implemented (precisely, seven API services) for AutoML when compared with the eleven services implemented for the entire ML API possibilities. ...
... In this work, we added PyCaret and Autokeras container images and they also represent a processing container type, this way they can be replicated over the deployed VMs and managed by the Docker Swarm, identical to any other processing container (Spark, Scikit-learn, TensorFlow or MongoDB) of the previous Learning Orchestra version (RIBEIRO, 2021). The AutoML services were coded also in Python, so the language interpreter is present in processing containers. ...
... The results presented in this section represent Learning Orchestra AutoML running Pycaret, precisely the Titanic model, using the single VM deployment option illustrated in Figure 4.1. In Learning Orchestra previous work (RIBEIRO, 2021), the authors performed Titanic experiments, but using Spark MLlib. We decided to evaluate an AutoML solution against a ML solution where several classifiers are used. ...
Article
Full-text available
Apache Mahout is a library for scalable machine learning (ML) on distributed dataflow systems, offering various implementations of classification, clustering, dimensionality reduction and recommendation algorithms. Mahout was a pioneer in large-scale machine learning in 2008, when it started and targeted MapReduce, which was the predominant abstraction for scalable computing in industry at that time. Mahout has been widely used by leading web companies and is part of several commercial cloud offerings. In recent years, Mahout migrated to a general framework enabling a mix of dataflow programming and linear algebraic computations on backends such as Apache Spark and Apache Flink. This design allows users to execute data preprocessing and model training in a single, unified dataflow system, instead of requiring a complex integration of several specialized systems. Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org.
Article
Full-text available
Purpose The purpose of this paper is to review the extant literature on Airbnb – one of the most significant recent innovations in the tourism sector – to assess the research progress that has been accomplished to date. Design/methodology/approach Numerous journal databases were searched, and 132 peer-reviewed journal articles from various disciplines were reviewed. Key attributes of each paper were recorded, and a content analysis was undertaken. Findings A survey of the literature found that the majority of Airbnb research has been published quite recently, often in hospitality/tourism journals, and the research has been conducted primarily by researchers in the USA/Canada and Europe. Based on the content analysis, the papers were divided into six thematic categories – Airbnb guests, Airbnb hosts, Airbnb supply and its impacts on destinations, Airbnb regulation, Airbnb’s impacts on the tourism sector and the Airbnb company. Consistent findings have begun to emerge on several important topics, including guests’ motivations and the geographical dispersion of listings. However, many research gaps remain, so numerous suggestions for future research are provided. Practical implications By reviewing a large body of literature on a fairly novel and timely topic, this research provides a concise summary of Airbnb knowledge that will assist industry practitioners as they adapt to the recent rapid emergence of Airbnb. Originality/value This is the first paper to review the extant literature specifically about Airbnb.
Chapter
Full-text available
The success of machine learning in a broad range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts. To be effective in practice, such systems need to automatically choose a good algorithm and feature preprocessing steps for a new dataset at hand, and also set their respective hyperparameters. Recent work has started to tackle this automated machine learning (AutoML) problem with the help of efficient Bayesian optimization methods. Building on this, we introduce a robust new AutoML system based on the Python machine learning package scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters). This system, which we dub Auto-sklearn, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization. Our system won six out of ten phases of the first ChaLearn AutoML challenge, and our comprehensive analysis on over 100 diverse datasets shows that it substantially outperforms the previous state of the art in AutoML. We also demonstrate the performance gains due to each of our contributions and derive insights into the effectiveness of the individual components of Auto-sklearn.
Conference Paper
Full-text available
Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components---a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.
Article
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image recognition, object detection, and language modeling. However, building a high-quality DL system for a specific task highly relies on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML) is a promising solution for building a DL system without human assistance and is being extensively studied. This paper presents a comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the DL pipeline, we introduce AutoML methods-covering data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS)-with a particular focus on NAS, as it is currently a hot sub-topic of AutoML. We summarize the representative NAS algorithms' performance on the CIFAR-10 and ImageNet datasets and further discuss the following subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods for future research.
Chapter
Machine learning is often and rightly viewed as the use of mathematical algorithms to teach the computer to learn tasks that are computationally infeasible to program as a set of specified instructions. However, it turns out that these algorithms constitute only a small fraction of the overall learning pipeline from an engineering perspective. Building high-performant and dynamic learning models includes a number of other critical components. These components actually dominate the space of concerns for delivering an end-to-end machine learning product.
Chapter
The paper presents Language Processing Modelling Notation (LPMN). It is a formal language used to orchestrate a set of NLP microservices. The LPMN allows modeling and running complex workflows of language and machine learning tools. The scalability of the solution was achieved by a usage of message-oriented middleware. LPMN is used for developing text mining application with web-based interface and performing research experiments that requires a usage of NLP and machine learning tools.