Article

MapReduce: Simplified data processing on large clusters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

... This work tries to improve the performance of Map Reduce workloads. [1] Make span and total completion time (TCT) are two key performance metrics. Make span is defined as the time period since the start of the first job until the completion of the last job for a set of jobs. ...
... In this works, we aim to optimize these two metrics. [1] When large data sets are given to operate on them for each or every single set of jobs they make span and total completion time should be able to minimize and optimize the job reduction issues using MapReduce algorithm. This including the total completion time for all datasets should be optimized. ...
... A MapReduce job consists of a set of map and reduce tasks, where reduce tasks are performed after the map tasks. [1] Hadoop [3] is an open source implementation of MapReduce. MapReduce and Hadoop [3] are used to support batch processing for jobs submitted from multiple users (i.e., MapReduce workloads). ...
Conference Paper
Full-text available
The continuous advancement in computer technology over the years, the quantity of data being generated is growing exponentially. Some of these data are structured, semi-structured or unstructured. This poses a great challenge when these data are to be analyzed because conventional data processing techniques are not suited to handling such data. Map Reduce is a programming model and an associated implementation for processing and generating large data sets. A Map Reduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. This system proposes of algorithms to minimize the makespan and the total completion time for an offline MapReduce workload. Our algorithms focuses on the job ordering optimization for a MapReduce workload and we can perform optimization of Makespan and total completion for a MapReduce workload. So to overcome that a new greedy job ordering algorithm to minimize the makespan and total completion time together.
... A S the growth rate of single-machine computation speeds started to stagnate in recent years, parallelism seemed like an effective technique to aggregate the computation speeds of multiple machines. Since then, parallelism has become a main ingredient in compute cluster architectures [1], [2] as it incurs less processing and storage cost on any one individual server [3]. ...
... We select a likelihood model f (D | θ) for the data generation process and a prior f (θ), with model parameters Θ. We assume we have access to data D = {d (1) , d (2) , . . . , d (n) }, where d (j) is inter-arrival times between the j-th and j + 1-st job arrival event that is observed by the load-balancer or the service times for each server. ...
... For a fixed overall number of jobs in the system and χ > 1, this objective tends to balance queue lengths, e.g., if total jobs in the system are 10 and N = 2 then an allocation of [5,5] jobs will have much higher reward than [9,1] allocation. Using the variance amongst the current queue fillings also balances the load on queues. ...
Article
Full-text available
Load balancing arises as a fundamental problem, underlying the dimensioning and operation of many computing and communication systems, such as job routing in data center clusters, multipath communication, Big Data and queueing systems. In essence, the decision-making agent maps each arriving job to one of the possibly heterogeneous servers while aiming at an optimization goal such as load balancing, low average delay or low loss rate. One main difficulty in finding optimal load balancing policies here is that the agent only partially observes the impact of its decisions, e.g., through the delayed acknowledgements of the served jobs. In this paper, we provide a partially observable (PO) model that captures the load balancing decisions in parallel buffered systems under limited information of delayed acknowledgements. We present a simulation model for this PO system to find a load balancing policy in real-time using a scalable Monte Carlo tree search algorithm. We numerically show that the resulting policy outperforms other limited information load balancing strategies such as variants of Join-the-Most-Observations and has comparable performance to full information strategies like: Join-the-Shortest-Queue, Join-the-Shortest-Queue(d) and Shortest-Expected-Delay. Finally, we show that our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
... Over the past decade, Apache Hadoop [4] has become the de facto standard ''Big Data'' store and processing platform by leveraging a robust and scalable distributed file system (HDFS [8]) and an efficient parallel processing framework (MapReduce [9]). In the initial design of Hadoop 1.0, MapReduce programming model was tightly coupled with the resource management infrastructure. ...
... Mahout has been recently migrated to a general framework enabling a mix of dataflow programming and linear algebraic computations on backends such as Apache Spark. Mahout was originally targeted MapReduce [9] based large-scale machine learning. However, both of MLlib and Mahout mainly focus on traditional machine learning algorithms rather than supporting deep learning application executions. ...
... The initial implementation of Apache Hadoop [4] was including MapReduce programming model [9] with underlying Hadoop Distributed File System (HDFS) [8]. From 2.0 version, Hadoop has been expanded into a multiuse data platform by breaking down its functionality into two layers: a platform layer for system-level resource management and a framework layer for application-level coordination based on the cluster resource management system (or often called as cluster-level operating system) ''YARN'' (Yet Another Resource Negotiator) [10,11]. ...
Article
Full-text available
We have designed and implemented a new data processing framework called “MeLoN” (Multi-tenant dEep Learning framework On yarN) which aims to effectively support distributed deep learning applications that can show another type of data-intensive workloads in the YARN-based Hadoop ecosystem. MeLoN is developed as one of Hadoop YARN applications so that it can transparently co-host existing deep learning applications with other data processing workflows. In this paper, we present comprehensive techniques that can effectively support multiple deep learning applications in a Hadoop YARN cluster by leveraging fine-grained GPU over-provisioning policy and a high-performance parallel file system for data staging which can improve the overall system throughput. Through our extensive experiments based on the representative deep learning workloads, we demonstrate that MeLoN can make an effective convergence of deep learning and the big data platform Hadoop by employing YARN-based resource allocation and execution mechanisms for running distributed deep learning applications. We believe that MeLoN can bring many additional interesting research issues including profiling of expected GPU memory usages of deep learning applications, supporting more complicated deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.
... There is a recent surge in the use of distributed computing system such as MapReduce [1], Spark [2] for processing nonlinear and computationally hard functions. This surge has been further intensified by recent developments relating to training large-scale machine learning algorithms such as deep neural networks with high data complexity (cf. ...
... In this section, we explain the Multi-user Linearly Separable Computation setting (cf. Fig. 1) which consists of K users, N servers, and a master node that coordinates servers and users 1 . ...
... In the above, we can easily see that the encoding coefficients e n,ℓ , which are indeed determined by the master node, satisfy e n,ℓ = 0, ∀(n, ℓ) ∈ [N ] × [L], n / ∈ W ℓ . 1 Our setting incorporates the underlying assumption that the tasks performed at the servers substantially outweigh in computational complexity the basic linear operations that are performed at the different users and also we have assumed that each server node is connected to all of the users through a broadcast channel. ...
Conference Paper
Full-text available
In this work, we investigate the problem of multiuser linearly separable function computation, where N servers help compute the desired functions (jobs) of K users. In this setting, each desired function can be written as a linear combination of up to L (generally non-linear) sub-functions. Each server computes some of the sub-tasks, and communicates a linear combination of its computed outputs (files) in a single-shot to some of the users, then each user linearly combines its received data in order to recover its desired function. We explore the range of the optimal computation cost via establishing a novel relationship between our problem, syndrome decoding and covering codes. The work reveals that in the limit of large N, the optimal computation cost in the form of the maximum fraction of all servers that must compute any subfunction-is lower bounded as γ ≥ H −1 q (log q (L) N), for any fixed log q (L)/N. The result reveals the role of the computational rate log q (L)/N, which cannot exceed what one might call the computational capacity Hq(γ) of the system.
... Aggregation applications are increasingly developed and deployed in data centers, such as distributed machine learning [1,2] and MapReduce-like applications [3]. Aggregation applications typically need to aggregate massive intermediate results from different worker servers to get final results, thus these applications can generate a large amount of traffic consisting of intermediate data. ...
... In data center networks, traffic of MapReduce-like applications accounts for considerable percentage in some data centers [5]. The common point of these applications is that their running process follow partition/aggregation pattern [3,6]. When new request arrives at master node, job will be assigned to a group of workers (mappers and reducers). ...
Article
Full-text available
Aggregation applications are widely deployed in data centers, such as distributed machine learning and MapReduce-like framework. These applications typically have large communication overhead, which brings extra delay and traffic pressure to data center networks. In-network aggregation (INA) technology is a new approach to accelerate aggregation tasks and reduce traffic by offloading aggregation function on network switches. In this paper, we concentrate on two aspects of INA. The first aspect is INA implementation methods. We summarize key points of INA designing, and classify INA methods into three categories according to different types of hardware: commodity programmable switch, middle box and new switch architecture. The second aspect is INA algorithm. Building aggregation tree effectively is essential to the performance of INA. Finally, we make comparisons and propose some potential challenges and opportunities for future INA research.
... Here, the k-hop neighborhoods are the vertices that can be visited within the maximum k edges by the given vertex. AGL proposes a distributed pipeline to generate k-hop neighborhoods of each vertex based on message passing, and implements it with MapReduce infrastructure [99]. The generated k-hop neighborhoods information is stored in the distributed file system. ...
... It is categorized as the individual-sample-based execution of distributed mini-batch training. To speed up the sampling process, it introduces a distributed pipeline to generate k-hop neighborhood in the spirit of message passing, which is implemented with MapReduce infrastructure [99]. In this way, in the sampling phase, mini-batch data can be rapidly generated by collecting the khop neighbors of the target vertices. ...
Preprint
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
... The owned organization had whole rights on the data and the cloud center. In these types of clouds the sender uses different encryption techniques [12] to secure and the receiver had a key to decrypt it. C) Hybrid clouds-These are the combination of both public and private clouds where the public cloud data can be used by private cloud of an organization if needed. ...
... While doing these things on the cloud we may face some issues like slow data transfer rate, delays in handover and takes more time to migrate. So to overcome the discussed issues we may use MapReduce [12] of HDFS architecture for distributed data [15]. ...
Article
Cloud acts as a data storage and also used for data transfer from one cloud to other. Here data exchange takes place among cloud centers of organizations. At each cloud center huge amount of data was stored, which interns hard to store and retrieve information from it. While migrating the data there are some issues like low data transfer rate, end to end latency issues and data storage issues will occur. As data was distributed among so many cloud centers from single source, will reduces the speed of migration. In distributed cloud computing it is very difficult to transfer the data fast and securely. This paper explores MapReduce within the distributed cloud architecture where MapReduce assists at each cloud. It strengthens the data migration process with the help of HDFS. Compared to existing cloud migration approach the proposed approach gives accurate results interns of speed, time and efficiency.
... Hence, several programming models have been proposed to improve iterative computation. The programming models of GPS include MapReduce [158], Vertex-centric [19], Gather-Apply-Scatter [23], and Subgraph-centric [84]. ...
... Jeffrey and Sanjay [158] proposed the MapReduce (MR) programming model. It is a distributed programming framework for large-scale data computing on commodity clusters. ...
Article
Full-text available
Graphs are a tremendously suitable data representations that model the relationships of entities in many application domains, such as recommendation systems, machine learning, computational biology, social network analysis, and other application domains. Graphs with many vertices and edges have become quite prevalent in recent years. Therefore, graph computing systems with integrated various graph partitioning techniques have been envisioned as a promising paradigm to handle large-scale graph analytics in these application domains. However, scalable processing of large-scale graphs is challenging due to their high volume and inherent irregular structure of the real-world graphs. Hence, industry and academia have been recently proposing graph partitioning and computing systems to process and analyze large-scale graphs efficiently. The graph partitioning and computing systems have been designed to improve scalability issues and reduce processing time complexity. This paper presents an overview, classification, and investigation of the most popular graph partitioning and computing systems. The various methods and approaches of graph partitioning and diverse categories of graph computing systems are presented. Finally, we discuss main challenges and future research directions in graph partitioning and computing systems.
... Due to the ever growing database sizes, many researchers of data mining revised the conventional mining algorithms and formulated the distributed algorithms to handle the big data more efficiently. The most efficient big data framework that helps in designing the distributed algorithms is MapReduce [19]. In 2008, Google developed MapReduce [19] distributed programming framework. ...
... The most efficient big data framework that helps in designing the distributed algorithms is MapReduce [19]. In 2008, Google developed MapReduce [19] distributed programming framework. It can handle the processing of big data by distributing the work into two parallel processes, namely, Mapper and Reducer. ...
Article
Full-text available
Mining high utility sequential patterns is observed to be a significant research in data mining. Several methods mine the sequential patterns while taking utility values into consideration. The patterns of this type can determine the order in which items were purchased, but not the time interval between them. The time interval among items is important for predicting the most useful real-world circumstances, including retail market basket data analysis, stock market fluctuations, DNA sequence analysis, and so on. There are a very few algorithms for mining sequential patterns those consider both the utility and time interval. However, they assume the same threshold for each item, maintaining the same unit profit. Moreover, with the rapid growth in data, the traditional algorithms cannot handle the big data and are not scalable. To handle this problem, we propose a distributed three phase MapReduce framework that considers multiple utilities and suitable for handling big data. The time constraints are pushed into the algorithm instead of pre-defined intervals. Also, the proposed upper bound minimizes the number of candidate patterns during the mining process. The approach has been tested and the experimental results show its efficiency in terms of efficiency, memory utilization, and scalability.
... Несмотря на преимущества этой модели, модель программирования MapReduce имеет некоторые недостатки, важные при извлечении данных. Это включает в себя неэффективную обработку итерационных задач и медленную скорость при объединении нескольких источников данных [10,3]. По этим причинам появились альтернативы стандартной платформе Hadoop MapReduce. ...
... Каждая процедура MapReduce, исходные данные обучения, делится на несколько частей: В процессе сегментации данных определяется потенциальное наличие малых классов в данных, связанных с каждой картой [5,10]. «Редко» известные малые классы -это классы, которые охватывают несколько обучающих примеров в обученном классификаторе. ...
Article
Full-text available
Сегодня в сфере больших данных мы наблюдаем растающую тенденцию успеха в изучении и применении задач, возникающих при извлечении знаний из больших объемов данных. По этой причине наблюдается миграция стандартных систем Data Mining на новую функциональную парадигму, позволяющую работать с большими данными. С помощью модели MapReduce и ее различных расширений можно успешно решать проблемы отказоустойчивости и масштабируемости алгоритмов, сохраняя при этом данные на хорошем уровне. Для многих приложений выбирают различные подходы, используемые в интеллектуальном анализе данных, и среди них модели, основанные на нечетких системах. Среди их достоинств следует выделить их близость к естественному языку. Кроме того, они используют модель логического вывода, которая позволяет им хорошо адаптироваться к различным сценариям, особенно с определенной степенью неопределенности. Несмотря на успех этих типов систем, их миграция в среду больших данных в различных областях обучения все еще находится на начальном этапе. В данной статье анализируется конструкция этих моделей и дается обзор основных существующих предложений по данной теме. Кроме того, обсуждались проблемы, связанные с распределением данных и распараллеливанием существующих алгоритмов, а также взаимосвязь моделей с нечетким представлением данных, по проектированию методов на основе нечетких множеств вносились предложения на будущее в этой области.
... In this Chapter, we present two frameworks for scalable and parallel processing big volumes of data, Apache Hadoop [5] and Apache Spark [11], and the most usual non-relational data models for storing and querying big data. The programming framework of Apache Hadoop, MapReduce [48] was used for the implementation of the query evaluation algorithms presented in Chapter 4, Chapter 5 and Chapter 6. Apache Spark and the NoSQL document database MongoDB were used for the implementation of the query evaluation algorithm presented in Chapter 7. ...
... MapReduce [48] is a programming framework, built by Google, for processing large datasets in a distributed manner. Creating a MapReduce job, the user defines two functions, map and reduce, which run in each cluster node, in isolation. ...
Thesis
In this Ph.D. thesis, we study the problem of efficiently evaluating basic graph pattern (BGP) queries over a large amount of linked data (RDF data) in parallel and provide four approaches. In this context, we consider that the data graph has been partitioned into graph segments and the initial query Q is decomposed into a set of BGP subqueries. In the first three approaches, the widely used MapReduce framework is used for querying a large amount of linked data. In the first approach, a generic two-phase, MapReduce algorithm is presented. The algorithm is based on the idea that the data graph has been arbitrarily partitioned into graph segments which are stored in different nodes of a cluster of commodity machines. To answer a user query Q, Q is also decomposed into a set of random BGP subqueries. In the first phase, the subqueries are applied to each graph segment, in isolation, and intermediate results are computed. The intermediate results are appropriately combined in the second phase to obtain the answers of the initial query Q. The proposed algorithm computes the answers to a given query correctly, independently of a) the data graph partitioning, b) the way that graph segments are stored, c) the query decomposition, and d) the algorithm used for calculating (partial) results. In the second approach, we present a method which focusing on the decomposition of the query Q, into a set of generalized star queries, which are queries that allow both subject-object and object-subject edges from a specific node, called central node. It is proved that each query Q can be transformed into a set of subject-object star subqueries. The data graph has also been arbitrarily partitioned into graph segments and a two-phase, scalable MapReduce algorithm is proposed that efficiently results from the answer of the initial query Q by computing and appropriately combining the generalized subqueries answers. The third approach is based on the assumptions that data graphs are partitioned in the distributed file system in such a way so as replication of data triples between the data segments is allowed. Data triples are replicated in such a way so as answers of generalized star queries, can be obtained from a single data segment. One and a half phase, scalable, MapReduce algorithm is proposed that efficiently computes the answer of the initial query Q by computing and appropriately combining the subquery answers. It is proved that, under certain conditions, the query can be answered in a single MapReduce phase. In the fourth approach, we propose an effective data model for storing RDF data in a document database using a maximum replication factor of 2 (i.e., in the worst-case scenario, the data graph will be doubled in storage size). The proposed storage model is utilized for efficiently evaluating BGP queries in a distributed manner. Each query is decomposed into a set of generalized star queries. The proposed data model ensures that no joining operations over multiple datasets are required to evaluate generalized star queries. The results of the evaluation of the generalized star subqueries of a query Q are then properly combined, in order to compute the answers of the initial query Q. The proposed approach has been implemented using MongoDB and Apache Spark.
... In those cluster and data center environments, MapReduce and Hadoop are used to support batch processing for jobs submitted from multiple users (i.e., MapReduce workloads). This work tries to improve the performance of MapReduce workloads [1] There are two key performance metrics i.e. Makespan and total completion time (TCT). ...
... In contrast, total completion time is referred to as the sum of completed time periods for all jobs since the start of the first job. In this works, we aim to optimize these two metrics [1]. ...
Conference Paper
Full-text available
In today's world the quantity of data being generated is growing exponentially. Some of these data are structured, semi-structured or unstructured. This poses a great challenge when these data are to be analyzed because conventional data processing techniques are not suited to handling such data. Map Reduce is a programming model and an associated implementation for processing and generating large data sets. A Map Reduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. This system proposes of algorithms to optimize the makespan and the total completion time for an offline MapReduce workload. Our algorithms focus on the job ordering optimization for a MapReduce workload and we can perform optimization of Makespan and total completion for a MapReduce workload. Our work is focuses on resolving the time efficiency problems as well as memory utilization problem. By using MK_TCT_JR algorithm produced the result that are up to, 90 % better than MK_JR. Our algorithm will improve the system performance in terms of makepan and total completion time.
... Hive Query Language (HQL) is translated to the execution plan of MapReduce [8] jobs to be run in parallel, (2) the Spark [2] program and machine learning workloads are transformed to a DAG workflow for execution [7], and (3) the Tez [5] framework allows for a complex DAG of tasks for processing data. ...
Preprint
Full-text available
Directed Acyclic Graph (DAG) workflows are widely used for large-scale data analytics in cluster-based distributed computing systems. The performance model for a DAG on data-parallel frameworks (e.g., MapReduce) is a research challenge because the allocation of preemptable system resources among parallel jobs may dynamically vary during execution. This resource allocation variation during execution makes it difficult to accurately estimate the execution time. In this paper, we tackle this challenge by proposing a new cost model, called Bottleneck Oriented Estimation (BOE), to estimate the allocation of preemptable resources by identifying the bottleneck to accurately predict task execution time. For a DAG workflow, we propose a state-based approach to iteratively use the resource allocation property among stages to estimate the overall execution plan. Furthermore, to handle the skewness of various jobs, we refine the model with the order statistics theory to improve estimation accuracy. Extensive experiments were performed to validate these cost models with HiBench and TPC-H workloads. The BOE model outperforms the state-of-the-art models by a factor of five for task execution time estimation. For the refined skew-aware model, the average prediction error is under 3% when estimating the execution time of 51 hybrid analytics (HiBench) and query (TPC-H) DAG workflows.
... Системы баз данных, предназначенные для обработки OLAP-запросов, используются для управления петабайтными массивами данных. Например, СУБД Greenplum, основанная на технологии MapReduce [1], выполняет глубинный анализ 6.5 Пбайт данных на 96-узловом кластере в компании eBay. СУБД Hadoop обрабатывает 2.5 Пбайт данных на кластере, состоящем из 610 узлов для популярного web-сервиса facebook. ...
Article
Full-text available
20.06.2009 г. Работа посвящена проблеме эффективной обработки запросов в кластерных вычислительных сис-темах. Представлен оригинальный подход к размещению и репликации данных на узлах кластерной системы. На основе этого подхода разработан метод балансировки загрузки. Предложен метод эф-фективной параллельной обработки запросов для кластерных систем, основанный на описанном методе балансировки загрузки. Приведены результаты вычислительных экспериментов и выполнен анализ эффективности предложенных подходов. 1. ВВЕДЕНИЕ На сегодняшний день существует целый ряд систем баз данных, обеспечивающих параллель-ную обработку запросов. Системы баз данных, предназначенные для обработки OLAP-запро-сов, используются для управления петабайтны-ми массивами данных. Например, СУБД Green-plum, основанная на технологии MapReduce [1], выполняет глубинный анализ 6.5 Пбайт данных на 96-узловом кластере в компании eBay. СУБД Hadoop обрабатывает 2.5 Пбайт данных на клас-тере, состоящем из 610 узлов для популярного web-сервиса facebook. В области параллельной обработки OLTP-запросов существует ряд ком-мерческих параллельных СУБД, среди которых наиболее известными являются Teradata, Oracle Exadata и DB2 Parallel Edition. В настоящее время исследования в данной области ведутся в направлении самонастройки СУБД [2], балансировки загрузки и решения свя-занной с ней проблемы размещения данных [3], оптимизации параллельных запросов [4] и эф-фективного использования современных много-ядерных процессоров [5, 6]. * Работа выполнена при финансовой поддержке Рос-сийского фонда фундаментальных исследований (проект 09-07-00241-а). Одной из важнейших задач в параллельных СУБД является балансировка загрузки. В клас-сической работе [7] было показано, что переко-сы, возникающие при обработке запросов в па-раллельных системах баз данных без совместно-го использования ресурсов, могут приводить к практически полной деградации производитель-ности системы. В работе [8] предложено решение проблемы ба-лансировки загрузки для систем без совместного использования ресурсов, основанное на реплика-ции. Данное решение позволяет уменьшить на-кладные расходы на передачу данных по сети в процессе балансировки загрузки. Однако этот подход применим в весьма узком контексте про-странственных баз данных в специфическом сег-менте диапазонных запросов. В работе [3] задача балансировки загрузки решается путем частич-ного перераспределения данных перед началом выполнения запроса. Данный подход уменьшает суммарное количество пересылок данных меж-ду вычислительными узлами в ходе обработки запроса, однако накладывает серьезные требо-вания на скорость межпроцессорных коммуника-ций. В настоящей работе предложен метод парал-лельной обработки запросов, основанный на ори-гинальном подходе к размещению базы данных, 25
... The waiting time of T l is the task arrival time T A l , subtracted from the task starting process time T S l so the task T l waiting time can be defined as (2). W T l can be calculated as follows: ...
Article
Full-text available
As the explosive growth of the data volume, data center is playing a critical role to store and process huge amount of data. Traditional single data center can no longer to adapt into incredibly fast-growing data. Recently, some researches have extended the tasks such data processing to geographically distributed data centers. However, since the joint consideration of task placement and data transfer, it is complex and difficult to design a proper scheduling approach with the goal of minimizing makespan under the constraint of task dependencies, processing capability and network, etc. Therefore, our work proposes JHTD : an efficient joint scheduling framework based on hypergraph for task placement and data transfer across geographically distributed data centers. Generally, there are two crucial stages in JHTD . Initially, due to the outstanding of hypergraphs in modeling complex problems, we have leveraged a hypergraph-based model to establish the relationship between tasks, data files, and data centers. Thereafter, a hypergraph-based partition method has been developed for task placement within the first stage. In the second stage, a task reallocation scheme has been devised in terms of each task-to-data dependency. Meanwhile, a data dependency aware transferring scheme has been designed to minimize the makespan. Last, the real-world model China-VO project has been used to conduct a variety of simulation experiments. The results have demonstrated that JHTD effectively optimizes the problems of task placement and data transfer across geographically distributed data centers. JHTD has been compared with three other state-of-the-art algorithms. The results have demonstrated that JHTD can reduce the makespan by up to 20.6%. Also, various impacts (data transfer volume and load balancing) have been taken into account to show and discuss the effectiveness of JHTD .
... Such systems automatically handle failures by repeating failed operations and replacing defective nodes. Newer distributed dataflow frameworks like Apache Spark [2] and Apache Flink [1] cache data in memory for faster read access, unlike the older Hadoop MapReduce [23]. The user or the framework itself can choose strategies for data that could not fit into memory, e.g., spilling it to disk or recomputing the data from previous stages. ...
Preprint
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.
... Available frameworks Given that these computations are resource-demanding and require processing large amounts of data that cannot easily fit on a single machine, the training phase of ML models is usually executed through special purpose frameworks that allow for highly distributed and parallelized executions. The reference programming model for batch computations is map-reduce [7], popularized by Google and the Hadoop framework [19]. The computation is organized in two phases: map and reduce. ...
Preprint
In recent years, Web services are becoming more and more intelligent (e.g., in understanding user preferences) thanks to the integration of components that rely on Machine Learning (ML). Before users can interact (inference phase) with an ML-based service (ML-Service), the underlying ML model must learn (training phase) from existing data, a process that requires long-lasting batch computations. The management of these two, diverse phases is complex and meeting time and quality requirements can hardly be done with manual approaches. This paper highlights some of the major issues in managing ML-services in both training and inference modes and presents some initial solutions that are able to meet set requirements with minimum user inputs. A preliminary evaluation demonstrates that our solutions allow these systems to become more efficient and predictable with respect to their response time and accuracy.
... There are also all-gather and all-reduce, where the output is shared by all processors. At a higher abstraction level, MapReduce (Dean and Ghemawat, 2008), a functional programming model in which a "map" function transforms each datum into a key-value pair, and a "reduce" function aggregates the results, is a popular distributed data processing model. While basic implementations are provided in base R, both the map and reduce operations are easy to parallelize. ...
... Due to the scale of the evaluations, we use an internal MapReduce [34] based C++ implementation to calculate the Diarization Error Rate (DER) reported in Section 3.3. Thus the DER numbers reported in this paper may have some discrepancies with numbers computed with other libraries such as pyannote.metrics ...
Preprint
While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we propose a multi-stage clustering strategy, that uses different clustering algorithms for input of different lengths. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.
... State-of-the-art Big Data frameworks that implement the MapReduce paradigm [13] are known to implement data locality optimizations. General Big Data architectures can thus efficiently co-locate map and reduce tasks with input data, effectively reducing the network overhead and thus increasing application throughput. ...
Preprint
Full-text available
Real-time Big Data architectures evolved into specialized layers for handling data streams' ingestion, storage, and processing over the past decade. Layered streaming architectures integrate pull-based read and push-based write RPC mechanisms implemented by stream ingestion/storage systems. In addition, stream processing engines expose source/sink interfaces, allowing them to decouple these systems easily. However, open-source streaming engines leverage workflow sources implemented through a pull-based approach, continuously issuing read RPCs towards the stream ingestion/storage, effectively competing with write RPCs. This paper proposes a unified streaming architecture that leverages push-based and/or pull-based source implementations for integrating ingestion/storage and processing engines that can reduce processing latency and increase system read and write throughput while making room for higher ingestion. We implement a novel push-based streaming source by replacing continuous pull-based RPCs with one single RPC and shared memory (storage and processing handle streaming data through pointers to shared objects). To this end, we conduct an experimental analysis of pull-based versus push-based design alternatives of the streaming source reader while considering a set of stream benchmarks and microbenchmarks and discuss the advantages of both approaches.
... As the processing of webpages was done in parallel, pages were randomly distributed into shards for processing 57 . We filter out snippets >1024 characters in length. ...
Article
Full-text available
Physicians write clinical notes with abbreviations and shorthand that are difficult to decipher. Abbreviations can be clinical jargon (writing “HIT” for “heparin induced thrombocytopenia”), ambiguous terms that require expertise to disambiguate (using “MS” for “multiple sclerosis” or “mental status”), or domain-specific vernacular (“cb” for “complicated by”). Here we train machine learning models on public web data to decode such text by replacing abbreviations with their meanings. We report a single translation model that simultaneously detects and expands thousands of abbreviations in real clinical notes with accuracies ranging from 92.1%-97.1% on multiple external test datasets. The model equals or exceeds the performance of board-certified physicians (97.6% vs 88.7% total accuracy). Our results demonstrate a general method to contextually decipher abbreviations and shorthand that is built without any privacy-compromising data.
... The CN-SSDs offer opportunities to analyze and gain insight into intermediate data to optimize the HPC applications. For instance, various HPC applications perform computation and analysis in the Map-Reduce <key, value> style [27] to obtain insights into the data generated during simulations. Therefore, by adopting the KV interface, HPC applications gain immediate access to perform queries on the intermediate data and get the specific insights required. ...
Conference Paper
High-performance computing (HPC) facilities have employed flash-based storage tier near to compute nodes to absorb high I/O demand by HPC applications during periodic system-level checkpoints. To accelerate these checkpoints, proxy-based distributed key-value stores (PD-KVS) gained particular attention for their flexibility to support multiple backends and different network configurations. PD-KVS rely internally on monolithic KVS, such as LevelDB or RocksDB, to exploit the KV interface and query support. However, PD-KVS are unaware of the high redundancy factor in checkpoint data, which can be up to GBs to TBs, and therefore, tend to generate high write and space amplification on these storage layers. In this paper, we propose DENKV which is deduplication-extended node-local LSM-tree-based KVS. DENKV employs asynchronous partially inline dedup (APID) and aims to maintain the performance characteristics of LSM-tree-based KVS while reducing the write and space amplification problems. We implemented DENKV atop BlobDB and showed that our proposed solution maintains performance while reducing write amplification up to 2× and space amplification by 4× on average.
... Datalog combined logic programming with DBs and greatly advanced query optimization techniques, especially for recursive queries [3]. Currently, concepts from functional PLs are finding a home in the analysis of Big Data using Map/Reduce; Map and Reduce are well-known higher-order functions [4]. Firstclass support for functional motifs has also contributed to integrating PLs and DB query languages using systems like LINQ [5]. ...
Thesis
Full-text available
MultiverseJava supports sequenced semantics for timestamped values in a Java program. Programmers currently have to resort to ad hoc methods to implement sequenced semantics in Java programs; hence, a better approach is needed. We show how MultiverseJava can be implemented using a MultiverseJava to Java translation. The translation layer weaves support for computing with the timestamped values into a Java program. This thesis describes the MultiverseJava achitechture, the layer, semantic templates, and experiments to quantify the cost of MultiverseJava.
... However, this section will concentrate on two paradigms. The first paradigm is MapReduce [25] proposed by Google to process large-scale data. In particular, we will be looking at Apache ...
Thesis
Recent decades have seen exponential growth in data acquisition attributed to advancements in edge device technology. Factory controllers, smart home appliances, mobile devices, medical equipment, and automotive sensors are a few examples of edge devices capable of collecting data. Traditionally, these devices are limited to data collection and transfer functionalities, whereas decision-making capabilities were missing. However, with the advancement in microcontroller and processor technologies, edge devices can perform complex tasks. As a result, it provides avenues for pushing training machine learning models to the edge devices, also known as learning-at-the-edge. Furthermore, these devices operate in a distributed environment that is constrained by high latency, slow connectivity, privacy, and sometimes time-critical applications. The traditional distributed machine learning methods are designed to operate in a centralized manner, assuming data is stored on cloud storage. The operating environment of edge devices is impractical for transferring data to cloud storage, rendering centralized approaches impractical for training machine learning models on edge devices. Decentralized Machine Learning techniques are designed to enable learning-at-the-edge without requiring data to leave the edge device. The main principle in decentralized learning is to build consensus on a global model among distributed devices while keeping the communication requirements as low as possible. The consensus-building process requires averaging local models to reach a global model agreed upon by all workers. The exact averaging schemes are efficient in quickly reaching global consensus but are communication inefficient. Decentralized approaches employ in-exact averaging schemes that generally reduce communication by communicating in the immediate neighborhood. However, in-exact averaging introduces variance in each worker's local values, requiring extra iterations to reach a global solution. This thesis addresses the problem of learning-at-the-edge devices, which is generally referred to as decentralized machine learning or Edge Machine Learning. More specifically, we will focus on the Decentralized Parallel Stochastic Gradient Descent (DPSGD) learning algorithm, which can be formulated as a consensus-building process among distributed workers or fast linear iteration for decentralized model averaging. The consensus-building process in decentralized learning depends on the efficacy of in-exact averaging schemes, which have two main factors, i.e., convergence time and communication. Therefore, a good solution should keep communication as low as possible without sacrificing convergence time. An in-exact averaging solution consists of a connectivity structure (topology) between workers and weightage for each link. We formulate an optimization problem with the objective of finding an in-exact averaging solution that can achieve fast consensus (convergence time) among distributed workers keeping the communication cost low. Since direct optimization of the objective function is infeasible, a local search algorithm guided by the objective function is proposed. Extensive empirical evaluations on image classification tasks show that the in-exact averaging solutions constructed through the proposed method outperform state-of-the-art solutions. Next, we investigate the problem of learning in a decentralized network of edge devices, where a subset of devices are close to each other in that subset but further apart from other devices not in the subset. Closeness specifically refers to geographical proximity or fast communication links. We proposed a hierarchical two-layer sparse communication topology that localizes dense communication among a subgroup of workers and builds consensus through a sparse inter-subgroup communication scheme. We also provide empirical evidence of the proposed solution scaling better on Machine Learning tasks than competing methods. Finally, we address scalability issues of a pairwise ranking algorithm that forms an important class of problem in online recommender systems. The existing solutions based on a parallel stochastic gradient descent algorithm define a static model parameter partitioning scheme, creating an imbalance of work distribution among distributed workers. We propose a dynamic block partitioning and exchange strategy for the model parameters resulting in work balance among distributed workers. Empirical evidence on publicly available benchmark datasets indicates that the proposed method scales better than the static block-based methods and outperforms competing state-of-the-art methods.
... Sparked by the success of large-scale distributed and parallel computing platforms such as MapReduce [DG08], Hadoop [Whi12], Dryad [IBY + 07], and Spark [ZCF + 10], there has been increasing interest in developing theoretically sound algorithms for these settings. The Massively Parallel Computation (MPC) model has emerged as the de facto standard theoretical abstractions for parallel computation in such settings. ...
Preprint
This paper presents an $O(\log\log \bar{d})$ round massively parallel algorithm for $1+\epsilon$ approximation of maximum weighted $b$-matchings, using near-linear memory per machine. Here $\bar{d}$ denotes the average degree in the graph and $\epsilon$ is an arbitrarily small positive constant. Recall that $b$-matching is the natural and well-studied generalization of the matching problem where different vertices are allowed to have multiple (and differing number of) incident edges in the matching. Concretely, each vertex $v$ is given a positive integer budget $b_v$ and it can have up to $b_v$ incident edges in the matching. Previously, there were known algorithms with round complexity $O(\log\log n)$, or $O(\log\log \Delta)$ where $\Delta$ denotes maximum degree, for $1+\epsilon$ approximation of weighted matching and for maximal matching [Czumaj et al., STOC'18, Ghaffari et al. PODC'18; Assadi et al. SODA'19; Behnezhad et al. FOCS'19; Gamlath et al. PODC'19], but these algorithms do not extend to the more general $b$-matching problem.
... Thanks to the advances in communications, now it is possible to build frameworks of several computing machines that are able to execute any task collaboratively at the same time, according to the paradigm divide et impera. Among these ones, we mention MapReduce (Dean and Ghemawat, 2008) and Spark (Zaharia et al., 2012), which are intensively used in clustering and they help to obtain results from classical clustering algorithms quickly. Indeed, enhanced versions of K-means, as PKMeans (Zhao et al., 2009) and SOKM (Zayani et al., 2016), have been proposed. ...
Thesis
Clustering reveals all its interest when the data set size considerably increases, since there is the opportunity to discover tiny but possibly high value clusters, which can not be detected with moderate sample sizes. However, the clustering of such high data volumes encounters computational limitations, requiring extremely high memory and computational resources. Thus, current clustering algorithms need frugal implementations, also demanded by institutions and industries to accomplish today’s eco-friendly policies. In this context, Gaussian model-based clustering, a popular clustering technique based on Gaussian mixtures, has required frugal adaptations to overcome these computational limitations and to report, even in the huge data case, the same good performance achieved in moderate size analyses. Such implementations are essentially based on subsampling strategies, which manage to be frugal, but they are expected to heavily failed in highly imbalanced cluster case. Thus, in this work, we propose a frugal technique, based on a so-called bin-marginal data-compression, to perform Gaussian model-based clustering on huge and imbalanced data sets. After a preliminary analysis on simple univariate settings revealing the potential of our solution (here, based on univariate binned data), we extend our proposal to multivariate data sets, where bin-marginal data are employed to perform a drastic reduction of the data volume. Despite this extreme loss of information, we prove identifiability property for the diagonal mixture model and we also introduce a specific EM-like algorithm associated to a composite likelihood approach guaranteeing frugality. Numerical experiments highlight that the proposed method outperforms subsampling both in controlled simulations and in various real applications where imbalanced clusters may typically appear, such as image segmentation, hazardous asteroids recognition and fraud detection. Then, additional topics regarding model choice, the problem of local maxima and the impact of our data-compression on clustering are dealt with a pure experimental point of view. Finally, through a collaboration with a company specialized in predictive maintanance, a practical application of anomaly detection on real time series is shown, in order to extend the potential application domains of the proposal.
... The recommended 3D video signature provides high resolution accuracy, as well as strong reminders in many video changes. The second important element of our strategy is the distributed index, which lists multimedia objects that have high dimensions [6]. The distributed cursor is applied through a frame reduction map and, therefore, can be used horribly to modify the amount of computer sources and provide high resolution. ...
Article
The distribution of copyrighted objects to copyright by uploading visitors to online hosting sites can affect, in the first place, insufficient revenue for content designers. The systems needed to detect cloning of multimedia objects require time, effort, and importance. We recommend script thinking about important multimedia content protection systems. We focus on the approach to preserving multimedia content, which is to recognize copies based on the content through which the signatures of the original objects are removed. Our physicists to protect multimedia content illegally discover copies of multimedia objects on the Internet. Our design makes rapid use of content protection systems, as it relies on cloud infrastructure that provides computers as well as software resources. These are two new components, as a way to generate three-dimensional matching signatures distributed to multimedia objects.
... MapReduce [42] is a framework for processing large data using a cluster that consists of multiple commodity machines. While distributed-memory algorithms are limited to moderatesized graphs, MapReduce is suitable for handling enormous graphs as it processes data in an I/ O efficient manner on a distributed file system. ...
Article
Full-text available
With a cluster of commodity hardware, how can we efficiently find all connected components of an enormous graph containing hundreds of billions of nodes and edges? The problem of finding connected components has been used in various applications such as pattern recognition, reachability indexing, graph compression, graph partitioning, and random walk. Several studies have been proposed to efficiently find connected components in various environments. Most existing single-machine and distributed-memory algorithms are limited in scalability as they have to load all data generated during the process into the main memory; they require expensive machines with vast memory capacities to handle large graphs. Several MapReduce algorithms try to handle large graphs by exploiting distributed storage but fail due to data explosion problems, which is a phenomenon that significantly increases the size of data as the computation proceeds. The latest MapReduce algorithms resolve the problem by proposing two distinguishing star-operations and executing them alternately, while the star-operations still cause massive network traffic as a star-operation is a distributed operation that connects each node to its smallest neighbor. In this paper, we unite the two star-operations into a single operation, namely UniStar, and propose UniCon, a new distributed algorithm for finding connected components in enormous graphs using UniStar. The partition-aware processing of UniStar effectively resolves the data explosion problems. We further optimize UniStar by filtering dispensable edges and exploiting a hybrid data structure. Experimental results with a cluster of 10 cheap machines each of which is equipped with Intel Xeon E3-1220 CPU (4-cores at 3.10GHz), 16GB RAM, and 2 SSDs of 1TB show that UniCon is up to 13 times faster than competitors on real-world graphs. UniCon succeeds in processing a tremendous graph with 129 billion edges, which is up to 4096 times larger than graphs competitors can process.
... Association rule analysis (Agapito et al (2013)), which is a workflow for association rule analysis between genome variations and clinical conditions of a group of patients; Trajectory mining (Altomare et al (2017)) for discovering patterns and rules from trajectory data of vehicles in a wide urban scenario; Political polarization (Belcastro et al (2020a)) that exploits a workflow combining multiple machine learning algorithms to estimate the polarization of social media users on political events, which are characterized by the competition of different factions or parties. In some cases, the workflow formalism has been integrated with other programming models, such as MapReduce (Dean and Ghemawat (2008)), to exploit the inherent parallelism of the application in presence of Big Data. As an example, in (Belcastro et al (2015b)) a workflow management system has been integrated with MapReduce for implementing a scalable predictor of flight delays due to weather conditions (Belcastro et al (2016)). ...
... Rail-RNA is built on the MapReduce programming model [30], which uses an economy of fundamental abstractions to promote scalable cluster computing. A problem is divided into a sequence of computation and aggregation steps. ...
Preprint
Full-text available
RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it is difficult to reproduce the exact analysis without access to original computing resources. We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 hours for US\$0.91 per sample. Rail-RNA produces alignments and base-resolution bigWig coverage files, ready for use with downstream packages for reproducible statistical analysis. We identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounders. Rail-RNA is open-source software available at http://rail.bio.
... Users specify the computation in terms of map and reduce constructs (typical of functional programming paradigm), and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. The MapReduce paradigm of parallel programming provides simplicity, while at the same time offering load balancing and fault tolerance [25]. With respect to the scalable implementation of Rough Sets, the work of Zhang et al. [26] provides a parallel method for computing Rough Set approximations based on the MapReduce technique in order to deal with massive data. ...
Article
Full-text available
In complex environments, decision-making processes are more and more dependent on gathering, processing and analysis of huge amounts of data, often produced with different velocities and different formats by distributed sensors (human or automatic). Such streams of data also suffer of imprecision and uncertainty. On the other hand, Three-way Decision is considered a suitable approach for data analysis based on the tri-partitioning of the universe of discourse, i.e., exploiting the notions of acceptance, rejection and non-commitment, as well as the human brain does to solve numerous problems. Suppose the application scenario foresees the processing of data streams. In that case, the analysis task could be accomplished by considering the stream computing paradigm which is one of the most important paradigms in Big Data. With such a paradigm data arrives, is processed and departs in real-time without needing to be temporarily serialized into a storage system. This work analyzes the implementation of the Three-Way Decision approach, based on Rough Set Theory, on a real-time data processing platform supporting streaming computing, i.e., Apache Spark.
Chapter
The tremendous popularity of web-based social media is attracting the attention of the industry to take profit from the massive availability of sentiment data, which is considered of a high value for Business Intelligence (BI). So far, BI has been mainly concerned with corporate data with little or null attention to the external world. However, for BI analysts, taking into account the Voice of the Customer (VoC) and the Voice of the Market (VoM) is crucial to put in context the results of their analyses. Recent advances in Sentiment Analysis have made possible to effectively extract and summarize sentiment data from these massive social media. As a consequence, VoC and VoM can be now listened from web-based social media (e.g., blogs, reviews forums, social networks, and so on). However, new challenges arise when attempting to integrate traditional corporate data and external sentiment data. This paper deals with these issues and proposes a novel semantic data infrastructure for BI aimed at providing new opportunities for integrating traditional and social BI. This infrastructure follows the principles of the Linked Open Data initiative.
Article
Full-text available
Several complex scientific simulations process large amounts of distributed and heterogeneous data. These simulations are commonly modeled as scientific workflows and require High Performance Computing (HPC) environments to produce results timely. Although scientists already benefit from clusters and clouds, new hardware, such as General Purpose Graphical Processing Units (GPGPUs), can be used to speedup the execution of the workflow. Clouds also provide virtual machines (VMs) with GPU capabilities that can also be used, thus becoming hybrid clouds. This way, many workflows can be modeled considering programs that execute in GPUs, CPUs or both. A problem that arises is how to schedule workflows with variant activities (that can be executed in CPU, GPU or both) in this hybrid environment. Although existing workflow systems (WfMS) can execute in GPGPUs and clouds independently, they do not provide mechanisms for scheduling workflows with variant activities in this hybrid environment. In fact, reducing the makespan and the financial cost of variant workflows in hybrid clouds may be a difficult task. In this article, we present a scheduling strategy for Variant GPU-accelerated workflows in clouds, named PROFOUND, which schedules activations (atomic tasks) to a set of CPU and GPU/CPU VMs based on provenance data (historical data). PROFOUND is based on a combination of a mathematical formulation and a heuristic, and aims at minimizing not only the makespan, but also the financial cost involved in the execution. To evaluate PROFOUND, we used a set of benchmark instances based on synthetic and real scenarios gathered from different workflows traces. The experiments show that PROFOUND is able to solve the referred scheduling problem.
Article
Full-text available
Edge computing is a paradigm that brings computation and data storage closer to the location where it is needed to improve response times and save bandwidth. It applies virtualization technology that makes it easier to deploy and run a wider range of applications on the edge servers and take advantage of largely unused computational resources. This article describes the design and formalization of Hive, a distributed shared memory model that can be transparently integrated with JavaScript using a standard out of the box runtime. To define such a model, a formal definition of the JavaScript language was used and extended to include modern capabilities and custom semantics. This extended model is used to prove that the distributed shared memory can operate on top of existing and unmodified web browsers, allowing the use of any computer and smartphone as a part of the distributed system. The proposed model guarantees the eventual synchronization of data across all the system and provides the possibility to have a stricter consistency using standard http operations.
Article
The technological advancement plays a major role in this era of digital world of growing data. Hence, there is a need to analyse the data so as to make good decisions. In the domain of data analytics, clustering is one of the significant tasks. The main difficulty in Map reduce is the clustering of massive amount of dataset. Within a computing cluster, Map Reduce associated with the algorithm such as parallel and distributed methods serve as a main programming model. In this work, Map Reduce-based Firefly algorithm known as MR-FF is projected for clustering the data. It is implemented using a MapReduce model within the Hadoop framework. It is used to enhance the task of clustering as a major role of reducing the sum of Euclidean distance among every instance of data and its belonging centroid of the cluster. The outcome of the experiment exhibits that the projected algorithm is better while dealing with gigantic data, and also outcome maintains the quality of clustering level.
Article
Full-text available
In the treatment of ischemic stroke, timely and efficient recanalization of occluded brain arteries can successfully salvage the ischemic brain. Thrombolysis is the first-line treatment for ischemic stroke. Machine learning models have the potential to select patients who could benefit the most from thrombolysis. In this study, we identified 29 related previous machine learning models, reviewed the models on the accuracy and feasibility, and proposed corresponding improvements. Regarding accuracy, lack of long-term outcome, treatment option consideration, and advanced radiological features were found in many previous studies in terms of model conceptualization. Regarding interpretability, most of the previous models chose restrictive models for high interpretability and did not mention processing time consideration. In the future, model conceptualization could be improved based on comprehensive neurological domain knowledge and feasibility needs to be achieved by elaborate computer science algorithms to increase the interpretability of flexible algorithms and shorten the processing time of the pipeline interpreting medical images.
Article
Full-text available
XML ist ein semi-strukturiertes Datenbeschreibungsformat, das aufgrund weiter Verbreitung und steigender Datenmengen auch als Eingabeformat für eine BigData-Verarbeitung relevant ist. Der vorliegende Beitrag befasst sich daher mit der Nutzung komplexer XML-basierter Datenstrukturen als Eingabeformat für BigData-Anwendungen. Werden umfangreiche komplexe XML-Datenstrukturen mit verschiedenen XML-Typen in einer zu verarbeitenden XML-Datei beispielsweise mit Apache Hadoop verarbeitet, kann das Einlesen der Daten die Laufzeit einer Anwendung dominieren. Unser Ansatz befasst sich mit der Optimierung der Eingabephasen, indem Zwischenergebnisse der Verarbeitung im Arbeitsspeicher abgelegt werden. Der Aufwand für die Verarbeitung reduziert sich damit zum Teil erheblich. Anhand einer Fallstudie aus der Musikbranche, in der standardisierte XML-basierte Formate wie das DDEX-Format genutzt werden, wird experimentell gezeigt, dass die Verarbeitung mit unserem Ansatz im Vergleich zur klassischen Abarbeitung von Dateiinhalten deutlich effizienter ist.
Chapter
For many years, the Health Information System (HIS) has been an integral part of the day-to-day operations of healthcare organizations such as clinical centers and hospitals. HIS collects, stores and manages healthcare data from different sources related to patients’ electronic medical record and other daily operational activities of healthcare organizations. HIS analyzes its database to create reports that help healthcare organizations improve patient outcomes, quality of services, treatment cost efficiency and support healthcare policy decisions. Over time, the HIS data has grown enormously and that makes the HIS’s data analytic module become overloaded, affecting the overall system performance of HIS and resulting in a denial of service from HIS. This paper presents a case study of applying big data technology to improve the performance of data analytic modules in a traditional HIS.
Article
Predictive maintenance (PdM) aims the reduction of costs to increase the competitive strength of the enterprises. It uses sensor data together with analytics techniques to optimize the schedule of maintenance interventions. The application of such maintenance strategy requires the cooperation of several agents and involves knowledge and skills in distinct fields, since it encompasses from the averaging of relevant signals in the shop-floor to its processing, transmission, storage, and analysis in order to extract meaningful knowledge. PdM is a broad topic, making it impossible to address all its subtopics in the same paper. Having this into consideration, this paper focuses on the main challenges that hinder the development of a generalized data-driven system for PdM, namely: the existence of noisy or erroneous sensor data in a real industrial environment; the necessity to collect, transmit and process high volumes of data in a timely manner; and the fact that current approaches for PdM are specific for a part or equipment rather than global. This paper connects three different perspectives: anomaly detection, which allows the removal of noisy or erroneous data and the detection of relevant events that can be used to improve the prognostics methods; prognostics methods, which address the models to forecast the condition of industrial equipment; and the architectures, which may allow the deployment of the anomaly detection and prognostics methods in real-time and in different industrial scenarios. Furthermore, the last trends, current challenges and opportunities of each perspective are discussed over the paper.
Chapter
With the recent advancements in information technology and the growing number of electronic devices, data is evolving at a very rapid rate. This humongous data needs to be efficiently managed and utilized. Hence, the primary concern is the storage, management, and exhaustive analysis of this enormous amount of data. Due to the increased volume of data, the term data is replaced by Big Data. Certain properties are possessed by Big Data like Volume, Variety, Velocity, Veracity, Value, and Variability. These six properties are also referred to as 6V's of Big Data. Now, unlike the simple relational database system, Big Data is in various formats as it is being generated and collected from different sources. It need not necessarily be structured in a tabular format. Most of the data that is generated these days is either semi‐structured or unstructured. Hence, the traditional database systems are no longer sufficient to handle this Big Data. We need to adopt new methods to handle such varied forms of data. This chapter will give a brief introduction to the concepts of Big Data, its components, and the various forms of data that are generated by different devices. It also provides an introduction to the data processing of Big Data in a distributed fashion, which is done using Hadoop and MapReduce.
Chapter
Over the past four decades, operations research (OR) has played a key role in solving complex problems in travel. More recently, artificial intelligence (AI) has seen rapid adoption in travel that covers everything from robotic process automation, to cognitive insight, to cognitive engagement. This chapter discusses the role of AI in lodging and its potential to address travel complexity, solve a range of problems, and create new value propositions. This chapter also reviews the role of big data and blockchain technology in travel. Leveraging big data and blockchain is reviewed with a series of examples.
Preprint
Full-text available
Motivation Although it circumvents hyperparameter estimation of ordinary differential equation (ODE) based models and the complexities of many other models, the computational time complexity of a fuzzy logic regulatory model inference problem, particularly at higher order of interactions, quickly approaches those of computationally intractable problems. This undermines the benefits inherent in the simplicity and strength of the fuzzy logic-based molecular regulatory inference approach. Results For a sample inference problem – molecular regulation of vorinostat resistance in the HCT116 colon cancer cell lines, our modeled, designed and implemented “multistaged-hyperparallel” optimization approach significantly shortened the time to model inference from about 485.6 hours (20.2 days) to approximately 9.6 hours (0.4 days), compared to an optimized version of a previous implementation. Availability The multistaged-hyperparallel method is implemented as a plugin in the JFuzzyMachine tool, freely available at the GitHub repository locations https://github.com/paiyetan/jfuzzymachine and https://github.com/paiyetan/jfuzzymachine/releases/tag/v1.7.21 . Source codes and binaries are freely available at the specified URLs. Contact paiyetan@gmu.edu
ResearchGate has not been able to resolve any references for this publication.