Article

MapReduce: Simplified data processing on large clusters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The amount of data being generated every second has to be handled efficiently to analyse them and extract useful information from them. Hadoop [2] is a software tool that is employed to process data in parallel between several nodes. The feature extraction from the massive amount of data that cannot be done in a single machine with a processor. ...
... The feature extraction from the massive amount of data that cannot be done in a single machine with a processor. In earlier works, the CPU-based MapReduce [2][3][4][5] to mine voluminous data. Traditional CPU-based MapReduce implementations faced several limitations and challenges. ...
... Definition 2 For a set P ⊆ C and X ⊆ U , the P-lower and P-upper approximations of X within P [32] are defined as Eqs. (1) and (2). ...
Article
Full-text available
Knowledge extraction from large, uncertain datasets has become a new challenge due to the rapid growth of data. In the present world, where petabytes of data are being generated within a fraction of a second, certain mechanisms are needed to analyse it and extract useful features from it. In this paper, we propose an efficient distributed framework under the GPU-based MapReduce paradigm to process the large, uncertain data and extract the interim knowledge or structural relationships present within the data itself. This framework involves two phases. The first phase reduces the dimension of large data by partitioning the data based on entropy and coverage value. The second phase extracts the important features from the reduced data using a positive region value based on Rough Set Theory. Our approach efficiently optimises the feature set that exhibits the essential information from the large dataset. The proposed framework processes data in parallel, which reduces the processing time and gives more speed and efficiency. The effectiveness of our proposed method compared to previous works has been shown by comprehensive experimental analysis on large datasets.
... Data parallelism involves applying the same operation to large datasets across multiple processors, which is particularly effective for tasks such as matrix multiplication, image processing, and scientific simulations. Data parallelism is the core idea behind frameworks such as MapReduce and Apache Spark, which have been widely adopted in big data analytics (Dean & Ghemawat, 2004;Zaharia et al., 2010). ...
... These frameworks utilize parallelization to handle tasks like searching, sorting, and data aggregation, which are crucial for deriving insights from massive datasets (Dean & Ghemawat, 2004;Zaharia et al., 2010). ...
... In this model, the same operation is applied to different data elements simultaneously across multiple processing units. Data parallelism is particularly beneficial for tasks that involve repetitive operations on large datasets, such as matrix operations, image processing, or scientific simulations (Dean & Ghemawat, 2004). ...
Article
Full-text available
Parallelization has become a cornerstone technique for optimizing computing performance, especially in addressing the growing complexity and scale of modern computational tasks. By leveraging concurrent processing capabilities of multi-core processors, GPUs, and distributed systems, parallel computing enables the efficient execution of large-scale problems that would otherwise be computationally prohibitive. This paper explores various parallelization techniques, including data parallelism, task parallelism, pipeline parallelism, and the use of GPUs for massive parallel computations. We also examine the key performance evaluation metrics such as speedup, efficiency, Amdahl’s Law, scalability, and load balancing that are critical in assessing the effectiveness of parallelization strategies. Through case studies in scientific simulations, machine learning, and big data analytics, we demonstrate how these techniques can be applied to real-world problems, offering significant improvements in execution time and resource utilization. The paper concludes by discussing the trade-offs involved in parallel computing and suggesting future avenues for optimizing parallelization methods in the context of evolving hardware and software technologies.
... Big Data processing frameworks like MapReduce [3] allow easy parallelization of complex computations on huge data sets. A single MapReduce step processes data using two functions: Map and Reduce. ...
... To the best of our knowledge, existing MapReduce implementations [2,3,5,7,14] create full checkpoint at each MapReduce step. 1 Besides incurring a considerable constant factor overhead during fault-free operation, this implies a fundamental limit to the scalability of MapReduce computations: Consider machines with processing elements (PEs). While the time to execute and checkpoint a MapReduce step grows with , the time intervals between PE failures shrink. 2 Eventually, errors are bound to occur in almost every MapReduce step and we would like to have a more fine-grained fault-tolerance mechanism which is able to tolerate such failures without a significant overhead. ...
... In this brief announcement we present such a solution for the important case of single PE-failures in the fail-stop model using shrinking recovery. 3 ...
Preprint
Full-text available
Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular MapReduce framework where huge multisets are processed by user-defined mapping and reducing functions. We observe that the full state of a MapReduce algorithm is described by its network communication. We present a low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of 1/(p1)1/(p-1) on p PEs. Recovery takes approximately the time of processing 1/p of the data on the surviving PEs. We achieve this by backing up self-messages and locally storing all messages sent through the network on the sending and receiving PEs until the next round of global communication. A prototypical implementation already indicates low overhead <4%<4\,\% during fault-free execution.
... MapReduce is a programing model and an associated implementation for processing and generating large data sets [1]. MapReduce handles the need for parallelization of a very large amount of date over several machines that are in a large cluster. ...
... MapReduce achieved the processing ability of vast amount of data over shorter period of time. Its mode of operation is based on, a map function and a reduced function [1]. MapReduce large data processing ability offers developers a means to transparently handle data partitioning, replication, task scheduling and fault tolerance computing on a cluster of commodity computers [4]. ...
... Jeffrey Dean and Sanjay Ghemawat highlighted the applications of MapReduce in large cluster commodity machines for parallel computing [1]. Khezr and Navimipour used MapReduce in the optimization algorithms, presenting its application to ant colony algorithm, cuckoo search, and particle swarm optimization (PSO) [8]. ...
... The primary difference between DML and traditional machine learning lies in the data and computation management. While traditional machine learning relies on a centralized approach where all data and computations occur on a single node, distributed machine learning utilizes a decentralized framework to divide tasks and resources across multiple nodes, thereby enhancing scalability and efficiency [6]. ...
... The communication overhead between nodes will increase, which will affect the efficiency and speed of model training. Developing more efficient algorithms and communication protocols to handle large-scale distributed systems is an ongoing area of research [6]. Looking ahead, the future of DML in healthcare is bright. ...
Article
Full-text available
As a matter of fact, machine learning (ML) has become a cornerstone in modern science and engineering, evolving from early algorithms like the perceptron to sophisticated deep learning models. The rise of distributed machine learning (DML) has enabled scalable, privacy-preserving analysis of large healthcare datasets in recent years. With this in mind, this study summarizes the principles of DML, contrasting it with traditional as well as parallel computing approaches. At the same time, this research discusses its applications in medical imaging, predictive analytics, and real-time monitoring. According to the analysis, despite significant advancements, challenges related to data privacy, heterogeneity, and scalability persist. Based on the evaluations, future research is needed to overcome these limitations and further integrate DML into healthcare. Overall, these results underscore the transformative potential of DML in enhancing healthcare delivery and outcomes and pave a path as well as offer a guideline for implementations of machine leaning in medic industry.
... Lazy computing has become a key component of modern big data processing systems. From MapReduce [2], which laid the foundation for many modern systems, to modern frameworks such as Apache Spark [1] and Apache Flink [3], lazy strategies are used to optimize data processing. Apache Spark, in particular, makes extensive use of lazy computing to improve the efficiency of big data processing [1]. ...
... The inclusion of specific techniques like MapReduce [2], pipeline processing, and out-of-core computation reflects contemporary best practices in handling large-scale data processing challenges. This comprehensive approach ensures that the strategy remains relevant and applicable across various computational scenarios in modern data science and engineering contexts. ...
Article
The article examines the concept of lazy operations and its application for efficient processing of large volumes of data. The main principles of lazy computations, their implementation in various programming languages, and strategies for effective use in Big Data processing are analyzed. The advantages and limitations of the lazy approach are investigated, particularly regarding memory savings, performance improvement, and the ability to work with infinite data streams. A concept for selecting computation strategies based on data size and computational complexity is proposed.
... Essas plataformas em nuvem possuem infraestruturas distribuídas globalmente, permitem o armazenamento e acesso rápido aos dados em qualquer lugar, um avanço significativo em relação aos métodos tradicionais (Dean;Ghemawat, 2004). Essa infraestrutura moderna de armazenamento facilita o desenvolvimento de soluções que dependem de vastas bibliotecas de informações, como algoritmos de aprendizado de máquina e sistemas de inteligência artificial. ...
... Essas plataformas em nuvem possuem infraestruturas distribuídas globalmente, permitem o armazenamento e acesso rápido aos dados em qualquer lugar, um avanço significativo em relação aos métodos tradicionais (Dean;Ghemawat, 2004). Essa infraestrutura moderna de armazenamento facilita o desenvolvimento de soluções que dependem de vastas bibliotecas de informações, como algoritmos de aprendizado de máquina e sistemas de inteligência artificial. ...
Article
Full-text available
Este artigo apresenta uma breve contextualização histórica e uma análise tripartite da evolução tecnológica, além de introduzir os conceitos e fundamentos da inteligência artificial (IA). O objetivo é desmistificar esse campo do conhecimento para estudantes de arquitetura, urbanismo e design. A IA, com suas diversas aplicações, tem o potencial de revolucionar esses campos, oferecendo novas ferramentas e métodos para a criação e análise de espaços urbanos e edificações. Foi realizada uma análise qualitativa de ilustrações de uma cidade do livro Cidades Invisíveis, de Italo Calvino, geradas por meio de ferramentas de IA. Essas ilustrações foram comparadas com as descrições literárias para avaliar a precisão e a criatividade das representações geradas. A análise revelou que a IA pode capturar a essência das descrições de Calvino, ao mesmo tempo em que introduz novas interpretações visuais que podem inspirar arquitetos e designers. Além disso, o artigo discute as implicações éticas e práticas do uso da IA na arquitetura e no urbanismo, destacando a importância de uma abordagem crítica e informada. Conclui-se que, embora a IA ofereça inúmeras possibilidades, é essencial que os profissionais dessas áreas compreendam suas limitações e potencialidades para utilizá-la de forma eficaz e responsável.
... Examples include k-means clustering, in which each cluster can be handled separately, and ensemble methods like random forest, in which every tree can train parallelly [10]. Deep learning model is also benefited from parallel learning frame, especially those models applied in computer vision and natural language processing, since their matrix operation is highly parallelizable [11]. Moreover, modern GPU frame aims to perform many simple operations parallelly, making it an ideal choice for tasks involving large scale neural networks. ...
... For instance, some popular frameworks include Google's MapReduce and Apache Hadoop. They distribute data and computation across clusters of machines to be able to handle the processing of petabyte-scale datasets [11]. Another example is the distributed model training using frameworks like TensorFlow and PyTorch to conduct training of deep neural network. ...
Article
Full-text available
Contemporarily, the explosive growth of technology and the growth of the complexity of machine learning models facilitated the advancement of parallel and distributed machine learning. This study compared and analyzed the two paradigms from three perspectives: principles, models, and applications. This research analyzed the basic principle and some important models of parallel and distributed machine learning, mainly focusing on their operating mechanism, scalability, and performance metrics. The analysis includes how these methods can help to deal with large datasets and complex models. According to the evaluations, distributed machine learning is better in managing massive data and achieving model convergence, while parallel machine learning performs well in processing complicated models quickly. These results provided opinions on each methods advantages and disadvantages and suggestions on how to choose between the two methods in different situations. This research also stressed the importance of understanding these differences in advancing machine learning and promoting innovation in data-intensive fields.
... There are ILP acceleration approaches that use Big Data [33] technologies (such as MapReduce [24] and Apache Spark [34]) to accelerate ILP computations [25]- [27], [38]. Although, one of the limitations (or challenges) with distributed parallel ILP approaches, is dealing with the search, coordination and communication overheads; the search overhead refers to the additional number of generated candidate hypotheses as opposed to the traditional (sequential) implementation. ...
Preprint
Full-text available
We present SPILDL, a Scalable and Parallel Inductive Learner in Description Logic (DL). SPILDL is based on the DL-Learner (the state of the art in DL-based ILP learning). As a DL-based ILP learner, SPILDL targets the ALCQI(D)\mathcal{ALCQI}^{\mathcal{(D)}} DL language, and can learn DL hypotheses expressed as disjunctions of conjunctions (using the \sqcup operator). Moreover, SPILDL's hypothesis language also incorporates the use of string concrete roles (also known as string data properties in the Web Ontology Language, OWL); As a result, this incorporation of powerful DL constructs, enables SPILDL to learn powerful DL-based hypotheses for describing many real-world complex concepts. SPILDL employs a hybrid parallel approach which combines both shared-memory and distributed-memory approaches, to accelerates ILP learning (for both hypothesis search and evaluation). According to experimental results, SPILDL's parallel search improved performance by up to \sim27.3 folds (best case). For hypothesis evaluation, SPILDL improved evaluation performance through HT-HEDL (our multi-core CPU + multi-GPU hypothesis evaluation engine), by up to 38 folds (best case). By combining both parallel search and evaluation, SPILDL improved performance by up to \sim560 folds (best case). In terms of worst case scenario, SPILDL's parallel search doesn't provide consistent speedups on all datasets, and is highly dependent on the search space nature of the ILP dataset. For some datasets, increasing the number of parallel search threads result in reduced performance, similar or worse than baseline. Some ILP datasets benefit from parallel search, while others don't (or the performance gains are negligible). In terms of parallel evaluation, on small datasets, parallel evaluation provide similar or worse performance than baseline.
... For parallel approaches in distributed-memory environments, several approaches have been developed [13]. Researchers in [46] used the MapReduce framework [7] to accelerate evaluation by exploiting MapReduce's distributed computing capabilities. In addition, other researchers [14] developed a distributed evaluation approach for Aleph (P-Progol) by dividing data among processors using the MPI framework [28]. ...
Preprint
Full-text available
We present High-Throughput Hypothesis Evaluation in Description Logic (HT-HEDL). HT-HEDL is a high-performance hypothesis evaluation engine that accelerates hypothesis evaluation computations for inductive logic programming (ILP) learners using description logic (DL) for their knowledge representation; in particular, HT-HEDL targets accelerating computations for the ALCQI(D)\mathcal{ALCQI}^{\mathcal{(D)}} DL language. HT-HEDL aggregates the computing power of multi-core CPUs with multi-GPUs to improve hypothesis computations at two levels: 1) the evaluation of a single hypothesis and 2) the evaluation of multiple hypotheses (i.e., batch of hypotheses). In the first level, HT-HEDL uses a single GPU or a vectorized multi-threaded CPU to evaluate a single hypothesis. In vectorized multi-threaded CPU evaluation, classical (scalar) CPU multi-threading is combined with CPU's extended vector instructions set to extract more CPU-based performance. The experimental results revealed that HT-HEDL increased performance using CPU-based evaluation (on a single hypothesis): from 20.4 folds using classical multi-threading to 85\sim85 folds using vectorized multi-threading. In the GPU-based evaluation, HT-HEDL achieved speedups of up to 38\sim38 folds for single hypothesis evaluation using a single GPU. To accelerate the evaluation of multiple hypotheses, HT-HEDL combines, in parallel, GPUs with multi-core CPUs to increase evaluation throughput (number of evaluated hypotheses per second). The experimental results revealed that HT-HEDL increased evaluation throughput by up to 29.3 folds using two GPUs and up to 44\sim44 folds using two GPUs combined with a CPU's vectorized multi-threaded evaluation.
... The Map process operates on the data blocks of a node to produce a local outcome. This process is carried out concurrently on multiple nodes to generate local results independently, thereby enhancing computing performance [8]. The Reduce process acts on the local outcomes to produce a global result. ...
... A computationally interesting aspect is the applied MapReduce framework, which allows to parallelize certain tasks [127]. In particular, Ying et al. apply MapReduce for embedding generation and aggregation steps during message passing as shown in Figure 3.7. ...
Thesis
Full-text available
This master thesis mainly examines the question whether Graph Neural Networks should be preferred over classic Multilayer Neural Networks for generating recommendations in Recommender Systems. We focus on generating recommendations using Collaborative Filtering, one of the two main branches within Recommender Systems. First, we describe methodological foundations. Then we introduce relevant models. We conduct simulations, which lead us to the conclusion that Graph Neural Networks should not always be considered as superior. In addition, our research shows advantages and disadvantages of Graph Neural Networks. Up to a certain size of the given graph, its structure can be exploited for Machine Learning. For the selected models, we observe an over-parameterisation and very long runtimes when processing large graphs.
... Article-level computations can also be done through segment metadata, such as word counts and article URL extractions. This step is similar to MapReduce (Dean and Ghemawat, 2008), which applies batch changes to the dataset, but BloArk modifier is simpler to define and easier to use on smaller-sized machines. We will describe the example usage and process setup in Section 5. To avoid overflowing the memory in subprocesses, BloArk loads blocks only when it is requested, and discards the loaded variable once the modification of a block has been done. ...
... In order to perform a reliable knowledge acquisition process in this framework, we gather statistical information about word co-occurrences with syntactic contexts from very large corpora. To avoid the intrinsic quadratic complexity of the similarity computation, we have developed an optimized process based on MapReduce (Dean and Ghemawat, 2004) that takes advantage of the sparsity of contexts, which allows scaling the process through parallelization. The result of this computation is a graph connecting the most discriminative contexts to terms and explicitly linking the most similar terms. ...
... Here, we present the computation of the distributional similarity between terms using three graphs. For the computation we use the Apache Hadoop Framework, based on (Dean and Ghemawat, 2004). ...
... One significant issue is the availability and quality of training data [48]. Cloud environments generate vast amounts of data, but much of it is unstructured and noisy, making it difficult to use for training machine learning models [49]. Furthermore, obtaining labeled datasets for supervised learning can be particularly challenging, as labeling requires extensive expertise and resources [50]. ...
Article
Full-text available
With the rapid expansion of cloud computing, the need for robust cybersecurity measures has become paramount. As organizations increasingly migrate their data and applications to the cloud, they encounter numerous cybersecurity risks that threaten the integrity, confidentiality, and availability of their information. Traditional risk assessment methods often fall short in addressing the dynamic and complex nature of cloud environments. This paper explores a novel approach to cybersecurity risk assessment in cloud computing using machine learning techniques. We propose a comprehensive framework that leverages machine learning algorithms to detect, predict, and mitigate potential cybersecurity threats. The framework incorporates various supervised and unsupervised learning models, including decision trees, support vector machines, and neural networks, to analyze large datasets and identify patterns indicative of security breaches. Our approach also includes feature selection methods to optimize the performance of these models by focusing on the most relevant risk factors. We conducted extensive experiments on publicly available cloud security datasets, which demonstrated the efficacy of our machine learning-based risk assessment framework in identifying threats with high accuracy and minimal false positives. The results indicate that our approach significantly outperforms traditional risk assessment techniques in terms of speed, scalability, and adaptability to evolving threat landscapes. This study contributes to the field by providing a scalable and efficient solution for enhancing cybersecurity in cloud environments. It highlights the potential of machine learning to revolutionize how we assess and manage cybersecurity risks, offering a proactive stance against emerging threats. Future work will focus on refining the model by incorporating real-time data and exploring advanced machine learning techniques such as deep learning and reinforcement learning to further enhance its predictive capabilities.
... In contrast, tasks in parallel machine learning are divided into smaller, independent units that can be processed in parallel. This approach reduces computation time and enhances efficiency, making it ideal for high-performance computing environments [3]. ...
Article
Full-text available
As a matter of fact, with the exponential growth of data, machine learning (ML) techniques have increasingly relied on distributed and parallel computing to handle large-scale problems. With this in mind, this paper provides a comparative analysis of distributed and parallel machine learning methodologies, focusing on their efficiency and effectiveness in processing large datasets. To be specific, this study will discuss as well as contrast key models and frameworks within both paradigms, assessing their performance based on computational cost, scalability, and accuracy. Through empirical evidence and case studies, this research will highlight the strengths and limitations of each approach. According to the analysis, the findings indicate that while distributed machine learning excels in scalability as well as fault tolerance, parallel machine learning offers superior computational speed for smaller-scale tasks. Overall, the insights from this study are crucial for researchers and practitioners seeking to optimize ML workflows for large-scale data environments.
... The 2000s ushered in the era of big data, characterized by the exponential growth of data from the web, mobile devices, and sensors. AI techniques evolved to process unstructured and large-scale datasets, driving the adoption of distributed computing frameworks like Hadoop and MapReduce [3]. ...
Research
Full-text available
It's about evaluation of artificial intelligence with respect to data science for 4 decades
... Spark is an open-source platform for large-scale data processing, known for being faster than MapReduce Dean and Ghemawat [2008]. Its core data structure, the Resilient Distributed Dataset (RDD), enables parallel computation. ...
Preprint
Graphs, consisting of vertices and edges, are vital for representing complex relationships in fields like social networks, finance, and blockchain. Visualizing these graphs helps analysts identify structural patterns, with readability metrics-such as node occlusion and edge crossing-assessing layout clarity. However, calculating these metrics is computationally intensive, making scalability a challenge for large graphs. Without efficient readability metrics, layout generation processes-despite numerous studies focused on accelerating them-face bottleneck, making it challenging to select or produce optimized layouts swiftly. Previous approaches attempted to accelerate this process through machine learning models. Machine learning approaches aimed to predict readability scores from rendered images of graphs. While these models offered some improvement, they struggled with scalability and accuracy, especially for graphs with thousands of nodes. For instance, this approach requires substantial memory to process large images, as it relies on rendered images of the graph; graphs with more than 600 nodes cannot be inputted into the model, and errors can exceed 55% in some readability metrics due to difficulties in generalizing across diverse graph layouts. This study addresses these limitations by introducing scalable algorithms for readability evaluation in distributed environments, utilizing Spark's DataFrame and GraphFrame frameworks to efficiently manage large data volumes across multiple machines. Experimental results show that these distributed algorithms significantly reduce computation time, achieving up to a 17x speedup for node occlusion and a 146x improvement for edge crossing on large datasets. These enhancements make scalable graph readability evaluation practical and efficient, overcoming the limitations of previous machine-learning approaches.
... Each cluster of the cloud consists of metadata that defines what types of information are stored in this cluster and a group of NoSQL databases that persist data in an IFC format BIM. A MapReduce framework, which is a programming model and an associated implementation that parallelizes data processing across large-scale clusters of machines (Dean and Ghemawat, 2008), is usually provided to implement the cloud. The framework is divided into two parts: map, and reduce. ...
Preprint
Full-text available
As the information from diverse disciplines continues to integrate during the whole life cycle of an Architecture, Engineering, and Construction (AEC) project, the BIM (Building Information Model/Modeling) becomes increasingly large. This condition will cause users difficulty in acquiring the information they truly desire on a mobile device with limited space for interaction. To improve the value of the big data of BIM, an approach to intelligent data retrieval and representation for cloud BIM applications based on natural language processing was proposed. First, strategies for data storage and query acceleration based on the popular cloud-based database were explored to handle the large amount of BIM data. Then, the concepts keyword and constraint were proposed to capture the key objects and their specifications in a natural-language-based sentence that expresses the requirements of the user. Keywords and constraints can be mapped to IFC entities or properties through the International Framework for Dictionaries (IFD). The relationship between the user's requirement and the IFC-based data model was established by path finding in a graph generated from the IFC schema, enabling data retrieval and analysis. Finally, the analyzed and summarized results of BIM data were represented based on the structure of the retrieved data. A prototype application was developed to validate the proposed approach on the data collected during the construction of the terminal of Kunming Airport, the largest single building in China. With this approach, users can significantly benefit from requesting for information and the value of BIM will be enhanced.
... The integration of big data and machine learning is driving transformative changes across industries. By harnessing large datasets, machine learning models can provide valuable insights and predictions that improve business decision-making [1][2][3]. However, this integration also presents significant challenges, including data management, privacy, and scalability issues. ...
Article
This paper explores the dynamic relationship between big data and machine learning, highlighting the key strategies, methodologies, and challenges associated with their integration. The convergence of these technologies presents transformative opportunities across industries, but it also introduces complexities in terms of data management, infrastructure, and real-time processing. This study examines the role of big data in fueling machine learning models, discusses critical success factors, and identifies best practices for implementing machine learning at scale.
... This unprecedented surge in big data has led to the evolution of distributed storage systems capable of handling large volumes, high velocity, and a wide variety of data. Traditional storage systems, which relied on centralized architectures, have proven inadequate in terms of scalability, fault tolerance, and performance when dealing with the sheer magnitude of today's data workloads [1,2]. ...
Article
The explosion of data in the last decade has led to significant advancements in distributed storage systems, which form the backbone of modern big data architectures. This paper reviews the evolution of distributed storage systems, focusing on their scalability, fault tolerance, data consistency, and latency optimizations. The paper covers various storage models, including HDFS, Cassandra, and Amazon S3, and evaluates their performance in the context of big data. Future trends in distributed storage systems, including cloud integration and data security, are also discussed. Index Terms: Big Data, Distributed Storage Systems, Scalability, Fault Tolerance, HDFS, Cassandra, Amazon S3, Data Consistency
... Considering the large number of built-in APIs in the framework of PMD, the process of retrieval is necessary to provide accurate and sufficient information for checker generation. To precisely retrieve related APIs that can be used in the target checker, we design a logic-guided API-context retrieval inspired by Chain-of-thought [42,63], MapReduce [28] and Compositional API Recommendation [50]. As illustrated in Fig. 5, AutoChecker first leverages the LLM to decompose the checker rule into a checking skeleton with sub-operations. ...
Preprint
Full-text available
With the rising demand for code quality assurance, developers are not only utilizing existing static code checkers but also seeking custom checkers to satisfy their specific needs. Nowadays, various code-checking frameworks provide extensive checker customization interfaces to meet this need. However, both the abstract checking logic as well as the complex API usage of large-scale frameworks make this task challenging. To this end, automated code checker generation is anticipated to ease the burden of checker development. In this paper, we explore the feasibility of automated checker generation and propose AutoChecker, an innovative LLM-powered approach that can write code checkers automatically based on only a rule description and a test suite. Instead of generating the checker at once, AutoChecker incrementally updates the checker with the rule and one single test case each time, i.e., it iteratively generates the checker case by case. During each iteration, AutoChecker first decomposes the whole logic into a series of sub-operations and then uses the logic-guided API-context retrieval strategy to search related API-contexts from all the framework APIs. To evaluate the effectiveness of AutoChecker, we apply AutoChecker and two LLM-based baseline approaches to automatically generate checkers for 20 built-in PMD rules, including easy rules and hard rules. Experimental results demonstrate that AutoChecker significantly outperforms baseline approaches across all effectiveness metrics, where its average test pass rate improved over 4.2 times. Moreover, the checkers generated by AutoChecker are successfully applied to real-world projects, matching the performance of official checkers.
... Since the computing resources in individual nodes are relatively constrained, integrating the resources of multiple nodes is a straightforward and effective way for acceleration of large-scale processing. In the early days when deep neural networks were not widely used in reinforcement learning, the classic cluster computing framework (i.e., MapReduce [103]) has been applied to reinforcement learning for acceleration [104]. This method distributes the computation in the large matrix multiplication for Markov decision process (MDP) solutions such as policy evaluation and iteration. ...
Preprint
Deep reinforcement learning has led to dramatic breakthroughs in the field of artificial intelligence for the past few years. As the amount of rollout experience data and the size of neural networks for deep reinforcement learning have grown continuously, handling the training process and reducing the time consumption using parallel and distributed computing is becoming an urgent and essential desire. In this paper, we perform a broad and thorough investigation on training acceleration methodologies for deep reinforcement learning based on parallel and distributed computing, providing a comprehensive survey in this field with state-of-the-art methods and pointers to core references. In particular, a taxonomy of literature is provided, along with a discussion of emerging topics and open issues. This incorporates learning system architectures, simulation parallelism, computing parallelism, distributed synchronization mechanisms, and deep evolutionary reinforcement learning. Further, we compare 16 current open-source libraries and platforms with criteria of facilitating rapid development. Finally, we extrapolate future directions that deserve further research.
... MapReduce, originally developed by Google, is a programming model and implementation for processing and generating large datasets in a distributed computing environment [18]. Several machine learning algorithms have been adapted to work within the MapReduce framework, enabling scalable analytics on IoT data: Distributed File System, YARN) provide a robust foundation for distributed IoT data analytics. ...
Article
Full-text available
The Internet of Things (IoT) has revolutionized data collection across various domains, generating massive amounts of heterogeneous data at unprecedented rates. This surge in data volume and velocity presents both opportunities and challenges for data analytics. Cloud computing environments offer a promising solution for processing and analyzing IoT data due to their scalability and resource elasticity. This paper presents a comprehensive review and analysis of scalable machine learning models designed for IoT data analytics in cloud environments. We explore the synergies between IoT, cloud computing, and machine learning, discussing the challenges of processing IoT data at scale and the advantages of cloud-based solutions. The paper examines various machine learning algorithms and architectures optimized for cloud deployment, including distributed learning frameworks, federated learning, and edge-cloud collaborative models. We also present case studies demonstrating the application of these models in real-world IoT scenarios, such as smart cities, industrial IoT, and healthcare. Our findings highlight the importance of scalable machine learning models in extracting valuable insights from IoT data and the role of cloud environments in enabling efficient, large-scale data analytics.
... The task processing model in Fig. 1 is standard in processing applications such as MapReduce [7], Spark [8] and Dryad [9]. In this paper we experiment with the MapReduce framework, which is a widely used system that employs standard replication in its 'map phase'. ...
Preprint
Full-text available
We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work replication, where only the first completion among replicated tasks is accepted, discarding the others. However, discarding work leads to resource wastage. In this paper, we propose a method for exploiting the work completed by stragglers rather than discarding it. The idea is to increase the granularity of the assigned work, and to increase the frequency of worker updates. We show that the proposed method reduces the completion time of tasks via experiments performed on a simulated cluster as well as on Amazon EC2 with Apache Hadoop.
... With the development of big data, companies are increasingly confronted with the problem of abundant information and its processing. Creating ETTL processes for the basis of data warehouses and BI systems becomes challenging with a flood of large volumes and varying data types and rates [1]. Consequently, there is a rising interest in high-volume ETL solutions supporting data handling in large organizations. ...
Article
With the rapid increase of data in today's organizations, there is a need to have sustainable and effective ETL solutions. The current paper covers a detailed performance evaluation of Hadoop-based tools such as MapReduce, Oozie, and Spark applications on large-volume ETL operations.
... Distributed computing [100,101] enables the parallel execution of tasks by distributing the computational load across multiple machines. This approach is particularly beneficial in the context of EAs, where large populations of candidate solutions need to be evaluated simultaneously. ...
Preprint
Full-text available
Designing optimization approaches, whether heuristic or meta-heuristic, usually demands extensive manual intervention and has difficulty generalizing across diverse problem domains. The combination of Large Language Models (LLMs) and Evolutionary Algorithms (EAs) offers a promising new approach to overcome these limitations and make optimization more automated. In this setup, LLMs act as dynamic agents that can generate, refine, and interpret optimization strategies, while EAs efficiently explore complex solution spaces through evolutionary operators. Since this synergy enables a more efficient and creative search process, we first conduct an extensive review of recent research on the application of LLMs in optimization. We focus on LLMs' dual functionality as solution generators and algorithm designers. Then, we summarize the common and valuable designs in existing work and propose a novel LLM-EA paradigm for automated optimization. Furthermore, centered on this paradigm, we conduct an in-depth analysis of innovative methods for three key components: individual representation, variation operators, and fitness evaluation. We address challenges related to heuristic generation and solution exploration, especially from the LLM prompts' perspective. Our systematic review and thorough analysis of the paradigm can assist researchers in better understanding the current research and promoting the development of combining LLMs with EAs for automated optimization.
... This complexity presents a significant challenge. Training such models requires vast amounts of data, often measured in terabytes or petabytes [8]. However, with this ever-growing data, comes the crucial need for efficient and robust optimization techniques, where traditional methods may struggle to handle the sheer volume of data and the intricate parameter landscapes of modern AI models. ...
... • Pengumpulan Data Batch: Batch Processing Systems: Sistem seperti Hadoop MapReduce digunakan untuk pemrosesan batch data, membagi data menjadi potongan-potongan kecil dan memprosesnya secara paralel (Dean & Ghemawat, 2008). ...
... Traditional EO cubes manage geospatial data to make them ready for cube OLAP analysis. Such cube analytics integrate traditional geoprocessing functions with parallel distributing paradigms in the cloud computing such as MapReduce (Dean and Ghemawat, 2008) to process EO data in distributed computing nodes. The traditional geoprocessing functions can be regarded as physicsbased models. ...
Article
Full-text available
The Earth Observation (EO) analytics are moving from local systems to online cloud computing platforms such as Google Earth Engine (GEE) and Open Geospatial Engine (OGE). A typical approach in existing efforts is to leverage geospatial data cubes with cloud computing to support large-scale big EO data analytics in Digital Earth systems. While online analytical processing (OLAP) can be enabled using the cube approach, it is still not clear how geospatial artificial intelligence (GeoAI) can be incorporated in data cubes to benefit the cube infrastructure. Such an investigation can consolidate the vision of an AI-ready SDI (Spatial Data Infrastructure). The paper presents a systematic approach to incorporate GeoAI models into geospatial data cubes to help create an AI Cube. It covers on-demand model retrieval, cube data and model integration, and distributed model inference. The approach is demonstrated in OGE, which is an EO cloud computing platform layered on the GeoCube implementation. The results show that such an AI Cube enriches a cube infrastructure with GeoAI capabilities, facilitates the on-demand coupling of cube data and GeoAI models, and improves the performance of GeoAI inference.
... They are processing data from a few gigabytes to several terabytes or even petabytes. Google for example is processing around twenty petabytes of data daily [10]. There have been various reviews of different recommender system techniques and applications. ...
... Scalability is another significant issue, as the growth of data volumes and model complexity necessitates the expansion of computational resources. Classical computing systems often require substantial upgrades or complete overhauls to manage increased data loads, leading to increased costs and potential system downtimes (Dean & Ghemawat, 2008). Moreover, the energy consumption of these systems rises with demands for faster processing and larger storage capacities, leading to higher operational costs and environmental concerns. ...
Article
Full-text available
Quantum computing represents a paradigm shift in computational capabilities, offering unprecedented processing power that promises to transform data science. This article explores the integration of quantum computing into data science, highlighting the potential for significant advancements in processing and analyzing data. Quantum computing introduces qubits, which can exist in multiple states simultaneously, and utilizes phenomena like superposition and entanglement to perform complex computations more efficiently. This capability is particularly advantageous in areas such as quantum machine learning, optimization, and the Quantum Fourier Transform, which can process data at unprecedented speeds. The article also discusses the current computational challenges faced by classical computing in data science, such as handling large datasets and training complex models, which quantum computing has the potential to overcome. However, the article also addresses the technological, practical, and ethical challenges that quantum computing faces, such as high error rates, qubit coherence times, and data privacy concerns. Overall, while still in its nascent stages, quantum computing holds promising prospects for revolutionizing data science across various sectors, including healthcare, finance, and logistics, provided that the scalability and reliability issues can be effectively managed.
... Many works in the literature aim to design efficient parallel algorithms [BFS12, CDK14, PPO + 15, FN19, CCMU21, CALM + 21, AW22, CKL + 24, CHS24]. The MPC model, as a theoretical abstraction of several real-world parallel models such as MapReduce [DG08], is a prevalent methodology employed in these works. ...
Preprint
Full-text available
We revisit the simultaneous approximation model for the correlation clustering problem introduced by Davies, Moseley, and Newman[DMN24]. The objective is to find a clustering that minimizes given norms of the disagreement vector over all vertices. We present an efficient algorithm that produces a clustering that is simultaneously a 63.3-approximation for all monotone symmetric norms. This significantly improves upon the previous approximation ratio of 6348 due to Davies, Moseley, and Newman[DMN24], which works only for p\ell_p-norms. To achieve this result, we first reduce the problem to approximating all top-k norms simultaneously, using the connection between monotone symmetric norms and top-k norms established by Chakrabarty and Swamy [CS19]. Then we develop a novel procedure that constructs a 12.66-approximate fractional clustering for all top-k norms. Our 63.3-approximation ratio is obtained by combining this with the 5-approximate rounding algorithm by Kalhan, Makarychev, and Zhou[KMZ19]. We then demonstrate that with a loss of ϵ\epsilon in the approximation ratio, the algorithm can be adapted to run in nearly linear time and in the MPC (massively parallel computation) model with poly-logarithmic number of rounds. By allowing a further trade-off in the approximation ratio to (359+ϵ)(359+\epsilon), the number of MPC rounds can be reduced to a constant.
Article
Full-text available
In the contemporary landscape of big data, efficiently processing and analyzing vast volumes of information is crucial for organizations seeking actionable insights. Apache Spark has emerged as a leading distributed computing framework that addresses these challenges with its in-memory processing capabilities and scalability. This article explores the implementation of Spark DataFrames as a pivotal tool for advanced data analysis. We delve into how DataFrames provide a higher-level abstraction over traditional RDDs (Resilient Distributed Datasets), enabling more intuitive and efficient data manipulation through a schema-based approach. By integrating SQL-like operations and supporting a wide range of data sources, Spark DataFrames simplify complex analytical tasks. The discussion includes methodologies for setting up the Spark environment, loading diverse datasets into DataFrames, and performing exploratory data analysis and transformations. Advanced techniques such as user-defined functions (UDFs), machine learning integration with MLlib, and real-time analytics using Structured Streaming are examined. Performance optimization strategies, including caching, broadcast variables, and utilizing efficient file formats like Parquet, are highlighted to demonstrate how to enhance processing speed and resource utilization. Through a practical case study, we illustrate the application of these concepts in a real-world scenario, showcasing the effectiveness of Spark DataFrames in handling large-scale data analytics. This comprehensive exploration underscores the significance of adopting Spark DataFrames for organizations aiming to leverage big data effectively, ultimately facilitating faster, more insightful decision-making processes. Keywords: Apache Spark,Spark DataFrames,Big Data Analytics,In-Memory Computation,Advanced Data Analysis.
Article
Full-text available
The paper deals with the transformative role of real-time data analytics in e-commerce and retail, discussing the shift from traditional data processing models toward more advanced capabilities with real-time analytics. In the light of a well-observed architecture, methodologies, and technologies such as Apache Kafka and Apache Flink, the study focuses on practical and operational benefits of immediate insights. Then, the applications in some of the major areas assessed include customer-specific experience, real-time pricing, and inventory management. In addition, there are identified challenges and proposed pathways for more transformative business activities by incorporating real-time analytics. The paper brings together an effective synthesis of the impact on e-commerce to be caused by real-time data systems.
Chapter
Traversing the entrepreneurial trajectories of Google, Coca-Cola, and Airbnb, this exploration extracts profound insights from their respective journeys. These industry leaders illuminate the path for emerging visionaries, imparting invaluable lessons and strategies that have propelled their remarkable success stories. Furthermore, this exploration delves into the intersection of entrepreneurship and artificial intelligence (AI). By scrutinizing AI-driven advancements within Google, Coca-Cola, and Airbnb, we unveil the transformative impact of AI on decision-making, process refinement, and innovative troubleshooting. The synthesis of the experiences of established giants with the transformative potential of AI constructs a comprehensive framework for aspiring entrepreneurs. This amalgamation empowers emerging visionaries to adeptly navigate the dynamic business landscape, offering a roadmap that seamlessly integrates historical wisdom and futuristic innovation, guiding new ventures toward achievements that resonate across time.
Article
Full-text available
Big Data Analytics has gained significant popularity in recent years, with many companies integrating it into their information technology roadmaps to enhance business performance. However, surveys indicate that Big Data Analytics demands substantial resources, including technology, costs, and talent, which often leads to failures in the initial stages of implementation. This study proposes a VGG6 architecture approach, intended to provide a framework for the initial implementation of Big Data Analytics. The study's outcomes include the implementation of the VGG6 architecture for processing images of aromatic plants using Python. Furthermore, this approach enabled the development of a Minimum Viable Product (MVP) solution that adheres to general Big Data principles, such as the 3Vs (Volume, Velocity, and Variety), and encompasses key technological components: 1) Data Storage and Analysis, 2) Knowledge Discovery and Computational Complexity, 3) Scalability and Data Visualization, and 4) Information Security.
Article
Full-text available
Distributed systems are integral to the foundation of cloud computing, enabling scalability, efficiency, and resilience across diverse cloud environments. These systems distribute resources and processes across multiple servers, ensuring decentralized control and concurrency. They play a critical role in enhancing cloud services by enabling load balancing, elastic resource management, and fault tolerance mechanisms, such as data replication and redundancy. Additionally, distributed systems facilitate seamless scalability across geographic regions and support modern cloud architectures, including microservices, containers, and serverless computing. However, they face challenges related to network latency, security, and cost optimization. Real-world implementations of distributed systems in platforms like AWS, GCP, and Microsoft Azure illustrate their impact on services like storage, virtualization, and function-as-a-service. Looking ahead, the role of distributed systems is expected to expand with advancements in edge computing, AI, ML, and quantum computing, underscoring their continuing importance in driving cloud computing forward.
Conference Paper
As organizações contemporâneas enfrentam o desafio de analisar volumes significativos de dados complexos, utilizando técnicas como análise de agrupamentos. Pesquisas antecedentes introduziram o DSG (Distributed Similarity Grouping), uma solução desenvolvida sob a estratégia MapReduce no Apache Hadoop/Spark, caracterizada pelo paralelismo na identificação de grupos similares em conjuntos de dados massivos. Este trabalho introduz uma evolução desse método, denominado DSG-VPTREE (Distributed Similarity Grouping with Vp-Tree), que incorpora a Vantage Point Tree (VP-Tree) para otimizar o particionamento de dados e melhorar as operações de agrupamento por similaridade. O novo algoritmo permite a otimização tanto do particionamento quanto da análise das janelas de sobreposição das partições. Os resultados dos experimentos demonstram que o DSG-VPTree supera o DSG, apresentando uma redução significativa no tempo de execução e melhor escalabilidade em dados de alta dimensionalidade.
Chapter
This chapter discusses Big Data algorithms that are capable of processing large volumes of data by using either parallelization or streaming mode. We will look at the MapReduce algorithm for parallel processing of large amounts of data that provides a basis for many other algorithms and applications working with Big Data. The MapReduce programming model and its implementation in Hadoop were invented to address limitations of the traditional high-performance computing programming model such as MPI (Message Passing Interface) for processing very large datasets such as of web scale that actually cannot be processed on one even very big computer (“MapReduce Tutorial, Apache Hadoop, Version 3.3.6, 13 June 2023,” [Online]. Available: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). MapReduce is specifically oriented on using commodity computer clusters. Hadoop is an Open Source implementation of MapReduce. It is the main computing platform for web data processing by web search engines such as Google, Bing, Yahoo!, LinkedIn, and others. We will look at the Apache Hadoop platform and ecosystem that provides a generic implementation of MapReduce and includes a wide rage of applications, libraries and software packages to store, process and visualize Big Data. Hadoop became a standard-de-facto and ultimate platform for building highly scalable Big Data applications. The Hadoop platform is provided by all big cloud providers. The chapter also provides overview of Hadoop Distributed File System (HDFS), Apache Hive and Apache Pig which are important components of the Hadoop ecosystem specifically design for storing and processing Big Data.
Article
Full-text available
In recent years, the IT industry has witnessed a significant shift from traditional data quality improvement methods to the adoption of modern, technology-driven approaches. This transition is driven by the need for real-time data accuracy, operational efficiency, and enhanced decision-making capabilities. Traditional methods, often reliant on manual data cleaning and periodic updates, are increasingly being replaced by advanced technologies such as artificial intelligence, machine learning, and blockchain. These technologies promise improved data accuracy, timeliness, and reliability, yet their implementation is not without hurdles. Challenges include high initial costs, the need for specialized skills, integration issues with existing systems, and concerns over data privacy and security. On the other hand, the benefits of adopting these modern approaches are substantial. Organizations report enhanced data quality, faster processing times, reduced manual effort, and more accurate predictive analytics. Additionally, these technologies facilitate real-time data monitoring and automated error detection, significantly improving operational efficiency. Through a comprehensive review of recent literature, this paper highlights the dual aspects of transitioning to modern data quality improvement methods. By understanding these challenges and benefits, IT organizations can better navigate the complexities of this transition, ultimately leading to more robust and reliable data management practices.
ResearchGate has not been able to resolve any references for this publication.