Conference Paper

Map-reduce-merge: Simplified relational data processing on large clusters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Map-Reduce is a programming model that enables easy de- velopment of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and re- duce, this model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. However, this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes dif- ficulties and/or ineciency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map- Reduce-Merge. It adds to Map-Reduce a Merge phase that can eciently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It consists of a framework. The Hadoop distributed file system and Map-Reduce are receiving a lot of attention due to their high usability as confirmed in real cases, and studies for further performance improvement are being actively conducted [6][7] [9].The Map-Reduce framework is composed of Map and Reduce functions that are commonly used in functional programming. In the map stage, according to the definition of the function, chunks (data blocks) are read and the processed data is changed into a key-value form, and in the reduce stage, the result of the map stage is merged and output. ...
... Hadoop, a representative framework for processing such large amounts of data, supports distributed applications that can process large amounts of data. Hadoop is a technology that implements the Hadoop distributed file system using the Google File System (Google File System) [9][16] [18], the underlying system used in Google , and Map-Reduce, a data distributed processing framework based on it. Map-Reduce processes data by combining two functions, Map and Reduce. ...
Article
Full-text available
Recently, research and utilization of distributed storage and processing systems for
... Solutions to this issue have been proposed by many researchers [11][12][13]. In Section 6, we propose a survey of some of the relevant literature proposals in the field. ...
... To identify the optimized schedules for job sequences, a data transformation graph is used to represent all the possible jobs' execution paths: then, the well-known Dijkstra's shortest path algorithm is used to determine the optimized schedule. An extra MapReduce phase named "merge" is introduced in [13]. It is executed after map and reduce phases and extends the MapReduce model for heterogeneous data. ...
Article
Full-text available
In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper.
... For example, users can map and reduce one data set on the fly and read data from other datasets. The Map-Reduce-Merge model [35] has been implemented to enable the processing of multiple datasets to tackle the restriction of the additional processing requirements for conducting joint operations in the MapReduce system. The structure of this model is shown in Figure 3 The main differences between this framework's processing model and the original MapReduce are that a key/value list is created from the Reduce function instead of just the values. ...
... An overview of the Map-Reduce-Merge framework[35] ...
Thesis
One of the main challenges for the automobile industry in the digital age is to provide their customers with a reliable and ubiquitous level of connected services. Smart cars have been entering the market for a few years now to offer drivers and passengers safer, more comfortable, and entertaining journeys. All this by designing, behind the scenes, computer systems that perform well while conserving the use of resources.The performance of a Big Data architecture in the automotive industry relies on keeping up with the growing trend of connected vehicles and maintaining a high quality of service. The Cloud at Groupe PSA has a particular load on ensuring a real-time data processing service for all the brand's connected vehicles: with 200k connected vehicles sold each year, the infrastructure is continuously challenged.Therefore, this thesis mainly focuses on optimizing resource allocation while considering the specifics of continuous flow processing applications and proposing a modular and fine-tuned component architecture for automotive scenarios.First, we go over a fundamental and essential process in Stream Processing Engines, a resource allocation algorithm. The central challenge of deploying streaming applications is mapping the operator graph, representing the application logic, to the available physical resources to improve its performance. We have targeted this problem by showing that the approach based on inherent data parallelism does not necessarily lead to all applications' best performance.Second, we revisit the Big Data architecture and design an end-to-end architecture that meets today's demands of data-intensive applications. We report on CV's Big Data platform, particularly the one deployed by Groupe PSA. In particular, we present open-source technologies and products used in different platform components to collect, store, process, and, most importantly, exploit big data and highlight why the Hadoop system is no longer the de-facto solution of Big Data. We end with a detailed assessment of the architecture while justifying the choices made during design and implementation.
... The pairwise distances are determined between the n initial clusters by computing the distance matrix. During the numbers from the distance matrix D, two nearest clusters i and j are identified [44]. A new cluster is created by combining clusters i and j, then the distance matrix D is adjusted appropriately. ...
Article
Full-text available
For organizing and analyzing massive amounts of data and revealing hidden patterns and structures, clustering is a crucial approach. This paper examines unique strategies for rapid clustering, highlighting the problems and possibilities in this area. The paper includes a brief introduction to clustering, discussing various clustering algorithms, improvements in handling various data types, and appropriate evaluation metrics. It then highlights the unsupervised nature of clustering and emphasizes its importance in many different fields, including customer segmentation, market research, and anomaly detection. This review emphasizes ongoing efforts to address these issues through research and suggests exciting directions for future investigations. By examining the advancements, challenges, and future opportunities in clustering, this research aims to increase awareness of cutting-edge approaches and encourage additional innovations in this essential field of data analysis and pattern identification. It highlights the need for resilience to noise and outliers, domain knowledge integration, scalable and efficient algorithms, and interpretable clustering technologies. In addition to managing high-dimensional data, creating incremental and online clustering techniques, and investigating deep learning-based algorithms, the study suggests future research areas. Additionally featured are real-world applications from several sectors. Although clustering approaches have made a substantial contribution, more research is necessary to solve their limitations and fully realize their promise for data analysis.
... This approach ensures the participants' information is not leaked through encryption and collaboration, thus achieving data privacy protection. Additionally, there are other vertical federated learning algorithms based on homomorphic encryption, such as the method for implementing vertical federated logistic regression under the central federated learning framework proposed by Yang et al. [10]. This method incorporates the concept of homomorphic encryption, encrypting both parties' data and gradients during the training process to protect data privacy. ...
Article
Full-text available
In recent years, federated learning has become a hot research topic in the machine learning community. It aims to reduce the potential data security and privacy risks caused by the centralized training paradigm of traditional machine learning through local training and global aggregation. Although federated learning methods have been widely applied in numerous fields such as finance, healthcare, autonomous driving, and smart retail, there are still urgent issues to be addressed in the field of federated learning, including data privacy leakage, malicious node attacks, model security, and the trustworthiness of participants. By delving into and discussing federated learning, this paper aims to provide researchers and practitioners in related fields with a comprehensive understanding and the latest progress of this technology. Based on the characteristics of the data distribution of the parties involved in federated learning training, this paper categorizes existing federated learning methods into horizontal federated learning, vertical federated learning, and federated transfer learning. It also introduces representative federated learning algorithms under different types, including their design concepts, basic processes, and advantages and disadvantages. Combining different application scenarios, this paper further discusses the challenges of federated learning and looks forward to the future development direction of this topic.
... This proactive approach significantly reduces query latency and improves database system responsiveness (Sakr et al., 2011). Yang et al. (2007) discusses how ML-driven data retrieval techniques streamline complex queries and enhance user experience by providing faster access to required information. The continuous monitoring and adjustment of retrieval strategies enabled by ML ensure sustained efficiency as data volumes and usage patterns evolve . ...
Article
This study systematically reviews the integration of machine learning (ML) and artificial intelligence (AI) into SQL databases and big data analytics, highlighting significant advancements and emerging trends. Using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, a comprehensive review of 60 selected articles published between 2010 and 2023 was conducted. The findings reveal substantial improvements in query optimization through ML algorithms, which adapt dynamically to changing data patterns, reducing processing times and enhancing performance. Additionally, embedding ML models within SQL databases facilitates real-time predictive analytics, streamlining workflows, and improving the accuracy and speed of predictions. AI-driven security systems provide proactive and real-time threat detection, significantly enhancing data protection. The development of hybrid systems that combine relational and non-relational databases offers versatile and efficient data management solutions, addressing the limitations of traditional systems. This study confirms the evolving role of AI and ML in transforming data management practices and aligns with and extends previous research findings.
... Recently, research using cloud computing as a way to store and process large-scale sensor data is being conducted. In [11], the integration of the Internet and the sensor network was proposed, and in [12], a framework for combining the sensor network and the web-based application program using the cloud was presented. ...
Article
Full-text available
Recently, as the construction of a large-scale sensor network increases, a system for efficiently managing large-scale sensor data is required. In this paper, we propose a cloud-based sensor data management system with low cost, high scalability, and high efficiency. In the proposed system, sensor data is transmitted to the cloud through the cloud gateway, and abnormal situation detection and event processing are performed at this time. The sensor data sent to the cloud is stored in Hadoop HBase, a distributed column-oriented database, and processed in parallel through the MapReduce model-based query processing module. As the processed result is provided through REST-based web service, it can be linked with application programs of various platforms. Indexed Terms-sensor data management, cloud computing, hadoop, hbase, mapreduce.
... Phoenix (Ranger et al., 2007) used the MapReduce project for shared memory and cell BE architecture using MapReduce was discussed in Kruijf and Sankaralinga (2007). Yahoo extends MapReduce with merge to perform join operation (Yang et al., 2007). MRPSO (McNabb et al., 2007) utilises the Hadoop MapReduce for particle swarm optimisation. ...
... Phoenix (Ranger et al., 2007) used the MapReduce project for shared memory and cell BE architecture using MapReduce was discussed in Kruijf and Sankaralinga (2007). Yahoo extends MapReduce with merge to perform join operation (Yang et al., 2007). MRPSO (McNabb et al., 2007) utilises the Hadoop MapReduce for particle swarm optimisation. ...
... For example, in the Chicago crime dataset, the map function calculates all the crimes against each day, and then the reduction function takes the day as the key and extracts the appropriate values (Key: values). The MapReduce-Merge framework was developed by the authors of [89] by including the merge operation into the MapReduce architecture. The performance of MapReduce was increased and was able to calculate relational algebra due to the merge operation and was able to process the data in the cluster. ...
Article
Full-text available
The competent software architecture plays a crucial role in the difficult task of big data processing for SQL and NoSQL databases. SQL databases were created to organize data and allow for horizontal expansion. NoSQL databases, on the other hand, support horizontal scalability and can efficiently process large amounts of unstructured data. Organizational needs determine which paradigm is appropriate, yet selecting the best option is not always easy. Differences in database design are what set SQL and NoSQL databases apart. Each NoSQL database type also consistently employs a mixed-model approach. Therefore, it is challenging for cloud users to transfer their data among different cloud storage services (CSPs). There are several different paradigms being monitored by the various cloud platforms (IaaS, PaaS, SaaS, and DBaaS). The purpose of this SLR is to examine the articles that address cloud data portability and interoperability, as well as the software architectures of SQL and NoSQL databases. Numerous studies comparing the capabilities of SQL and NoSQL of databases, particularly Oracle RDBMS and NoSQL Document Database (MongoDB), in terms of scale, performance, availability, consistency, and sharding, were presented as part of the state of the art. Research indicates that NoSQL databases, with their specifically tailored structures, may be the best option for big data analytics, while SQL databases are best suited for online transaction processing (OLTP) purposes.
... In [20], the authors have a comparative study of classification algorithms based on MapReduce model. The Map-Reduce-Merge implemented in [21], simplified relational data processing on large clusters. A number of techniques have been proposed to improve the performance of MapReduce jobs. ...
Article
Full-text available
MapReduce algorithm inspired by the map and reduces functions commonly used in functional programming. The use of this model is more beneficial when optimization of the distributed mappers in the MapReduce framework comes into the account. In standard mappers, each mapper operates independently and has no collaborative function or content relationship with other mappers. We propose a new technique to improve performance of the inter-processing tasks in MapReduce functions. In the proposed method, the mappers are connected and collaborated through a shared coordinator with a distributed metadata store called DMDS. In this new structure, a parallel and co-evolutionary genetic algorithm has been used to optimize and match the matrix processes simultaneously. The proposed method uses a genetic algorithm with a parallel and evolutionary executive structure in the mapping process of the mappers program to allocate resources, transfer and store data. The co-evolutionary MapReduce mappers can simplify and optimize relational data processing in the large clusters. MapReduce using a co-evolutionary mapper, provide successful convergence and better performance. Our experimental evaluation shows that collaborative techniques improves performance especially in the big size computations, and dramatically improves processing time across the MapReduce process. Even though the execution time in MapReduce varies with data volume, in the proposed method the overhead processing in low volume data is considerable where in high volume data shows more competitive advantage. In fact, with increasing the data volume, advantage of the proposed method becomes more considerable.
... However, one major obstacle around MapReduce is the low productivity of developing entire applications. Unlike the high-level declarative language such as SQL, MapReduce jobs are written in the low-level descriptive language, often requiring massive programming efforts and leading to considerable difficulty in programming debugging [24]. ...
Preprint
Full-text available
Data management applications are rapidly growing applications that require more attention, especially in the big data era. Thus, it is critical to support these applications with novel and efficient algorithms that satisfy higher performance. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface.Further, as a popular programming model, MapReduce allows large-scale data analysis and has also been leveraged to facilitate query processing and used as a database engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike the high-level declarative language such as SQL, MapReduce jobs are written in low-level descriptive language, often requiring massive programming efforts and complicated programming debugging processes.This paper presents a system that supports translating array queries expressed by AQL (Array Query Language) in SciDB into MapReduce jobs. We focus on effectively translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike the traditional aggregations in relational databases, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of some existing SQL-to- MapReduce translators like HiveQL/Hive and YSmart. We show that our translator can effectively support structural aggregations over arrays (or sub-arrays) to meet various array manipulations. Moreover, our translator can help user-defined aggregation functions with the minimum effort of the user.We also show that our translator can generate optimized MapReduce code, leading to significantly better performance than the short hand-written code by up to 10.84X.
... Das weit verbreitete Framework Apache Hadoop ist die bekannteste Umsetzung des MapReduce-Ansatzes, neben weiteren Ansätzen wie Dryad (Isard et al. 2007) und MapReduce-Merge (Yang et al. 2007 ...
Article
Full-text available
XML ist ein semi-strukturiertes Datenbeschreibungsformat, das aufgrund weiter Verbreitung und steigender Datenmengen auch als Eingabeformat für eine BigData-Verarbeitung relevant ist. Der vorliegende Beitrag befasst sich daher mit der Nutzung komplexer XML-basierter Datenstrukturen als Eingabeformat für BigData-Anwendungen. Werden umfangreiche komplexe XML-Datenstrukturen mit verschiedenen XML-Typen in einer zu verarbeitenden XML-Datei beispielsweise mit Apache Hadoop verarbeitet, kann das Einlesen der Daten die Laufzeit einer Anwendung dominieren. Unser Ansatz befasst sich mit der Optimierung der Eingabephasen, indem Zwischenergebnisse der Verarbeitung im Arbeitsspeicher abgelegt werden. Der Aufwand für die Verarbeitung reduziert sich damit zum Teil erheblich. Anhand einer Fallstudie aus der Musikbranche, in der standardisierte XML-basierte Formate wie das DDEX-Format genutzt werden, wird experimentell gezeigt, dass die Verarbeitung mit unserem Ansatz im Vergleich zur klassischen Abarbeitung von Dateiinhalten deutlich effizienter ist.
... Map-Reduce-Merge [80] extends MapReduce with a merge function to facilitate expressing the join operation. Map-Join-Reduce [43] adds a join phase between the map and the reduce phases. ...
Preprint
The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms.
... For example, in the Chicago crime dataset, the map function calculates all the crimes against each day, and then the reduce function uses the day as the key and extracts the required values (Key: values). The authors of [89] added the merge operation into MapReduce architecture and derived the MapReduce-Merge framework. The performance of MapReduce was improved, and able to calculate relational algebra due to the merge operation and can process the data in the cluster. ...
Preprint
Full-text available
Context: The efficient processing of Big Data is a challenging task for SQL and NoSQL Databases, where competent software architecture plays a vital role. The SQL Databases are designed for structuring data and supporting vertical scalability. In contrast, horizontal scalability is backed by NoSQL Databases and can process sizeable unstructured Data efficiently. One can choose the right paradigm according to the organisation's needs; however, making the correct choice can often be challenging. The SQL and NoSQL Databases follow different architectures. Also, the mixed model is followed by each category of NoSQL Databases. Hence, data movement becomes difficult for cloud consumers across multiple cloud service providers (CSPs). In addition, each cloud platform IaaS, PaaS, SaaS, and DBaaS also monitors various paradigms. Objective: This systematic literature review (SLR) aims to study the related articles associated with SQL and NoSQL Database software architectures and tackle data portability and Interoperability among various cloud platforms. State of the art presented many performance comparison studies of SQL and NoSQL Databases by observing scaling, performance, availability, consistency and sharding characteristics. According to the research studies, NoSQL Database designed structures can be the right choice for big data analytics, while SQL Databases are suitable for OLTP Databases. The researcher proposes numerous approaches associated with data movement in the cloud. Platform-based APIs are developed, which makes users' data movement difficult. Therefore, data portability and Interoperability issues are noticed during data movement across multiple CSPs. To minimize developer efforts and Interoperability, Unified APIs are demanded to make data movement relatively more accessible among various cloud platforms.
... Game telemetry datasets are large because they contain logs of individual player behavior. To analyze data of this scale, distributed approaches such as Map-Reduce-Merge (Yang et al. 2007) have been used to distribute and analyze datasets on a large cluster. In our approach, we analyze a subset of telemetry using a representative sampling of players. ...
Article
Video games are increasingly producing huge datasets available for analysis resulting from players engaging in interactive environments. These datasets enable investigation of individual player behavior at a massive scale, which can lead to reduced production costs and improved player retention. We present an approach for modeling player retention in Madden NFL 11, a commercial football game. Our approach encodes gameplay patterns of specific players as feature vectors and models player retention as a regression problem. By building an accurate model of player retention, we are able to identify which gameplay elements are most influential in maintaining active players. The outcome of our tool is recommendations which will be used to influence the design of future titles in the Madden NFL series.
... Theoretical Basis. The intelligent classroom teaching mode of College Chinese takes the "constructivism" educational theory as an important theoretical basis [13]. The core view of constructivism theory holds that we should place the student in the center of teaching; meanwhile, the role of the teacher, which is the instructor and the promoters, in the class should make full use of learning environment elements such as situation, cooperation, and dialogue, give full play to students' enthusiasm, initiative, and creativity, and finally achieve the purpose of making students effectively construct the meaning of the current learning content. ...
Article
Full-text available
In China, the Chinese subject is called “the mother of the encyclopedia.” Learning Chinese well is the basic condition for learning other subjects well. Therefore, Chinese education has always undertaken the important task of teaching the mother tongue and cultural inheritance and has a great influence on the development process of China’s education. Society is facing new opportunities and challenges since the 21st century, and all the mobile Internet, cloud computing, and Internet of Things achieve great development. Using the increasingly mature technology of data information collection, analysis, and processing can realize a more comprehensive evaluation of students in teaching activities. In the current upsurge of big data, it is more and more widely used in the field of education, such as learning analysis, network education platform, education information management platform, education AP, and smart campus. In this context, Chinese teaching needs more open courses, which requires us to organically integrate the existing information technology and Chinese teaching, and on this basis, integrate modern educational data sets and learning analysis technology into Chinese teaching and learning, to promote the improvement of the quality of language teaching and learning. The development of information technology in the intelligent era provides an opportunity for the establishment and application of an intelligent classroom. This paper constructs an intelligent and efficient new classroom equipped with smart devices teaching mode based on defining the concept and connotation of a classroom equipped with smart devices. The proposed model is combined with the current situation and the development of classrooms equipped with smart devices. It also analyzes the difficulties and pain points of traditional College Chinese teaching. The research results can provide a reference for the implementation of intelligent classroom teaching mode in College Chinese and other general courses in higher vocational colleges.
... Researchers have used Hadoop to implement a variety of parallel processing algorithms to efficiently handle geographical data (17,18). Multistage map and reduce algorithms, which generate on-demand indexes and retain persistent indexes, are examples of these techniques (19). ...
Article
Full-text available
In the current scenario, with a large amount of unstructured data, Health Informatics is gaining traction, allowing Healthcare Units to leverage and make meaningful insights for doctors and decision-makers with relevant information to scale operations and predict the future view of treatments via Information Systems Communication. Now, around the world, massive amounts of data are being collected and analyzed for better patient diagnosis and treatment, improving public health systems and assisting government agencies in designing and implementing public health policies, instilling confidence in future generations who want to use better public health systems. This article provides an overview of the HL7 FHIR Architecture, including the workflow state, linkages, and various informatics approaches used in healthcare units. The article discusses future trends and directions in Health Informatics for successful application to provide public health safety. With the advancement of technology, healthcare units face new issues that must be addressed with appropriate adoption policies and standards.
... With hardware upgrade, GPU is also very promising to improve the performance of join operations in a hybrid framework [10]. Existing join processing using the MapReduce programming model includes k-NN [11], Equi- [12], eta- [13], Similarity [14], Top-K [15], and filter-based joins [16]. To handle the problem that existing MapReducebased filtering methods require multiple MapReduce jobs to improve the join performance, the adaptive filter-based join algorithm is proposed [17]. ...
Article
Full-text available
Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.
... Many companies have come up with new frameworks to handle such data. Some examples are Google's MapReduce [2], Microsoft's Dryad [3], or Yahoo!'s Map-Reduce-Merge [4]. All of these frameworks differ in their designs but share common objectives of fault tolerance, parallel programming and optimization. ...
Article
Vigorous resource allocation is one of the biggest challenging problems in the area of cloud resource management and in the last few years, has attracted a lot of attention from researchers. The improvised coextensive data processing has emerged as one of the best applications of Infrastructure-as-a-Service (IaaS) Cloud. Current data processing frameworks can be used for static and homogenous cluster setups only. So, the resources that are allocated may be insufficient for larger parts of submitted tasks and increase the processing cost and time unnecessarily. Due to the arcane nature of the cloud, only static allocation of resources is possible rather than dynamic. In this paper, we have proposed a generic coextensive data processing framework (ViCRA) whose working is based upon Nephele architecture that allows vigorous resource allocation for both task scheduling and realization. Different tasks of a processing job can be assigned to different virtual machines which are automatically initiated and halted during the task realization. This framework has been developed in C#. The experimental results profess that the framework is effective for exploiting the vigorous cloud resource allocation for coextensive task scheduling and realization. [URL for download: http://www.ijascse.org/volume-4-theme-based-issue-7/Vigorous_cloud_resource_allocation.pdf]
... For this reason, a machine learning scheme is introduced. Moreover, for the sentiment or opinion categorization, the system must know the emotion of human behavior like delight, annoyance, fury, etc., (Yang et al. 2007). Opinion evaluation in NLP is used to classify human emotions with the use of machine. ...
Article
Full-text available
Nowadays, the big data is ruling the entire digital world with its applications and facilities. Thus, to run the online services in better way, some of the machine learning models are utilized, also the machine learning strategy is became a trending field in big data; hence the success of online services or business is based upon the customer reviews. Almost the review contains neutral, positive, and negative sentiment value; Manual classification of sentiment value is a difficult task so that the natural language processing (NLP) scheme is used which is processed using machine learning strategy. Moreover, the part of speech specification for different languages is difficult. To overcome this issue, the current research aims to develop a novel less error pruning-shortest description length (LEP-SDL) for error pruning and ant lion boosting model (ALBM) for opinion specification purpose. Here, the Telugu news review dataset adopted to process the sentiment analysis in NLP. Furthermore, the fitness function of ant lion model in boosting approach improves the accuracy and precision of opinion specification also makes the classification process easier. Thus, to evaluate the competence of the projected model, it is evaluated with recent existing works in terms of accuracy, precision, etc., and achieved better results by obtaining high accuracy and precision of opinion specification.
... Various frameworks split computations into multiple phases: Map-Reduce-Merge [10,11] extends MapReduce to implement aggregations, Camdoop [9] assumes that an aggregation's output size is a specific fraction of input sizes, Astrolabe [12] collects large-scale system state and provides on-the-fly attribute aggregation, and so on. ...
Article
Efficient representation of data aggregations is a fundamental problem in modern big data applications, where network topologies and deployed routing and transport mechanisms play a fundamental role in optimizing desired objectives such as cost, latency, and others. In traditional networking, applications use TCP and UDP transports as a primary interface for implemented applications that hide the underlying network topology from end systems. On the flip side, to exploit network infrastructure in a better way, applications restore characteristics of the underlying network. In this work, we demonstrate that both specified extreme cases can be inefficient to optimize given objectives. We study the design principles of routing and transport infrastructure and identify extra information that can be used to improve implementations of compute-aggregate tasks. We build a taxonomy of compute-aggregate services unifying aggregation design principles, propose algorithms for each class, analyze them theoretically, and support our results with an extensive experimental study.
Chapter
This chapter discusses Big Data algorithms that are capable of processing large volumes of data by using either parallelization or streaming mode. We will look at the MapReduce algorithm for parallel processing of large amounts of data that provides a basis for many other algorithms and applications working with Big Data. The MapReduce programming model and its implementation in Hadoop were invented to address limitations of the traditional high-performance computing programming model such as MPI (Message Passing Interface) for processing very large datasets such as of web scale that actually cannot be processed on one even very big computer (“MapReduce Tutorial, Apache Hadoop, Version 3.3.6, 13 June 2023,” [Online]. Available: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). MapReduce is specifically oriented on using commodity computer clusters. Hadoop is an Open Source implementation of MapReduce. It is the main computing platform for web data processing by web search engines such as Google, Bing, Yahoo!, LinkedIn, and others. We will look at the Apache Hadoop platform and ecosystem that provides a generic implementation of MapReduce and includes a wide rage of applications, libraries and software packages to store, process and visualize Big Data. Hadoop became a standard-de-facto and ultimate platform for building highly scalable Big Data applications. The Hadoop platform is provided by all big cloud providers. The chapter also provides overview of Hadoop Distributed File System (HDFS), Apache Hive and Apache Pig which are important components of the Hadoop ecosystem specifically design for storing and processing Big Data.
Article
Full-text available
Wireless charging technology (WCT) for electric vehicles (EVs) has gained significant attention as a promising alternative to traditional plug-in charging systems due to its convenience and efficiency. This paper systematically reviews 100 peer-reviewed studies on the advancements in WCT, focusing on wireless power transfer (WPT) methods such as inductive and resonant coupling, and the critical engineering challenges that limit widespread adoption. Key issues identified include power transfer efficiency, misalignment between the vehicle and charging pad, and electromagnetic interference (EMI). Additionally, this review explores the infrastructure and scalability challenges of implementing WCT in urban environments and highways, including the potential of dynamic wireless charging systems, which allow EVs to charge while in motion. Despite recent innovations, such as adaptive control systems and advanced coil designs, gaps remain in the research on long-term feasibility and standardization. This study emphasizes the need for interdisciplinary collaboration across technical, economic, and policy domains to support the large-scale commercialization of WCT.
Chapter
Full-text available
Objective: Facial anthropometric data is important for the design of respirators. Twodimensional (2D) photogrammetry has replaced direct anthropometric method, but the reliability and accuracy of 2D photogrammetry has not been quantified. This study aimed to assess inter-rater reliability of 2D photogrammetry and to examine the reliability and accuracy of 2D photogrammetry with direct measurement. Design: A cross-sectional study. Setting: Malaysia. Participants: A subset of 96 participants aged 18 and above. Primary and secondary outcomes: Ten facial dimensions were measured using direct measurement and 2D photogrammetry. An assessment of inter-rater reliability was performed using intra-class correlation (ICC) of the 2D images. In addition, ICC and Bland-Altman analyses were used to assess the reliability and agreement of 2D photogrammetry with direct measurement. Results: Except for head breadth and bigonial breadth, which were also found to have low inter-rater reliability, there was no significant difference in the inter-rater mean value of the 2D photogrammetry. The mean measurements derived from direct measurement and 2D photogrammetry were mostly similar. However, statistical differences were noted for two facial dimensions, i.e., bizygomatic breadth and bigonial breadth, and clinically the magnitude of difference was also significant. There were no statistical differences in respect to the remaining eight facial dimensions, where the smallest mean difference was 0.3mm and biggest mean difference was 1.0 mm. The ICC showed head breadth had poor reliability, whilst Bland-Altman analyses showed seven out of 10 facial dimensions using 2D photogrammetry were accurate, as compared to direct measurement. Conclusion: Only certain facial measurements can be reliably and accurately measured using 2D photogrammetry, thus it is important to conduct a reliability and validation study before the use of any measurement methods in anthropometric studies. The results of this study also suggest that 2D photogrammetry can be used to supplement direct measurement for certain facial dimensions
Article
Full-text available
Problem of ageing is appearing as a major issue in modern age because the improvement of medical science has raised the life expectancy of the people. As a result, the number of old age people is increasing all over the world. According to UNESCO estimate the number of old people aged above 60+ in the world is likely to go up from 350 million in 1975 to 590 million in 2005. About half of them are in the developing countries. In the advanced countries like Norway, Sweden and Japan the population of the aged above 60 is 20 per cent. However, the percentage of the aged in other advanced countries in the West is in the neighbourhood of 14 Per cent. So far as India is concerned as a result of the change in the age composition of the population over time, there has been a progressive increase in both the number and proportion of aged people. The proportion of the population aged 60 years or more has been increasing consistently over the last century, particularly after 1951. In 1901 the proportion of the population aged 60 or over of India was about 5 percent, which marginally increased to 5.4 percent in 1951, and by 2001 this share was found to have risen to about 7.4 percent. About 75% of persons of age 60 and above reside in rural areas. The Status of the aged not only differ between younger and olds but also from country to county on the socio-cultural background. Venkoba Rao (1979) has indicated that as to how the prevalent cultural conditions are affecting or contributing to the problems of the aged. Ghosal (1962) has observed that the problems of old age tend to be multiple rather than single. Due to the multiple natures being observed in the old age causes different problems. Based on these types of problems vary between tribes and non-tribes. Keywords: Aged People, Old age, Elderly, Tribes, Tribal Community
Article
Full-text available
Big data has revolutionized science and technology leading to the transformation of our societies. High-performance computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally, HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realization of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high-performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in situ and transit data analysis. This paper provides an extensive review of cutting-edge research on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
Conference Paper
Article
Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches which store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top- k ). The user can either provide minimal information which allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real world data to experimentally demonstrate speedups up to 3 × over single level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.
Preprint
Full-text available
In this paper, a technology for massive data storage and computing named Hadoop is surveyed. Hadoop consists of heterogeneous computing devices like regular PCs abstracting away the details of parallel processing and developers can just concentrate on their computational problem. A Hadoop cluster is made of two parts: HDFs and Mapreduce. Hadoop cluster uses HDFS for data management. HDFS provides storage for input and output data in MapReduce jobs and is designed with abilities like high-fault tolerance, high-distribution capacity, and high throughput. It is also suitable for storing Terabyte data on clusters and it runs on flexible hardware like commodity devices.
Thesis
Full-text available
La Big Data science è oggigiorno alla base della maggior parte delle applicazioni quotidianie, che si tratti di ambienti lavorativi o ricreativi. Il presente elaborato discute, dopo un'analisi di carattere generale sul mondo degli Intelligent Buildings e sulle tecnologie che lo dominano, dell'importanze del Big Data all' interno del settore edile, analizzandone le maggiori tecnologie. Se ne studia il grado di integrazione, fornendo una "fotografia" dello stato attuale ed analizzandone i possibili sviluppi futuri.
Book
Full-text available
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DT-DMS that is used to support decision-making mechanisms. Book Chapters; 1. CHAPTER Understanding Hadoop and MapReduce by example: log data analysis case study Gligor Risteski, Mihiri Chathurika, Beyza Ali, Atanas Hristov 2. CHAPTER Edge Computing Security with an IoT device Beyda Nur Kars 3. CHAPTER Secure Connection between Google Home and IoT Device Ekrem Yiğit 4. CHAPTER Practical Evaluation on Serious Games in Education Slavica Mileva Eftimova, Ana Madevska Bogdanova, Vladimir Trajkovik 5. CHAPTER Digital Twin Based Disaster Management System Proposal: DT-DMS Özgür Doğan, Oğuzhan Şahin, Enis Karaarslan
Book
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DTDMS that is used to support decision-making mechanisms.
Article
Full-text available
It isn't easy to analyze huge records. This requires system based structures and technologies for you to procedure. Map-reduce a distributed parallel programming model runs on hadoop surroundings, approaches massive volumes of information. A parallel programming technique can be relevant to the linear regression algorithm and support vector machines algorithm from the system getting to know network to parallelize acceleration on the multicore system for efficient timing efficiency.
Chapter
Advances in the communication technologies, along with the birth of new communication paradigms leveraging on the power of the social, has fostered the production of huge amounts of data. Old-fashioned computing paradigms are unfit to handle the dimensions of the data daily produced by the countless, worldwide distributed sources of information. So far, the MapReduce has been able to keep the promise of speeding up the computation over Big Data within a cluster. This article focuses on scenarios of worldwide distributed Big Data. While stigmatizing the poor performance of the Hadoop framework when deployed in such scenarios, it proposes the definition of a Hierarchical Hadoop Framework (H2F) to cope with the issues arising when Big Data are scattered over geographically distant data centers. The article highlights the novelty introduced by the H2F with respect to other hierarchical approaches. Tests run on a software prototype are also reported to show the increase of performance that H2F is able to achieve in geographical scenarios over a plain Hadoop approach.
Article
Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed
Article
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.
Conference Paper
Full-text available
Big data business ecosystem is very supportive for user needs and its trend that provide basis for big data are explained. There is need of effective solution with issue of data volume, in order to enable the feasible study , cost effective and scalable storage and processing of enormous quantity of data, thus the big data and cloud go hand in hand and Hadoop is very hot and enormously growing technology for organizations . The steps into begin required for setting up a distributed, single node Hadoop cluster backed by HDFS running on ubuntu (steps only) are given. We proposed new techniques for better Big Data, Hadoop Implementations using cloud computing sources using various methods and systems for common man needs. It is very helpful and Big scope for future research.
Conference Paper
Full-text available
Now a day’s Technological evolutions at Smart Intelligent Transport Systems (SITS) are allowing that ubiquitous society use the SITS facilities to conducting their activities for our Social needs. At this scenario the research paper presents a study case in order to use SITS technologies applied to conduct these activities, mainly for every vehicle tracking. The study case presented in this research paper uses the concept of Internet of Things to make the inspection of different vehicles; all study case considers the use of existing Intelligent Transport Systems infrastructure, as well as the deployment of new infrastructure using Bus Rapid Transit (BRT). In order to implement the Internet of Things, the study case use the RFID (Radio Frequency Identification) technologies associated with multiple sensors are installed at vehicle in order to identify the vehicle when it pass through a Region of Interest (ROI) portal in Bus Rapid Transit (BRT) and the associated information, if available, as: temperature, average velocity, humidity, average Speed, Frequency modulation, doors, Road Lengths and others relevant information. The major advantage of IoT is the ability to recognize or identify the vehicles using Transportation System in ROI and other information without disrupting the normal flow.
Conference Paper
Full-text available
Nanotechnology is quickly turning into an omnipresent Technology with a possibility to affect on each part of present day human development. Relatively every part of human undertaking will be influenced, for example, agribusiness and sustenance, correspondence, PCs, ecological checking, materials, mechanical autonomy, social insurance and medicinal Technology. All the more as of late software engineering has turned out to be associated with nanotechnology. This technology may be impelled in 2020. Fifth time orchestrate Technology is depend upon nanotechnology and all IP compose. 5G can be cross the speed and accessibility. The 5G sort out Technology is more creative and engaging Technology which will be valuable for the customer of various field or specialists. The paper covers the technique towards the Nanotechnology and the building, central focuses and employments of the 5G remote framework correspondence Technology.
Article
Full-text available
Although a search engine manages a great deal of data and responds to queries, it is not accurately described as a "database" or DMBS. We believe that it represents the first of many application-specific data systems built by the systems community that must exploit the principles of databases without necessarily using the (current) database implementations. In this paper, we present how a search engine should have been designed in hindsight. Although much of the material has not been presented before, the contribution is not in the spe-cific design, but rather in the combination of principles from the largely independent disciplines of "systems" and "databases." Thus we present the design using the ideas and vocabulary of the database community as a model of how to design data-intensive systems. We then draw some conclusions about the application of data-base principles to other "out of the box" data-intensive systems.
Conference Paper
Full-text available
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers.
Conference Paper
Full-text available
The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distr ibuted data warehousing en- vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com- puters, guarantying a nearly optimal speed up and s cale up. However, the use of a large number of inexpensive nodes increases the r isk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capabilit y of answering to queries even in the presence of node failures. This approac h is based on the selective replication of data over the cluster nodes, which g uarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC- DS benchmark and the results show that the approach is quite effective.
Article
Full-text available
Large-scale cluster-based Internet services often host partitioned datasets to provide incremental scalability. The aggregation of results produced from multiple partitions is a fundamental building block for the delivery of these services. This paper presents the design and implementation of a programming primitive -- Data Aggregation Call (DAC) -- to exploit partition parallelism for clusterbased Internet services. A DAC request specifies a local processing operator and a global reduction operator, and it aggregates the local processing results from participating nodes through the global reduction operator. Applications may allow a DAC request to return partial aggregation results as a tradeoff between quality and availability. Our architecture design aims at improving interactive responses with sustained throughput for typical cluster environments where platform heterogeneity and software/hardware failures are common. At the cluster level, our load-adaptive reduction tree construction algorithm balances processing and aggregation load across servers while exploiting partition parallelism. Inside each node, we employ an event-driven thread pool design that prevents slow nodes from adversely affecting system throughput under highly concurrent workload. We further devise a staged timeout scheme that eagerly prunes slow or unresponsive servers from the reduction tree to meet soft deadlines. We have used the DAC primitive to implement several applications: a search engine document retriever, a parallel protein sequence matcher, and an online parallel facial recognizer. Our experimental and simulation results validate the effectiveness of the proposed optimization techniques for (1) reducing response time, (2) improving throughput, and (3) handling server unresponsiveness ...
Article
Full-text available
This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.
Article
Google's MapReduce programming model serves for processing large data sets in a massively parallel manner. We deliver the first rigorous description of the model including its advancement as Google's domain-specific language Sawzall. To this end, we reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our findings as an executable specification. We also identify and resolve some obscurities in the informal presentation given in the seminal papers. We use typed functional programming (specifically Haskell) as a tool for design recovery and executable specification. Our development comprises three components: (i) the basic program skeleton that underlies MapReduce computations; (ii) the opportunities for parallelism in executing MapReduce computations; (iii) the fundamental characteristics of Sawzall's aggregators as an advancement of the MapReduce approach. Our development does not formalize the more implementational aspects of an actual, distributed execution of MapReduce computations.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Like most applications, database systems want cheap, fast hardware. Today that means commodity processors, memories, and disks. Consequently, the hardware concept of a database machine built of exotic hardware is inappropriate for current technology. On the other hand, the availability of fast microprocessors, and small inexpensive disks packaged as standard inexpensive but fast computers is an ideal platform for parallel database systems. The main topics are basic techniques for parallel database machine implementation, the trend to shared-nothing machines, parallel dataflow approach to SQL software, and future directions and research problems.
Article
This paper extends earlier research on hash-join algorithms to a multiprocessor architecture. Implementations of a number of centralized join algorithms are described and measured. Evaluation of these algorithms served to verify earlier analytical results. In addition, they demonstrate that bit vector filtering provides dramatic improvement in the performance of all algorithms including the sort merge join algorithm. Multiprocessor configurations of the centralized Grace and Hybrid hash-join algorithms are also presented. Both algorithms are shown to provide linear increases in throughput with corresponding increases in processor and disk resources. 1. Introduction After the publication of the classic join algorithm paper in 1977 by Blasgen and Eswaran [BLAS77], the topic was virtually abandoned as a research area. Everybody "knew" that a nested-loops algorithm provided acceptable performance on small relations or large relations when a suitable index existed and that sort-merge was ...
Article
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds. Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.
Redundant Array of Inexpensive Nodes
  • Wikipedia
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/ Redundant Array of Inexpensive Nodes, 2006.
A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations
  • A C Arpaci-Dusseau
  • A.