Conference Paper

Map-reduce-merge: Simplified relational data processing on large clusters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Map-Reduce is a programming model that enables easy de- velopment of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and re- duce, this model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. However, this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes dif- ficulties and/or ineciency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map- Reduce-Merge. It adds to Map-Reduce a Merge phase that can eciently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... MapReduce has two functionalities, Map () and Reduce (). This model has been used in Google's search index, machine learning, and statistical analysis [8]. Implementation of MapReduce is highly scalable and easy to use. ...
... Then the produced key-value pairs are fed into the Reducer. After collecting all the key-value pairs from all of the map jobs the Reducer groups the pairs into a smaller set of key-value pairs, producing the final output [7,8]. ...
... As it can be seen from this result, with the unprecedented increase in the data generated traditional methods fall short with providing a solution for data analysis. This is, exactly, the point where the new technologies stepped in [8]. Hadoop MapReduce has a wide area of applications for Big Data analysis [3], [9], [11]. ...
Article
As the term “Big Data” is becoming more and more popular every day, the first thing we should know and remember about it is that it does not have a single and unique definition. Basically, as one can understand from its name, Big Data means big amount of data. Sethy, R. in his article gives the definition as “Big Data describes any massive volume of structured, semi structured and unstructured data that are difficult to process using traditional database system.” Researches show that data volumes are doubling every year. Although there is not a specific reason behind this rapid growth rate, the new data sources, contribute to that growth highly. Smart phones, tablet computers, sensors, and all other devices that can be connected to the internet generate a vast amount of data. The technologies that are used by traditional enterprises are changing to more powerful platforms which also play an important role in the growth rate of the data that is generated.
... Solutions to this issue have been proposed by many researchers [11][12][13]. In Section 6, we propose a survey of some of the relevant literature proposals in the field. ...
... To identify the optimized schedules for job sequences, a data transformation graph is used to represent all the possible jobs' execution paths: then, the well-known Dijkstra's shortest path algorithm is used to determine the optimized schedule. An extra MapReduce phase named "merge" is introduced in [13]. It is executed after map and reduce phases and extends the MapReduce model for heterogeneous data. ...
Article
Full-text available
In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper.
... For example, users can map and reduce one data set on the fly and read data from other datasets. The Map-Reduce-Merge model [35] has been implemented to enable the processing of multiple datasets to tackle the restriction of the additional processing requirements for conducting joint operations in the MapReduce system. The structure of this model is shown in Figure 3 The main differences between this framework's processing model and the original MapReduce are that a key/value list is created from the Reduce function instead of just the values. ...
... An overview of the Map-Reduce-Merge framework[35] ...
Thesis
One of the main challenges for the automobile industry in the digital age is to provide their customers with a reliable and ubiquitous level of connected services. Smart cars have been entering the market for a few years now to offer drivers and passengers safer, more comfortable, and entertaining journeys. All this by designing, behind the scenes, computer systems that perform well while conserving the use of resources.The performance of a Big Data architecture in the automotive industry relies on keeping up with the growing trend of connected vehicles and maintaining a high quality of service. The Cloud at Groupe PSA has a particular load on ensuring a real-time data processing service for all the brand's connected vehicles: with 200k connected vehicles sold each year, the infrastructure is continuously challenged.Therefore, this thesis mainly focuses on optimizing resource allocation while considering the specifics of continuous flow processing applications and proposing a modular and fine-tuned component architecture for automotive scenarios.First, we go over a fundamental and essential process in Stream Processing Engines, a resource allocation algorithm. The central challenge of deploying streaming applications is mapping the operator graph, representing the application logic, to the available physical resources to improve its performance. We have targeted this problem by showing that the approach based on inherent data parallelism does not necessarily lead to all applications' best performance.Second, we revisit the Big Data architecture and design an end-to-end architecture that meets today's demands of data-intensive applications. We report on CV's Big Data platform, particularly the one deployed by Groupe PSA. In particular, we present open-source technologies and products used in different platform components to collect, store, process, and, most importantly, exploit big data and highlight why the Hadoop system is no longer the de-facto solution of Big Data. We end with a detailed assessment of the architecture while justifying the choices made during design and implementation.
... Map-Reduce-Merge [14] can be considered as an extension to the MapReduce programming model, rather than an implementation of MapReduce. Original MapReduce programming model does not directly support processing multiple related heterogeneous datasets. ...
... Map-Reduce-Merge is an improved model that can be used to express relational algebra operators and join algorithms. This improved framework introduces a new Merge phase, that can join reduced outputs, and a naming and configuring scheme, that extends MapReduce to process heterogeneous datasets simultaneously [14]. The Merge function is much like Map or Reduce. ...
Article
Full-text available
MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google in 2004. In the programming model, a user specifies the computation by two functions, Map and Reduce. The MapReduce as well as its open-source Hadoop, is aimed for parallelizing computing in large clusters of commodity machines. Other implementations for different environments have been introduced as well, such as Mars, which implements MapReduce for graphics processors, and Phoenix, the MapReduce implementation for shared-memory systems. This paper provides an overview of MapReduce programming model, its various applications and different implementations of MapReduce. GridGain is another open source java implementation of mapreduce. We also discuss comparisons of Hadoop and GridGain.
... For example, in the Chicago crime dataset, the map function calculates all the crimes against each day, and then the reduction function takes the day as the key and extracts the appropriate values (Key: values). The MapReduce-Merge framework was developed by the authors of [89] by including the merge operation into the MapReduce architecture. The performance of MapReduce was increased and was able to calculate relational algebra due to the merge operation and was able to process the data in the cluster. ...
Article
Full-text available
The competent software architecture plays a crucial role in the difficult task of big data processing for SQL and NoSQL databases. SQL databases were created to organize data and allow for horizontal expansion. NoSQL databases, on the other hand, support horizontal scalability and can efficiently process large amounts of unstructured data. Organizational needs determine which paradigm is appropriate, yet selecting the best option is not always easy. Differences in database design are what set SQL and NoSQL databases apart. Each NoSQL database type also consistently employs a mixed-model approach. Therefore, it is challenging for cloud users to transfer their data among different cloud storage services (CSPs). There are several different paradigms being monitored by the various cloud platforms (IaaS, PaaS, SaaS, and DBaaS). The purpose of this SLR is to examine the articles that address cloud data portability and interoperability, as well as the software architectures of SQL and NoSQL databases. Numerous studies comparing the capabilities of SQL and NoSQL of databases, particularly Oracle RDBMS and NoSQL Document Database (MongoDB), in terms of scale, performance, availability, consistency, and sharding, were presented as part of the state of the art. Research indicates that NoSQL databases, with their specifically tailored structures, may be the best option for big data analytics, while SQL databases are best suited for online transaction processing (OLTP) purposes.
... In [20], the authors have a comparative study of classification algorithms based on MapReduce model. The Map-Reduce-Merge implemented in [21], simplified relational data processing on large clusters. A number of techniques have been proposed to improve the performance of MapReduce jobs. ...
Article
Full-text available
MapReduce algorithm inspired by the map and reduces functions commonly used in functional programming. The use of this model is more beneficial when optimization of the distributed mappers in the MapReduce framework comes into the account. In standard mappers, each mapper operates independently and has no collaborative function or content relationship with other mappers. We propose a new technique to improve performance of the inter-processing tasks in MapReduce functions. In the proposed method, the mappers are connected and collaborated through a shared coordinator with a distributed metadata store called DMDS. In this new structure, a parallel and co-evolutionary genetic algorithm has been used to optimize and match the matrix processes simultaneously. The proposed method uses a genetic algorithm with a parallel and evolutionary executive structure in the mapping process of the mappers program to allocate resources, transfer and store data. The co-evolutionary MapReduce mappers can simplify and optimize relational data processing in the large clusters. MapReduce using a co-evolutionary mapper, provide successful convergence and better performance. Our experimental evaluation shows that collaborative techniques improves performance especially in the big size computations, and dramatically improves processing time across the MapReduce process. Even though the execution time in MapReduce varies with data volume, in the proposed method the overhead processing in low volume data is considerable where in high volume data shows more competitive advantage. In fact, with increasing the data volume, advantage of the proposed method becomes more considerable.
... However, one major obstacle around MapReduce is the low productivity of developing entire applications. Unlike the high-level declarative language such as SQL, MapReduce jobs are written in the low-level descriptive language, often requiring massive programming efforts and leading to considerable difficulty in programming debugging [24]. ...
Preprint
Full-text available
Data management applications are rapidly growing applications that require more attention, especially in the big data era. Thus, it is critical to support these applications with novel and efficient algorithms that satisfy higher performance. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface.Further, as a popular programming model, MapReduce allows large-scale data analysis and has also been leveraged to facilitate query processing and used as a database engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike the high-level declarative language such as SQL, MapReduce jobs are written in low-level descriptive language, often requiring massive programming efforts and complicated programming debugging processes.This paper presents a system that supports translating array queries expressed by AQL (Array Query Language) in SciDB into MapReduce jobs. We focus on effectively translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike the traditional aggregations in relational databases, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of some existing SQL-to- MapReduce translators like HiveQL/Hive and YSmart. We show that our translator can effectively support structural aggregations over arrays (or sub-arrays) to meet various array manipulations. Moreover, our translator can help user-defined aggregation functions with the minimum effort of the user.We also show that our translator can generate optimized MapReduce code, leading to significantly better performance than the short hand-written code by up to 10.84X.
... Das weit verbreitete Framework Apache Hadoop ist die bekannteste Umsetzung des MapReduce-Ansatzes, neben weiteren Ansätzen wie Dryad (Isard et al. 2007) und MapReduce-Merge (Yang et al. 2007 ...
Article
Full-text available
XML ist ein semi-strukturiertes Datenbeschreibungsformat, das aufgrund weiter Verbreitung und steigender Datenmengen auch als Eingabeformat für eine BigData-Verarbeitung relevant ist. Der vorliegende Beitrag befasst sich daher mit der Nutzung komplexer XML-basierter Datenstrukturen als Eingabeformat für BigData-Anwendungen. Werden umfangreiche komplexe XML-Datenstrukturen mit verschiedenen XML-Typen in einer zu verarbeitenden XML-Datei beispielsweise mit Apache Hadoop verarbeitet, kann das Einlesen der Daten die Laufzeit einer Anwendung dominieren. Unser Ansatz befasst sich mit der Optimierung der Eingabephasen, indem Zwischenergebnisse der Verarbeitung im Arbeitsspeicher abgelegt werden. Der Aufwand für die Verarbeitung reduziert sich damit zum Teil erheblich. Anhand einer Fallstudie aus der Musikbranche, in der standardisierte XML-basierte Formate wie das DDEX-Format genutzt werden, wird experimentell gezeigt, dass die Verarbeitung mit unserem Ansatz im Vergleich zur klassischen Abarbeitung von Dateiinhalten deutlich effizienter ist.
... Map-Reduce-Merge [80] extends MapReduce with a merge function to facilitate expressing the join operation. Map-Join-Reduce [43] adds a join phase between the map and the reduce phases. ...
Preprint
The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms.
... For example, in the Chicago crime dataset, the map function calculates all the crimes against each day, and then the reduce function uses the day as the key and extracts the required values (Key: values). The authors of [89] added the merge operation into MapReduce architecture and derived the MapReduce-Merge framework. The performance of MapReduce was improved, and able to calculate relational algebra due to the merge operation and can process the data in the cluster. ...
Preprint
Full-text available
Context: The efficient processing of Big Data is a challenging task for SQL and NoSQL Databases, where competent software architecture plays a vital role. The SQL Databases are designed for structuring data and supporting vertical scalability. In contrast, horizontal scalability is backed by NoSQL Databases and can process sizeable unstructured Data efficiently. One can choose the right paradigm according to the organisation's needs; however, making the correct choice can often be challenging. The SQL and NoSQL Databases follow different architectures. Also, the mixed model is followed by each category of NoSQL Databases. Hence, data movement becomes difficult for cloud consumers across multiple cloud service providers (CSPs). In addition, each cloud platform IaaS, PaaS, SaaS, and DBaaS also monitors various paradigms. Objective: This systematic literature review (SLR) aims to study the related articles associated with SQL and NoSQL Database software architectures and tackle data portability and Interoperability among various cloud platforms. State of the art presented many performance comparison studies of SQL and NoSQL Databases by observing scaling, performance, availability, consistency and sharding characteristics. According to the research studies, NoSQL Database designed structures can be the right choice for big data analytics, while SQL Databases are suitable for OLTP Databases. The researcher proposes numerous approaches associated with data movement in the cloud. Platform-based APIs are developed, which makes users' data movement difficult. Therefore, data portability and Interoperability issues are noticed during data movement across multiple CSPs. To minimize developer efforts and Interoperability, Unified APIs are demanded to make data movement relatively more accessible among various cloud platforms.
... Game telemetry datasets are large because they contain logs of individual player behavior. To analyze data of this scale, distributed approaches such as Map-Reduce-Merge (Yang et al. 2007) have been used to distribute and analyze datasets on a large cluster. In our approach, we analyze a subset of telemetry using a representative sampling of players. ...
Article
Video games are increasingly producing huge datasets available for analysis resulting from players engaging in interactive environments. These datasets enable investigation of individual player behavior at a massive scale, which can lead to reduced production costs and improved player retention. We present an approach for modeling player retention in Madden NFL 11, a commercial football game. Our approach encodes gameplay patterns of specific players as feature vectors and models player retention as a regression problem. By building an accurate model of player retention, we are able to identify which gameplay elements are most influential in maintaining active players. The outcome of our tool is recommendations which will be used to influence the design of future titles in the Madden NFL series.
... Theoretical Basis. The intelligent classroom teaching mode of College Chinese takes the "constructivism" educational theory as an important theoretical basis [13]. The core view of constructivism theory holds that we should place the student in the center of teaching; meanwhile, the role of the teacher, which is the instructor and the promoters, in the class should make full use of learning environment elements such as situation, cooperation, and dialogue, give full play to students' enthusiasm, initiative, and creativity, and finally achieve the purpose of making students effectively construct the meaning of the current learning content. ...
Article
Full-text available
In China, the Chinese subject is called “the mother of the encyclopedia.” Learning Chinese well is the basic condition for learning other subjects well. Therefore, Chinese education has always undertaken the important task of teaching the mother tongue and cultural inheritance and has a great influence on the development process of China’s education. Society is facing new opportunities and challenges since the 21st century, and all the mobile Internet, cloud computing, and Internet of Things achieve great development. Using the increasingly mature technology of data information collection, analysis, and processing can realize a more comprehensive evaluation of students in teaching activities. In the current upsurge of big data, it is more and more widely used in the field of education, such as learning analysis, network education platform, education information management platform, education AP, and smart campus. In this context, Chinese teaching needs more open courses, which requires us to organically integrate the existing information technology and Chinese teaching, and on this basis, integrate modern educational data sets and learning analysis technology into Chinese teaching and learning, to promote the improvement of the quality of language teaching and learning. The development of information technology in the intelligent era provides an opportunity for the establishment and application of an intelligent classroom. This paper constructs an intelligent and efficient new classroom equipped with smart devices teaching mode based on defining the concept and connotation of a classroom equipped with smart devices. The proposed model is combined with the current situation and the development of classrooms equipped with smart devices. It also analyzes the difficulties and pain points of traditional College Chinese teaching. The research results can provide a reference for the implementation of intelligent classroom teaching mode in College Chinese and other general courses in higher vocational colleges.
... Researchers have used Hadoop to implement a variety of parallel processing algorithms to efficiently handle geographical data (17,18). Multistage map and reduce algorithms, which generate on-demand indexes and retain persistent indexes, are examples of these techniques (19). ...
Article
Full-text available
In the current scenario, with a large amount of unstructured data, Health Informatics is gaining traction, allowing Healthcare Units to leverage and make meaningful insights for doctors and decision-makers with relevant information to scale operations and predict the future view of treatments via Information Systems Communication. Now, around the world, massive amounts of data are being collected and analyzed for better patient diagnosis and treatment, improving public health systems and assisting government agencies in designing and implementing public health policies, instilling confidence in future generations who want to use better public health systems. This article provides an overview of the HL7 FHIR Architecture, including the workflow state, linkages, and various informatics approaches used in healthcare units. The article discusses future trends and directions in Health Informatics for successful application to provide public health safety. With the advancement of technology, healthcare units face new issues that must be addressed with appropriate adoption policies and standards.
... With hardware upgrade, GPU is also very promising to improve the performance of join operations in a hybrid framework [10]. Existing join processing using the MapReduce programming model includes k-NN [11], Equi- [12], eta- [13], Similarity [14], Top-K [15], and filter-based joins [16]. To handle the problem that existing MapReducebased filtering methods require multiple MapReduce jobs to improve the join performance, the adaptive filter-based join algorithm is proposed [17]. ...
Article
Full-text available
Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.
... Many companies have come up with new frameworks to handle such data. Some examples are Google's MapReduce [2], Microsoft's Dryad [3], or Yahoo!'s Map-Reduce-Merge [4]. All of these frameworks differ in their designs but share common objectives of fault tolerance, parallel programming and optimization. ...
Article
Vigorous resource allocation is one of the biggest challenging problems in the area of cloud resource management and in the last few years, has attracted a lot of attention from researchers. The improvised coextensive data processing has emerged as one of the best applications of Infrastructure-as-a-Service (IaaS) Cloud. Current data processing frameworks can be used for static and homogenous cluster setups only. So, the resources that are allocated may be insufficient for larger parts of submitted tasks and increase the processing cost and time unnecessarily. Due to the arcane nature of the cloud, only static allocation of resources is possible rather than dynamic. In this paper, we have proposed a generic coextensive data processing framework (ViCRA) whose working is based upon Nephele architecture that allows vigorous resource allocation for both task scheduling and realization. Different tasks of a processing job can be assigned to different virtual machines which are automatically initiated and halted during the task realization. This framework has been developed in C#. The experimental results profess that the framework is effective for exploiting the vigorous cloud resource allocation for coextensive task scheduling and realization. [URL for download: http://www.ijascse.org/volume-4-theme-based-issue-7/Vigorous_cloud_resource_allocation.pdf]
... For this reason, a machine learning scheme is introduced. Moreover, for the sentiment or opinion categorization, the system must know the emotion of human behavior like delight, annoyance, fury, etc., (Yang et al. 2007). Opinion evaluation in NLP is used to classify human emotions with the use of machine. ...
Article
Full-text available
Nowadays, the big data is ruling the entire digital world with its applications and facilities. Thus, to run the online services in better way, some of the machine learning models are utilized, also the machine learning strategy is became a trending field in big data; hence the success of online services or business is based upon the customer reviews. Almost the review contains neutral, positive, and negative sentiment value; Manual classification of sentiment value is a difficult task so that the natural language processing (NLP) scheme is used which is processed using machine learning strategy. Moreover, the part of speech specification for different languages is difficult. To overcome this issue, the current research aims to develop a novel less error pruning-shortest description length (LEP-SDL) for error pruning and ant lion boosting model (ALBM) for opinion specification purpose. Here, the Telugu news review dataset adopted to process the sentiment analysis in NLP. Furthermore, the fitness function of ant lion model in boosting approach improves the accuracy and precision of opinion specification also makes the classification process easier. Thus, to evaluate the competence of the projected model, it is evaluated with recent existing works in terms of accuracy, precision, etc., and achieved better results by obtaining high accuracy and precision of opinion specification.
... Various frameworks split computations into multiple phases: Map-Reduce-Merge [10,11] extends MapReduce to implement aggregations, Camdoop [9] assumes that an aggregation's output size is a specific fraction of input sizes, Astrolabe [12] collects large-scale system state and provides on-the-fly attribute aggregation, and so on. ...
Article
Efficient representation of data aggregations is a fundamental problem in modern big data applications, where network topologies and deployed routing and transport mechanisms play a fundamental role in optimizing desired objectives such as cost, latency, and others. In traditional networking, applications use TCP and UDP transports as a primary interface for implemented applications that hide the underlying network topology from end systems. On the flip side, to exploit network infrastructure in a better way, applications restore characteristics of the underlying network. In this work, we demonstrate that both specified extreme cases can be inefficient to optimize given objectives. We study the design principles of routing and transport infrastructure and identify extra information that can be used to improve implementations of compute-aggregate tasks. We build a taxonomy of compute-aggregate services unifying aggregation design principles, propose algorithms for each class, analyze them theoretically, and support our results with an extensive experimental study.
... doi: bioRxiv preprint 3 Spark, and MapReduce implementations are either closed source, restrictively licensed, or locked in their own ecosystems making them inaccessible to many bioinformatics labs. 4,5 Other approaches for bidding on cloud resources exist, but they neither provide implementations nor interface with a distributed batch job processing backend. [6][7][8] Our proposed tool, Aether, leverages a linear programming approach to minimize cloud compute cost while being constrained by user needs and cloud capacity, which are parameterized by the number of cores, RAM, and in-node solid-state drive space. ...
Preprint
Across biology we are seeing rapid developments in scale of data production without a corresponding increase in data analysis capabilities. Here, we present Aether ( http://aether.kosticlab.org ), an intuitive, easy-to-use, cost-effective, and scalable framework that uses linear programming (LP) to optimally bid on and deploy combinations of underutilized cloud computing resources. Our approach simultaneously minimizes the cost of data analysis while maximizing its efficiency and speed. As a test, we used Aether to de novo assemble 1572 metagenomic samples, a task it completed in merely 13 hours with cost savings of approximately 80% relative to comparable methods.
... This is because it consists of the major learning techniques that can be used in classifying and analysing massive amounts of medical data such as Bayesian classifiers and decision trees. (Yang, et al. 2007). ...
Article
Full-text available
The existence of Massive datasets that are generated in many applications provides various opportunities and challenges. Especially, scalable mining of such large-scale datasets is a challenging issue that attracted some recent research. In the present study, the main focus is to analyse the classification techniques using WEKA machine learning workbench. Moreover, a large-scale dataset was used. This dataset comes from the protein structure prediction field. It has already been partitioned into training and test sets using the ten-fold cross-validation methodology. In this experiment, nine different methods have been tested. As a result, it became obvious that it is not applicable to test more than one classifier from the (tree) family in the same experiment. On the other hand, using (NaiveBayes) Classifier with the default properties of the attribute selection filter has a great time consuming. Finally, varying the parameters of the attribute selections should be prioritized for more accurate results.
... Sqoop is mainly used for bidirectional data transfer between relational databases (such as my SQL, Oracle, etc.) and Hadoop (Dean and Ghemawat 2010;Yang et al. 2007;Abouzeid et al. 2009;Capriolo et al. 2012;Kumar et al. 2014). ...
Article
Full-text available
With the rapid development of Internet technology and the popularization and application of various terminals, human beings live in the era of big data. In the process of people's daily use of the network, all kinds of data are produced at all times. At the same time, the problem of information security is increasingly prominent and the situation is more and more complex. As the increasing threat of information security has caused irreparable economic losses to human beings, which seriously hinders the further development of information technology, it is urgent to combat computer crime. When users use search engine to get information, the whole process is recorded, including the user's query log. Through the analysis of these query logs, we can indirectly get the real needs of users, so as to provide reference for the optimization and improvement of search engine. The analysis of the network user behavior requires the analysis of the data generated by the network user behavior. The traditional manual data analysis and the single computer data transmission mode have not been able to better analyze the increasing data information. Therefore, we should think of an effective technology, which can effectively analyze the huge and complex data and show the results. Based on the log analysis of voice data system, this paper constructs a user behavior analysis system under the network environment. Experimental results show that the method proposed in this paper can effectively reflect the behavior of network users, and timely feedback.
Article
Full-text available
Problem of ageing is appearing as a major issue in modern age because the improvement of medical science has raised the life expectancy of the people. As a result, the number of old age people is increasing all over the world. According to UNESCO estimate the number of old people aged above 60+ in the world is likely to go up from 350 million in 1975 to 590 million in 2005. About half of them are in the developing countries. In the advanced countries like Norway, Sweden and Japan the population of the aged above 60 is 20 per cent. However, the percentage of the aged in other advanced countries in the West is in the neighbourhood of 14 Per cent. So far as India is concerned as a result of the change in the age composition of the population over time, there has been a progressive increase in both the number and proportion of aged people. The proportion of the population aged 60 years or more has been increasing consistently over the last century, particularly after 1951. In 1901 the proportion of the population aged 60 or over of India was about 5 percent, which marginally increased to 5.4 percent in 1951, and by 2001 this share was found to have risen to about 7.4 percent. About 75% of persons of age 60 and above reside in rural areas. The Status of the aged not only differ between younger and olds but also from country to county on the socio-cultural background. Venkoba Rao (1979) has indicated that as to how the prevalent cultural conditions are affecting or contributing to the problems of the aged. Ghosal (1962) has observed that the problems of old age tend to be multiple rather than single. Due to the multiple natures being observed in the old age causes different problems. Based on these types of problems vary between tribes and non-tribes. Keywords: Aged People, Old age, Elderly, Tribes, Tribal Community
Article
Full-text available
Big data has revolutionized science and technology leading to the transformation of our societies. High-performance computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally, HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realization of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high-performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in situ and transit data analysis. This paper provides an extensive review of cutting-edge research on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
Conference Paper
Article
Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches which store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top- k ). The user can either provide minimal information which allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real world data to experimentally demonstrate speedups up to 3 × over single level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.
Preprint
Full-text available
In this paper, a technology for massive data storage and computing named Hadoop is surveyed. Hadoop consists of heterogeneous computing devices like regular PCs abstracting away the details of parallel processing and developers can just concentrate on their computational problem. A Hadoop cluster is made of two parts: HDFs and Mapreduce. Hadoop cluster uses HDFS for data management. HDFS provides storage for input and output data in MapReduce jobs and is designed with abilities like high-fault tolerance, high-distribution capacity, and high throughput. It is also suitable for storing Terabyte data on clusters and it runs on flexible hardware like commodity devices.
Thesis
Full-text available
La Big Data science è oggigiorno alla base della maggior parte delle applicazioni quotidianie, che si tratti di ambienti lavorativi o ricreativi. Il presente elaborato discute, dopo un'analisi di carattere generale sul mondo degli Intelligent Buildings e sulle tecnologie che lo dominano, dell'importanze del Big Data all' interno del settore edile, analizzandone le maggiori tecnologie. Se ne studia il grado di integrazione, fornendo una "fotografia" dello stato attuale ed analizzandone i possibili sviluppi futuri.
Book
Full-text available
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DT-DMS that is used to support decision-making mechanisms. Book Chapters; 1. CHAPTER Understanding Hadoop and MapReduce by example: log data analysis case study Gligor Risteski, Mihiri Chathurika, Beyza Ali, Atanas Hristov 2. CHAPTER Edge Computing Security with an IoT device Beyda Nur Kars 3. CHAPTER Secure Connection between Google Home and IoT Device Ekrem Yiğit 4. CHAPTER Practical Evaluation on Serious Games in Education Slavica Mileva Eftimova, Ana Madevska Bogdanova, Vladimir Trajkovik 5. CHAPTER Digital Twin Based Disaster Management System Proposal: DT-DMS Özgür Doğan, Oğuzhan Şahin, Enis Karaarslan
Book
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DTDMS that is used to support decision-making mechanisms.
Article
Full-text available
It isn't easy to analyze huge records. This requires system based structures and technologies for you to procedure. Map-reduce a distributed parallel programming model runs on hadoop surroundings, approaches massive volumes of information. A parallel programming technique can be relevant to the linear regression algorithm and support vector machines algorithm from the system getting to know network to parallelize acceleration on the multicore system for efficient timing efficiency.
Chapter
Advances in the communication technologies, along with the birth of new communication paradigms leveraging on the power of the social, has fostered the production of huge amounts of data. Old-fashioned computing paradigms are unfit to handle the dimensions of the data daily produced by the countless, worldwide distributed sources of information. So far, the MapReduce has been able to keep the promise of speeding up the computation over Big Data within a cluster. This article focuses on scenarios of worldwide distributed Big Data. While stigmatizing the poor performance of the Hadoop framework when deployed in such scenarios, it proposes the definition of a Hierarchical Hadoop Framework (H2F) to cope with the issues arising when Big Data are scattered over geographically distant data centers. The article highlights the novelty introduced by the H2F with respect to other hierarchical approaches. Tests run on a software prototype are also reported to show the increase of performance that H2F is able to achieve in geographical scenarios over a plain Hadoop approach.
Article
Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed
Article
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.
Conference Paper
Full-text available
Big data business ecosystem is very supportive for user needs and its trend that provide basis for big data are explained. There is need of effective solution with issue of data volume, in order to enable the feasible study , cost effective and scalable storage and processing of enormous quantity of data, thus the big data and cloud go hand in hand and Hadoop is very hot and enormously growing technology for organizations . The steps into begin required for setting up a distributed, single node Hadoop cluster backed by HDFS running on ubuntu (steps only) are given. We proposed new techniques for better Big Data, Hadoop Implementations using cloud computing sources using various methods and systems for common man needs. It is very helpful and Big scope for future research.
Conference Paper
Full-text available
Now a day’s Technological evolutions at Smart Intelligent Transport Systems (SITS) are allowing that ubiquitous society use the SITS facilities to conducting their activities for our Social needs. At this scenario the research paper presents a study case in order to use SITS technologies applied to conduct these activities, mainly for every vehicle tracking. The study case presented in this research paper uses the concept of Internet of Things to make the inspection of different vehicles; all study case considers the use of existing Intelligent Transport Systems infrastructure, as well as the deployment of new infrastructure using Bus Rapid Transit (BRT). In order to implement the Internet of Things, the study case use the RFID (Radio Frequency Identification) technologies associated with multiple sensors are installed at vehicle in order to identify the vehicle when it pass through a Region of Interest (ROI) portal in Bus Rapid Transit (BRT) and the associated information, if available, as: temperature, average velocity, humidity, average Speed, Frequency modulation, doors, Road Lengths and others relevant information. The major advantage of IoT is the ability to recognize or identify the vehicles using Transportation System in ROI and other information without disrupting the normal flow.
Conference Paper
Full-text available
Nanotechnology is quickly turning into an omnipresent Technology with a possibility to affect on each part of present day human development. Relatively every part of human undertaking will be influenced, for example, agribusiness and sustenance, correspondence, PCs, ecological checking, materials, mechanical autonomy, social insurance and medicinal Technology. All the more as of late software engineering has turned out to be associated with nanotechnology. This technology may be impelled in 2020. Fifth time orchestrate Technology is depend upon nanotechnology and all IP compose. 5G can be cross the speed and accessibility. The 5G sort out Technology is more creative and engaging Technology which will be valuable for the customer of various field or specialists. The paper covers the technique towards the Nanotechnology and the building, central focuses and employments of the 5G remote framework correspondence Technology.
Article
Full-text available
Technologies around the world produce and interact with geospatial data instantaneously, from mobile web applications to satellite imagery that is collected and processed across the globe daily. Big raster data allow researchers to integrate and uncover new knowledge about geospatial patterns and processes. However, we are at a critical moment, as we have an ever-growing number of big data platforms that are being co-opted to support spatial analysis. A gap in the literature is the lack of a robust assessment comparing the efficiency of raster data analysis on big data platforms. This research begins to address this issue by establishing a raster data benchmark that employs freely accessible datasets to provide a comprehensive performance evaluation and comparison of raster operations on big data platforms. The benchmark is critical for evaluating the performance of spatial operations on big data platforms. The benchmarking datasets and operations are applied to three big data platforms. We report computing times and performance bottlenecks so that GIScientists can make informed choices regarding the performance of each platform. Each platform is evaluated for five raster operations: pixel count, reclassification, raster add, focal averaging, and zonal statistics using three raster different datasets.
Chapter
One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.
Chapter
In this paper, we propose FastThetaJoin, an optimization technique for θ-join operation on multi-way data streams, which is an essential query often used in many data analytical tasks. The θ-join operation on multi-way data streams is notoriously difficult as it always involves tremendous shuffle cost due to data movements between multiple operation components, rendering it hard to be efficiently implemented in a distributed environment. As with previous methods, FastThetaJoin also tries to minimize the number of θ-joins, but it is distinct from others in terms of making partitions, deleting unnecessary data items, and performing the Cartesian product. FastThetaJoin not only effectively minimizes the number of θ-joins, but also substantially improves the efficiency of its operations in a distributed environment. We implemented FastThetaJoin in the framework of Spark Streaming, characterized by its efficient bucket implementation of parameterized windows. The experimental results show that, compared with the existing solutions, our proposed method can speed up the θ-join processing while reducing its overhead; the specific effects of the optimization is correlated to the nature of data streams–the greater the data difference is, the more apparent the optimization effect is.
Chapter
In recent years, so-called Infrastructure as a Service (IaaS) clouds have become increasingly popular as a flexible and inexpensive platform for ad-hoc parallel data processing. Major players in the cloud computing space like Amazon EC2 have already recognized this trend and started to create special offers which bundle their compute platform with existing software frameworks for these kinds of applications. However, the data processing frameworks which are currently used in these offers have been designed for static, homogeneous cluster systems and do not support the new features which distinguish the cloud platform. This chapter examines the characteristics of IaaS clouds with special regard to massively-parallel data processing. The author highlights use cases which are currently poorly supported by existing parallel data processing frameworks and explains how a tighter integration between the processing framework and the underlying cloud system can help to lower the monetary processing cost for the cloud customer. As a proof of concept, the author presents the parallel data processing framework Nephele, and compares its cost efficiency against the one of the well-known software Hadoop.
Preprint
Technologies around the world produce and interact with geospatial data instantaneously, from mobile web applications to satellite imagery that is collected and processed across the globe daily. Big raster data allows researchers to integrate and uncover new knowledge about geospatial patterns and processes. However, we are also at a critical moment, as we have an ever-growing number of big data platforms that are being co-opted to support spatial analysis. A gap in the literature is the lack of a robust framework to assess the capabilities of geospatial analysis on big data platforms. This research begins to address this issue by establishing a geospatial benchmark that employs freely accessible datasets to provide a comprehensive comparison across big data platforms. The benchmark is a critical for evaluating the performance of spatial operations on big data platforms. It provides a common framework to compare existing platforms as well as evaluate new platforms. The benchmark is applied to three big data platforms and reports computing times and performance bottlenecks so that GIScientists can make informed choices regarding the performance of each platform. Each platform is evaluated for five raster operations: pixel count, reclassification, raster add, focal averaging, and zonal statistics using three different datasets.
Article
Full-text available
Map Reduce have been acquainted with facilitate the errand of growing huge data projects and applications. This implies conveyed occupations aren't locally compostable and recyclable for resulting improvement. Additionally, it likewise hampers the capacity for applying improvements on the data stream of employment arrangements and pipelines. The Hierarchically Distributed Data Matrix (HDM) which be practical, specifically data portrayal for composing compostable huge data applications. Alongside HDM, a runtime system is given to help the execution, coordination and the executives of HDM applications on distributed foundations. In light of the utilitarian data reliance diagram of HDM, numerous advancements are connected to enhance the execution of executing HDM employments. The exploratory outcomes demonstrate that our enhancements can accomplish upgrades between 10% to 30% of the Job-Completion-Time and grouping time for various kinds of uses when looked at. In this record, we address the Hierarchically Distributed Data Matrix (HDM) which is a reasonable explicitly sureness's appear for creating Compostable epic facts application. Nearby HDM, a runtime structure is given to enable the execution, to blend and organization of HDM applications on coursed establishments. In perspective of the conscious data dependence chart of HDM, a few upgrades are realized to improve the execution of executing HDM livelihoods. The preliminary effects demonstrate that our upgrades can get updates among 10% to 40% of Job-Completion-Time for one of kind sorts of tasks while in examination with the bleeding edge country of compelling artwork. Programming reflection is the centre of our system, along these lines; we initially present our Hierarchically Distributed Data Matrix (HDM) which is a utilitarian, specifically meta-data deliberation for composing data-parallel projects.
Article
Full-text available
Although a search engine manages a great deal of data and responds to queries, it is not accurately described as a "database" or DMBS. We believe that it represents the first of many application-specific data systems built by the systems community that must exploit the principles of databases without necessarily using the (current) database implementations. In this paper, we present how a search engine should have been designed in hindsight. Although much of the material has not been presented before, the contribution is not in the spe-cific design, but rather in the combination of principles from the largely independent disciplines of "systems" and "databases." Thus we present the design using the ideas and vocabulary of the database community as a model of how to design data-intensive systems. We then draw some conclusions about the application of data-base principles to other "out of the box" data-intensive systems.
Conference Paper
Full-text available
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers.
Conference Paper
Full-text available
The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distr ibuted data warehousing en- vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com- puters, guarantying a nearly optimal speed up and s cale up. However, the use of a large number of inexpensive nodes increases the r isk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capabilit y of answering to queries even in the presence of node failures. This approac h is based on the selective replication of data over the cluster nodes, which g uarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC- DS benchmark and the results show that the approach is quite effective.
Article
Full-text available
Large-scale cluster-based Internet services often host partitioned datasets to provide incremental scalability. The aggregation of results produced from multiple partitions is a fundamental building block for the delivery of these services. This paper presents the design and implementation of a programming primitive -- Data Aggregation Call (DAC) -- to exploit partition parallelism for clusterbased Internet services. A DAC request specifies a local processing operator and a global reduction operator, and it aggregates the local processing results from participating nodes through the global reduction operator. Applications may allow a DAC request to return partial aggregation results as a tradeoff between quality and availability. Our architecture design aims at improving interactive responses with sustained throughput for typical cluster environments where platform heterogeneity and software/hardware failures are common. At the cluster level, our load-adaptive reduction tree construction algorithm balances processing and aggregation load across servers while exploiting partition parallelism. Inside each node, we employ an event-driven thread pool design that prevents slow nodes from adversely affecting system throughput under highly concurrent workload. We further devise a staged timeout scheme that eagerly prunes slow or unresponsive servers from the reduction tree to meet soft deadlines. We have used the DAC primitive to implement several applications: a search engine document retriever, a parallel protein sequence matcher, and an online parallel facial recognizer. Our experimental and simulation results validate the effectiveness of the proposed optimization techniques for (1) reducing response time, (2) improving throughput, and (3) handling server unresponsiveness ...
Article
Full-text available
This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.
Article
Google's MapReduce programming model serves for processing large data sets in a massively parallel manner. We deliver the first rigorous description of the model including its advancement as Google's domain-specific language Sawzall. To this end, we reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our findings as an executable specification. We also identify and resolve some obscurities in the informal presentation given in the seminal papers. We use typed functional programming (specifically Haskell) as a tool for design recovery and executable specification. Our development comprises three components: (i) the basic program skeleton that underlies MapReduce computations; (ii) the opportunities for parallelism in executing MapReduce computations; (iii) the fundamental characteristics of Sawzall's aggregators as an advancement of the MapReduce approach. Our development does not formalize the more implementational aspects of an actual, distributed execution of MapReduce computations.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Like most applications, database systems want cheap, fast hardware. Today that means commodity processors, memories, and disks. Consequently, the hardware concept of a database machine built of exotic hardware is inappropriate for current technology. On the other hand, the availability of fast microprocessors, and small inexpensive disks packaged as standard inexpensive but fast computers is an ideal platform for parallel database systems. The main topics are basic techniques for parallel database machine implementation, the trend to shared-nothing machines, parallel dataflow approach to SQL software, and future directions and research problems.
Article
This paper extends earlier research on hash-join algorithms to a multiprocessor architecture. Implementations of a number of centralized join algorithms are described and measured. Evaluation of these algorithms served to verify earlier analytical results. In addition, they demonstrate that bit vector filtering provides dramatic improvement in the performance of all algorithms including the sort merge join algorithm. Multiprocessor configurations of the centralized Grace and Hybrid hash-join algorithms are also presented. Both algorithms are shown to provide linear increases in throughput with corresponding increases in processor and disk resources. 1. Introduction After the publication of the classic join algorithm paper in 1977 by Blasgen and Eswaran [BLAS77], the topic was virtually abandoned as a research area. Everybody "knew" that a nested-loops algorithm provided acceptable performance on small relations or large relations when a suitable index existed and that sort-merge was ...
Article
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds. Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.
Redundant Array of Inexpensive Nodes
  • Wikipedia
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/ Redundant Array of Inexpensive Nodes, 2006.
A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations
  • A C Arpaci-Dusseau
  • A.