Conference Paper

Map-reduce-merge: Simplified relational data processing on large clusters

Authors:
Conference Paper

Map-reduce-merge: Simplified relational data processing on large clusters

If you want to read the PDF, try requesting it from the authors.

Abstract

Map-Reduce is a programming model that enables easy de- velopment of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and re- duce, this model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. However, this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes dif- ficulties and/or ineciency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map- Reduce-Merge. It adds to Map-Reduce a Merge phase that can eciently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... MapReduce has two functionalities, Map () and Reduce (). This model has been used in Google's search index, machine learning, and statistical analysis [8]. Implementation of MapReduce is highly scalable and easy to use. ...
... Then the produced key-value pairs are fed into the Reducer. After collecting all the key-value pairs from all of the map jobs the Reducer groups the pairs into a smaller set of key-value pairs, producing the final output [7,8]. ...
... As it can be seen from this result, with the unprecedented increase in the data generated traditional methods fall short with providing a solution for data analysis. This is, exactly, the point where the new technologies stepped in [8]. Hadoop MapReduce has a wide area of applications for Big Data analysis [3], [9], [11]. ...
Article
As the term “Big Data” is becoming more and more popular every day, the first thing we should know and remember about it is that it does not have a single and unique definition. Basically, as one can understand from its name, Big Data means big amount of data. Sethy, R. in his article gives the definition as “Big Data describes any massive volume of structured, semi structured and unstructured data that are difficult to process using traditional database system.” Researches show that data volumes are doubling every year. Although there is not a specific reason behind this rapid growth rate, the new data sources, contribute to that growth highly. Smart phones, tablet computers, sensors, and all other devices that can be connected to the internet generate a vast amount of data. The technologies that are used by traditional enterprises are changing to more powerful platforms which also play an important role in the growth rate of the data that is generated.
... For example, users can map and reduce one data set on the fly and read data from other datasets. The Map-Reduce-Merge model [35] has been implemented to enable the processing of multiple datasets to tackle the restriction of the additional processing requirements for conducting joint operations in the MapReduce system. The structure of this model is shown in Figure 3 The main differences between this framework's processing model and the original MapReduce are that a key/value list is created from the Reduce function instead of just the values. ...
... An overview of the Map-Reduce-Merge framework[35] ...
Thesis
One of the main challenges for the automobile industry in the digital age is to provide their customers with a reliable and ubiquitous level of connected services. Smart cars have been entering the market for a few years now to offer drivers and passengers safer, more comfortable, and entertaining journeys. All this by designing, behind the scenes, computer systems that perform well while conserving the use of resources.The performance of a Big Data architecture in the automotive industry relies on keeping up with the growing trend of connected vehicles and maintaining a high quality of service. The Cloud at Groupe PSA has a particular load on ensuring a real-time data processing service for all the brand's connected vehicles: with 200k connected vehicles sold each year, the infrastructure is continuously challenged.Therefore, this thesis mainly focuses on optimizing resource allocation while considering the specifics of continuous flow processing applications and proposing a modular and fine-tuned component architecture for automotive scenarios.First, we go over a fundamental and essential process in Stream Processing Engines, a resource allocation algorithm. The central challenge of deploying streaming applications is mapping the operator graph, representing the application logic, to the available physical resources to improve its performance. We have targeted this problem by showing that the approach based on inherent data parallelism does not necessarily lead to all applications' best performance.Second, we revisit the Big Data architecture and design an end-to-end architecture that meets today's demands of data-intensive applications. We report on CV's Big Data platform, particularly the one deployed by Groupe PSA. In particular, we present open-source technologies and products used in different platform components to collect, store, process, and, most importantly, exploit big data and highlight why the Hadoop system is no longer the de-facto solution of Big Data. We end with a detailed assessment of the architecture while justifying the choices made during design and implementation.
... Map-Reduce-Merge [14] can be considered as an extension to the MapReduce programming model, rather than an implementation of MapReduce. Original MapReduce programming model does not directly support processing multiple related heterogeneous datasets. ...
... Map-Reduce-Merge is an improved model that can be used to express relational algebra operators and join algorithms. This improved framework introduces a new Merge phase, that can join reduced outputs, and a naming and configuring scheme, that extends MapReduce to process heterogeneous datasets simultaneously [14]. The Merge function is much like Map or Reduce. ...
Article
Full-text available
MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google in 2004. In the programming model, a user specifies the computation by two functions, Map and Reduce. The MapReduce as well as its open-source Hadoop, is aimed for parallelizing computing in large clusters of commodity machines. Other implementations for different environments have been introduced as well, such as Mars, which implements MapReduce for graphics processors, and Phoenix, the MapReduce implementation for shared-memory systems. This paper provides an overview of MapReduce programming model, its various applications and different implementations of MapReduce. GridGain is another open source java implementation of mapreduce. We also discuss comparisons of Hadoop and GridGain.
... Solutions to this issue have been proposed by many researchers [11][12][13]. In Section 6, we propose a survey of some of the relevant literature proposals in the field. ...
... To identify the optimized schedules for job sequences, a data transformation graph is used to represent all the possible jobs' execution paths: then, the well-known Dijkstra's shortest path algorithm is used to determine the optimized schedule. An extra MapReduce phase named "merge" is introduced in [13]. It is executed after map and reduce phases and extends the MapReduce model for heterogeneous data. ...
Article
Full-text available
In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper.
... Map-Reduce-Merge [80] extends MapReduce with a merge function to facilitate expressing the join operation. Map-Join-Reduce [43] adds a join phase between the map and the reduce phases. ...
Preprint
The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms.
... This factor prompted us to use the Hadoop platform because its HDFS file system provides the facility to store the data in a cluster, consisting of multiple nodes [17]. By map-reduce approach, Hadoop processes the computations internally at different nodes of a cluster [18]. Measuring various centralities in a social network demands an evaluation of all source shortest paths in the network. ...
Article
Social network analysis is found to be one of the emerging research directions in the field of data science. This paper mainly concerns with the identification of influential entities with the help of several centrality measures like degree, closeness, and eigenvector centrality. The computational efficiency of analyzing social networks is limited by the size and complexity of the network domain. As the size of the network grows at an exponential rate, it is quite challenging to process the massive network with the help of conventional computing resources. In this manuscript, scalability and complexity issues have been addressed to identify the influential nodes in the network by implementing the algorithm in a distributed manner. The distributed approach has been considered in computing different centrality measures like degree, closeness, and eigenvector. In this paper, the centrality measures have been computed by considering both the local and global structural information. Real-world social networks are observed to follow the power law in both centrality drift and degree distribution. In this work, nodes are ranked based on their importance for different centrality measures. The effectiveness of these algorithms is critically examined through experimentation by using six real-time network datasets.
... In this proposal, we will advance the state of the art by designing algorithms for video-analysis, text mining or case-based reasoning that are scalable and can benefit from parallel processing for the computing and data intensive tasks. This will allow the system to deal with varying workloads for large organizations and provide analysis response times [48] in line with police investigation requirements. ...
... The features of MapReduce (Yang et al. (2007), Blanas et al. (2010)) are as follows: ...
... Yang [41] improved the MapReduce framework by adding a Merge phase so that it is more efficient and easier to process data relationships among heterogeneous datasets. Also, the researchers extended the MapReduce framework to the Map-Reduce-Merge framework. ...
Article
Full-text available
This paper developed a distributed algorithm for Big Data Analytics to address the delay in the processing of big data. In order to achieve the aim of this research, an inspection of organizational documents, direct observation and collection of existing data from the National Health Insurance Scheme (NHIS) in Nigeria. The algorithm was formulated using Apriori Association Rule Mining and was specified using the enterprise application diagram. The implementation of the prototype for the algorithm was using MongoDB as the big data storage mechanism for the input. Comma Separated Values (CSV) files was used as the storage facility for the intermediate results generated during processing, and MySQL was used as the storage mechanism for the final output. Finally, Apache MapReduce as the big data multi-nodal processing platform and Java programming language as the implementation technology. This prototype was able to analyze different formats of data (i.e., pdf, excel, csv and images) with high volume and velocity. The result showed that the response time was 0.25 seconds, and the throughput was 8865.29 records per second. The stability of the prototype was also evaluated using the confidence of the rules generated. In conclusion, this research has shown that unnecessary delays in the processing of big data were due to the lack of appropriate data analytics tool to enhance the process. This study eliminated these irregularities which paved the way for quicker disbursement of funds to providers and other stakeholders, as well as, a quicker response to requests on enrollment, update and referral.
... Sqoop is mainly used for bidirectional data transfer between relational databases (such as my SQL, Oracle, etc.) and Hadoop (Dean and Ghemawat 2010;Yang et al. 2007;Abouzeid et al. 2009;Capriolo et al. 2012;Kumar et al. 2014). ...
Article
Full-text available
With the rapid development of Internet technology and the popularization and application of various terminals, human beings live in the era of big data. In the process of people's daily use of the network, all kinds of data are produced at all times. At the same time, the problem of information security is increasingly prominent and the situation is more and more complex. As the increasing threat of information security has caused irreparable economic losses to human beings, which seriously hinders the further development of information technology, it is urgent to combat computer crime. When users use search engine to get information, the whole process is recorded, including the user's query log. Through the analysis of these query logs, we can indirectly get the real needs of users, so as to provide reference for the optimization and improvement of search engine. The analysis of the network user behavior requires the analysis of the data generated by the network user behavior. The traditional manual data analysis and the single computer data transmission mode have not been able to better analyze the increasing data information. Therefore, we should think of an effective technology, which can effectively analyze the huge and complex data and show the results. Based on the log analysis of voice data system, this paper constructs a user behavior analysis system under the network environment. Experimental results show that the method proposed in this paper can effectively reflect the behavior of network users, and timely feedback.
... This is because it consists of the major learning techniques that can be used in classifying and analysing massive amounts of medical data such as Bayesian classifiers and decision trees. (Yang, et al. 2007). ...
Article
Full-text available
The existence of Massive datasets that are generated in many applications provides various opportunities and challenges. Especially, scalable mining of such large-scale datasets is a challenging issue that attracted some recent research. In the present study, the main focus is to analyse the classification techniques using WEKA machine learning workbench. Moreover, a large-scale dataset was used. This dataset comes from the protein structure prediction field. It has already been partitioned into training and test sets using the ten-fold cross-validation methodology. In this experiment, nine different methods have been tested. As a result, it became obvious that it is not applicable to test more than one classifier from the (tree) family in the same experiment. On the other hand, using (NaiveBayes) Classifier with the default properties of the attribute selection filter has a great time consuming. Finally, varying the parameters of the attribute selections should be prioritized for more accurate results.
... doi: bioRxiv preprint 3 Spark, and MapReduce implementations are either closed source, restrictively licensed, or locked in their own ecosystems making them inaccessible to many bioinformatics labs. 4,5 Other approaches for bidding on cloud resources exist, but they neither provide implementations nor interface with a distributed batch job processing backend. [6][7][8] Our proposed tool, Aether, leverages a linear programming approach to minimize cloud compute cost while being constrained by user needs and cloud capacity, which are parameterized by the number of cores, RAM, and in-node solid-state drive space. ...
Preprint
Across biology we are seeing rapid developments in scale of data production without a corresponding increase in data analysis capabilities. Here, we present Aether ( http://aether.kosticlab.org ), an intuitive, easy-to-use, cost-effective, and scalable framework that uses linear programming (LP) to optimally bid on and deploy combinations of underutilized cloud computing resources. Our approach simultaneously minimizes the cost of data analysis while maximizing its efficiency and speed. As a test, we used Aether to de novo assemble 1572 metagenomic samples, a task it completed in merely 13 hours with cost savings of approximately 80% relative to comparable methods.
... For example, in the Chicago crime dataset, the map function calculates all the crimes against each day, and then the reduce function uses the day as the key and extracts the required values (Key: values). The authors of [89] added the merge operation into MapReduce architecture and derived the MapReduce-Merge framework. The performance of MapReduce was improved, and able to calculate relational algebra due to the merge operation and can process the data in the cluster. ...
Preprint
Full-text available
Context: The efficient processing of Big Data is a challenging task for SQL and NoSQL Databases, where competent software architecture plays a vital role. The SQL Databases are designed for structuring data and supporting vertical scalability. In contrast, horizontal scalability is backed by NoSQL Databases and can process sizeable unstructured Data efficiently. One can choose the right paradigm according to the organisation's needs; however, making the correct choice can often be challenging. The SQL and NoSQL Databases follow different architectures. Also, the mixed model is followed by each category of NoSQL Databases. Hence, data movement becomes difficult for cloud consumers across multiple cloud service providers (CSPs). In addition, each cloud platform IaaS, PaaS, SaaS, and DBaaS also monitors various paradigms. Objective: This systematic literature review (SLR) aims to study the related articles associated with SQL and NoSQL Database software architectures and tackle data portability and Interoperability among various cloud platforms. State of the art presented many performance comparison studies of SQL and NoSQL Databases by observing scaling, performance, availability, consistency and sharding characteristics. According to the research studies, NoSQL Database designed structures can be the right choice for big data analytics, while SQL Databases are suitable for OLTP Databases. The researcher proposes numerous approaches associated with data movement in the cloud. Platform-based APIs are developed, which makes users' data movement difficult. Therefore, data portability and Interoperability issues are noticed during data movement across multiple CSPs. To minimize developer efforts and Interoperability, Unified APIs are demanded to make data movement relatively more accessible among various cloud platforms.
... Various frameworks split computations into multiple phases: Map-Reduce-Merge [10,11] extends MapReduce to implement aggregations, Camdoop [9] assumes that an aggregation's output size is a specific fraction of input sizes, Astrolabe [12] collects large-scale system state and provides on-the-fly attribute aggregation, and so on. ...
Article
Efficient representation of data aggregations is a fundamental problem in modern big data applications, where network topologies and deployed routing and transport mechanisms play a fundamental role in optimizing desired objectives such as cost, latency, and others. In traditional networking, applications use TCP and UDP transports as a primary interface for implemented applications that hide the underlying network topology from end systems. On the flip side, to exploit network infrastructure in a better way, applications restore characteristics of the underlying network. In this work, we demonstrate that both specified extreme cases can be inefficient to optimize given objectives. We study the design principles of routing and transport infrastructure and identify extra information that can be used to improve implementations of compute-aggregate tasks. We build a taxonomy of compute-aggregate services unifying aggregation design principles, propose algorithms for each class, analyze them theoretically, and support our results with an extensive experimental study.
... There have been several attempts to modify the MapReduce framework to cope with the join problem. Map-Reduce-Merge (Yang et al., 2007) is the first one. It introduces a third phase to MapReduce called merge, which is invoked after the reduce phase and receives as input the reduced outputs that come from two distinguishable sources. ...
Article
Full-text available
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for large-scale data processing. However, it has some limitations in processing heterogeneous datasets. In this study, we review the state-of-the-art strategies for joining two datasets based on an equi-join condition and provide a detail implementation for each strategy. We also present an in-depth analysis of the join strategies and discuss their feasibilities and limitations to assist the reader in selecting the most efficient strategy. Concluding, we outline interesting directions for future join processing systems.
... For this reason, a machine learning scheme is introduced. Moreover, for the sentiment or opinion categorization, the system must know the emotion of human behavior like delight, annoyance, fury, etc., (Yang et al. 2007). Opinion evaluation in NLP is used to classify human emotions with the use of machine. ...
Article
Full-text available
Nowadays, the big data is ruling the entire digital world with its applications and facilities. Thus, to run the online services in better way, some of the machine learning models are utilized, also the machine learning strategy is became a trending field in big data; hence the success of online services or business is based upon the customer reviews. Almost the review contains neutral, positive, and negative sentiment value; Manual classification of sentiment value is a difficult task so that the natural language processing (NLP) scheme is used which is processed using machine learning strategy. Moreover, the part of speech specification for different languages is difficult. To overcome this issue, the current research aims to develop a novel less error pruning-shortest description length (LEP-SDL) for error pruning and ant lion boosting model (ALBM) for opinion specification purpose. Here, the Telugu news review dataset adopted to process the sentiment analysis in NLP. Furthermore, the fitness function of ant lion model in boosting approach improves the accuracy and precision of opinion specification also makes the classification process easier. Thus, to evaluate the competence of the projected model, it is evaluated with recent existing works in terms of accuracy, precision, etc., and achieved better results by obtaining high accuracy and precision of opinion specification.
... Researchers have used Hadoop to implement a variety of parallel processing algorithms to efficiently handle geographical data (17,18). Multistage map and reduce algorithms, which generate on-demand indexes and retain persistent indexes, are examples of these techniques (19). ...
Article
Full-text available
In the current scenario, with a large amount of unstructured data, Health Informatics is gaining traction, allowing Healthcare Units to leverage and make meaningful insights for doctors and decision-makers with relevant information to scale operations and predict the future view of treatments via Information Systems Communication. Now, around the world, massive amounts of data are being collected and analyzed for better patient diagnosis and treatment, improving public health systems and assisting government agencies in designing and implementing public health policies, instilling confidence in future generations who want to use better public health systems. This article provides an overview of the HL7 FHIR Architecture, including the workflow state, linkages, and various informatics approaches used in healthcare units. The article discusses future trends and directions in Health Informatics for successful application to provide public health safety. With the advancement of technology, healthcare units face new issues that must be addressed with appropriate adoption policies and standards.
... Theoretical Basis. The intelligent classroom teaching mode of College Chinese takes the "constructivism" educational theory as an important theoretical basis [13]. The core view of constructivism theory holds that we should place the student in the center of teaching; meanwhile, the role of the teacher, which is the instructor and the promoters, in the class should make full use of learning environment elements such as situation, cooperation, and dialogue, give full play to students' enthusiasm, initiative, and creativity, and finally achieve the purpose of making students effectively construct the meaning of the current learning content. ...
Article
Full-text available
In China, the Chinese subject is called “the mother of the encyclopedia.” Learning Chinese well is the basic condition for learning other subjects well. Therefore, Chinese education has always undertaken the important task of teaching the mother tongue and cultural inheritance and has a great influence on the development process of China’s education. Society is facing new opportunities and challenges since the 21st century, and all the mobile Internet, cloud computing, and Internet of Things achieve great development. Using the increasingly mature technology of data information collection, analysis, and processing can realize a more comprehensive evaluation of students in teaching activities. In the current upsurge of big data, it is more and more widely used in the field of education, such as learning analysis, network education platform, education information management platform, education AP, and smart campus. In this context, Chinese teaching needs more open courses, which requires us to organically integrate the existing information technology and Chinese teaching, and on this basis, integrate modern educational data sets and learning analysis technology into Chinese teaching and learning, to promote the improvement of the quality of language teaching and learning. The development of information technology in the intelligent era provides an opportunity for the establishment and application of an intelligent classroom. This paper constructs an intelligent and efficient new classroom equipped with smart devices teaching mode based on defining the concept and connotation of a classroom equipped with smart devices. The proposed model is combined with the current situation and the development of classrooms equipped with smart devices. It also analyzes the difficulties and pain points of traditional College Chinese teaching. The research results can provide a reference for the implementation of intelligent classroom teaching mode in College Chinese and other general courses in higher vocational colleges.
... With hardware upgrade, GPU is also very promising to improve the performance of join operations in a hybrid framework [10]. Existing join processing using the MapReduce programming model includes k-NN [11], Equi- [12], eta- [13], Similarity [14], Top-K [15], and filter-based joins [16]. To handle the problem that existing MapReducebased filtering methods require multiple MapReduce jobs to improve the join performance, the adaptive filter-based join algorithm is proposed [17]. ...
Article
Full-text available
Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.
... Many companies have come up with new frameworks to handle such data. Some examples are Google's MapReduce [2], Microsoft's Dryad [3], or Yahoo!'s Map-Reduce-Merge [4]. All of these frameworks differ in their designs but share common objectives of fault tolerance, parallel programming and optimization. ...
Article
Vigorous resource allocation is one of the biggest challenging problems in the area of cloud resource management and in the last few years, has attracted a lot of attention from researchers. The improvised coextensive data processing has emerged as one of the best applications of Infrastructure-as-a-Service (IaaS) Cloud. Current data processing frameworks can be used for static and homogenous cluster setups only. So, the resources that are allocated may be insufficient for larger parts of submitted tasks and increase the processing cost and time unnecessarily. Due to the arcane nature of the cloud, only static allocation of resources is possible rather than dynamic. In this paper, we have proposed a generic coextensive data processing framework (ViCRA) whose working is based upon Nephele architecture that allows vigorous resource allocation for both task scheduling and realization. Different tasks of a processing job can be assigned to different virtual machines which are automatically initiated and halted during the task realization. This framework has been developed in C#. The experimental results profess that the framework is effective for exploiting the vigorous cloud resource allocation for coextensive task scheduling and realization. [URL for download: http://www.ijascse.org/volume-4-theme-based-issue-7/Vigorous_cloud_resource_allocation.pdf]
Preprint
Technologies around the world produce and interact with geospatial data instantaneously, from mobile web applications to satellite imagery that is collected and processed across the globe daily. Big raster data allows researchers to integrate and uncover new knowledge about geospatial patterns and processes. However, we are also at a critical moment, as we have an ever-growing number of big data platforms that are being co-opted to support spatial analysis. A gap in the literature is the lack of a robust framework to assess the capabilities of geospatial analysis on big data platforms. This research begins to address this issue by establishing a geospatial benchmark that employs freely accessible datasets to provide a comprehensive comparison across big data platforms. The benchmark is a critical for evaluating the performance of spatial operations on big data platforms. It provides a common framework to compare existing platforms as well as evaluate new platforms. The benchmark is applied to three big data platforms and reports computing times and performance bottlenecks so that GIScientists can make informed choices regarding the performance of each platform. Each platform is evaluated for five raster operations: pixel count, reclassification, raster add, focal averaging, and zonal statistics using three different datasets.
Preprint
Full-text available
In this paper, a technology for massive data storage and computing named Hadoop is surveyed. Hadoop consists of heterogeneous computing devices like regular PCs abstracting away the details of parallel processing and developers can just concentrate on their computational problem. A Hadoop cluster is made of two parts: HDFs and Mapreduce. Hadoop cluster uses HDFS for data management. HDFS provides storage for input and output data in MapReduce jobs and is designed with abilities like high-fault tolerance, high-distribution capacity, and high throughput. It is also suitable for storing Terabyte data on clusters and it runs on flexible hardware like commodity devices.
Article
Full-text available
Map Reduce have been acquainted with facilitate the errand of growing huge data projects and applications. This implies conveyed occupations aren't locally compostable and recyclable for resulting improvement. Additionally, it likewise hampers the capacity for applying improvements on the data stream of employment arrangements and pipelines. The Hierarchically Distributed Data Matrix (HDM) which be practical, specifically data portrayal for composing compostable huge data applications. Alongside HDM, a runtime system is given to help the execution, coordination and the executives of HDM applications on distributed foundations. In light of the utilitarian data reliance diagram of HDM, numerous advancements are connected to enhance the execution of executing HDM employments. The exploratory outcomes demonstrate that our enhancements can accomplish upgrades between 10% to 30% of the Job-Completion-Time and grouping time for various kinds of uses when looked at. In this record, we address the Hierarchically Distributed Data Matrix (HDM) which is a reasonable explicitly sureness's appear for creating Compostable epic facts application. Nearby HDM, a runtime structure is given to enable the execution, to blend and organization of HDM applications on coursed establishments. In perspective of the conscious data dependence chart of HDM, a few upgrades are realized to improve the execution of executing HDM livelihoods. The preliminary effects demonstrate that our upgrades can get updates among 10% to 40% of Job-Completion-Time for one of kind sorts of tasks while in examination with the bleeding edge country of compelling artwork. Programming reflection is the centre of our system, along these lines; we initially present our Hierarchically Distributed Data Matrix (HDM) which is a utilitarian, specifically meta-data deliberation for composing data-parallel projects.
Article
Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches which store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top- k ). The user can either provide minimal information which allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real world data to experimentally demonstrate speedups up to 3 × over single level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.
Chapter
In practice, it has been acknowledged that Hadoop framework is not an adequate choice for supporting interactive queries which aim of achieving a response time of milliseconds or few seconds. In addition, many programmers may be unfamiliar with the Hadoop framework and they would prefer to use SQL as a high-level declarative language to implement their jobs while delegating all of the optimization details in the execution process to the underlying engine. This chapter provides an overview of various systems that have been introduced to support the SQL flavor on top of the Hadoop-like infrastructure and provide competing and scalable performance on processing large-scale structured data.
Thesis
Full-text available
La Big Data science è oggigiorno alla base della maggior parte delle applicazioni quotidianie, che si tratti di ambienti lavorativi o ricreativi. Il presente elaborato discute, dopo un'analisi di carattere generale sul mondo degli Intelligent Buildings e sulle tecnologie che lo dominano, dell'importanze del Big Data all' interno del settore edile, analizzandone le maggiori tecnologie. Se ne studia il grado di integrazione, fornendo una "fotografia" dello stato attuale ed analizzandone i possibili sviluppi futuri.
Chapter
In general, the discovery process often employs analytics techniques from a variety of genres such as time-series analysis, text analytics, statistics, and machine learning. Moreover, the process might involve the analysis of structured data from conventional transactional sources, in conjunction with the analysis of multi-structured data from other sources such as clickstreams, call detail records, application logs, or text from call center records. This chapter provides an overview of various general-purpose big data processing systems which empower its user to develop various big data processing jobs for different application domains.
Chapter
With the wide availability of data and increasing capacity of computing resources, machine learning and deep learning techniques have become very popular techniques on harnessing the power of data by achieving powerful analytical features. This chapter focuses on discussing several systems that have been developed to support computationally expensive machine learning and deep learning algorithms on top of big data processing frameworks.
Chapter
Big data analytics is currently representing a revolution that cannot be missed. It is significantly transforming and changing various aspects in our modern life including the way we live, socialize, think, work, do business, conduct research, and govern society. In this chapter, we provide an outlook for various applications to exploit big data technologies in current and future application domains. In addition, we highlight some of the open challenges which addressing them will further improve the power of big data technologies.
Chapter
Graphs are recognized as a general, natural, and flexible data-abstraction that can model complex relationships, interactions, and interdependencies between objects. Graphs have been widely used to represent datasets and encode problems across an already extensive range of application domains. The ever-increasing size of graph-structured data for these applications creates a critical need for scalable and even elastic systems that can process large amounts of it efficiently. In general, graph processing algorithms are iterative and need to traverse the graph in a certain way. This chapter focuses on discussing several systems that have been designed to tackle the problem of large-scale graph processing.
Conference Paper
Chapter
In every second of every day, we are generating massive amounts of data. In general, stream computing is a new paradigm which has been necessitated by new data-generating scenarios, such as the ubiquity of mobile devices, location services, and sensor pervasiveness. In general, stream processing engines enable a large class of applications in which data are produced from various sources and are moved asynchronously to processing nodes. Thus, streaming applications are normally configured as continuous tasks in which their execution starts from the time of their inception till the time of their cancellation. The main focus of this chapter is to cover several systems that have been designed to provide scalable solutions for processing big data streams in addition to other set of systems that have been introduced to support the development of data pipelines between various types of big data processing jobs and systems.
Chapter
In this paper, we propose FastThetaJoin, an optimization technique for θ-join operation on multi-way data streams, which is an essential query often used in many data analytical tasks. The θ-join operation on multi-way data streams is notoriously difficult as it always involves tremendous shuffle cost due to data movements between multiple operation components, rendering it hard to be efficiently implemented in a distributed environment. As with previous methods, FastThetaJoin also tries to minimize the number of θ-joins, but it is distinct from others in terms of making partitions, deleting unnecessary data items, and performing the Cartesian product. FastThetaJoin not only effectively minimizes the number of θ-joins, but also substantially improves the efficiency of its operations in a distributed environment. We implemented FastThetaJoin in the framework of Spark Streaming, characterized by its efficient bucket implementation of parameterized windows. The experimental results show that, compared with the existing solutions, our proposed method can speed up the θ-join processing while reducing its overhead; the specific effects of the optimization is correlated to the nature of data streams–the greater the data difference is, the more apparent the optimization effect is.
Chapter
In recent years, so-called Infrastructure as a Service (IaaS) clouds have become increasingly popular as a flexible and inexpensive platform for ad-hoc parallel data processing. Major players in the cloud computing space like Amazon EC2 have already recognized this trend and started to create special offers which bundle their compute platform with existing software frameworks for these kinds of applications. However, the data processing frameworks which are currently used in these offers have been designed for static, homogeneous cluster systems and do not support the new features which distinguish the cloud platform. This chapter examines the characteristics of IaaS clouds with special regard to massively-parallel data processing. The author highlights use cases which are currently poorly supported by existing parallel data processing frameworks and explains how a tighter integration between the processing framework and the underlying cloud system can help to lower the monetary processing cost for the cloud customer. As a proof of concept, the author presents the parallel data processing framework Nephele, and compares its cost efficiency against the one of the well-known software Hadoop.
Book
Full-text available
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DT-DMS that is used to support decision-making mechanisms. Book Chapters; 1. CHAPTER Understanding Hadoop and MapReduce by example: log data analysis case study Gligor Risteski, Mihiri Chathurika, Beyza Ali, Atanas Hristov 2. CHAPTER Edge Computing Security with an IoT device Beyda Nur Kars 3. CHAPTER Secure Connection between Google Home and IoT Device Ekrem Yiğit 4. CHAPTER Practical Evaluation on Serious Games in Education Slavica Mileva Eftimova, Ana Madevska Bogdanova, Vladimir Trajkovik 5. CHAPTER Digital Twin Based Disaster Management System Proposal: DT-DMS Özgür Doğan, Oğuzhan Şahin, Enis Karaarslan
Book
There is rapid development and change in the field of computer science today. These affect all areas of life. Emerging topics in computer science are covered in this book. In the first chapter, there is a log data analysis case study that aims to understand Hadoop and MapReduce by example. In the second chapter, encrypted communication has been tried to be provided on IoT devices performing edge computing. In this way, it is aimed to make communication secure. Arduino is used as an IoT device. In the encryption process, AES encryption is used with 128-bit and 256-bit key lengths. The third chapter presents a more secure connection between the NodemCU and Blynk, using the AES algorithm. it is aimed to prevent a vulnerability in the connection of Google Home devices with IoT during the Blynk IoT connection phase. The next chapter presents a methodological tool based on an evaluation framework for integration of digital games into education (MEDGE), expanded by adding additional information from the students, MEDGE+. The fifth and last chapter proposes a disaster management system utilising machine learning called DTDMS that is used to support decision-making mechanisms.
Chapter
One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.
Article
Full-text available
Technologies around the world produce and interact with geospatial data instantaneously, from mobile web applications to satellite imagery that is collected and processed across the globe daily. Big raster data allow researchers to integrate and uncover new knowledge about geospatial patterns and processes. However, we are at a critical moment, as we have an ever-growing number of big data platforms that are being co-opted to support spatial analysis. A gap in the literature is the lack of a robust assessment comparing the efficiency of raster data analysis on big data platforms. This research begins to address this issue by establishing a raster data benchmark that employs freely accessible datasets to provide a comprehensive performance evaluation and comparison of raster operations on big data platforms. The benchmark is critical for evaluating the performance of spatial operations on big data platforms. The benchmarking datasets and operations are applied to three big data platforms. We report computing times and performance bottlenecks so that GIScientists can make informed choices regarding the performance of each platform. Each platform is evaluated for five raster operations: pixel count, reclassification, raster add, focal averaging, and zonal statistics using three raster different datasets.
Article
Full-text available
It isn't easy to analyze huge records. This requires system based structures and technologies for you to procedure. Map-reduce a distributed parallel programming model runs on hadoop surroundings, approaches massive volumes of information. A parallel programming technique can be relevant to the linear regression algorithm and support vector machines algorithm from the system getting to know network to parallelize acceleration on the multicore system for efficient timing efficiency.
Conference Paper
Full-text available
Now a day’s Technological evolutions at Smart Intelligent Transport Systems (SITS) are allowing that ubiquitous society use the SITS facilities to conducting their activities for our Social needs. At this scenario the research paper presents a study case in order to use SITS technologies applied to conduct these activities, mainly for every vehicle tracking. The study case presented in this research paper uses the concept of Internet of Things to make the inspection of different vehicles; all study case considers the use of existing Intelligent Transport Systems infrastructure, as well as the deployment of new infrastructure using Bus Rapid Transit (BRT). In order to implement the Internet of Things, the study case use the RFID (Radio Frequency Identification) technologies associated with multiple sensors are installed at vehicle in order to identify the vehicle when it pass through a Region of Interest (ROI) portal in Bus Rapid Transit (BRT) and the associated information, if available, as: temperature, average velocity, humidity, average Speed, Frequency modulation, doors, Road Lengths and others relevant information. The major advantage of IoT is the ability to recognize or identify the vehicles using Transportation System in ROI and other information without disrupting the normal flow.
Conference Paper
Full-text available
Big data business ecosystem is very supportive for user needs and its trend that provide basis for big data are explained. There is need of effective solution with issue of data volume, in order to enable the feasible study , cost effective and scalable storage and processing of enormous quantity of data, thus the big data and cloud go hand in hand and Hadoop is very hot and enormously growing technology for organizations . The steps into begin required for setting up a distributed, single node Hadoop cluster backed by HDFS running on ubuntu (steps only) are given. We proposed new techniques for better Big Data, Hadoop Implementations using cloud computing sources using various methods and systems for common man needs. It is very helpful and Big scope for future research.
Article
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.
Article
Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed
Chapter
Advances in the communication technologies, along with the birth of new communication paradigms leveraging on the power of the social, has fostered the production of huge amounts of data. Old-fashioned computing paradigms are unfit to handle the dimensions of the data daily produced by the countless, worldwide distributed sources of information. So far, the MapReduce has been able to keep the promise of speeding up the computation over Big Data within a cluster. This article focuses on scenarios of worldwide distributed Big Data. While stigmatizing the poor performance of the Hadoop framework when deployed in such scenarios, it proposes the definition of a Hierarchical Hadoop Framework (H2F) to cope with the issues arising when Big Data are scattered over geographically distant data centers. The article highlights the novelty introduced by the H2F with respect to other hierarchical approaches. Tests run on a software prototype are also reported to show the increase of performance that H2F is able to achieve in geographical scenarios over a plain Hadoop approach.
Conference Paper
Full-text available
Nanotechnology is quickly turning into an omnipresent Technology with a possibility to affect on each part of present day human development. Relatively every part of human undertaking will be influenced, for example, agribusiness and sustenance, correspondence, PCs, ecological checking, materials, mechanical autonomy, social insurance and medicinal Technology. All the more as of late software engineering has turned out to be associated with nanotechnology. This technology may be impelled in 2020. Fifth time orchestrate Technology is depend upon nanotechnology and all IP compose. 5G can be cross the speed and accessibility. The 5G sort out Technology is more creative and engaging Technology which will be valuable for the customer of various field or specialists. The paper covers the technique towards the Nanotechnology and the building, central focuses and employments of the 5G remote framework correspondence Technology.
Article
Full-text available
Although a search engine manages a great deal of data and responds to queries, it is not accurately described as a "database" or DMBS. We believe that it represents the first of many application-specific data systems built by the systems community that must exploit the principles of databases without necessarily using the (current) database implementations. In this paper, we present how a search engine should have been designed in hindsight. Although much of the material has not been presented before, the contribution is not in the spe-cific design, but rather in the combination of principles from the largely independent disciplines of "systems" and "databases." Thus we present the design using the ideas and vocabulary of the database community as a model of how to design data-intensive systems. We then draw some conclusions about the application of data-base principles to other "out of the box" data-intensive systems.
Conference Paper
Full-text available
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers.
Conference Paper
Full-text available
The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distr ibuted data warehousing en- vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com- puters, guarantying a nearly optimal speed up and s cale up. However, the use of a large number of inexpensive nodes increases the r isk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capabilit y of answering to queries even in the presence of node failures. This approac h is based on the selective replication of data over the cluster nodes, which g uarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC- DS benchmark and the results show that the approach is quite effective.
Article
Full-text available
Large-scale cluster-based Internet services often host partitioned datasets to provide incremental scalability. The aggregation of results produced from multiple partitions is a fundamental building block for the delivery of these services. This paper presents the design and implementation of a programming primitive -- Data Aggregation Call (DAC) -- to exploit partition parallelism for clusterbased Internet services. A DAC request specifies a local processing operator and a global reduction operator, and it aggregates the local processing results from participating nodes through the global reduction operator. Applications may allow a DAC request to return partial aggregation results as a tradeoff between quality and availability. Our architecture design aims at improving interactive responses with sustained throughput for typical cluster environments where platform heterogeneity and software/hardware failures are common. At the cluster level, our load-adaptive reduction tree construction algorithm balances processing and aggregation load across servers while exploiting partition parallelism. Inside each node, we employ an event-driven thread pool design that prevents slow nodes from adversely affecting system throughput under highly concurrent workload. We further devise a staged timeout scheme that eagerly prunes slow or unresponsive servers from the reduction tree to meet soft deadlines. We have used the DAC primitive to implement several applications: a search engine document retriever, a parallel protein sequence matcher, and an online parallel facial recognizer. Our experimental and simulation results validate the effectiveness of the proposed optimization techniques for (1) reducing response time, (2) improving throughput, and (3) handling server unresponsiveness ...
Article
Full-text available
This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.
Article
Google's MapReduce programming model serves for processing large data sets in a massively parallel manner. We deliver the first rigorous description of the model including its advancement as Google's domain-specific language Sawzall. To this end, we reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our findings as an executable specification. We also identify and resolve some obscurities in the informal presentation given in the seminal papers. We use typed functional programming (specifically Haskell) as a tool for design recovery and executable specification. Our development comprises three components: (i) the basic program skeleton that underlies MapReduce computations; (ii) the opportunities for parallelism in executing MapReduce computations; (iii) the fundamental characteristics of Sawzall's aggregators as an advancement of the MapReduce approach. Our development does not formalize the more implementational aspects of an actual, distributed execution of MapReduce computations.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Like most applications, database systems want cheap, fast hardware. Today that means commodity processors, memories, and disks. Consequently, the hardware concept of a database machine built of exotic hardware is inappropriate for current technology. On the other hand, the availability of fast microprocessors, and small inexpensive disks packaged as standard inexpensive but fast computers is an ideal platform for parallel database systems. The main topics are basic techniques for parallel database machine implementation, the trend to shared-nothing machines, parallel dataflow approach to SQL software, and future directions and research problems.
Article
This paper extends earlier research on hash-join algorithms to a multiprocessor architecture. Implementations of a number of centralized join algorithms are described and measured. Evaluation of these algorithms served to verify earlier analytical results. In addition, they demonstrate that bit vector filtering provides dramatic improvement in the performance of all algorithms including the sort merge join algorithm. Multiprocessor configurations of the centralized Grace and Hybrid hash-join algorithms are also presented. Both algorithms are shown to provide linear increases in throughput with corresponding increases in processor and disk resources. 1. Introduction After the publication of the classic join algorithm paper in 1977 by Blasgen and Eswaran [BLAS77], the topic was virtually abandoned as a research area. Everybody "knew" that a nested-loops algorithm provided acceptable performance on small relations or large relations when a suitable index existed and that sort-merge was ...
Article
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds. Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.
Redundant Array of Inexpensive Nodes
  • Wikipedia
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/ Redundant Array of Inexpensive Nodes, 2006.