Article

MapReduce: A major step backwards

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The application first defines a regular Runnable class that carries the estimation of π (lines [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. To parallelize its execution, lines 23-24 run a fork-join pattern using a set of number of points falling into the circle, which serves to approximate π . ...
... When data is small and the reduction operation simple, aggregating the output of the map phase directly in the storage layer is faster [12]. The DSO layer of Crucial allows to implement such an approach. ...
Conference Paper
Serverless computing is an emerging paradigm that greatly simplifies the usage of cloud resources and suits well to many tasks. Most notably, Function-as-a-Service (FaaS) enables programmers to develop cloud applications as individual functions that can run and scale independently. Yet, due to the disaggregation of storage and compute resources in FaaS, applications that require fine-grained support for mutable state and synchronization, such as machine learning and scientific computing, are hard to build. In this work, we present Crucial, a system to program highly-concurrent stateful applications with serverless architectures. Its programming model keeps the simplicity of FaaS and allows to port effortlessly multi-threaded algorithms to this new environment. Crucial is built upon the key insight that FaaS resembles to concurrent programming at the scale of a data center. As a consequence, a distributed shared memory layer is the right answer to the need for fine-grained state management and coordination in serverless. We validate our system with the help of micro-benchmarks and various applications. In particular, we implement two common machine learning algorithms: k-means clustering and logistic regression. For both cases, Crucial obtains superior or comparable performance to an equivalent Spark cluster.
... L'interrogation des données se fait via des fonctions codées par l'utilisateur faute de langage d'interrogation dédié. Par ailleurs, comme pointé par [Dewitt and Stonebraker 2008], HDFS est conçu pour des manipulations des données avec une seule passe, d'où par exemple, le recours à l'utilisation de chaînes MapReduce. ...
... ]. La chute du coût de stockage depuis les années 2000 (moins d'un euro le gigaoctet) en fait une solution capable d'absorber de très gros volumes de données à moindre coût. Ce type d'architecture distribuée a connu récemment le développement de systèmes de gestion de fichiers massivement distribués (le plus connu est probablement Hadoop [Anderson et al. 2010]) et de nouvelles techniques de parallélisation massive des traitements (Map/Reduce[Dewitt and Stonebraker 2008]). ...
Thesis
Les systèmes d’aide à la décision occupent une place prépondérante au sein des entreprises et des grandes organisations, pour permettre des analyses dédiées à la prise de décisions. Avec l’avènement du big data, le volume des données d’analyses atteint des tailles critiques, défiant les approches classiques d’entreposage de données, dont les solutions actuelles reposent principalement sur des bases de données R-OLAP. Avec l’apparition des grandes plateformes Web telles que Google, Facebook, Twitter, Amazon… des solutions pour gérer les mégadonnées (Big Data) ont été développées et appelées « Not Only SQL ». Ces nouvelles approches constituent une voie intéressante pour la construction des entrepôts de données multidimensionnelles capables de supporter des grandes masses de données. La remise en cause de l’approche R-OLAP nécessite de revisiter les principes de la modélisation des entrepôts de données multidimensionnelles. Dans ce manuscrit, nous avons proposé des processus d’implantation des entrepôts de données multidimensionnelles avec les modèles NoSQL. Nous avons défini quatre processus pour chacun des deux modèles NoSQL orienté colonnes et orienté documents. De plus, le contexte NoSQL rend également plus complexe le calcul efficace de pré-agrégats qui sont habituellement mis en place dans le contexte ROLAP (treillis). Nous avons élargis nos processus d’implantations pour prendre en compte la construction du treillis dans les deux modèles retenus.Comme il est difficile de choisir une seule implantation NoSQL supportant efficacement tous les traitements applicables, nous avons proposé deux processus de traductions, le premier concerne des processus intra-modèles, c’est-à-dire des règles de passage d’une implantation à une autre implantation du même modèle logique NoSQL, tandis que le second processus définit les règles de transformation d’une implantation d’un modèle logique vers une autre implantation d’un autre modèle logique.
... Instead of traditional SQL execution algorithms, NoSQL databases usually use the Map-Reduce (MR) model [6] for processing the large amounts of data. There are many discussion papers [7][8][9] and research blogs [10] on these two technologies. Among them, I briefly present two of them comparing the two classes of systems in detail [7,8]. ...
... This is also possible in MR systems but MR programmers must "implement" indexes to accelerate the accesses to the data required for the application. I strongly recommend reading those comparison papers for interested readers [7][8][9][10]. Now, the question is: "Where, in IoT systems do we prefer parallel DBMS over MapReduce or vice versa?" ...
Article
Full-text available
Nesnelerin İnterneti (IoT), verilerin sürekli olarak üretilip İnternet üzerinden iletildiği farklı tip bilgi kaynaklarından oluşan bir ağdır. Sensörler, telsiz frekans tanıma (RFID) cihazları, küresel konumlandırma sistemleri (GPS), mobil cihazlar ve Internet özellikli aktüatör teknolojileri IoT sistemlerinde önemli bir rol oynamaktadır. IoT, veri ve bilgi yönetimi açısından yeni zorluklar getiriyor, çünkü çok yüksek hızda üretilen büyük miktarda heterojen veriyi toplamak ve işlemenin zorluğu yanında, bu büyük veride gizlenen bilgileri almak ve yönetmek de kolay değildir. Bu makalede, IoT sistemlerinde veri işleme verimliliğini etkileyen temel faktörleri, özellikle sorgulama ve hareket yönetimini ele alıyorum. Geleneksel veri tabanı sistemlerinden, dağıtık sistemlerden ve sensör ağlarından öğrenilen çok sayıda dersler vardır, ancak geleneksel çözümler, IoT gibi karmaşık bir ekosistemdeki uygulamaların ihtiyaçlarını karşılamada çoğunlukla yetersiz kalmaktadır. Geleneksel veri tabanı sistemlerinde, örneğin, sorgulama işlemleri, genellikle yereldir ve yürütme maliyetleri mevcut işlemci gücü ve bellek gibi kaynak kısıtlamalarına bağlıdır. Diğer taraftan geleneksel hareket yönetimi mekanizmaları, genel veri bütünlüğünü sağlamak için ACID özelliklerini garanti eder. Heterojen, sürekli, gerçek-zamanlı ve coğrafi olarak dağınık büyük veri üzerinde çalışan farklı tip IoT uygulamalarının, sorgulama işleminin ve hareket yönetiminin iyi bilinen yönlerini önemli ölçüde değiştireceği açıktır. İçeriğe duyarlı sorgulama, dağıtılmış sorgulama, MapReduce hesaplama modeli ve web tabanlı hareket yönetimi gibi esnek işlem modelleri bu makalede ele alınan güncel konulardan bazılarıdır. Bu çalışmadaki kısa fakat kapsamlı bilgilerle, IoT sistemlerinde, özellikle veri tabanı sistemleri üzerine, çalışan araştırmacılar için bir kılavuz sağlamayı amaçladım.
... In addition to data storage, it must be possible to call and apply algorithms to these datasets. Over the past 20 years or so, parallel computing has been the most well-known technique to store and explore petabytes of data (Dean & Ghemawat, 2008;DeWitt & Stonebraker, 2008;Ghemawat, Gobioff, & Leung, 2003). ...
... Over the past few years, in addition to the benefits of MapReduce, there has also been a widespread discussion regarding its challenges. A well-known argument presented by DeWitt and Stonebraker (2008) is MapReduce being superficial for handling large-scale and demanding data processing. They strongly debate the need for schemas to avoid the inclusion of low quality or "corrupt" data into the process. ...
Article
Full-text available
This paper investigates the web-based remote sensing platform, Google Earth Engine (GEE) and evaluates the platform's utility for performing raster and vector manipulations on Landsat, Moderate Resolution Imaging Spectroradiometer and GlobCover (2009) imagery. We assess its capacity to conduct space–time analysis over two subregions of Singapore, namely, Tuas and the Central Catchment Reserve (CCR), for Urban and Wetlands land classes. In its current state, GEE has proven to be a powerful tool by providing access to a wide variety of imagery in one consolidated system. Furthermore, it possesses the ability to perform spatial aggregations over global-scale data at a high computational speed though; supporting both spatial and temporal analysis is not an obvious task for the platform. We examine the challenges that GEE faces, also common to most parallel-processing, big-data architectures. The ongoing refinement of this system makes it promising for big-data analysts from diverse user groups. As a use case for exploring GEE, we analyze Singapore’s land use and cover. We observe the change in Singapore’s landmass through land reclamation. Also, within the region of the CCR, a large protected area, we find forest cover is not affected by anthropogenic factors, but instead is driven by the monsoon cycles affecting Southeast Asia.
... These two stages are then linked by a data shuffle. Although these systems have proved very powerful for processing very large datasets, the map-reduce API has been criticized as inflexible [31]. Additionally, since jobs are restricted to a single map and reduce phase, tools such as FlumeJava [32] are necessary for assembling pipelines of map-reduce jobs. ...
... A variety of scientific applications have been parallelized using Hadoop such as CloudBLAST [8]. Although Hadoop exposes many convenient abstractions, it is difficult to express the application with the restrictive map-reduce API [31] and Hadoop's disk based model makes iterative/pipelined tasks expensive. ...
Article
Full-text available
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, HPC tools are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark—a modern platform for data intensive computing—to parallelize many-task applications. We implement Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor (Kira SE) application. Using Kira SE as a case study, we examine the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud. By exploiting data locality, Kira SE achieves a 4.1 speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, Kira SE on the Amazon EC2 cloud achieves a 1.8 speedup over the C program on the NERSC Edison supercomputer. A 128-core Amazon EC2 cloud deployment of Kira SE using Spark Streaming can achieve a second-scale latency with a sustained throughput of 800 MB/s. Our experience with Kira demonstrates that data intensive computing platforms like Apache Spark are a performant alternative for many-task scientific applications.
... These two stages are then linked by a data shuffle. Although these systems have proved very powerful for processing very large datasets, the map-reduce API has been criticized as inflexible [31]. Additionally, since jobs are restricted to a single map and reduce phase, tools such as FlumeJava [32] are necessary for assembling pipelines of map-reduce jobs. ...
... A variety of scientific applications have been parallelized using Hadoop such as CloudBLAST [8]. Although Hadoop exposes many convenient abstractions, it is difficult to express the application with the restrictive map-reduce API [31] and Hadoop's disk based model makes iterative/pipelined tasks expensive. ...
... There has been a need for special-purpose data processing tools to be commensurate with the solution of those problems [11,12,13]. Although the MapReduce is referred to as a modern way of processing large data in data-center computing [14], it is also defined as a "major step backwards" in parallel data processing in comparison with DBMS [15,16]. We summarize the defects in the MapReduce framework below, compared with DBMS. ...
Conference Paper
Full-text available
Cloud computing is joined with a new model for supplying of computing infrastructure. Big Data management has been specified as one of the momentous technologies for the next years. This paper shows a comprehensive survey of different approaches of data management applications using MapReduce. The open source framework implementing the MapReduce algorithm is Hadoop. We simulate the different design examples of the MapReduce which stored on the cloud. This paper proposes the application of MapReduce which runs on a huge cluster of machines, in Hadoop framework. The proposed implantation methodology is highly scalable and easy to use for non professional users. The main objective is to improve the performance of the MapReduce data management system in the basis of the Hadoop framework. Simulation result shows the effectiveness of the proposed implementation methodology for the MapReduce.
... The biggest issue of MapReduce is performance [16] [17] [18] [19] [20]. Since there is no schema design or data loading before data processing, many performance enhancing techniques developed by the database community cannot be applied to MapReduce directly. ...
Conference Paper
Full-text available
Nowadays, datasets grow enormously both in size and complexity. One of the key issues confronted by large-scale dataset analysis is how to adapt systems to new, unprecedented query loads. Existing systems nail down the data organization scheme once and for all at the beginning of the system design; thus inevitably will see the performance goes down when user requirements change. In this paper, we propose a new paradigm, Data Vitalization, for large-scale dataset analysis. Our goal is to enable high flexibility such that the system is adaptive to complex analytical applications. Specifically, data are organized into a group of vitalized cells, each of which is a collection of data coupled with computing power. As user requirements change over time, cells evolve spontaneously to meet the potential new query loads. Besides basic functionality of Data Vitalization, we also explore an envisioned architecture of Data Vitalization including possible approaches for query processing, data evolution, as well as its tight-coupled mechanism for data storage and computing.
... Shuffling the map output is a costly operation in MapReduce, even if the reduce phase is short. For that reason, when data is small and the reduction operation simple, it is better to skip the reduce phase and instead aggregate the map output directly in the storage layer [30]. Crucial allows to easily implement this approach. ...
Article
Serverless computing greatly simplifies the use of cloud resources. In particular, Function-as-a-Service (FaaS) platforms enable programmers to develop applications as individual functions that can run and scale independently. Unfortunately, applications that require fine-grained support for mutable state and synchronization, such as machine learning (ML) and scientific computing, are notoriously hard to build with this new paradigm. In this work, we aim at bridging this gap. We present Crucial , a system to program highly-parallel stateful serverless applications. Crucial retains the simplicity of serverless computing. It is built upon the key insight that FaaS resembles to concurrent programming at the scale of a datacenter. Accordingly, a distributed shared memory layer is the natural answer to the needs for fine-grained state management and synchronization. Crucial allows to port effortlessly a multi-threaded code base to serverless, where it can benefit from the scalability and pay-per-use model of FaaS platforms. We validate Crucial with the help of micro-benchmarks and by considering various stateful applications. Beyond classical parallel tasks (e.g., a Monte Carlo simulation), these applications include representative ML algorithms such as k -means and logistic regression. Our evaluation shows that Crucial obtains superior or comparable performance to Apache Spark at similar cost (18%–40% faster). We also use Crucial to port (part of) a state-of-the-art multi-threaded ML library to serverless. The ported application is up to 30% faster than with a dedicated high-end server. Finally, we attest that Crucial can rival in performance with a single-machine, multi-threaded implementation of a complex coordination problem. Overall, Crucial delivers all these benefits with less than 6% of changes in the code bases of the evaluated applications.
... Such big data processing systems are thought to be complementary to parallel databases: Stonebraker et al. [50] consider the former to excel at complex analytics and ETL, while the latter at efficiently query-ing large datasets. One of the inefficiencies of MapReduce-like systems come from the fact that they are not designed to support indexing [51]. ...
Preprint
Full-text available
ions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
... Meanwhile, the Big Data stacks of the 2000s-including the MapReduce phenomenon that gave Stonebraker and DeWitt such heartburn [DS08]-are a re-realization of the Postgres idea of user-defined code hosted in a query framework. MapReduce looks very much like a combination of software engineering ideas from Postgres combined with parallelism ideas from systems like Gamma and Teradata, with some minor innovation around mid-query restart for extreme-scalability workloads. ...
Preprint
Full-text available
This is a recollection of the UC Berkeley Postgres project, which was led by Mike Stonebraker from the mid-1980's to the mid-1990's. The article was solicited for Stonebraker's Turing Award book, as one of many personal/historical recollections. As a result it focuses on Stonebraker's design ideas and leadership. But Stonebraker was never a coder, and he stayed out of the way of his development team. The Postgres codebase was the work of a team of brilliant students and the occasional university "staff programmers" who had little more experience (and only slightly more compensation) than the students. I was lucky to join that team as a student during the latter years of the project. I got helpful input on this writeup from some of the more senior students on the project, but any errors or omissions are mine. If you spot any such, please contact me and I will try to fix them.
... With the large volume of data in the order of exabyte, it becomes almost impractical to process the data on individual machines, no matter how powerful they are. Parallel processing of the data chunks on dedicated servers, such as MapReduce tool proposed by Google, offers advantages over conventional processing methods; however it is still not very effective to handle a large amount of data, mainly due to scalability, latency, availability, and inefficient programming techniques, including but not limited to database management systems [90], [91]. One attractive solution to dedicated servers is the processing on cloud centers, which offers users the ability to rent computing and storage resources in a pay-as-you-go manner [92]. ...
Article
Full-text available
The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world via creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The information and communication technology (ICT) sector is experiencing a significant growth in data traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. It is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy via providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS require cybersecurity to protect them against malicious attacks and unauthorized intrusion, which become a challenge with the enormous amount of data that is continuously being generated in the network. Thus, we also provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS.
... With the large volume of data in the order of exabyte, it becomes almost impractical to process the data on individual machines, no matter how powerful they are. Parallel processing of the data chunks on dedicated servers, such as MapReduce tool proposed by Google, offers advantages over conventional processing methods; however it is still not very effective to handle a large amount of data, mainly due to scalability, latency, availability, and inefficient programming techniques, including but not limited to database management systems [90], [91]. One attractive solution to dedicated servers is the processing on cloud centers, which offers users the ability to rent computing and storage resources in a pay-as-you-go manner [92]. ...
Preprint
Full-text available
The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS.
... Types of deferred tasks for cluster computing applications are typically resource-driven. For example, in the Mapreduce framework, each application has two main phases, the map and the minimization (DeWitt, & Stonebraker, 2008). ...
Article
Full-text available
Hadoop is a cloud computing open source system, used in large-scale data processing. It became the basic computing platforms for many internet companies. With Hadoop platform users can develop the cloud computing application and then submit the task to the platform. Hadoop has a strong fault tolerance, and can easily increase the number of cluster nodes, using linear expansion of the cluster size, so that clusters can process larger datasets. However Hadoop has some shortcomings, especially in the actual use of the process of exposure to the MapReduce scheduler, which calls for more researches on Hadoop scheduling algorithms.This survey provides an overview of the default Hadoop scheduler algorithms and the problem they have. It also compare between five Hadoop framework scheduling algorithms in term of the default scheduler algorithm to be enhanced, the proposed scheduler algorithm, type of cluster applied either heterogeneous or homogeneous, methodology, and clusters classification based on performance evaluation. Finally, a new algorithm based on capacity scheduling and use of perspective resource utilization to enhance Hadoop scheduling is proposed.
... P e [7 · · · 10] ⟨D [58,9]="venkaee sh'' e 4 ="venkatesh"⟩ [26] q-gram [25][26][27][71][72][73] Wang [26] Neighborhood-Generation 1 Neighborhoods Neighborhoods 1 Chakrabarti [27] Inverted Signature 0-1 Lu [72] Signature-based Inverted Lists Agrawal [73] Inverted Lists Chandel [25] Batch Top-K Chaudhuri [71] [27, 28,35,42,43,45,46,[74][75][76][77] [ [28][29][30]36,40,70,78] NGPP [26] ISH [27] 38 NGPP [26] ISH [27] [ [79][80][81] 2.8 [28][29][30]36,40,70,78,84] TrieJoin [31] , All-Pairs-Ed [29] , ED-Join [30] PartEnum [28] All-Pairs-Ed ...
Article
Real-world data contains various kinds of errors. Before analyzing data, one usually needs to process the raw data. However, traditional data processing based on exactly match often misses lots of valid information. To get high-quality analysis results and fit in the big data era, this thesis studies the error-tolerant big data processing. As most of the data in real world can be represented as a sequence or a set, this thesis utilizes the widely-used sequence-based and set-based similar functions to tolerate errors in data processing and studies the approximate entity extraction, similarity join and similarity search problems. The main contributions of this thesis include: 1. This thesis proposes a unified framework to support approximate entity extraction with both sequence-based and set-based similarity functions simultaneously. The experiments show that the unified framework can improve the state-of-the-art methods by 1 to 2 orders of magnitude. 2. This thesis designs two methods respectively for the sequence and the set similarity joins. For the sequence similarity join, this thesis proposes to evenly partition the sequences to segments. It is guaranteed that two sequences are similar only if one sequence has a subsequence identical to a segment of another sequence. For the set similarity join, this thesis proposes to partition all the sets into segments based on the universe. This thesis further extends the two partition-based methods to support the large-scale data processing framework, Map-Reduce and Spark. The partition-based method won the string similarity join competition held by EDBT and beat the second place by 10 times. 3. This thesis proposes a pivotal prefix filter technique to solve the sequence similarity search problem. This thesis shows that the pivotal prefix filter has stronger pruning power and less filtering cost compared to the state-of-the-art filters.
... In the eyes of many programmers and researches MapReduce seems like a magic solution for theoretical endless scaling, however some researchers, leaded by Turing prize winner Michael Stonebraker, claim that MapReduce is actually a major step backwards [15]. ...
Article
The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively handling such large data sets. MapReduce is a novel programming paradigm for processing distributable problems over large-scale data using a computer cluster. In this work we explore the MapReduce paradigm from three different angles. We begin by examining a well-known problem in the field of data mining: mining closed frequent itemsets over a large dataset. By harnessing the power of MapReduce, we present a novel algorithm for mining closed frequent itemsets that outperforms existing algorithms. Next, we explore one of the fundamental implications of "Big Data": The data is not known with complete certainty. A probabilistic database is a relational database with the addendum that each tuple is associated with a probability of its existence. A natural development of MapReduce is of a distributed relational database management system, where relational calculus has been reduced to a combination of map and reduce function. We take this development a step further by proposing a query optimizer over distributed, probabilistic database. Finally, we analyze the best known implementation of MapReduce called Hadoop, aiming to overcome one of its major drawbacks: it does not directly support the explicit specification of the data repeatedly processed throughout different jobs.Many data-mining algorithms, such as clustering and association-rules require iterative computation: the same data are processed again and again until the computation converges or a stopping condition is satisfied. We propose a modification to Hadoop such that it will support efficient access to the same data in different jobs.
... Therefore, Map-Reduce all over the paper has become the important standard for large-scale data processing in many enterprises. In addition, it is used for developing new solutions for massive data-sets such as relational data analytics, web analytics, machine learning, realtime analytics, and data mining [2,3]. ...
Article
These days Big Data represents complex and an important issue for the extraction/retrieval of information due to the fact that its analysis requires massive computation power. In addition, database star schema can be considered as one of the complicated data models due to the use of joining queries heavily for information extraction and reports generation, which demands scanning for a large amount of data (tera, peta, zeta bytes, etc.). On the other hand, HIVE is considered one of the essential and efficient Big Data SQL-based tools built on the top of Hadoop as a translator from SQL queries into Map/Reduce tasks. In addition, using data indexing techniques with join queries could improve /speed up HIVE join query tasks execution especially in a star schema. According to the work in this paper, Key/Facts indexing methodology was introduced to materialize the star schema and inject a simple index for data. Based on this, Key/Facts indexing methodology SQL queries’ execution time in HIVE improved without changing HIVE framework. TPC-H benchmark was used in order to estimate the performance of Key/Facts methodology. Experimental results prove that Key/Facts methodology out-performs traditional HIVE join execution time. Also, Key/Facts performance is improved by increasing the data size. Generally, Key/Facts can be considered one of the suitable methodologies for Big Data analysis.
... During the early term of MapReduce, it provoked strong doubts from Database societies [29]. Comparisons and debates between DBMS and MapReduce have been shown in a series of articles [32], [33], [55], [70], [74]. ...
Article
The volume, variety, and velocity properties of big data and the valuable information it contains have motivated the investigation of many new parallel data processing systems in addition to the approaches using traditional database management systems (DBMSs). MapReduce pioneered this paradigm change and rapidly became the primary big data processing system for its simplicity, scalability, and fine-grain fault tolerance. However, compared with DBMSs, MapReduce also arouses controversy in processing efficiency, low-level abstraction, and rigid dataflow. Inspired by MapReduce, nowadays the big data systems are blooming. Some of them follow MapReduce's idea, but with more flexible models for general-purpose usage. Some absorb the advantages of DBMSs with higher abstraction. There are also specific systems for certain applications, such as machine learning and stream data processing. To explore new research opportunities and assist users in selecting suitable processing systems for specific applications, this survey paper will give a high-level overview of the existing parallel data processing systems categorized by the data input as batch processing, stream processing, graph processing, and machine learning processing and introduce representative projects in each category. As the pioneer, the original MapReduce system, as well as its active variants and extensions on dataflow, data access, parameter tuning, communication, and energy optimizations will be discussed at first. System benchmarks and open issues for big data processing will also be studied in this survey.
... Add PGC (π, i) to CND-LPP; 12: vn = vn + 1; 13: 14: ...
Article
MapReduce-like frameworks have achieved tremendous success for large-scale data processing in data centers. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. Past schemes, and especially their theoretical bounds, on general parallel models are therefore, unlikely to be applied to MapReduce directly. There are many recent studies on MapReduce job and task scheduling. These studies assume that the servers are assigned in advance. In current data centers, multiple MapReduce jobs of different importance levels run together. In this paper, we investigate a schedule problem for MapReduce taking server assignment into consideration as well. We formulate a MapReduce server-job organizer problem (MSJO) and show that it is NP complete. We develop a 3-approximation algorithm and a fast heuristic design. Moreover, we further propose a novel fine-grained practical algorithm for general MapReduce-like task scheduling problem. Finally, we evaluate our algorithms through both simulations and experiments on Amazon EC2 with an implementation with Hadoop. The results confirm the superiority of our algorithms. (C) 2016 The Authors. Published by Elsevier Inc.
Preprint
In recent past, big data opportunities have gained much momentum to enhance knowledge management in organizations. However, big data due to its various properties like high volume, variety, and velocity can no longer be effectively stored and analyzed with traditional data management techniques to generate values for knowledge development. Hence, new technologies and architectures are required to store and analyze this big data through advanced data analytics and in turn generate vital real-time knowledge for effective decision making by organizations. More specifically, it is necessary to have a single infrastructure which provides common functionality of knowledge management, and flexible enough to handle different types of big data and big data analysis tasks. Cloud computing infrastructures capable of storing and processing large volume of data can be used for efficient big data processing because it minimizes the initial cost for the large-scale computing infrastructure demanded by big data analytics. This paper aims to explore the impact of big data analytics on knowledge management and proposes a cloud-based conceptual framework that can analyze big data in real time to facilitate enhanced decision making intended for competitive advantage. Thus, this framework will pave the way for organizations to explore the relationship between big data analytics and knowledge management which are mostly deemed as two distinct entities.
Article
Full-text available
Data security and access control has consistently been a major problem in cloud computing. In information technology, the cloud computing environment, it becomes particularly a serious issue because the data is located in some other places in the entire world. Cloud computing is a technique that provides the way for sharing the distributed resource and services that is belong to some other organizations or sites. Basically Security problems are raised when cloud computing share distributed resources via network in the open environment. The security of data and privacy protection are basically the two main issues of user's interest about the cloud computing technology. When cloud computing are developed, data security and privacy protection becomes more important in cloud computing. The security of Data and privacy protection are becoming most important for development of cloud computing technology in the future in government, industry, and business. In the cloud architecture security of data and privacy protection issues are compatible for both of the hardware and software. Cloud users can access the various different cloud services from heterogeneous client platforms (e.g. smart phones, laptops, other computers in the same or another cloud, etc.), without knowing the exact location of the services. Additionally, the cloud service user need not know the processes to develop, manage, or maintain the services. One of the foremost difficult issues of cloud service solicitation is to influence users to trust the protection of cloud service and transfer their sensitive knowledge. Key management is that the toughest half to manage in crypto systems. So as to manage the coding keys firmly, enterprises ought to use coding in their cloud setting, whereas maintaining secure off-site storage of their coding keys. Keys ought to be hold on within the same place as encrypted knowledge. This study is basically propose a more decentralized, reliable ,light weight key management technique for cloud environment which provides more efficient security of data and key management in cloud environment. This proposed technique provides better security against data modification attacks, server colluding and byzantine failure.
Chapter
Cloud computing is highly praised for its high data reliability, lower cost, and nearly unlimited storage. In cloud computing projects, the MapReduce distributed computing model is prevalent. MapReduce distributed computing model is mainly divided into the Map and Reduce functions. As a mapper, the Map function is responsible for dividing tasks (such as uploaded files) into multiple small tasks executed separately; As a reducer, the Reduce function is responsible for summarizing the processing results of multiple tasks after decomposition. It is a scalable and fault-tolerant data processing tool that can process huge voluminous data in parallel with many low-end computing nodes. This paper implements the wordcount program based on the MapReduce framework and uses different dividing methods and data sizes to test the program. The common faults faced by the MapReduce framework also emerged during the experiment. This paper proposes schemes to improve the efficiency of the MapReduce framework. Finally, building an index or using a machine learning model to alleviate data skew is proposed to improve program efficiency. The application system is recommended to be a hybrid system with different modules to process variant tasks.
Article
Full-text available
企业级大规模数据管理与分析系统从来不是单指某种“数据库软件”,由于这类系统面对的几乎都是大数据量、大规模复杂关联查询的应用场景,使得这类问题一开始就是与基础计算与存储资源的能力和效率密切相关的,设计者需要综合考虑软件与硬件密切配合,才可能解决常规数据库软件和通用计算架构碰到的难题。本文试图梳理从上世纪60年代到今天企业数据管理与分析系统演进发展的历史脉络,回顾数据库计算机、数据仓库、分析型关系数据库的关键事件,以铭记所有先驱者们的贡献,延续永不停歇的开放和创新精神,展望新时代的光明未来。
Chapter
The Cyber‐Physical System (CPS) combines corporeal gadgets (i.e. sensors) through digital (i.e., illuminating) apparatuses to create an elusive structure that returns wisely towards complexing variations in actual circumstances. The fundamental aspect of the CPS is the analysis and evaluation of data from woody, challenging, and volatile external structures that can only be gradually transformed into usable knowledge. AI evaluation, such as neighborhood surveys, is being used to derive valuable data and insights from data obtained by smart objects, focusing on which various applications of CPS can be used to make informed choices. Throughout this paper, through the use of size and shape‐dependent information stream sorting measurements, based on the Numerous Specimen Flocking Model, is suggested for the evaluation of large‐scale details produced from various activities, e.g., system observation, well‐being tests, sensor organizations. In the proposed method, the estimated findings are available on request at any time so that they are particularly well equipped for actual observation purposes.
Article
The Continuous increase in urban population causes enormous pressure on cities’ limited resources, including transport, energy, water, housing, public services, and others. Hence, the need to plan and develop smart cities based solutions for enhanced urban governance is becoming more evident. These solutions are motivated by innovations in Information and Communication Technology to support smart planning for the city and to facilitate enhanced services to its citizens. Important areas where smart city services can be offered include urban planning, transport planning, energy conservation, water management, waste management, environmental monitoring, public safety, healthcare, education, entertainment, and many other services. Hence, the enormous data collected from different networks and applications to facilitate the offering of smart city services requires efficient data scheduling, aggregation, and processing to ensure service quality (QoS). However, existing data scheduling approaches consider scheduling and processing data only in the cloud, while processing also in the data collecting devices is significantly essential. This paper first introduces the multi-layer network architecture comprising sensor/device networks and cloud. The paper then introduces a Multi-layer, Priority-based, Dynamic, and Time-sensitive data processing and Scheduling approach (MPDTS) in the proposed multi-layer networks. Simulation results show that the proposed MPDTS approach achieves lower latency and data processing time than existing traditional data scheduling approaches that work only in the cloud layer.
Chapter
Huge amount of data which can be used for computations and predictive analysis based on the trends in data is called Big Data. In the recent times, there has been notable increase in the use of big data analytics in healthcare, with medical imaging being an important aspect of it. For the purpose of handling diverse medical image data obtained from X-rays, CT-scan, MRI etc. and in order to attain better insights for diagnosis, Big Data Analytics platforms are being leveraged to a great extent. Disease surveillance can be effectively improved using Big Data Analytics. Unstructured medical image data sets can be evaluated with great efficiency to create a better discernment about the disease and requisite prevention and curing methodologies, hence leading to much better critical decision making. An approximate of 66,000 images were contained in medical image dataset called CLEF (Cross Language Evaluation Forum) in 2007 which increased to 300,000 in 2013 with images varying greatly in dimensions, resolution and modalities. In order to handle this huge amount of image data, dedicated analytics platforms are required for analyzing these big data sets in distributed environment. Medical images reveal information about organs and internal functioning of the body which is required to identify the tumors, diabetic retinopathy, artery stenosis etc. Data storage, automatic extraction, and advanced analytics of this medical image data using Big Data Analytics platforms has resulted in much faster diagnosis and prediction of treatment plans in advance. Parallel programming and cloud computation have also played a significant role in overcoming the challenges of huge amount of data computation. The medical image processing is based on extracting features from the images and detecting patterns in the extracted data. Various tools and frameworks are used to solve the purpose like Hadoop, MapReduce, YARN, Spark, Hive etc. Machine learning and Deep learning techniques are extensively used for carrying out the required analytics. Genetic algorithms and association rule learning techniques are considerably used for the purpose.
Article
In 2009 we explored the feasibility of building a hybrid SQL data analysis system that takes the best features from two competing technologies: large-scale data processing systems (such as Google MapReduce and Apache Hadoop) and parallel database management systems (such as Greenplum and Vertica). We built a prototype, HadoopDB, and demonstrated that it can deliver the high SQL query performance and efficiency of parallel database management systems while still providing the scalability, fault tolerance, and flexibility of large-scale data processing systems. Subsequently, HadoopDB grew into a commercial product, Hadapt, whose technology was eventually acquired by Teradata. In this paper, we provide an overview of HadoopDB's original design, and its evolution during the subsequent ten years of research and development effort. We describe how the project innovated both in the research lab, and as a commercial product at Hadapt and Teradata. We then discuss the current vibrant ecosystem of software projects (most of which are open source) that continued HadoopDB's legacy of implementing a systems level integration of large-scale data processing systems and parallel database technology.
Article
MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data. In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.
Article
Full-text available
In recent past, big data opportunities have gained much momentum to enhance knowledge management in organizations. However, big data due to its various properties like high volume, variety, and velocity can no longer be effectively stored and analyzed with traditional data management techniques to generate values for knowledge development. Hence, new technologies and architectures are required to store and analyze this big data through advanced data analytics and in turn generate vital real-time knowledge for effective decision making by organizations. More specifically, it is necessary to have a single infrastructure which provides common functionality of knowledge management, and flexible enough to handle different types of big data and big data analysis tasks. Cloud computing infrastructures capable of storing and processing large volume of data can be used for efficient big data processing because it minimizes the initial cost for the large-scale computing infrastructure demanded by big data analytics. This paper aims to explore the impact of big data analytics on knowledge management and proposes a cloud-based conceptual framework that can analyze big data in real time to facilitate enhanced decision making intended for competitive advantage. Thus, this framework will pave the way for organizations to explore the relationship between big data analytics and knowledge management which are mostly deemed as two distinct entities.
Chapter
In the world where technology has largely dominated almost every aspect of human life the amount of data generated each minute grows at a rapid rate. The need to analyse massive volumes of data poses new challenges for researchers and specialists around the world. The MapReduce model became a center of interest due to offering a way of execution that allowed a parallelization of tasks. Machine learning also gained a significantly more attention due to many applications where it can be used. In this paper, we discuss the challenges of Big Data analysis and provide an overview of the MapReduce model. We conducted experiments to examine the performance of the MapReduce on the example of a Random Forest algorithm to determine its effect on the overall quality of the analysis. The paper ends with remarks on the strengths and pitfalls of using MapReduce as well as ideas on improving its potential.
Conference Paper
Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).
Chapter
Apache Hadoop is a software framework that allows distributed processing of large datasets across clusters of computers using simple programming constructs/models. It is designed to scale-up from a single server to thousands of nodes. It is designed to detect failures at the application level rather than rely on hardware for high-availability thereby delivering a highly available service on top of cluster of commodity hardware nodes each of which is prone to failures [2]. While Hadoop can be run on a single machine the true power of Hadoop is realized in its ability to scale-up to thousands of computers, each with several processor cores. It also distributes large amounts of work across the clusters efficiently [1].
Conference Paper
When analysing the data, the user often may want to perform the join between the input data sources. At first glance, in Map- Reduce programming model, the developer is limited only to equi-joins as they can be easily implemented using the grouping operation. However, some techniques have been developed to leverage the joins using non-equality conditions. In this paper, we propose the enhancement to cross-join based algorithms, like Strict-Even Join, by handling the equality and non-equality conditions separately.
Conference Paper
In the era of the Big Data, how to analyze such a vast quantity of data is a challenging problem, and conducting a multi-way theta-join query is one of the most time consuming operations. MapReduce has been mentioned most in the massive data processing area and some join algorithms based on it have been raised in recent years. However, MapReduce paradigm itself may not be suitable to some scenarios and multi-way theta-join seems to be one of them. Many multi- way theta-join algorithms on traditional parallel database have been raised for many years, but no algorithm has been mentioned on the CMD (coordinate modulo distribution) storage method, although some algorithms on equal-join have been proposed. In this paper, we proposed a multi-way theta-join method based on CMD, which takes the advantage of the CMD storage method. Experiments suggest that it’s a valid and efficient method which achieves significant improvement compared to those applied on the MapReduce.
Conference Paper
The so called “big data” is increasingly present in several modern applications, in which massive parallel processing is the main approach in order to achieve acceptable performance. However, as the size of data is ever increasing, even parallelism will meet its limits unless it is combined with other powerful processing techniques. In this paper we propose to combine parallelism with rewriting, that is reusing previous results stored in a cache in order to perform new (parallel) computations. To do this, we introduce an abstract framework based on the lattice of partitions of the data set. Our basic contributions are: (a) showing that our framework allows rewriting of parallel computations (b) deriving the basic principles of optimal cache management and (c) showing that, in case of structured data, our approach can leverage both structure and semantics in data to improve performance.
ResearchGate has not been able to resolve any references for this publication.