Conference PaperPDF Available

Evaluating Performance of Distributed Systems With MapReduce and Network Traffic Analysis

Authors:

Abstract and Figures

Testing, monitoring and evaluation of distributed systems at runtime is a difficult effort, due the dynamicity of the environment, the large amount of data exchanged between the nodes and the difficulty of reproduce an error for debugging. Application traffic analysis is a method to evaluate distributed systems, but the ability to analyze large amount of data is a challenge. This paper proposes and evaluates the use of MapReduce programming model to deep packet inspection the application traffic of distributed systems, evaluating the effectiveness and the processing capacity of the MapReduce programming model for deep packet inspection of a JXTA distributed storage application, in order to measure performance indicators.
Content may be subject to copyright.
A preview of the PDF is not available
... With the growth of link speeds and Internet applications traffic, it is necessary to develop novel approaches to provide processing capacity for profiling applications through network traffic analysis [2]. In this context, MapReduce presents a good option for offline analysis of massive volumes of network traces [10,18]. ...
... MapReduce can be used for offline evaluation of distributed applications through packet level analysis [10], evaluating packets individually, and through DPI [19,18], adopting an approach where a whole block is processed without division into packets, to reassemble application messages and protocols. The kind of workload submitted for processing by MapReduce impacts on its performance [16,6], requiring specific configuration to obtain an optimal performance. ...
... In previous studies [19,18] we proposed an architecture based on MapReduce to perform DPI, and we evaluated its effectiveness to extract indicators from a JXTA-based application. This paper evaluates MapReduce for packet level analysis and DPI, characterizing its behavior of phases, scalability, and speed-up, over variation of input size, block size, and cluster size. ...
Conference Paper
Full-text available
The use of MapReduce for distributed data processing has been growing and achieving benefits with its application for different workloads. MapReduce can be used for distributed traffic analysis, although network traces present characteristics which are not similar to the data type commonly processed through MapReduce. Motivated by the use of MapReduce for profiling application traffic and due to the lack of evaluation of MapReduce for network traffic analysis and the peculiarity of this kind of data, this paper evaluates the performance of MapReduce in packet level analysis and DPI, analysing its scalability, speed-up, and the behavior of MapReduce phases. The experiments provide evidences for the predominant phases in this kind of job, and show the impact of input size, block size and number of nodes, on MapReduce completion time and scalability.
... RIPE library does not consider the parallel-processing capability of reading packet records from distributed filesystem which leads to performance degradation. Vierira et al. 3,4 have developed a parallel packet processing system only for JXTA-based applications. Jethoe 5 in his paper proposed a method for detected DDOS attacks using Hadoop. ...
Article
Full-text available
Information processing is currently one of the most vital tasks. With the growth and development of information and telecommunication technologies increased the volume of data transmitted over the Internet. Simultaneously with the processing large amount of information raises the question of its protection. The given paper proposes a distributed cloud-computing framework based on the Big Data approach where both storage and computing resources can be scaled out to collect and process traffic from a large-scale network in a reasonable time.
... MapReduce [Dean and Ghemawat 2008] se tornou um importante modelo de programação e plataforma para processamento distribuído de grande massas de dados, com uso em diversas aplicações da indústria e academia [Guo et al. 2012]. MapReduce pode avaliar o tráfego de aplicações distribuídas, através da coleta de tráfego em pontos de um datacenter e o seu processamento offline, utilizando a análise em nível de pacotes [Lee et al. 2011], que avalia cada pacote individualmente para extrair informações da camada de rede ou transporte, e/ou DPI [Vieira et al. 2012b, Vieira et al. 2012a], que avalia cada bloco sem divisão para remontar dois ou mais pacotes e obter informações da camada de aplicação. ...
Conference Paper
Full-text available
Diagnosing distributed systems is a complex task. Network traffic analysis can be used to evaluate these systems, but it is needed mechanisms to deal with a growing amount of network traffic. MapReduce can be used for distributed traffic analysis, although network traces present characteristics not similar to the data type commonly processed through MapReduce. Due to lack of evaluation of MapReduce for traffic analysis, this paper evaluates the performance of MapReduce to packet level analysis and deep packet inspection, characterizing its scalability, speedup and the behavior of its phases. The experiments evidence the predominant phases in this kind of job, and show the impact of block, input and cluster size into job completion time and scalability. Resumo. Diagnosticar sistemas distribuídos é uma tarefa complexa, a análise de tráfego de rede pode ser utilizada para avaliar estes sistemas, mas são ne-cessários mecanismos capazes de avaliar uma crescente quantidade de tráfego de rede. MapReduce pode ser usado neste sentido, embora os traces de rede apresentem características diferentes dos tipos de dados geralmente processa-dos por MapReduce. Devido às lacunas existentes quanto a avaliação do Ma-pReduce para análise de tráfego de rede, este artigo avalia o desempenho do MapReduce para análise em nível de pacotes e inspeção profunda de pacotes (DPI), caracterizando sua escalabilidade, ganhos em tempo de execução e o comportamento de suas fases. Os experimentos executados evidenciam as fases predomimantes neste tipo de processamento e mostram o impacto causado pelo tamanho do cluster, do bloco e da entrada de dado no tempo de conclusão e escalabilidade.
Thesis
Full-text available
Distributed systems has been adopted for building modern Internet services and cloud computing infrastructures, in order to obtain services with high performance, scalability, and reliability. Cloud computing SLAs require low time to identify, diagnose and solve problems in a cloud computing production infrastructure, in order to avoid negative impacts into the quality of service provided for its clients. Thus, the detection of error causes, diagnose and reproduction of errors are challenges that motivate efforts to the development of less intrusive mechanisms for monitoring and debugging distributed applications at runtime. Network traffic analysis is one option to the distributed systems measurement, although there are limitations on capacity to process large amounts of network traffic in short time, and on scalability to process network traffic where there is variation of resource demand. The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network traffic analysis, in order to evaluate the performance of distributed systems at a data center, using commodity hardware and cloud computing services, in a minimally intrusive way. We propose a new approach based on MapReduce, for deep inspection of distributed application traffic, in order to evaluate the performance of distributed systems at runtime, using commodity hardware. In this dissertation we evaluated the effectiveness of MapReduce for a deep packet inspection algorithm, its processing capacity, completion time speedup, processing capacity scalability, and the behavior followed by MapReduce phases, when applied to deep packet inspection for extracting indicators of distributed applications.
Conference Paper
Internet traffic has continued to grow at a spectacular rate over the past ten years. Understanding and managing network traffic have become an important issue for network operators to meet service-level agreements with their customers. In addition, the emergence of high-speed networks, such as 20 Gbps, 40Gbps Ethernet and beyond, requires fast analysis of a large volume of network traffic and this is beyond the capabilities of a single machine. Distributed parallel processing schemes have recently been developed to analyze high quantities of traffic data. However, scalable Internet traffic analysis in real-time is difficult because of a large dataset requires high processing intensity. In this paper, we describe a real-time Deep Packet Inspection (DPI) system based on the MapReduce programming model. We combine a stand-alone classification engine (L7-filter) with the distributed programming MapReduce model. Our experimental results show that the MapReduce programming paradigm is a useful approach for building highly scalable real-time network traffic processing systems. We generate 20 Gbps network traffic to validate the real-time analysis ability of the proposed system.
Article
Full-text available
The area of Internet traffic measurement has advanced enormously over the last couple of years. This was mostly due to the increase in network access speeds, due to the appearance of bandwidth-hungry applications, due to the ISPs' increased interest in precise user traffic profile information and also a response to the enormous growth in the number of connected users. These changes greatly affected the work of Internet service providers and network administrators, which have to deal with increasing resource demands and abrupt traffic changes brought by new applications. This survey explains the main techniques and problems known in the field of IP traffic analysis and focuses on application detection. First, it separates traffic analysis into packet-based and flow-based categories and details the advantages and problems for each approach. Second, this work cites the techniques for traffic analysis accessible in the literature, along with the analysis performed by the authors. Relevant techniques include signature-matching, sampling and inference. Third, this work shows the trends in application classification analysis and presents important and recent references in the subject. Lastly, this survey draws the readers' interest to open research topics in the area of traffic analysis and application detection and makes some final remarks.
Conference Paper
Full-text available
Internet service providers (ISP) have been recently relying on deep packet inspection (DPI) systems, which are the most accurate techniques for traffic identification and classification. However, building high performance DPI systems requires an in-depth and careful computing system design due to the memory and processing power demands. DPI's accuracy mostly depends on string matching process and regular expression heuristics that go deep down on the packet payloads in a search for networked application signatures. As ISPs backbone links' speed and data volume soar, commodity hardware-based DPI systems start to face performance bottlenecks (e.g., packet losses), which interferes on traffic classification accuracy dramatically. In this paper we propose a lightweight DPI (LW-DPI) system that overcomes performance bottlenecks of traditional DPI systems, without a significant decrease on accuracy. We evaluate LW-DPI's accuracy by inspecting two factors: a limited number of full-payload packets in a given flow or a fraction of the packet payload. Our experiments were performed using more than 6TB of packet-level data from a large ISP and show that there is some interesting trade-offs between such factors and accuracy. Most flows can be classified with only their first 7 packets or a fraction of their payload. We also show that the impact on DPI's processing time may decrease around 75% as compared to analyzing all full-payload packets in a flow.
Conference Paper
Full-text available
Internet traffic measurement and analysis have been usually performed on a high performance server that collects and examines packet or flow traces. However, when we monitor a large volume of traffic data for detailed statistics, a long-period or a large-scale network, it is not easy to handle Tera or Peta-byte traffic data with a single server. Common ways to reduce a large volume of continuously monitored traffic data are packet sampling or flow aggregation that results in coarse traffic statistics. As distributed parallel processing schemes have been recently developed due to the cloud computing platform and the cluster filesystem, they could be usefully applied to analyzing big traffic data. Thus, in this paper, we propose an Internet flow analysis method based on the MapReduce software framework of the cloud computing platform for a large-scale network. From the experiments with an open-source MapReduce system, Hadoop, we have verified that the MapReduce-based flow analysis method improves the flow statistics computation time by 72%, when compared with the popular flow data processing tool, flow-tools, on a single host. In addition, we showed that MapReduce-based programs complete the flow analysis job against a single node failure.
Conference Paper
Full-text available
Internet traffic measurement and analysis has become a significantly challenging job because large packet trace files captured on fast links could not be easily handled on a single server with limited computing and memory resources. Hadoop is a popular open-source cloud computing platform that provides a software programming framework called MapReduce and the distributed filesystem, HDFS, which are useful for analyzing a large data set. Therefore, in this paper, we present a Hadoop-based packet processing tool that provides scalability for a large data set by harnessing MapReduce and HDFS. To tackle large packet trace files in Hadoop efficiently, we devised a new binary input format, called PcapInputFormat, hiding the complexity of processing binary-formatted packet data and parsing each packet record. We also designed efficient traffic analysis MapReduce job models consisting of map and reduce functions. To evaluate our tool, we compared its computation time with a well-known packet-processing tool, CoralReef, and showed that our approach is more affordable to process a large set of packet data.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD Lab's existence is due to the generous support of the founding members Google, Microsoft, and Sun Microsystems and of the affiliate members Amazon Web Services, Cisco Systems, Facebook, Hewlett-
Article
Project JXTA is an open-source effort to formulate and implement the core peer-to-peer (p2p) networking and collaboration protocols. A JXTA peer network is a complex overlay, constructed on top of the physical network, with its own identification scheme and routing. This paper reviews the performance of the JXTA networks using benchmarking, based on the proposed performance model.The two major versions of the JXTA protocol implementations (Versions 1.0 and 2.0) are surveyed and discussed.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.