Book

Hadoop: The Definitive Guide

Authors:
... HDFS is a reporting widget which works on clusters of hardware widgets and is designed to save excessively massive material the use of streaming facts get admission to models [6]. HDFS can be highly fault tolerant and is believed to be connected to low-value hardware [7]. It also offers high-throughput data access, making it excellent for applications with huge data collections [8]. ...
... The DataNodes are in charge of serving the file system's clients' read and write requests. On the NameNode's instructions, the DataNodes also create, delete, and replicate blocks [7]. When a data block is first written to a DataNode by a client, the NameNode assigns a block with a special block ID and assigns a list of DataNodes to host copies of that block.. ...
Article
Full-text available
Hadoop Distributed File System (HDFS) is a record framework that is intended to store, examine, and dependably move enormous datasets to client applications. Data replication is utilized to deal with adaptation to handle failures, with every data block being copied and stored on various DataNodes. Thereafter, the HDFS promote availability and reliability. The current Hadoop execution of HDFS does replication in a pipelined design, which consumes most of the daytime. The replication approach is proposed in this concentrate as a substitute methodology for effective replica state of affairs. The basic idea of this procedure is that the client allows two DataNodes to compose one block to the other equally, by storing the package.
... HDFS is a reporting widget which works on clusters of hardware widgets and is designed to save excessively massive material the use of streaming facts get admission to models [6]. HDFS can be highly fault tolerant and is believed to be connected to low-value hardware [7]. It also offers high-throughput data access, making it excellent for applications with huge data collections [8]. ...
... The DataNodes are in charge of serving the file system's clients' read and write requests. On the NameNode's instructions, the DataNodes also create, delete, and replicate blocks [7]. When a data block is first written to a DataNode by a client, the NameNode assigns a block with a special block ID and assigns a list of DataNodes to host copies of that block.. ...
Article
Full-text available
Hadoop Distributed File System (HDFS) is a record framework that is intended to store, examine, and dependably move enormous datasets to client applications. Data replication is utilized to deal with adaptation to handle failures, with every data block being copied and stored on various DataNodes. Thereafter, the HDFS promote availability and reliability. The current Hadoop execution of HDFS does replication in a pipelined design, which consumes most of the daytime. The replication approach is proposed in this concentrate as a substitute methodology for effective replica state of affairs. The basic idea of this procedure is that the client allows two DataNodes to compose one block to the other equally, by storing the package.
... The master machine uses job tracker to manage and monitor both of Map and Reduce tasks progress. on the other side, the slaves employ task trackers to run all tasks of MapReduce [18]. The following sections illustrate Hadoop distributed file system and MapReduce. ...
Article
Full-text available
span>Random forest is a machine learning algorithm that mainly built as a classification method to make predictions based on decision trees. Many machine learning approaches used random forest to perform deep analysis on different cancer diseases to understand their complex characterstics and behaviour. However, due to massive and complex data generated from such diseases, it has become difficult to run random forest using single machine. Therefore, advanced tools are highly required to run random forest to analyse such massive data. In this paper, random forest algorithm using Apache Mahout and Hadoop based software defined networking (SDN) are used to conduct the prediction and analysis on large lung cancer datasets. Several experiments are conducted to evaluate the proposed system. Experiments are conducted using nine virtual nodes. Experiments show that the implementation of random forest algorithm using the proposed work outperforms its implementation in traditional environment with regard to the execution time. Comparison between the proposed system using Hadoop based SDN and Hadoop only is performed. Results show that random forest using Hadoop based SDN has less execution time than when using Hadoop only. Furthermore, experiments reveal that the performance of implemented system achieved more efficiency regarding execution time, accuracy and reliability.</span
... In addition, it can recover when it fails to run, as well as to manage different types of data. Moreover, it facilitates a shared environment, allowing the execution of multiple jobs simultaneously [15]. ...
Chapter
Full-text available
In this paper, we present the implementation of a big data environment to extract, filter, and classify data to use tools that anticipate its growth and that allow scaling resources in order for results to be used as a basis for future analysis or, as in this case study, for the generation of synthetic data. Specifically, we focus on the information available in a learning management system, concretely Moodle, where numerous interactions occur between users and the learning platform. All actions of users, mainly students, are recorded and can be used to reproduce behaviors. Then, we create a request simulator to emulate the behavioral pattern of students, using data from a database taken from an operative learning management system for training. Results were quite similar concerning student behavior, as reflected in the statistics on the operational learning management system database and the data synthetically generated.
... It consists of one NameNode and several DataNodes. The NameNode acts as a master node, while the DataNodes act as a slave to the master node [16]. Also, NameNode maps input data splits to DataNodes and maintains the metadata, and DataNodes store application data. ...
Article
Full-text available
Hadoop is a framework for storing and processing huge volumes of data on clusters. It uses Hadoop Distributed File System (HDFS) for storing data and uses MapReduce to process that data. MapReduce is a parallel computing framework for processing large amounts of data on clusters. Scheduling is one of the most critical aspects of MapReduce. Scheduling in MapReduce is critical because it can have a significant impact on the performance and efficiency of the overall system. The goal of scheduling is to improve performance, minimize response times, and utilize resources efficiently. A systematic study of the existing scheduling algorithms is provided in this paper. Also, we provide a new classification of such schedulers and a review of each category. In addition, scheduling algorithms have been examined in terms of their main ideas, main objectives, advantages, and disadvantages.
... For this data point, the received data set is assigned to the cluster of the closest centroid.  Fitness evaluation  For evaluating fitness, we are computing the Davies-Bouldin [7] index of each individual. ...
Conference Paper
Full-text available
The clustering of Bigdata is a common task in data mining and machine learning. The goal is to group similar data points to identify patterns and relationships in the data. However, clustering large datasets can be computationally expensive and time-consuming. This is where Hadoop MapReduce comes in. Hadoop is a sophisticated framework that facilitates the distributed processing of voluminous datasets across multiple clusters of computers. MapReduce is a programming model that simplifies the processing of large datasets by breaking them down into smaller chunks and processing them in parallel across the cluster. One approach to clustering Bigdata using Hadoop MapReduce is to use a genetic algorithm. A genetic algorithm is an optimization technique that is inspired by the process of natural selection. It works by iteratively generating and evaluating candidate solutions and using the best solutions as a basis for generating the next generation of candidates. This paper introduces a technique to parallelize GA-based clustering by extending Hadoop MapReduce. An analysis of the proposed approach to evaluate performance gains to a sequential algorithm is presented. The analysis is predicated upon a substantial real-world dataset.
... Zdobyta wiedza teoretyczna posłużyła jako fundament do projektowania i przeprowadzania eksperymentów. Praca ta ma na celu nie tylko poszerzenie zrozumienia związku między teoretycznymi aspektami procesów ETL, a praktycznymi zastosowaniami, ale także dostarczenie konkretnych danych dotyczących wydajności SQL i HiveQL w różnych kontekstach [2,8,9,12,13]. ...
Article
Full-text available
In the era of digitization, where data is collected in ever-increasing quantities, efficient processing is required. The article analyzes the performance of SQL and HiveQL, for scenarios of varying complexity, focusing on the execution time of individual queries. The tools used in the study are also discussed. The results of the study for each language are summarized and compared, highlighting their strengths and weaknesses, as well as identifying their possible areas of application.
... The third replica is placed on the same rack as the second, but on a different node which are chosen randomly. Further replicas are placed on random nodes in the cluster, where the system tries to avoid placing too many replicas on the same rack [9]. ...
Article
Full-text available
The Hadoop Distributed File System (HDFS) is designed to store, analysis, transfers large scale of data sets, and stream it at high bandwidth to the user applications. It handles fault tolerance by using data replication, where each data block is replicated and stored in multiple DataNodes. Therefore, the HDFS supports reliability and availability. The data replication of the HDFS in Hadoop is implemented in a pipelined manner which takes much time for replication. Other approaches have been proposed to improve the performance of the data replication in THE Hadoop HDFS .The paper provides the comprehensive and theoretical analysis of three existed HDFS replication approaches; the default pipeline approach, parallel (Broadcast) approach and parallel (Master/Slave) approach. The study describes the technical specification, features, and specialization for each approach along with its applications. A comparative study has been performed to evaluate the performance of these approaches using TestDFSIO benchmark. According to the experimental results it is found that the performance (i.e., the execution time and throughput) of the parallel (Broadcast) replication approach and the parallel (Master/Slave) outperform the default pipelined replication. Also, it is noticed that the throughput is decreased with increasing the file size in the three approaches.
... Ejemplos de sistemas de alto rendimiento son: Teradata, HP Vertica, IBM Netezza, Oracle Exadata (Moniruzzaman & Hossain, 2013;Ţăranu, 2015). Un ejemplo de sistema distribuido es Hadoop 1 (HDFS es Hadoop Distributed File System), considerado comos la plataforma más utilizadas para el procesamiento y almacenamiento distribuido de datos (Sawant & Shah, 2013;Ţăranu, 2015;White, 2015); ejemplos de versiones de plataformas que integran Hadoop son Hortonworks 2 , Cloudera 3 , AWS 4 , Microsoft Azure 5 , etc. ...
Book
Full-text available
El análisis de datos es un proceso complejo que trata de encontrar patrones útiles y relaciones entre los datos a fin de obtener información sobre un problema específico y de esta manera tomar decisiones acertadas para su solución. Las técnicas de análisis de datos que son exploradas en el presente libro son actualmente utilizadas en diversos sectores de la economía. En un inicio, fueron empleadas por las grandes empresas a fin de incrementar sus rendimientos financieros. El libro se basa en la aplicación de la especialización inteligente, de este modo, gracias al trabajo colaborativo, se combina al sector agropecuario con las tecnologías, matemáticas, estadística y las ciencias computacionales, para la optimización de los procesos productivos.
... The first type aims to extend the big data frameworks to take advantage of TEEs, which will have to handle sidechannel attacks, as well. VC3 [37] applied this strategy for modifying the Hadoop [46] system. M2R [15] targets the problem of access-pattern leakage in the shuffling phase of VC3 and proposes to use the oblivious schemes for shuffling. ...
Preprint
Full-text available
Trusted Execution Environments (TEEs) are gradually adopted by major cloud providers, offering a practical option of \emph{confidential computing} for users who don't fully trust public clouds. TEEs use CPU-enabled hardware features to eliminate direct breaches from compromised operating systems or hypervisors. However, recent studies have shown that side-channel attacks are still effective on TEEs. An appealing solution is to convert applications to be \emph{data oblivious} to deter many side-channel attacks. While a few research prototypes on TEEs have adopted specific data oblivious operations, the general conversion approaches have never been thoroughly compared against and tested on benchmark TEE applications. These limitations make it difficult for researchers and practitioners to choose and adopt a suitable data oblivious approach for their applications. To address these issues, we conduct a comprehensive analysis of several representative conversion approaches and implement benchmark TEE applications with them. We also perform an extensive empirical study to provide insights into their performance and ease of use.
... Even though there are other open source MapReduce implementations, they are not as complete as some component of the full platform (e.g., a storage solution). Hadoop is currently a top-level project of the Apache Software Foundation, a non-profit corporation that supports a number of other well-known projects such as the Apache HTTP Server (White 2012). ...
... To deal with the sheer size of these modern embedding datasets, the typical approach is to implement algorithms in massively parallel computation systems such as MapReduce [DG04,DG08], Spark [ZCF + 10], Hadoop [Whi12], Dryad [IBY + 07] and others. The Massively Parallel Computation (MPC) model [KSV10, GSZ11, BKS17, ANOY14] is a computational model for these systems that balances accurate modeling with theoretical elegance. ...
Preprint
Full-text available
We study the classic Euclidean Minimum Spanning Tree (MST) problem in the Massively Parallel Computation (MPC) model. Given a set $X \subset \mathbb{R}^d$ of $n$ points, the goal is to produce a spanning tree for $X$ with weight within a small factor of optimal. Euclidean MST is one of the most fundamental hierarchical geometric clustering algorithms, and with the proliferation of enormous high-dimensional data sets, such as massive transformer-based embeddings, there is now a critical demand for efficient distributed algorithms to cluster such data sets. In low-dimensional space, where $d = O(1)$, Andoni, Nikolov, Onak, and Yaroslavtsev [STOC '14] gave a constant round MPC algorithm that obtains a high accuracy $(1+\epsilon)$-approximate solution. However, the situation is much more challenging for high-dimensional spaces: the best-known algorithm to obtain a constant approximation requires $O(\log n)$ rounds. Recently Chen, Jayaram, Levi, and Waingarten [STOC '22] gave a $\tilde{O}(\log n)$ approximation algorithm in a constant number of rounds based on embeddings into tree metrics. However, to date, no known algorithm achieves both a constant number of rounds and approximation. In this paper, we make strong progress on this front by giving a constant factor approximation in $\tilde{O}(\log \log n)$ rounds of the MPC model. In contrast to tree-embedding-based approaches, which necessarily must pay $\Omega(\log n)$-distortion, our algorithm is based on a new combination of graph-based distributed MST algorithms and geometric space partitions. Additionally, although the approximate MST we return can have a large depth, we show that it can be modified to obtain a $\tilde{O}(\log \log n)$-round constant factor approximation to the Euclidean Traveling Salesman Problem (TSP) in the MPC model. Previously, only a $O(\log n)$ round was known for the problem.
... High-level approaches to big data analytics such as Hadoop MapReduce [26] or Apache Spark [1] are often inspired by bulk synchronous parallelism (BSP) [25] a model of scalable parallel computing. In this context, scalable means that the number of processors of the parallel machines running BSP programs could range from a few to several dozens of thousand cores or more. ...
Preprint
BSML is a pure functional library for the multi-paradigm language OCaml. BSML embodies the principles of the Bulk Synchronous Parallel (BSP) model, a model of scalable parallel computing. We propose a formalization of BSML primitives with WhyML, the specification language of Why3 and specify and prove the correctness of most of the BSML standard library. Finally, we develop and verify the correctness of a small BSML application.
... Traditionally in MapReduce, when no heartbeat from a task tracker is sent to the job tracker for a separate stint, the job tracker will conclude that the agent allocated to the task tracker has stopped. In such a case, the job tracker is responsible for setting a new schedule for the tasks that were in progress along with those that were about to start a new task tracker since the information that belongs to the corrupted task tracker will no longer stay available [49]. Otherwise, and with correspondence to the linear trend, Spark will construct RDDs again to tackle the exercise. ...
Article
Full-text available
Clustering divides a set of objects into several classes, where each class is composed of similar objects. Traditional centralized clustering algorithms target those objects located on the same site since they cannot perform on distributed objects. Distributed clustering algorithms, however, can fulfil this gap. They extract a classification model from the distributed objects even when they are in different sites and locations. With the trend of storing data in different locations and sites, and with the vast amount of data propagating throughout the web, it seems it will be one of the prevailing fields. Even though much research and work have been done on this topic, it is still considered in its infantry because of the challenges that are still popping up, such as bandwidth limitation, transferring data to a single site, and many others. In this work, we present DG-means, a greedy algorithm that performs on distributed data sets. Three datasets—the wholesale dataset, banknotes dataset, and Iris dataset, are used to compare multiple distributed clustering algorithms on different metrics: runtime execution, stability, and accuracy. DG-means exhibited superior performance when compared to the other algorithms.
... In distributed computing frameworks like Apache Spark, reading compressed files can be done using input readers that support standard compression formats and can transparently read and decompress compressed datasets (on the fly) [37]. A word of caution should be spent here about nonsplittable compression formats. ...
Article
Full-text available
Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.
... Since the said computation can become expensive with the increasing quantity of code to analyse, the second tool can be executed in a distributed manner using Hadoop, an Apache framework providing the MapReduce model, for providing support to distributed systems accessing big data [8], [14], [18]. Of course distribution depends on the underlying infrastructure and configuration, for which other support may be needed [3], [4], [12], [13]. ...
Article
Full-text available
We have realised a Java source code analyser that firstly extracts code from a repository and then analyses the code to gain some knowledge. In particular, for each method of each class the internal variables used are automatically determined. For Java classes, we find method bodies and variable declarations by using regular expressions. The tool has been developed by means of Java and MrJob. Python with the MapReduce model have been used to perform data analysis in a distributed manner.
... The basic idea behind MapReduce is to divide the processing into two phases: the Map phase, which transforms the input data, and the Reduce phase, which aggregates the intermediate data produced by the Map phase. The Hadoop framework [9] is one of the most popular ways to implement MapReduce, and it provides a robust and scalable infrastructure for running MapReduce jobs. ...
Article
Full-text available
Distributed Systems are widely used in industrial projects and scientific research. The Apache Hadoop environment, which works on the MapReduce paradigm, lost popularity because new, modern tools were developed. For example, Apache Spark is preferred in some cases since it uses RAM resources to hold intermediate calculations; therefore, it works faster and is easier to use. In order to take full advantage of it, users must think about the MapReduce concept. In this paper, a usual solution and MapReduce solution of ten problems were compared by their pseudocodes and categorized into five groups. According to these groups’ descriptions and pseudocodes, readers can get a concept of MapReduce without taking specific courses. This paper proposes a five-category classification methodology to help distributed-system users learn the MapReduce paradigm fast. The proposed methodology is illustrated with ten tasks. Furthermore, statistical analysis is carried out to test if the proposed classification methodology affects learner performance. The results of this study indicate that the proposed model outperforms the traditional approach with statistical significance, as evidenced by a p-value of less than 0.05. The policy implication is that educational institutions and organizations could adopt the proposed classification methodology to help learners and employees acquire the necessary knowledge and skills to use distributed systems effectively.
... We study the problem of exactly recovering communities of a graph from the SBM in the massively parallel computation (MPC) model [26,23,6], which is a mathematical abstraction of modern frameworks of real-world parallel computing systems like MapReduce [20], Hadoop [30], Spark [31] and Dryad [25]. In this model, there are M machines that communicate in synchronous rounds, where the local memory of each machine is limited to s words, each of O(log n) bits. ...
Preprint
Learning the community structure of a large-scale graph is a fundamental problem in machine learning, computer science and statistics. We study the problem of exactly recovering the communities in a graph generated from the Stochastic Block Model (SBM) in the Massively Parallel Computation (MPC) model. Specifically, given $kn$ vertices that are partitioned into $k$ equal-sized clusters (i.e., each has size $n$), a graph on these $kn$ vertices is randomly generated such that each pair of vertices is connected with probability~$p$ if they are in the same cluster and with probability $q$ if not, where $p > q > 0$. We give MPC algorithms for the SBM in the (very general) \emph{$s$-space MPC model}, where each machine has memory $s=\Omega(\log n)$. Under the condition that $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac12}n^{-\frac12+\frac{1}{2(r-1)}})$ for any integer $r\in [3,O(\log n)]$, our first algorithm exactly recovers all the $k$ clusters in $O(kr\log_s n)$ rounds using $\tilde{O}(m)$ total space, or in $O(r\log_s n)$ rounds using $\tilde{O}(km)$ total space. If $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac34}n^{-\frac14})$, our second algorithm achieves $O(\log_s n)$ rounds and $\tilde{O}(m)$ total space complexity. Both algorithms significantly improve upon a recent result of Cohen-Addad et al. [PODC'22], who gave algorithms that only work in the \emph{sublinear space MPC model}, where each machine has local memory~$s=O(n^{\delta})$ for some constant $\delta>0$, with a much stronger condition on $p,q,k$. Our algorithms are based on collecting the $r$-step neighborhood of each vertex and comparing the difference of some statistical information generated from the local neighborhoods for each pair of vertices. To implement the clustering algorithms in parallel, we present efficient approaches for implementing some basic graph operations in the $s$-space MPC model.
... Data processing systems implemented the dataflow model using two orthogonal execution strategies. Systems such as Hadoop [16] and Apache Spark [2] dynamically schedule operator instances over the nodes of the compute infrastructure. Communication between operators occurs by saving intermediate results on some shared storage, with operators deployed as close as possible to the input data they consume. ...
Preprint
Full-text available
Today, data analysis drives the decision-making process in virtually every human activity. This demands for software platforms that offer simple programming abstractions to express data analysis tasks and that can execute them in an efficient and scalable way. State-of-the-art solutions range from low-level programming primitives, which give control to the developer about communication and resource usage, but require significant effort to develop and optimize new algorithms, to high-level platforms that hide most of the complexities of parallel and distributed processing, but often at the cost of reduced efficiency. To reconcile these requirements, we developed Noir, a novel distributed data processing platform written in Rust. Noir provides a high-level dataflow programming model as mainstream data processing systems. It supports static and streaming data, it enables data transformations, grouping, aggregation, iterative computations, and time-based analytics, incurring in a low overhead. This paper presents In this paper, we present the programming model and the implementation details of Noir. We evaluate it under heterogeneous workloads. We compare it with state-of-the-art solutions for data analysis and high-performance computing, as well as alternative research products, which offer different programming abstractions and implementation strategies. Noir programs are compact and easy to write: developers need not care about low-level concerns such as resource usage, data serialization, concurrency control, and communication. Noir consistently presents comparable or better performance than competing solutions, by a large margin in several scenarios. We conclude that Noir offers a good tradeoff between simplicity and performance, allowing developers to easily express complex data analysis tasks and achieve high performance and scalability.
... Minimizing the average computation time of the job in big data has been given serious efforts by the researchers. The computation time of the job in MapReduce computation engine can be minimized in several ways like by proper scheduling of tasks (map/reduce), effective placements of data, tuning of the job parameters etc. MapReduce as a computation paradigm is itself divided into sub-phases, where some sub-phases are dependent on others and some of the sub-phases are independent [1]. The key issue in the MapReduce computation issue is to minimize the job completion time [2]. ...
Chapter
Full-text available
The evolution and advancements in Information and Communication Technologies (ICT) have enabled large scale distributed computing with a huge chunk of applications for massive number of users. This has obviously generated large volumes of data, thus severely burdening the processing capacity of computers as well as the inflexible traditional networks. State-of-the-art methods for addressing datacenter level performance fixes are yet found wanting for sufficiently addressing this huge processing, storage, and network movement with proprietary protocols for this voluminous data. In this chapter the works have focused on addressing the backend server performance through effective reducer placement, intelligent compression policy, handling slower tasks and in-network performance boosting techniques through effective traffic engineering, traffic classification, topology discovery, energy minimization and load balancing in datacenter-oriented applications. Hadoop, the defacto standard in distributed big data storage and processing, has been designed to store data with its Hadoop Distributed File System (HDFS) and processing engine MapReduce large datasets reliably. However, the processing performance of Hadoop is critically dependent on the time taken to transfer data during the shuffle generated during MapReduce. Also, during concurrent execution of tasks, slower tasks need to be properly identified and efficiently handled to improve the completion time of jobs. To overcome these limitations, three contributions have been made; (i) Compression of generated map outputs at a suitable time when all the map tasks are yet to be completed to shift the load of network onto the CPU; (ii) Placing the reducer onto the nodes where the computation done is highest based on a couple of counters, one maintained at the rack level and another at node level, to minimize the run-time data copying; and (iii) Placing the slower map tasks onto the nodes where the computation done is highest and network is handled by prioritizing. Software defined networking (SDN) has been a boon for next generation networking owing to the separation of control plane from the data plane. It has the capability to address the network requirements in a timely manner by setting flows for every to and fro data movement and gathering large network statistics at the controller to make informed decisions about the network. A core issue in the network for the controller is traffic classification, which can substantially assist SDN controllers towards efficient routing and traffic engineering decisions. This chapter presents a traffic classification scheme utilizing three classifiers namely Feed-forward Neural Network (FFNN), Logistic Regression (LR), Naïve Bayes and employing Particle Swarm Optimization (PSO) for improved traffic classification with less overhead and without overlooking the key Quality of Service (QoS) criterion. Also lowering energy minimization and link utilization has been an important criterion for lowering the operating cost of the network and effectively utilizing the network. This issue has been addressed in the chapter by formulating a multi-objective problem while simultaneously addressing the QoS constraints by proposing a metaheuristic, since no polynomial solution exists and hence an evolutionary based metaheuristic (Clonal Selection) based energy optimization scheme, namely, Clonal Selection Based Energy Minimization (CSEM) has been devised. The obtained results show the efficacy of the proposed traffic classification scheme and CSEM based solution as compared with the state-of-the-art techniques. SDN has been a promising newer network paradigm but security issues and expensive capital procurement of SDN limit its full deployment hence moving to a hybrid SDN (h-SDN) deployment is only logical moving forward. The usage of both centralized and decentralized paradigms in h-SDN with intrinsic issues of interoperability poses challenges to key issues of topology gathering by the controller for proper allocation of network resources and traffic engineering for optimum network performance. State-of-the-art protocols for topology gathering, such as Link Layer Discovery Protocol (LLDP) and Broadcast Domain Discovery Protocol (BDDP) require a huge number of messages and such schemes only gather link information of SDN devices leaving out legacy switches’ (LS) links which results in sub-optimal performance. This chapter provides novel schemes which unearth topology discovery by requiring fewer messages and gathering link information of all the devices in both single and multi-controller environments (might be used when scalability issue is prevalent in h-SDN). Traffic engineering problems in h-SDN are addressed by proper placement of SDN nodes in h-SDN by utilizing the analyzing key criterion of traffic details and the degree of a node while lowering the link utilization in real-time topologies. The results of the proposed schemes for topology discovery and SDN node placement demonstrate the merits as compared with the state-of-the-art protocols.KeywordsBig dataSDNHybrid SDNLink discoveryTraffic classificationTraffic engineering and energy minimization
... Novel parallel computing models, such as Google's MapReduce (Dean and Ghemawat, MapReduce: simplified data processing on large clusters 2008), have been proposed in recent years for a new large data infrastructure. Apache just launched Hadoop (White 2015), an open-source MapReduce software for distributed data management. Concurrent data access to clustered servers is supported via the Hadoop Distributed File System (HDFS). ...
Article
Full-text available
The healthcare industry is different from other industries–patient data are sensitive, their storage needs to be handled with care and in compliance with regulative, while prediction accuracy needs to be high. This fast expansion in medical image modalities and data collection leads to generation of so called “Big Data” which is time-consuming to be analyzed by medical experts. This paper provides an insight into the Big Data from the aspect of its role in multiscale modelling. Special attention is paid to the workflow, starting from medical image processing all the way to creation of personalized models and their analysis. A review of literature regarding Big Data in healthcare is provided and two proposed solutions are described–carotid artery ultrasound image processing and 3D reconstruction, and drug testing on personalized heart models. Related to the carotid artery ultrasound image processing, the starting point is ultrasound images, which are segmented using convolutional neural network U-net, while segmented masks were further used in 3D reconstruction of geometry. Related to the drug testing on personalized heart model, similar approach was proposed, images were used in creation of personalized 3D geometrical model that is used in computational modelling to determine pressure in the left ventricle before and after drug testing. All the aforementioned methodologies are complex, include Big Data analysis and should be performed using servers or high-performance computing. Future development of Big Data applications in healthcare domains offers a lot of potential due to new data standards, rapid development of research and technology, as well as strong government incentives.
... The continuous growth of the datasets being produced by modern applications, which is known in the IT field as the Big Data phenomena, led both the industry and academia to work on innovative ways to extract valuable information from such large volumes of data. Many general-purpose systems were developed, such as Hadoop/MapReduce [7,30], Spark [34] and others. Although very successful, it was soon noticed that the programming abstractions provided by such systems are not efficient to perform analytics on large datasets modeled as graphs [19]. ...
Article
Full-text available
Much of the data being produced in large scale by modern applications represents connected entities and their relationships, that can be modeled as large graphs. In order to extract valuable information from these large datasets, several parallel and distributed graph processing engines have been proposed. These systems are designed to run in large clusters, where resources must by allocated efficiently. Aiming to handle this problem, this paper presents a performance prediction model for GPS, a popular Pregel-based graph processing framework. By leveraging a micro-partitioning technique, our system can use various partitioning algorithms that greatly reduce the execution time, comparing with the simple hash partitioning that is commonly used in graph processing systems. Experimental results show that the prediction model has accuracy close to 90%, allowing it to be used in schedulers or to estimate the cost of running graph processing tasks.
... Moreover, other of the limitations of Hadoop is that it has to query the dataset after executing each job, thus incurring in a large disk input/output, while Spark directly passes the data without writing to a persistent storage [16,17]. In addition, the implementation of Spark [3] enables faster memory operations than Hadoop [18] since it allows in-memory computations (see the complete comparison made in [19] where Hadoop and Spark frameworks are compared in several Machine Learning algorithms). ...
Article
Full-text available
The large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform which has been demonstrated to outperform existing distributive algorithmic implementations.
... Another reason to rethink the architecture of existing data lakes is the recent advancements in modern data processing systems [7,27,24,26,25]. The notion of unification in the modern big data system has resulted in developing a single entry point (tool or API) for different operations in a pipeline that were performed previously on multiple frameworks [27]. ...
Chapter
Analytics of Big Data in the absence of an accompanying framework of metadata can be a quite daunting task. While it is true that statistical algorithms can do large-scale analyses on diverse data with little support from metadata, using such methods on widely dispersed, extremely diverse, and dynamic data may not necessarily produce trustworthy findings. One such task is identifying the impact of indicators for various Sustainable Development Goals (SDGs). One of the methods to analyze impact is by developing a Bayesian network for the policymaker to make informed decisions under uncertainty. It is of key interest to policy-makers worldwide to rely on such models to decide the new policies of a state or a country (https://sdgs.un.org/2030agenda). The accuracy of the models can be improved by considering enriched data – often done by incorporating pertinent data from multiple sources. However, due to the challenges associated with volume, variety, veracity, and the structure of the data, traditional data lake systems fall short of identifying information that is syntactically diverse yet semantically connected. In this paper, we propose a Data Lake (DL) framework that targets ingesting & processing of data like any traditional DL, and in addition, is capable of performing data retrieval for applications such as Policy Support Systems (where the selection of data greatly affect the output interpretations) by using ontologies as the intermediary. We discuss the proof of concept for the proposed system and the preliminary results (IIITB Data Lake project Website link: http://cads.iiitb.ac.in/wordpress/) based on the data collected from the agriculture department of the Government of Karnataka (GoK).KeywordsBig dataOntologyDocument retrievalData lakeData analysesPolicy support systemBayesian network
... As the amount of data that is produced every day is huge and keeps increasing the need for efficient solutions to process huge volumes of data has risen [1]. These solutions solve the problem by parallelizing the work over different machines that belong to a cluster. ...
Preprint
Full-text available
In the big data era, the key feature that each algorithm needs to have is the possibility of efficiently running in parallel in a distributed environment. The popular Silhouette metric to evaluate the quality of a clustering, unfortunately, does not have this property and has a quadratic computational complexity with respect to the size of the input dataset. For this reason, its execution has been hindered in big data scenarios, where clustering had to be evaluated otherwise. To fill this gap, in this paper we introduce the first algorithm that computes the Silhouette metric with linear complexity and can easily execute in parallel in a distributed environment. Its implementation is freely available in the Apache Spark ML library.
... They choose fields like timestamp, number of bytes transferred, date, and version of browser for the further processing (Savitha and Vijaya 2014). Users are also recognized solely by their IP addresses, and sessions are determined using a time-oriented heuristic, i.e. session time (White 2012 (Huang et al. 2013). Their updated technique builds fewer big sessions than a referrer-oriented algorithm and can handle enormous datasets. ...
Article
Full-text available
Data preparation is a vital step in the web usage mining process since it provides structured data for the subsequent stages. Hence, it is necessary to convert raw server logs into user sessions to generate structured data for pattern discovery phase. In recent decade, popular websites’ server log production has risen to many terabytes to petabytes each day. As a result, server logs possess big data issues such as storage and processing. This study focuses on initial phases of web usage mining process such as data cleaning, user identification, and session identification. These phases are classified as data-intensive processes and deemed-computation intensive. In the last decade, MapReduce emerges as one of the best parallel programming frameworks for data-intensive applications. An efficient MapReduce-based data pre-processing algorithm, i.e. IRPDP_HT2, is proposed in this study. Previous parallel data pre-processing algorithms either include partial phases or lack with efficient robot detection approaches. IRPDP_HT2 algorithm uses a variety of efficient heuristics in all three phases of data pre-processing to identify both ethical and unethical robots. The suggested IRPDP_HT2 approach is found to be effective and scalable for larger datasets after various experiments on a cluster of nodes. The effectiveness of suggested heuristics is also examined during session identification phase. Three variants of IRPDP_HT2 such as PDP_HT2, IPDP_HT2, and RPDP_HT2 are also developed and tested. Impact of robots’ requests and internal dummy connections’ requests on session count by IRPDP_HT2 algorithm is 45.81% which is more than in PDP_HT2, IPDP_HT2, and RPDP_HT2 algorithms. Further speed-up and size-up are also analysed to demonstrate scalability of algorithm. In the presence of larger datasets, the algorithm’s running time falls, while the number of data nodes grows. The size-up of IRPDP_HT2 demonstrates that even after doubling the input data, the algorithm’s running time does not grow in that ratio for the fixed number of data nodes.
Article
Full-text available
As software systems become increasingly large and complex, automated parameter tuning of software systems (PTSS) has been the focus of research and many tuning algorithms have been proposed recently. However, due to the lack of a unified platform for comparing and reproducing existing tuning algorithms, it remains a significant challenge for a user to choose an appropriate algorithm for a given software system. There are multiple reasons for this challenge, including diverse experimental conditions, lack of evaluations for different tasks, and excessive evaluation costs of tuning algorithms. In this paper, we propose an extensible and efficient benchmark, referred to as PTSSBench, which provides a unified platform for supporting a comparative study of different tuning algorithms via surrogate models and actual systems. We demonstrate the usability and efficiency of PTSSBench through comparative experiments of six state-of-the-art tuning algorithms from a holistic perspective and a task-oriented perspective. The experimental results show the necessity and effectiveness of parameter tuning for software systems and indicate that the PTSS problem remains an open problem. Moreover, PTSSBench allows extensive runs and in-depth analyses of parameter tuning algorithms, hence providing an efficient and effective way for researchers to develop new tuning algorithms and for users to choose appropriate tuning algorithms for their systems. The proposed PTSSBench benchmark together with the experimental results is made publicly available online as an open-source project.
Chapter
The research work is focused on examining the role of artificial intelligence (AI) in addressing challenges associated with money laundering in the banking sector. Money laundering is a global issue that threatens financial stability and international security, making anti-money laundering research crucial. Furthermore, just 0.2% of money laundered through the financial system is estimated to be seized. The crime is growing increasingly sophisticated and intricate, and the amount of the crime increases banks’ vulnerability. Researchers have begun to investigate the possibility of artificial intelligence approaches in this setting. However, a thorough assessment has identified a systematic knowledge deficit that systematically examines and synthesizes artificial intelligence techniques for anti-money laundering efforts in the banking industry. Therefore, this chapter is focused on a systematic review of key technologies categorized into artificial intelligence or machine learning (AI or ML), natural language processing (NLP), robotic process automation (RPA), and cloud-based solutions. However, various challenges concerned with these techniques, such as data quality, the nature of money laundering and data volume, and data heterogeneity, are also discussed. As a result, the findings add to the total knowledge base in anti-money laundering from the banking sector’s perspective. Additionally, future study directions were narrowed even further based on the limitations discovered.
Chapter
The term “Big Data” is used to designate very large data sets, more varied and with more complex structures. Its characteristics are usually related to other challenges such as data storage, processing, analysis, or security. A Health Information System (HIS) refers to a system designed to manage healthcare data. This includes systems that collect, store, manage, and transmit data that is related to patients or the operational management of a hospital, clinic, testing laboratory, or any health care facility. Today, the health sector produces a significant amount of data. This information is voluminous and is stored in many different forms and types. It has become difficult to secure this data with traditional techniques and methods. The objective of this work is to propose a health data security process in a Big Data environment. The idea is to take advantage of the performance of Apache Hadoop and its components to create a secure environment for health data. The proposed solution will be based on four layers of security, the first component to be implemented in Apache Hadoop is Kerberos, the second element will be Apache Ranger, then Knox and finally we will use the encryption of Hadoop Distributed File System (HDFS).
Chapter
BSML is a pure functional library for the multi-paradigm language OCaml. BSML embodies the principles of the Bulk Synchronous Parallel (BSP) model, a model of scalable parallel computing. We propose a formalization of BSML primitives with WhyML, the specification language of Why3 and specify and prove the correctness of most of the BSML standard library. Finally, we develop and verify the correctness of a small BSML application.
Article
Full-text available
Due to social media, internet websites, and cellular networks, the world is undergoing a digital avalanche. Extensive information will mask this pattern, emerging quickly and in many ways. Big data analytics will filter large amounts of unprocessed data to provide more manageable data to help parties make intelligent decisions. This research demonstrates how large geographical datasets are essential to numerous cutting-edge wireless communication technologies. We also argue that geospatial and spatio-temporal concerns matter differently in massive datasets than interpersonal issues. We present three significant geospatial information use cases with distinct architectural and analytical challenges. Next, using map-based Reduce computing, we offer our research on developing highly available multi-processing systems for geographical information on Hadoop. Our results show that Hadoop allows for highly extendable spatial data analysis methodologies. However, designing such applications requires specialized skills, stressing the need for simpler alternatives.
Conference Paper
Full-text available
Hadoop is an open source software framework that dramatically simplifies writing distributed data intensive applications. It provides a distributed file system, which is modeled after the Google File System, and a map/reduce implementation that manages distributed computation. Job scheduling is an important process in Hadoop Map Reduce. Hadoop comes with three types of schedulers namely FIFO, Fair and Capacity Scheduler. The schedulers are now a pluggable component in the Hadoop Map Reduce framework. When jobs have a dependency on an external service like database or web service may leads to the failure of tasks due to overloading. In this scenario, Hadoop needs to rerun the tasks in another slots. To address this issue, Task Tracker aware scheduling has introduced. This scheduler enables users to configure a maximum load per Task Tracker in the Job Configuration itself. The algorithm will not allow a task to run and fail if the load of the Task Tracker reaches its threshold for the job. Also this scheduler allows the users to select the Task Tracker's per Job in the Job configuration.
Chapter
Since its introduction in the early 2000s, the Dynamic Data-Driven Applications Systems (DDDAS) paradigm has served as a powerful concept for continuously improving the quality of both models and data embedded in complex dynamical systems. The DDDAS unifying concept enables capabilities to integrate multiple sources and scales of data, mathematical and statistical algorithms, advanced software infrastructures, and diverse applications into a dynamic feedback loop. DDDAS has not only motivated notable scientific and engineering advances on multiple fronts, but it has been also invigorated by the latest technological achievements in artificial intelligence, cloud computing, augmented reality, robotics, edge computing, Internet of Things (IoT), and Big Data. Capabilities to handle more data in a much faster and smarter fashion is paving the road for expanding automation capabilities. The purpose of this chapter is to review the fundamental components that have shaped reservoir-simulation-based optimization in the context of DDDAS. The foundations of each component will be systematically reviewed, followed by a discussion on current and future trends oriented to highlight the outstanding challenges and opportunities of reservoir management problems under the DDDAS paradigm. Moreover, this chapter should be viewed as providing pathways for establishing a synergy between renewable energy and oil and gas industry with the advent of the DDDAS method.
Article
Twitter is an online social networking site which contains rich amount of data that can be a structured, semistructured and un-structured data. In this work, a method which performs classification of tweet sentiment in Twitter is discussed. To improve its scalability and efficiency, it is proposed to implement the work on Hadoop Ecosystem, a widely-adopted distributed processing platform using the MapReduce parallel processing paradigm. Finally, extensive experiments will be conducted on real-world data sets, with an expectation to achieve comparable or greater accuracy than the proposed techniques in literature.
Article
Full-text available
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Preprint
We show the first conditionally optimal deterministic algorithm for $3$-coloring forests in the low-space massively parallel computation (MPC) model. Our algorithm runs in $O(\log \log n)$ rounds and uses optimal global space. The best previous algorithm requires $4$ colors [Ghaffari, Grunau, Jin, DISC'20] and is randomized, while our algorithm are inherently deterministic. Our main technical contribution is an $O(\log \log n)$-round algorithm to compute a partition of the forest into $O(\log n)$ ordered layers such that every node has at most two neighbors in the same or higher layers. Similar decompositions are often used in the area and we believe that this result is of independent interest. Our results also immediately yield conditionally optimal deterministic algorithms for maximal independent set and maximal matching for forests, matching the state of the art [Giliberti, Fischer, Grunau, SPAA'23]. In contrast to their solution, our algorithms are not based on derandomization, and are arguably simpler.
Article
Big Data and cloud computing integration has become a formidable strategy for businesses to unlock the potential of enormous and complicated data sets. With the scalability, flexibility, and cost-effectiveness that this combination provides, businesses are able to handle and analyse massive amounts of data in a distributed, as-needed way. But there are also issues and restrictions that need to be resolved with this integration. This overview of the literature focuses on the issues, difficulties, and potential applications of big data and cloud computing. It offers information on the advantages of this integration, such as improved data processing capabilities, increased scalability, and cost reduction. The difficulties with data migration, security, privacy, data governance, talent needs, vendor lock-in, and compliance are all discussed. Future research areas are also highlighted, such as enhanced analytics methods, edge computing integration, privacy-preserving data analysis, hybrid cloud architectures, data governance, real-time decision-making, and applications specialised to particular industries. Organisations can maximise the potential of Big Data with cloud computing and generate priceless insights to fuel innovation and informed decision-making by comprehending, tackling, and investigating these difficulties.
Article
Full-text available
With recent advancements in computational biology, high throughput Next-Generation Sequencing (NGS) has become a de facto standard technology for gene expression studies, including DNAs, RNAs, and proteins; however, it generates several millions of sequences in a single run. Moreover, the raw sequencing datasets are increasing exponentially, doubling in size every 18 months, leading to a big data issue in computational biology. Moreover, inflammatory illnesses and boosting immune function have recently attracted a lot of attention, yet accurate recognition of Anti-Inflammatory Peptides (AIPs) through a biological process is time-consuming as therapeutic agents for inflammatory-related diseases. Similarly, precise classification of these AIPs is challenging for traditional technology and conventional machine learning algorithms. Parallel and distributed computing models and deep neural networks have become major computing platforms for big data analytics now required in computational biology. This study proposes an efficient high-throughput anti-inflammatory peptide predictor based on a parallel deep neural network model. The model performance is extensively evaluated regarding performance measurement parameters such as accuracy, efficiency, scalability, and speedup in sequential and distributed environments. The encoding sequence data were balanced using the SMOTETomek approach, resulting in a high-accuracy performance. The parallel deep neural network demonstrated high speed up and scalability compared to other traditional classification algorithms study’s outcome could promote a parallel-based model for predicting anti-Inflammatory Peptides.
Chapter
Today's world is a new digital world where most of the things have moved online. As the digital world is enhancing, advancing, and evolving every day, various new and unique innovations and developments are rolling out in the IT industry. These innovations and developments are helping humans overcome existing problems, improve processes, and enhance the user experience by providing specific and personalized solutions to users. User needs are the main reason behind all this digital change and innovation. One such important innovation which has changed the digital world's face is known as cloud computing. Cloud computing is an internet-based service, which enables on-demand network access to share a collection of all configurable computational resources (such as servers, repositories, networks, applications, resources). As the amount of data is increasing day by day, the security of users' data has become a major and crucial concern to provide protected communication between users and the cloud service provider.
Conference Paper
Recentes técnicas têm aplicado algoritmos de evolução de clusters para analisar transições de assuntos em redes sociais e apresentam-se eficazes no monitoramento destes. No entanto, a elevada taxa de produção de dados nas redes sociais cria a necessidade de processamento de uma quantidade de dados cada vez maior. Este trabalho propõe uma estratégia mais escalável para análise da evolução de assuntos em redes sociais, por meio do emprego de uma solução distribuída na etapa de clustering dos dados. Os experimentos foram realizados utilizando dados obtidos do Twitter e demonstram que a solução proposta é promissora, apresentando ganhos consideráveis de desempenho.
Chapter
Healthcare institutions are complex organizations dedicated to providing care to the population. Continuous improvement has made the care provided a factor of excellence in the population, improving people’s daily lives and increasing average life expectancy. Even so, the resulting aging has caused patterns to increase day by day and the paradigm of medicine to shift from reaction to prevention. Often, the principle of evidence-based medicine is compromised by lack of evidence on pathogenic mechanisms, risk prediction, lack of resources, and effective therapeutic strategies. This is even more evident in pandemic situations. The current data management tools (centered in a single machine) do not have an ideal behavior for the processing of large amounts of information. This fact combined with the lack of sensitivity for the health area makes it imminent the need to create and implement an architecture that performs this management and processing effectively. In this sense, this paper aims to study the problem of knowledge construction from Big Data in health institutions. The main goal is to present an architecture that deals with the adversities of the big data universe when applied to health.KeywordsBig DataHealthcare Information SystemsReal-Time Information SystemSystem Architecture
Article
Full-text available
Solving problems of high dimensionality (and complexity) usually needs the intense use of technologies, like parallelism, advanced computers and new types of algorithms. MapReduce (MR) is a computing paradigm long time existing in computer science that has been proposed in the last years for dealing with big data applications, though it could also be used for many other tasks. In this article, we address big optimization: the solution to large instances of combinatorial optimization problems by using MR as the paradigm to design solvers that allow transparent runs on a varied number of computers that collaborate to find the problem solution. We study and analyze the MR technology, focusing on Hadoop, Spark, and MPI as the middleware platforms to develop genetic algorithms (GAs). From this, MRGA solvers arise using a different programming paradigm from the usual imperative transformational programming. Our objective is to confirm the expected benefits of these systems, namely file, memory, and communication management, over the resulting algorithms. We analyze our MRGA solvers from relevant points of view like scalability, speedup, and communication vs. computation time in big optimization. The results for high-dimensional datasets show that the MRGA over Hadoop outperforms the implementations in Spark and MPI frameworks. For the smallest datasets, the execution of MRGA on MPI is always faster than the executions of the remaining MRGAs. Finally, the MRGA over Spark presents the lowest communication times. Numerical and time insights are given in our work, so as to ease future comparisons of new algorithms over these three popular technologies.
Article
Full-text available
The traditional standalone computing approach is difficult to handle the task of processing large XML data due to scalability, thus distributed processing using cluster systems becomes an inevitable choice. The currently distributed XML processing methods generally rely on existing distributed computing frameworks for general purpose data, which have limitations such as complex configuration, inflexible working mechanism, and difficult performance optimization in the context of XML semi-structural features and complex queries. In addition, XML distributed queries suffer from a low level of automatic processing and lack of effective integration with distributed XML parsing and indexing. In this paper we propose an integrated method for distributed processing of large XML data, called the dXML method. Our method supports the distributed parsing of arbitrary XML fragment and the distributed creation of index, and adopts the efficient navigational XPath evaluation based on relation index. Through a distributed XPath evaluation approach based on filter-upon-pre-evaluate, our method enables data locality and reduces network traffic during the distributed evaluation of complex XPath predicates. dXML integrates the distributed processing technology of XML parsing, index creation and XPath query, provides a one-stop XML processing solution, supports the automatic distributed processing of large XML data, and has the characteristics of lightweight configuration and flexible working mechanism. Experimental evaluation verifies the effectiveness of dXML, and comparative experimental results show that dXML has better distributed query performance than both the typical existing navigational and Twig distributed processing methods.
Chapter
This article presents the analysis of the demand and the characterization of mobility using public transportation in Montevideo, Uruguay, during the COVID-19 pandemic. A urban data-analysis approach is applied to extract useful insights from open data from different sources, including mobility of citizens, the public transportation system, and COVID cases. The proposed approach allowed computing significant results to determine the reduction of trips caused by each wave of the pandemic, the correlation between the number of trips and COVID cases, and the recovery of the use of the public transportation system. Overall, results provide useful insights to quantify and understand the behavior of citizens in Montevideo, regarding public transportation during the COVID-19 pandemic.KeywordsPublic transportationCOVID-19 pandemicUrban data analysisMobility patterns
Article
The business world is getting competitive and complex day by day. The customer needs and expectations also changing rapidly. Therefore, making effective strategic marketing decisions in a short time is essential to hold a competitive advantage. Especially in the telecommunication sector, there are lots of data to be handled every day. By analyzing all those data and extracting the most critical insights and use those insights for strategic marketing decisions is essential to remain as the market leader. Today, Big Data Analytics is used to analyses massive databases and find patterns and extract valuable information and insights. Therefore, this research investigates on strategic marketing decisions, Big Data Analytics and how Big Data Analytics use in telecommunication sector to make strategic marketing decisions. Moreover, this research examines how Big Data Analytics can be applied to perform customer segmentation based on Psychographic segmentation by analyzing URLs using artificial neural network techniques. This study will provide suitable recommendations for the telecommunication companies to apply big data analytics to improve the revenue, and finally it points out the possible future study areas.
ResearchGate has not been able to resolve any references for this publication.