Book

Hadoop: The Definitive Guide

Authors:
... The user is able to shrink or expand the size of the computing clusters to control the data volume handled and response time [1,46] Dealing with big amounts of data is not an easy task, especially if there is a certain goal in mind since data arrives in a fast manner, it is vital to provide fast collection, sorting, and processing speeds. Apache Hadoop was created by Doug Cutting [47] for this purpose. It was later adopted, developed, and released by Yahoo [48]. ...
... Many projects were developed in a quest to either complement or replace the above parts, and not all projects are hosted by the Apache Software Foundation, which is the reason for the emergence of the term Hadoop ecosystem [47]. Hadoop V2.x is viewed as a three-layered model. ...
Preprint
Currently, the world is witnessing a mounting avalanche of data due to the increasing number of mobile network subscribers, Internet websites, and online services. This trend is continuing to develop in a quick and diverse manner in the form of big data. Big data analytics can process large amounts of raw data and extract useful, smaller-sized information, which can be used by different parties to make reliable decisions. In this paper, we conduct a survey on the role that big data analytics can play in the design of data communication networks. Integrating the latest advances that employ big data analytics with the networks control/traffic layers might be the best way to build robust data communication networks with refined performance and intelligent features. First, the survey starts with the introduction of the big data basic concepts, framework, and characteristics. Second, we illustrate the main network design cycle employing big data analytics. This cycle represents the umbrella concept that unifies the surveyed topics. Third, there is a detailed review of the current academic and industrial efforts toward network design using big data analytics. Forth, we identify the challenges confronting the utilization of big data analytics in network design. Finally, we highlight several future research directions. To the best of our knowledge, this is the first survey that addresses the use of big data analytics techniques for the design of a broad range of networks.
... The overlapping area should be at least as big as the required neighborhood for processing a local entity and the produced overlapping result may require special treatment to be unified. The authors used the map phase of the Hadoop MapReduce infrastructure (White, 2012) for clustering buildings of large urban areas and the overlapping result was unified separately afterwards. ...
... It is assumed that all processes can independently input tiles data and output results. Such an assumption can reasonably be fulfilled by using a supercomputing infrastructure with a unified file system (typically designed to efficiently support all existent physical processing cores), by maintaining the tiles and the results on a scalable distributed file system such as the Hadoop file system (White, 2012), or by using a specialized distributed spatial data organization/retrieval system (Aji et al., 2013;Hongchao and Wang, 2011). The master initializes the work by loading the tile map and assigning each slave to process a unique tile via a process tile (PT) message carrying the associated tile ID. ...
Preprint
Full-text available
This paper presents a distributed approach that scales up to segment tree crowns within a LiDAR point cloud representing an arbitrarily large forested area. The approach uses a single-processor tree segmentation algorithm as a building block in order to process the data delivered in the shape of tiles in parallel. The distributed processing is performed in a master-slave manner, in which the master maintains the global map of the tiles and coordinates the slaves that segment tree crowns within and across the boundaries of the tiles. A minimal bias was introduced to the number of detected trees because of trees lying across the tile boundaries, which was quantified and adjusted for. Theoretical and experimental analyses of the runtime of the approach revealed a near linear speedup. The estimated number of trees categorized by crown class and the associated error margins as well as the height distribution of the detected trees aligned well with field estimations, verifying that the distributed approach works correctly. The approach enables providing information of individual tree locations and point cloud segments for a forest-level area in a timely manner, which can be used to create detailed remotely sensed forest inventories. Although the approach was presented for tree segmentation within LiDAR point clouds, the idea can also be generalized to scale up processing other big spatial datasets. Highlights: - A scalable distributed approach for tree segmentation was developed and theoretically analyzed. - ~2 million trees in a 7440 ha forest was segmented in 2.5 hours using 192 cores. - 2% false positive trees were identified as a result of the distributed run. - The approach can be used to scale up processing other big spatial data
... To verify the performance of HTECs we implemented them in C/C++ and used them in Hadoop Distributed File System (HDFS). Hadoop is an open-source software framework used for distributed storage and processing of big data sets [35]. From release 3.0.0-alpha2 ...
... The nodes were running on Linux machines equipped with Intel Xeon E5-2676 v3 running on 2.4GHz. Two crucial parameters in Hadoop are split size and block size (we refer an interested reader to [35]). We have experimented with different block sizes (90MB and 360MB), different split sizes (512KB, 1MB and 4MB) and different sub-packetization levels (α = 1, 3, 6, and 9) in order to check how they affect the repair time of one lost node. ...
Preprint
Minimum-Storage Regenerating (MSR) codes have emerged as a viable alternative to Reed-Solomon (RS) codes as they minimize the repair bandwidth while they are still optimal in terms of reliability and storage overhead. Although several MSR constructions exist, so far they have not been practically implemented mainly due to the big number of I/O operations. In this paper, we analyze high-rate MDS codes that are simultaneously optimized in terms of storage, reliability, I/O operations, and repair-bandwidth for single and multiple failures of the systematic nodes. The codes were recently introduced in \cite{7463553} without any specific name. Due to the resemblance between the hashtag sign \# and the procedure of the code construction, we call them in this paper \emph{HashTag Erasure Codes (HTECs)}. HTECs provide the lowest data-read and data-transfer, and thus the lowest repair time for an arbitrary sub-packetization level α\alpha, where \alpha \leq r^{\lceil \sfrac{k}{r} \rceil}, among all existing MDS codes for distributed storage including MSR codes. The repair process is linear and highly parallel. Additionally, we show that HTECs are the first high-rate MDS codes that reduce the repair bandwidth for more than one failure. Practical implementations of HTECs in Hadoop release 3.0.0-alpha2 demonstrate their great potentials.
... Many task scheduling algorithms have been proposed [14], [15], [16], [17], [18] to improve data locality and to shorten job turnaround time, but most of them only focus on scheduling map tasks, rather than scheduling reduce tasks. Hence, employing them in a virtual MapReduce cluster might cause a low reduce-data locality. ...
... Before submitting a MapReduce job J to process a data file D, a user needs to upload D to the distributed filesystem of a MapReduce cluster. The file D will be divided into fixedsize blocks (e.g., 64 MB in Hadoop [16]), and each block will be replicated and randomly stored in several slaves based on available storage space. The execution of J comprises three phases: map, shuffle, and reduce. ...
Preprint
It is cost-efficient for a tenant with a limited budget to establish a virtual MapReduce cluster by renting multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling scheme for this type of computing environment, we propose in this paper a hybrid job-driven scheduling scheme (JoSS for short) from a tenant's perspective. JoSS provides not only job level scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies MapReduce jobs based on job scale and job type and designs an appropriate scheduling policy to schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks, avoid job starvation, and improve job execution performance. Two variations of JoSS are further introduced to separately achieve a better map-data locality and a faster task assignment. We conduct extensive experiments to evaluate and compare the two variations with current scheduling algorithms supported by Hadoop. The results show that the two variations outperform the other tested algorithms in terms of map-data locality, reduce-data locality, and network overhead without incurring significant overhead. In addition, the two variations are separately suitable for different MapReduce-workload scenarios and provide the best job performance among all tested algorithms.
... Some MapReduce runtime systems were arXiv:1610.01807v1 [cs.DC] 6 Oct 2016 implemented, such as Hadoop [20], Twister [21], Phoenix [22] and Mars [23], which all can help developers to parallelize traditional algorithms by using MapReduce model. For example, Apache Mahout [24] is machine learning libraries, and produces implementations of parallel scalable machine learning algorithms on Hadoop platform by using MapReduce. ...
... Afterwards, Qian et al. presented a parallel attribute reduction algorithm based on MapReduce [27]. However, all of these existing parallel methods make use of the classical MapReduce framework and are implemented on the Hadoop platform [20]. ...
Preprint
The rapid growth of emerging information technologies and application patterns in modern society, e.g., Internet, Internet of Things, Cloud Computing and Tri-network Convergence, has caused the advent of the era of big data. Big data contains huge values, however, mining knowledge from big data is a tremendously challenging task because of data uncertainty and inconsistency. Attribute reduction (also known as feature selection) can not only be used as an effective preprocessing step, but also exploits the data redundancy to reduce the uncertainty. However, existing solutions are designed 1) either for a single machine that means the entire data must fit in the main memory and the parallelism is limited; 2) or for the Hadoop platform which means that the data have to be loaded into the distributed memory frequently and therefore become inefficient. In this paper, we overcome these shortcomings for maximum efficiency possible, and propose a unified framework for Parallel Large-scale Attribute Reduction, termed PLAR, for big data analysis. PLAR consists of three components: 1) Granular Computing (GrC)-based initialization: it converts a decision table (i.e., original data representation) into a granularity representation which reduces the amount of space and hence can be easily cached in the distributed memory: 2) model-parallelism: it simultaneously evaluates all feature candidates and makes attribute reduction highly parallelizable; 3) data-parallelism: it computes the significance of an attribute in parallel using a MapReduce-style manner. We implement PLAR with four representative heuristic feature selection algorithms on Spark, and evaluate them on various huge datasets, including UCI and astronomical datasets, finding our method's advantages beyond existing solutions.
... Analisa data IoT dapat dikategorikan menjadi tiga kelompok, yaitu (Lim et al., 2023 (White, 2012). ...
... • Penyimpanan Data Batch: Hadoop HDFS: Sistem penyimpanan terdistribusi yang dirancang untuk menyimpan data besar dalam format batch. HDFS mendukung redundansi dan toleransi kesalahan dengan menyimpan salinan data di beberapa node (White, 2012). Data Warehouses: Sistem seperti Amazon Redshift atau Google BigQuery menyediakan penyimpanan dan pemrosesan data batch untuk analisis skala besar (Vassiliadis, 2015). ...
... The inventor of this method is Google, which introduced Google map-reduce but does not grant access permission to it. Therefore, if someone needs access to the main code, they should use Hadoop [12]. Hadoop and Spark have introduced an open-source version of mapreduce. ...
... As the volume of data generated by Twitter exceeded Storm's capabilities in terms of management, debugging, and scheduling, Twitter introduced Heron as a new technology in 2015. Heron is one of the latest programs for stream processing [12]. This program revamped the internal structure and reduced the complexity of the Storm system. ...
Article
Full-text available
With the advancements in science and technology, the industrial and aviation sectors have witnessed a significant increase in data. A vast amount of data is generated and utilized continuously. It is imperative to employ data mining techniques to extract and uncover knowledge from this data. Data mining is a method that enables the extraction of valuable information and hidden relationships from datasets. However, the current aviation data presents challenges in effectively extracting knowledge due to its large volume and diverse structures. Air Traffic Management (ATM) involves handling Big data, which exceeds the capacity of conventional acquisition, matching, management, and processing within a reasonable timeframe. Aviation Big data exists in batch forms and streaming formats, necessitating the utilization of parallel hardware and software, as well as stream processing, to extract meaningful insights. Currently, the map-reduce method is the prevailing model for processing Big data in the aviation industry. This paper aims to analyze the evolving trends in aviation Big data processing methods, followed by a comprehensive investigation and discussion of data analysis techniques. We implement the map-reduce optimization of the K-Means algorithm in the Hadoop and Spark environments. The K-Means map-reduce is a crucial and widely applied clustering method. Finally, we conduct a case study to analyze and compare aviation Big data related to air traffic management in the USA using the K-Means map-reduce approach in the Hadoop and Spark environments. The analyzed dataset includes flight records. The results demonstrate the suitability of this platform for aviation Big data, considering the characteristics of the aviation dataset. Furthermore, this study presents the first application of the designed program for air traffic management.
... Cloud storage and computing play a crucial role in BDA and computational tasks, offering scalable resources and flexibility for handling large volumes of data [28]. Major cloud storage platforms used in BDA include Amazon S3, which provides scalable object storage for backup and archiving [67], BigQuery for analyzing large datasets with SQL [68], Google Drive for file storage and sharing [69], Microsoft Azure for a wide range of cloud services including data analytics [70], and Hadoop for distributed storage and processing [71]. In our case study, we utilized Google Colab [72] and Google Drive as our cloud computing platforms to predict diabetes, leveraging Python [73] for scripting and Apache Spark [74] for processing large datasets. ...
Article
Full-text available
The evolving healthcare domain necessitates an upgrade through digitization, integrating patient data, and advanced medical results. In the last couple of decades, advances in information and storage technologies in healthcare have produced vast amounts of data. The remarkable increases in data volumes, along with the enticing prospects and potential inherent in data analysis, have contributed to the concept of Big Data. There is a pressing need within the research community to analyze these large volumes of Big Data. To address this challenge, Big Data Analytics (BDA), the systematic process of examining large and complex datasets to uncover hidden patterns, correlations, and insights for informed decision-making, has emerged. It employs various methodologies and techniques to enable informed decision-making. This study delves into using Machine Learning (ML) in big data environments, explicitly utilizing the MLib library in Apache Spark to derive meaningful insights from diabetic healthcare dataset. The CDC’s Behavioral Risk Factor Surveillance System (BRFSS) was used to empirically demonstrate the advantages of integrating BDA with ML for medical decision-making in Big Data environments. The research finding highlighted the superior performance of Logistic Regression (LR) models compared to other models like Naive Bayes (NB), providing valuable insights for healthcare applications.
... The Deep Learning models were evaluated using records of ship voyages built from the global collection of live and historical AIS data supplied by the UN Global Platform (UNGP) [21]. The platform is based on Hadoop [22] and Spark [23] frameworks, where the first one was used to store data in a distributed file system and the second one was used for the data processing phase. Data is divided into small parts called 'blocks' and distributed among several nodes of a computer network. ...
Conference Paper
Full-text available
Maritime data surveillance has significantly grown in the last decade, notably through technologies like AIS (Automatic Identification System). Predicting vessel positions is crucial for various maritime applications, with a focus on trajectory forecasting. This involves anticipating vessel direction and future locations, which is vital for tasks such as search and rescue, traffic management, and pollution monitoring. Despite the abundance of AIS data available, predicting vessel paths remains challenging. AIS technologies, which use ship transponders, assist vessel traffic services (VTS) and serve as the foundation for tasks such as trajectory prediction and the classification of the arrival port given a route, namely ‘Port Classification’. AIS data includes essential vessel information, such as latitude, longitude, speed, and identity, transmitted via VHF signals. Deep learning models are considered state-of-the-art for analyzing AIS data. This study focuses on implementing a ‘Port Classifier’, evaluating several models including Conv1D, MLP and LSTM on AIS trajectories labelled through heuristic algorithms, and promising results are achieved, with Conv1D showing superiority in port classification tasks. Additionally, we conducted an Exploratory Data Analysis (EDA) to better understand the data. Our findings contribute to enhance maritime data analysis, and demonstrate potential applications for Official Statistics.
... Most data parallel frameworks are responsible for partitioning, merging, and distributing data splits across distributed hardware resources for optimal performance, load balancing and data parallelism. For example, in the Hadoop MapReduce implementation [30], each file ingested into the Hadoop Filesystem (HDFS) is partitioned by the framework into multiple 64 Megabyte chunks by default. In Spark, an in-memory data frame is partitioned into multiple subarrays, each of which is then distributed onto a Worker/Executor for processing. ...
Preprint
The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.
... Hadoop is an open-source software framework used for distributed storage and processing of big data sets [34]. To verify the performance of HashTag codes and their locally repairable and locally regenerating variants we implemented them in C/C++ and used them in HDFS. ...
Preprint
Recently we constructed an explicit family of locally repairable and locally regenerating codes. Their existence was proven by Kamath et al. but no explicit construction was given. Our design is based on HashTag codes that can have different sub-packetization levels. In this work we emphasize the importance of having two ways to repair a node: repair only with local parity nodes or repair with both local and global parity nodes. We say that the repair strategy is network traffic driven since it is in connection with the concrete system and code parameters: the repair bandwidth of the code, the number of I/O operations, the access time for the contacted parts and the size of the stored file. We show the benefits of having repair duality in one practical example implemented in Hadoop. We also give algorithms for efficient repair of the global parity nodes.
... As we can see in Figure 5, all of the schedulers do well with 60-second tasks with the exception of Hadoop YARN; all of the schedulers launch jobs reasonably fast. Hadoop YARN has greater overhead for each job, including launching an application master process for each job [White (2015)]. Slurm, Grid Engine, and Mesos perform similarly with 1-, 5-, and 30-second tasks. ...
Preprint
In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) are conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler tst_s and a nonlinear exponent αs\alpha_s. For all four schedulers, the utilization of the computing system decreases to < 10\% for computations lasting only a few seconds. Multilevel schedulers that transparently aggregate short computations can improve utilization for these short computations to > 90\% for all four of the schedulers that were tested.
... To the best of our knowledge, there is no robust solution for detecting or preventing insider threats within big data infrastructures. For example, security mechanisms of popular big data systems such as Hadoop [4] and Spark [5] include third-party applications such as Kerberos [6], access control lists (ACL), log monitoring and data encryption (to some extent). But for an insider, especially a traitor, circumventing these mechanisms is not difficult [7]. ...
Preprint
In big data systems, the infrastructure is such that large amounts of data are hosted away from the users. In such a system information security is considered as a major challenge. From a customer perspective, one of the big risks in adopting big data systems is in trusting the provider who designs and owns the infrastructure from accessing user data. Yet there does not exist much in the literature on detection of insider attacks. In this work, we propose a new system architecture in which insider attacks can be detected by utilizing the replication of data on various nodes in the system. The proposed system uses a two-step attack detection algorithm and a secure communication protocol to analyze processes executing in the system. The first step involves the construction of control instruction sequences for each process in the system. The second step involves the matching of these instruction sequences among the replica nodes. Initial experiments on real-world hadoop and spark tests show that the proposed system needs to consider only 20% of the code to analyze a program and incurs 3.28% time overhead. The proposed security system can be implemented and built for any big data system due to its extrinsic workflow.
... Hadoop is an open-source software framework used for distributed storage and processing of big data sets [16]. From To verify the performance of HashTag codes and their locally repairable and locally regenerating variants we implemented them in C/C++ and used them in HDFS. ...
Preprint
We construct an explicit family of locally repairable and locally regenerating codes whose existence was proven in a recent work by Kamath et al. about codes with local regeneration but no explicit construction was given. This explicit family of codes is based on HashTag codes. HashTag codes are recently defined vector codes with different vector length α\alpha (also called a sub-packetization level) that achieve the optimal repair bandwidth of MSR codes or near-optimal repair bandwidth depending on the sub-packetization level. We applied the technique of parity-splitting code construction. We show that the lower bound on the size of the finite field for the presented explicit code constructions can be lower than the one given in the work of Kamath et al. Finally, we discuss the importance of having two ways for node repair with locally regenerating HashTag codes: repair only with local parity nodes or repair with both local and global parity nodes. To the best of the authors' knowledge, this is the first work where this duality in repair process is discussed. We give a practical example and experimental results in Hadoop where we show the benefits of having this repair duality.
... Large scale Internet companies, such as Google [24] and Amazon [25], have software that scales to cyberpods, but this proprietary software is not available to the research community. Although there are software applications, such as Hadoop [26], that are available to the research community and scale across multiple racks, there is not a complete open source software stack containing all the services required to build a large scale data commons, including the infrastructure automation and management services, security services, etc. [27] that are required to operate a data commons at scale. ...
Preprint
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience developing data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.
... Once all pairs belonging to one key are in the same node, it is processed in parallel. Apache Hadoop [46] [1] is the most popular open-source framework based on the MapReduce model. Apache Spark [20,40] is an open-source framework for Big Data processing built around speed, ease of use and sophisticated analytics. ...
Preprint
In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.
... In this paper, we use the term "MapReduce" as a placeholder for a wider range of frameworks. While some frameworks such as Hadoop's MapReduce [13] strictly adhere to the two functions "map" and "reduce", the more recent and widely used distribution frameworks provide many additional primitives -for performance reasons and to make programming more comfortable. ...
Preprint
Full-text available
Distributed programs are often formulated in popular functional frameworks like MapReduce, Spark and Thrill, but writing efficient algorithms for such frameworks is usually a non-trivial task. As the costs of running faulty algorithms at scale can be severe, it is highly desirable to verify their correctness. We propose to employ existing imperative reference implementations as specifications for MapReduce implementations. To this end, we present a novel verification approach in which equivalence between an imperative and a MapReduce implementation is established by a series of program transformations. In this paper, we present how the equivalence framework can be used to prove equivalence between an imperative implementation of the PageRank algorithm and its MapReduce variant. The eight individual transformation steps are individually presented and explained.
... Spurred by continuous and dramatic advancements in processing power, memory, storage, and an unprecedented wealth of data, big data processing platforms have been developed in order to tackle the increasingly complex data science jobs. Lead by the Hadoop framework [47] and its ecosystem, big data processing systems are showing remarkable success in several business and research domains [38]. In particular, for about a decade, the Hadoop platform represented the defacto standard of Big Data analytics world. ...
Preprint
Recently, we have been witnessing huge advancements in the scale of data we routinely generate and collect in pretty much everything we do, as well as our ability to exploit modern technologies to process, analyze and understand this data. The intersection of these trends is what is called, nowadays, as Big Data Science. Cloud computing represents a practical and cost-effective solution for supporting Big Data storage, processing and for sophisticated analytics applications. We analyze in details the building blocks of the software stack for supporting big data science as a commodity service for data scientists. We provide various insights about the latest ongoing developments and open challenges in this domain.
... If a system experiences an increased data input, additional hardware resources (data nodes) can be easily added to a cluster to handle the increased workload. For more details on the architecture of Hadoop and MapReduce, readers can consult these souces [74,75]. ...
Preprint
Context: Big Data Cybersecurity Analytics is aimed at protecting networks, computers, and data from unauthorized access by analysing security event data using big data tools and technologies. Whilst a plethora of Big Data Cybersecurity Analytic Systems have been reported in the literature, there is a lack of a systematic and comprehensive review of the literature from an architectural perspective. Objective: This paper reports a systematic review aimed at identifying the most frequently reported quality attributes and architectural tactics for Big Data Cybersecurity Analytic Systems. Method: We used Systematic Literature Review (SLR) method for reviewing 74 primary studies selected using well-defined criteria. Results: Our findings are twofold: (i) identification of 12 most frequently reported quality attributes and the justification for their significance for Big Data Cybersecurity Analytic Systems; and (ii) identification and codification of 17 architectural tactics for addressing the quality attributes that are commonly associated with Big Data Cybersecurity Analytic systems. The identified tactics include six performance tactics, four accuracy tactics, two scalability tactics, three reliability tactics, and one security and usability tactic each. Conclusion: Our findings have revealed that (a) despite the significance of interoperability, modifiability, adaptability, generality, stealthiness, and privacy assurance, these quality attributes lack explicit architectural support in the literature (b) empirical investigation is required to evaluate the impact of codified architectural tactics (c) a good deal of research effort should be invested to explore the trade-offs and dependencies among the identified tactics and (d) there is a general lack of effective collaboration between academia and industry for supporting the field of Big Data Cybersecurity Analytic Systems.
... It is characterized by programming model expressing distributed applications in terms of two computations, map and reduce, and a fault tolerant distributed file system that is optimized for moving large quantities of data. Hadoop[48] is an open source implementation of MapReduce and has been utilized as Cloud programming platform on top of the Amazon EC2 (Elastic MapReduce) and the Yahoo Cloud Supercomputing Cluster. Other research works and commercial implementations adopting the PaaS approach are mostly focused on providing a scalable infrastructure for developing web applications. ...
Preprint
Cloud computing has penetrated the Information Technology industry deep enough to influence major companies to adopt it into their mainstream business. A strong thrust on the use of virtualization technology to realize Infrastructure-as-a-Service (IaaS) has led enterprises to leverage subscription-oriented computing capabilities of public Clouds for hosting their application services. In parallel, research in academia has been investigating transversal aspects such as security, software frameworks, quality of service, and standardization. We believe that the complete realization of the Cloud computing vision will lead to the introduction of a virtual market where Cloud brokers, on behalf of end users, are in charge of selecting and composing the services advertised by different Cloud vendors. In order to make this happen, existing solutions and technologies have to be redesigned and extended from a market-oriented perspective and integrated together, giving rise to what we term Market-Oriented Cloud Computing. In this paper, we will assess the current status of Cloud computing by providing a reference model, discuss the challenges that researchers and IT practitioners are facing and will encounter in the near future, and present the approach for solving them from the perspective of the Cloudbus toolkit, which comprises of a set of technologies geared towards the realization of Market Oriented Cloud Computing vision. We provide experimental results demonstrating market-oriented resource provisioning and brokering within a Cloud and across multiple distributed resources. We also include an application illustrating the hosting of ECG analysis as SaaS on Amazon IaaS (EC2 and S3) services.
... Large community driven projects have developed frameworks optimized for distributed data stream processing. Map-Reduce based solutions such as Hadoop [21,22] and Spark [23] provide distributed I/O, a unified environment, and hooks for running map and reduce operations over a cloud-based network. Other frameworks such as Flink [24], Samza [25], and Storm [26] are more tailored for realtime stream processing of tasks executing a directed acyclic graph (DAG) [27] of operations as fast as possible. ...
Preprint
Scientists are drawn to synchrotrons and accelerator based light sources because of their brightness, coherence and flux. The rate of improvement in brightness and detector technology has outpaced Moore's law growth seen for computers, networks, and storage, and is enabling novel observations and discoveries with faster frame rates, larger fields of view, higher resolution, and higher dimensionality. Here we present an integrated software/algorithmic framework designed to capitalize on high throughput experiments, and describe the streamlined processing pipeline of ptychography data analysis. The pipeline provides throughput, compression, and resolution as well as rapid feedback to the microscope operators.
... Over the last decade, massive parallelism became a major paradigm in computing, and we have witnessed the deployment of a number of very successful massively parallel computation frameworks, such as MapReduce [DG04,DG08], Hadoop [Whi12], Dryad [IBY + 07], or Spark [ZCF + 10]. This paradigm and the corresponding models of computation are rather different from classical parallel algorithms models considered widely in literature, such as the PRAM model. ...
Preprint
For over a decade now we have been witnessing the success of {\em massive parallel computation} (MPC) frameworks, such as MapReduce, Hadoop, Dryad, or Spark. One of the reasons for their success is the fact that these frameworks are able to accurately capture the nature of large-scale computation. In particular, compared to the classic distributed algorithms or PRAM models, these frameworks allow for much more local computation. The fundamental question that arises in this context is though: can we leverage this additional power to obtain even faster parallel algorithms? A prominent example here is the {\em maximum matching} problem---one of the most classic graph problems. It is well known that in the PRAM model one can compute a 2-approximate maximum matching in O(logn)O(\log{n}) rounds. However, the exact complexity of this problem in the MPC framework is still far from understood. Lattanzi et al. showed that if each machine has n1+Ω(1)n^{1+\Omega(1)} memory, this problem can also be solved 2-approximately in a constant number of rounds. These techniques, as well as the approaches developed in the follow up work, seem though to get stuck in a fundamental way at roughly O(logn)O(\log{n}) rounds once we enter the near-linear memory regime. It is thus entirely possible that in this regime, which captures in particular the case of sparse graph computations, the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power. In this paper, we finally refute that perplexing possibility. That is, we break the above O(logn)O(\log n) round complexity bound even in the case of {\em slightly sublinear} memory per machine. In fact, our improvement here is {\em almost exponential}: we are able to deliver a (2+ϵ)(2+\epsilon)-approximation to maximum matching, for any fixed constant ϵ>0\epsilon>0, in O((loglogn)2)O((\log \log n)^2) rounds.
... Slave nodes have Data Node and Task Tracker roles. If a slave node fails and cannot finish executing its task, master node automatically schedules the same task to run on another slave machine [9]. ...
Preprint
In this paper we describe our work on designing a web based, distributed data analysis system based on the popular MapReduce framework deployed on a small cloud; developed specifically for analyzing web server logs. The log analysis system consists of several cluster nodes, it splits the large log files on a distributed file system and quickly processes them using MapReduce programming model. The cluster is created using an open source cloud infrastructure, which allows us to easily expand the computational power by adding new nodes. This gives us the ability to automatically resize the cluster according to the data analysis requirements. We implemented MapReduce programs for basic log analysis needs like frequency analysis, error detection, busy hour detection etc. as well as more complex analyses which require running several jobs. The system can automatically identify and analyze several web server log types such as Apache, IIS, Squid etc. We use open source projects for creating the cloud infrastructure and running MapReduce jobs.
... To get an expected performance gain by running a job on Hadoop, map and reduce phases of a job need to be defined very carefully. For more details about the framework, its open-source implementation can be found in (White, 2015). ...
Preprint
In this paper, we propose distributed feature extraction tool from high spatial resolution remote sensing images. Tool is based on Apache Hadoop framework and Hadoop Image Processing Interface. Two corner detection (Harris and Shi-Tomasi) algorithms and five feature descriptors (SIFT, SURF, FAST, BRIEF, and ORB) are considered. Robustness of the tool in the task of feature extraction from LandSat-8 imageries are evaluated in terms of horizontal scalability.
... In this paper we focus on log collection and processing for large IT infrastructures. The distinguishing features of such use case, compared to generic data col-lection systems (Hadoop 1 [2], HBase 2 [4], Elasticsearch 3 [5], etc. [6]), are twofold. First, logs are produced by a large number of servers and routers and the resulting amount of data is huge. ...
Preprint
Nowadays, most systems and applications produce log records that are useful for security and monitoring purposes such as debugging programming errors, checking system status, and detecting configuration problems or even attacks. To this end, a log repository becomes necessary whereby logs can be accessed and visualized in a timely manner. This paper presents Loginson, a high-performance log centralization system for large-scale log collection and processing in large IT infrastructures. Besides log collection, Loginson provides high-level analytics through a visual interface for the purpose of troubleshooting critical incidents. We note that Loginson outperforms all of the other log centralization solutions by taking full advantage of the vertical scalability, and therefore decreasing Capital Expenditure (CAPEX) and Operating Expense (OPEX) costs for deployment scenarios with a huge volume of log data.
... A resource management component performs profiling and monitoring of devices. Harox is another framework that provides an approach towards cloud services on android phones in which various benchmark algorithms of distributed sorting and searching are implemented using Hadoop framework (White 2012;Marinelli 2009). ...
Preprint
Ad hoc network enables network creation on the fly without support of any predefined infrastructure. The spontaneous erection of networks in anytime and anywhere fashion enables development of various novel applications based on ad hoc networks. However, at the same ad hoc network presents several new challenges. Different research proposals have came forward to resolve these challenges. This chapter provides a survey of current issues, solutions and research trends in wireless ad hoc network. Even though various surveys are already available on the topic, rapid developments in recent years call for an updated account on this topic. The chapter has been organized as follows. In the first part of the chapter, various ad hoc network's issues arising at different layers of TCP/IP protocol stack are presented. An overview of research proposals to address each of these issues is also provided. The second part of the chapter investigates various emerging models of ad hoc networks, discusses their distinctive properties and highlights various research issues arising due to these properties. We specifically provide discussion on ad hoc grids, ad hoc clouds, wireless mesh networks and cognitive radio ad hoc networks. The chapter ends with presenting summary of the current research on ad hoc network, ignored research areas and directions for further research.
... We now describe the traffic patterns used in our evaluation, the primary one being the all-to-all traffic pattern. All-to-all communications are extremely relevant as they are intrinsic to MapReduce, the preferred paradigm for data-oriented application development; see, for example, [8,29,40]. In addition, all-to-all can be considered a worst-case traffic pattern for two reasons: (a) the lack of spatial locality; and (b) the high levels of contention for the use of resources. ...
Preprint
The first dual-port server-centric datacenter network, FiConn, was introduced in 2009 and there are several others now in existence; however, the pool of topologies to choose from remains small. We propose a new generic construction, the stellar transformation, that dramatically increases the size of this pool by facilitating the transformation of well-studied topologies from interconnection networks, along with their networking properties and routing algorithms, into viable dual-port server-centric datacenter network topologies. We demonstrate that under our transformation, numerous interconnection networks yield datacenter network topologies with potentially good, and easily computable, baseline properties. We instantiate our construction so as to apply it to generalized hypercubes and obtain the datacenter networks GQ*. Our construction automatically yields routing algorithms for GQ* and we empirically compare GQ* (and its routing algorithms) with the established datacenter networks FiConn and DPillar (and their routing algorithms); this comparison is with respect to network throughput, latency, load balancing, fault-tolerance, and cost to build, and is with regard to all-to-all, many all-to-all, butterfly, and random traffic patterns. We find that GQ* outperforms both FiConn and DPillar (sometimes significantly so) and that there is substantial scope for our stellar transformation to yield new dual-port server-centric datacenter networks that are a considerable improvement on existing ones.
... A job is considered finished once all of its corresponding tasks have finished. We consider the special case of a blocking system whereby jobs cannot be forked before all of the tasks of the previous job have left the system (this mode is in particular characteristic to Hadoop, through a particular coordination service [32]). ...
Preprint
Task replication has recently been advocated as a practical solution to reduce latencies in parallel systems. In addition to several convincing empirical studies, some others provide analytical results, yet under some strong assumptions such as Poisson arrivals, exponential service times, or independent service times of the replicas themselves, which may lend themselves to some contrasting and perhaps contriving behavior. For instance, under the second assumption, an overloaded system can be stabilized by a replication factor, but can be sent back in overload through further replication. In turn, under the third assumption, strictly larger stability regions of replication systems do not necessarily imply smaller delays. Motivated by the need to dispense with such common and restricting assumptions, which may additionally cause unexpected behavior, we develop a unified and general theoretical framework to compute tight bounds on the distribution of response times in general replication systems. These results immediately lend themselves to the optimal number of replicas minimizing response time quantiles, depending on the parameters of the system (e.g., the degree of correlation amongst replicas). As a concrete application of our framework, we design a novel replication policy which can improve the stability region of classical fork-join queueing systems by O(lnK)\mathcal{O}(\ln K), in the number of servers K.
... Further investigation in this direction is required, but beyond the scope of this paper. Lastly, the hierarchical algorithm has a map-reduce flavor that will lend itself well to a map reduce framework such as Apache Hadoop [34] or Apache Spark [36]. ...
Preprint
In this paper, we show that the SVD of a matrix can be constructed efficiently in a hierarchical approach. Our algorithm is proven to recover the singular values and left singular vectors if the rank of the input matrix A is known. Further, the hierarchical algorithm can be used to recover the d largest singular values and left singular vectors with bounded error. We also show that the proposed method is stable with respect to roundoff errors or corruption of the original matrix entries. Numerical experiments validate the proposed algorithms and parallel cost analysis.
... Should the entering data not abide the schema contract, it is either deleted or flagged for further inspection. Similar applications also abound for Google Protobuf and JSON Schema [12]. ...
Article
Ensuring data quality and consistency across data intake from several sources presents a tremendous challenge for companies running large-scale data platforms. These challenges are exacerbated by the numerous data forms—structured, semi-structured, unstructured—that come from many sources—including relational databases, IoT devices, and APIs. Data variances, incompatible schemas, and quality issues cause poor analytics and downstream decision-making. With an eye toward how schema validation and transformation techniques might assist address problems with data quality and consistency across the pipeline, this paper looks at the primary challenges related to receiving multi-format data. We analyze schema validation systems including Apache Avro, JSON Schema, and Protobuf with real-time transformation techniques applied using Apache Kafka, Apache Spark, and AWS Glue. The limits of these techniques and future directions—such as AI-driven data validation and self-healing data pipelines are also discussed at the end of this work.
... Hadoop and Spark. Hadoop is primarily used for storing and processing large volumes of unstructured and semi-structured data, while Spark is a fast, in-memory data processing engine that is used for realtime analytics, machine learning, and big data processing [31,32]. ...
Preprint
Full-text available
Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.
... Big Data Analytics is a challenge that Machine Learning has to adapt to, as Machine Learning algorithms for Big Data analytics take a long time to run. To attack these challenges, several solutions and platforms have been developed, such as the Hadoop framework [White, 2015] and the MapReduce model, so it is possible to develop algorithms for Machine Learning in a distributed context [S. et al., 2014], [S. ...
Article
In recent years, Big Data has emerged as a crucial source for data mining, encompassing a vast and complex collection of structured and unstructured data. Machine learning has become widely adopted for analyzing this data and deriving structured insights, particularly for Big Data Mining classification. To fully utilize this valuable resource, new tools and learning methods are needed to address scalability challenges, limited computation time, and storage capacity. Big Data processing and management require data-driven algorithms and statistical models, which help analyze datasets, identify patterns, and make predictions. However, class imbalance is a common challenge in Big Data mining. This paper introduces a new method called "DK-MS" to address imbalanced Big Data classification problems. DK-MS, based on Double K-Means and SMOTE, aims to reduce the volume of big datasets while preserving essential characteristics and ensuring information reliability. By employing classifiers like Logistic Regression, K-NN, Naive Bayes, and Random Forests, the DK-MS method achieves higher accuracy rates and AUC measures compared to cases without data balancing strategies. The DK-MS method demonstrated high accuracy rates of 91.30%, 99.93%, and 99.93%, demonstrating its significant contribution to effectively addressing imbalanced Big Data classification problems.
... Hadoop is a framework of java that allows us to first store Big Data in a distributed environment, so that, we can process it parallel. There are basically two components in Hadoop, first one is HDFS(Hadoop Distributed File System) for storage, that allows to store data of various formats across a cluster and the second one is mapreduce for processing [2][3][4]. ...
Article
Full-text available
Naive Bayes classifier is well known machine learning algorithm which has shown virtues in many fields. In this work big data analysis platforms like Hadoop distributed computing and map reduce programming is used with Naive Bayes and Gaussian Naive Bayes for classification. Naive Bayes is manily popular for classification of discrete data sets while Gaussian is used to classify data that has continuous attributes. Experimental results show that Hybrid of Naive Bayes and Gaussian Naive Bayes MapReduce model shows the better performance in terms of classification accuracy on adult data set which has many continuous attributes.
... Hadoop and Spark are other frameworks utilized in big data processing. Apache Hadoop is an open-source framework designed for distributed storage and processing of large datasets [35]. Hadoop supports batch processing through its MapReduce programming model. ...
... The development of quantum error correction remains a critical focus, addressing the challenge of maintaining qubit coherence and accuracy in practical quantum computers. (White, 2015) have revolutionized large-scale data processing. Platforms like Tableau and Microsoft Power BI have improved data visualization and interactive analytics, making complex data more accessible and understandable for decision-makers. ...
Article
This article explores the transformative potential of quantum computing in the field of business analytics. It begins with an introduction to quantum computing, explaining its fundamental principles and recent advancements. The study highlights the limitations of current business analytics methods and demonstrates how quantum computing could address these limitations by offering enhanced data processing capabilities, advanced algorithms, and solutions to complex optimization problems. A comprehensive literature review is conducted to provide context and identify gaps in the existing research. The article then outlines a research design that incorporates both real-world and simulated data, using online datasets and quantum computing frameworks for analysis. The findings reveal significant opportunities for quantum computing to revolutionize business analytics, including improved efficiency, accuracy, and the ability tosolve previously intractable problems. However, the article also addresses key challenges such as technical limitations, cost, accessibility, and integration issues. The discussion highlights emerging trends and provides strategic recommendations for businesses considering the adoption of quantum computing. The article concludes with a summary of the implications of integrating quantum computing into business analytics and reflects onfuture potential and challenges.
Article
Full-text available
Within the later a long time, we seeing a colossal increment in how much information we are able collect. There's information coming from sensors, gadgets, and other places in numerous distinctive groups. But presently, we got so much information, it's exceptionally difficult to oversee all of it. For illustration, Google had almost a million web pages in 1998. But by 2000, that number already went up to 1 billion. And in 2008, it coming to 1 trillion! This is often developing indeed more with social media like Facebook and Twitter, where individuals can make substance effectively, making the information indeed greater. Presently, with Web of Things (IoT), it's getting more insane. Everything, from coffee machines to cars, is connected to internet. All these things aremaking information all the time. Take for illustration, your morning commute. To figure out perfect way, the most perfect way to urge to work, a framework requiring to check activity, climate, street works, and indeed your calendar. All these information ought to be prepare quick to induce you to work on time. But indeed we have all this information, knowing what to do with it is huge issue. We requiring superior frameworks and calculations to go through the information and discover the valuable things. Also, with so much information around, we must make beyond any doubt it is safe and not being utilized wrongly. In this paper, we are looking at where Enormous Information mining is right presently and what might happen in future. We'll see the issues we confronting, like framework capabilities, making calculations, and keeping information secure.
Thesis
Full-text available
In fulfillment of the requirements for obtaining a Master of Science degree in Computer Science Faculty of Science, Port said University.
Conference Paper
The use of a computational platform can reduce costs and time required for developing new models of medical devices and drugs. EU project STRATIFYHF is to develop and clinically validate a truly innovative AI-based Decision Support System for predicting the risk of heart failure, facilitating its early diagnosis and progression prediction that will radically change how heart failure is managed in both primary and secondary care. It is developed using state-of-the-art finite element modelling for macro simulation of fluid-structure interaction with micro modelling at the molecular level for drug interaction with the cardiac cells and using artificial intelligence. Computational platforms such as STRATIFYHF platform is novel medical tool for risk prediction of cardiac disease in a specific patient.
Chapter
Big data and data mining have revolutionized the marketing landscape, transforming how businesses understand and engage with their customers. In today’s digital age, every interaction with technology generates data, creating vast amounts of information that require sophisticated techniques to analyze and derive meaningful insights. Data mining is the process of extracting valuable patterns from large datasets, enabling companies to make informed decisions. This chapter outlines the key steps in data mining, including data collection, cleaning, integration, storage, analysis, pattern recognition, interpretation, and decision-making. These steps harness the power of big data to provide actionable insights, making businesses more competitive and innovative.
Chapter
We all know the formula GIGO, “garbage in, garbage out,” meaning if your data is not good from the beginning, what you get out at the other end in terms of results is going to be wishful thinking. Data quality has become a significant concern for businesses due to the rapid growth of digital innovation and larger databases. Ensuring high-quality data is crucial for marketers to make informed decisions and effectively target potential customers. Poor data quality can lead to inconsistencies, increased costs, and poor decision-making, which can be detrimental to a business. For instance, a sports shoes company expanding its online presence must rely on accurate and up-to-date customer data to engage effectively with its target audience. To address data quality issues, businesses and researchers have developed various techniques for professional and safe data handling. Data warehouses, such as those provided by Amazon Web Services (AWS), Google BigQuery, and Microsoft Azure Synapse, are essential for storing vast amounts of data efficiently. These platforms help businesses manage their data and ensure its quality for current and future use. Data quality is defined by several key dimensions: accuracy, reliability, timeliness, currency, relevance, completeness, and consistency. For example, in a sports shoes company, high-quality data would include accurate customer demographics and purchase histories, allowing the company to create targeted marketing campaigns and improve customer satisfaction. Ensuring data quality involves continuous data collection and validation to maintain its reliability and relevance.
Article
Full-text available
In the contemporary landscape of big data, efficiently processing and analyzing vast volumes of information is crucial for organizations seeking actionable insights. Apache Spark has emerged as a leading distributed computing framework that addresses these challenges with its in-memory processing capabilities and scalability. This article explores the implementation of Spark DataFrames as a pivotal tool for advanced data analysis. We delve into how DataFrames provide a higher-level abstraction over traditional RDDs (Resilient Distributed Datasets), enabling more intuitive and efficient data manipulation through a schema-based approach. By integrating SQL-like operations and supporting a wide range of data sources, Spark DataFrames simplify complex analytical tasks. The discussion includes methodologies for setting up the Spark environment, loading diverse datasets into DataFrames, and performing exploratory data analysis and transformations. Advanced techniques such as user-defined functions (UDFs), machine learning integration with MLlib, and real-time analytics using Structured Streaming are examined. Performance optimization strategies, including caching, broadcast variables, and utilizing efficient file formats like Parquet, are highlighted to demonstrate how to enhance processing speed and resource utilization. Through a practical case study, we illustrate the application of these concepts in a real-world scenario, showcasing the effectiveness of Spark DataFrames in handling large-scale data analytics. This comprehensive exploration underscores the significance of adopting Spark DataFrames for organizations aiming to leverage big data effectively, ultimately facilitating faster, more insightful decision-making processes. Keywords: Apache Spark,Spark DataFrames,Big Data Analytics,In-Memory Computation,Advanced Data Analysis.
Article
Full-text available
This article addresses the importance of HaaS (Hadoop-as-a-Service) in cloud technologies, with specific reference to its usefulness in big data mining for environmental computing applications. The term environmental computing refers to computational analysis within environmental science and management, encompassing a myr-iad of techniques, especially in data mining and machine learning. As is well-known, the classical MapReduce has been adapted within many applications for big data storage and information retrieval. Hadoop based tools such as Hive and Mahout are broadly accessible over the cloud and can be helpful in data warehousing and data mining over big data in various domains. In this article, we explore HaaS technologies, mainly based on Apache's Hive and Mahout for applications in environmental computing, considering publicly available data on the Web. We dwell upon interesting applications such as automated text classification for energy management, recommender systems for ecofriendly products, and decision support in urban planning. We briefly explain the classical paradigms of MapReduce, Hadoop and Hive, further delve into data mining and machine learning over the MapReduce framework, and explore techniques such as Naïve Bayes and Random Forests using Apache Ma-hout with respect to the targeted applications. Hence, the paradigm of Hadoop-as-a-Service, popularly referred to as HaaS, is emphasized here as per its benefits in a domain-specific context. The studies in environmental computing , as presented in this article, can be useful in other domains as well, considering similar applications. This article can thus be interesting to professionals in web technologies, cloud computing, environmental management, as well as AI and data science in general.
Conference Paper
Hadoop Distributed File System (HDFS) is known for its specialized strategies and policies tailored to enhance replica placement. This capability is critical for ensuring efficient and reliable access to data replicas, particularly as HDFS operates best when data are evenly distributed within the cluster. In this study, we conduct a thorough analysis of the replica balancing process in HDFS, focusing on two critical performance metrics: stability and efficiency. We evaluated these balancing aspects by contrasting them with conventional HDFS solutions and employing a novel dynamic architecture for data replica balancing. On top of that, we delve into the optimizations in data locality brought about by effective replica balancing and their benefits for data-intensive applications.
Article
In silico clinical trials are the future of medicine and virtual testing and simulation are the future of medical engineering. The use of a computational platform can reduce costs and time required for developing new models of medical devices and drugs. The computational platform in different projects, such as SILICOFCM, was developed using state-of-the-art finite element modelling for macro simulation of fluid-structure interaction with micro modelling at the molecular level for drug interaction with the cardiac cells. SILICOFCM platform is used for risk prediction and optimal drug therapy of familial cardiomyopathy in a specific patient. STRATIFYHF project is to develop and clinically validate a truly innovative AI-based Decision Support System for predicting the risk of heart failure, facilitating its early diagnosis and progression prediction that will radically change how heart failure is managed in both primary and secondary care. This rapid expansion in computer modelling, image modalities and data collection, leads to a generation of so-called "Big Data" which are time-consuming to be analyzed by medical experts. In order to obtain 3D image reconstruction, the U-net architecture was used to determine geometric parameters for the left ventricle which were extracted from the echocardiographic apical and M-mode views. A micro-mechanics cellular model which includes three kinetic processes of sarcomeric proteins interactions was developed. It allows simulation of the drugs which are divided into three major groups defined by the principal action of each drug. The presented results were obtained with the parametric model of the left ventricle, where pressure-volume (PV) diagrams depend on the change of Ca2+. It directly affects the ejection fraction. The presented approach with the variation of the left ventricle (LV) geometry and simulations which include the influence of different parameters on the PV diagrams are directly interlinked with drug effects on the heart function. It includes different drugs such as Entresto and Digoxin that directly affect the cardiac PV diagrams and ejection fraction. Computational platforms such as the SILICOFCM and STRATIFYHF platforms are novel tools for risk prediction of cardiac disease in a specific patient that will certainly open a new avenue for in silico clinical trials in the future.
Chapter
Full-text available
Bagian ini mengeksplorasi Internet of Things (IoT) sebagai salah satu tren utama dalam ilmu komputer. Dimulai dengan pengenalan dasar mengenai IoT dan arsitekturnya, bagian ini membahas protokol komunikasi, standar interoperabilitas, serta tantangan keamanan dan privasi yang terkait. Berbagai aplikasi IoT yang mengubah cara hidup dan bekerja, termasuk rumah pintar, sistem kesehatan, transportasi, dan pertanian, juga diulas. Studi kasus penelitian terkini menunjukkan inovasi dalam kota pintar, pertanian presisi, dan sistem kesehatan berbasis IoT. Bab ini diakhiri dengan refleksi mengenai potensi transformatif IoT dan tantangan yang harus diatasi, menekankan pentingnya pengembangan teknologi yang bijaksana dengan mempertimbangkan aspek teknis dan etika.
ResearchGate has not been able to resolve any references for this publication.