Conference Paper

High productivity data processing analytics methods with applications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The term 'big data analytics' emerged in order to engage in the ever increasing amount of scientific and engineering data with general analytics techniques that support the often more domain-specific data analysis process. It is recognized that the big data challenge can only be adequately addressed when knowledge of various different fields such as data mining, machine learning algorithms, parallel processing, and data management practices are effectively combined. This paper thus describes some of the 'smart data analytics methods' that enable a high productivity data processing of large quantities of scientific data in order to enhance the data analysis efficiency. The paper thus aims to provide new insights how various fields can be successfully combined. Contributions of this paper include the concretization of the cross-industry standard process for data mining (CRISP-DM) process model in scientific environments using concrete machine learning algorithms (e.g. support vector machines that enable data classification) or data mining mechanisms (e.g. outlier detection in measurements). Serial and parallel approaches to specific data analysis challenges are discussed in the context of concrete earth science application data sets. Solutions also include various data visualizations that enable a better insight in the corresponding data analytics and analysis process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The review shows that the focus areas of the BDA implementation from the most to the least popular topics are as follows: BDA challenges, BDA process, data mining techniques, BDA trends and business value from BDA (Table 4). [7], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] BDA Process [7], [9], [16], [17], [18], [19], [23], [24] Data Mining Techniques [13], [18], [20], [24], [25], [26], [27] BDA Trends [7], [8], [14], [15], [28] Business Value from BDA [22], [29] ...
... The review shows that the focus areas of the BDA implementation from the most to the least popular topics are as follows: BDA challenges, BDA process, data mining techniques, BDA trends and business value from BDA (Table 4). [7], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] BDA Process [7], [9], [16], [17], [18], [19], [23], [24] Data Mining Techniques [13], [18], [20], [24], [25], [26], [27] BDA Trends [7], [8], [14], [15], [28] Business Value from BDA [22], [29] ...
... The review shows that the focus areas of the BDA implementation from the most to the least popular topics are as follows: BDA challenges, BDA process, data mining techniques, BDA trends and business value from BDA (Table 4). [7], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] BDA Process [7], [9], [16], [17], [18], [19], [23], [24] Data Mining Techniques [13], [18], [20], [24], [25], [26], [27] BDA Trends [7], [8], [14], [15], [28] Business Value from BDA [22], [29] ...
Article
Full-text available
The growing number of big data technologies and analytic solutions has been developed to support the requirement of big data implementation. The capability of analyzing big data becomes critical issues in the big data implementation because the traditional analytics tools are no longer suitable to process and analyze the massive amount and different types of data. In the recent years, technological issues and challenges on big data adoptions have been actively conducted globally. However, there are still lacking of studies on how big data implementation can derive and discover values for better decision making. The intent of this review is to investigate the capability components for Big Data Analytics (BDA) implementation towards value discovery. Based on this investigation, it was found that the capability components that may impact value discovery is formulating big data framework that includes the enabler technology and processing and using sufficient analytic techniques for analysing big data.
... LinkedIn, on the other hand, has improved the Job industry and helped Employers and Job seekers to build great relationships and getting good jobs in today's tough economy. Data mining and data analysis/prediction has become very popular in Education (Delavari 2004;Delavari et al. 2005;Riedel et al. 2014;Bhatia and Prasad 2015;Baker 2010;Shahiri et al. 2015) in the today's research community. A detailed survey done by Jindal and Borah (2013) on educational data mining and research trends provide motivation and information on various tools and algorithms. ...
Article
Full-text available
The research progress presented in this paper comes under the areas of data science. The authors propose enhanced machine learning (supervised learning) framework for the prediction of the students through stochastic probability-based math constructs/model and an algorithm [Good Fit Student (GFS)], along with the enhanced quantification of target variables and algorithmic metrics. Academia in today’s modern world sees the problem of dropouts, low retention, poor student performances, lack of motivation, and unnecessary change of study majors and re-admissions. The authors consider this challenge as a research problem and attempt to solve it by utilizing social networking-based personality traits, relevant data and features to improve the predictive modeling approach. The authors recognize that admission choices are often governed by family trends, affordability, basic motivation, market trends, and natural instincts. However, natural gifts and talents are minimally used to select such directions in the academics. The authors based on literature review identify this a research gap and improves with a unique blend of algorithms/methods, an improved modeling of performance metrics, built upon cross-validation to improve fitness, and enhance the process of feature engineering and tuning for reduced errors and optimum fitness, at the end. The authors present the latest progress of their research in this paper. The included results show the progress of the work and ongoing improvements. The authors use machine learning techniques, Microsoft SQL Server, Excel data mining, R and Python to develop and test their model. The authors provide related work and conclude with final remarks and future work.
... CRISP-DM ze względu na wskazywaną w literaturze użyteczność praktyczną [16,21] może służyć jako kompleksowy zbiór wskazówek i metod postępowania w procesie analizy postaw i opinii pracowników. Stosowanie założeń CRISP-DM pozwala na planowanie, monitorowanie i kontrolowanie przebiegu badania opinii. ...
Article
Full-text available
The aim of this study is to present a Cross Industry Standard Process for Data Mining (CRISP-DM), as a model of collecting and analyzing data from employees attitudes and opinions research. CRISP – DM model through structuring and organization of the research process can improve research management and enable more efficient knowledge discovery from collected data.
... Our motivation is driven by the needs of two concrete data-intensive science applications we introduced in earlier work [25] that require different parallel and scalable machine learning algorithms. Both raise a couple of joint requirements that we investigate in this paper and that motivate a more broader approach we refer to as a an 'open standards-based smart data analytics framework' for parallel and scalable machine learning tasks. ...
Conference Paper
Full-text available
Many scientific datasets (e.g. earth sciences, medical sciences, etc.) increase with respect to their volume or in terms of their dimensions due to the ever increasing quality of measurement devices. This contribution will specifically focus on how these datasets can take advantage of new ‘big data’ technologies and frameworks that often are based on parallelization methods. Lessons learned with medical and earth science data applications that require parallel clustering and classification techniques such as support vector machines (SVMs) and density-based spatial clustering of applications with noise (DBSCAN) are a substantial part of the contribution. In addition, selected experiences of related ‘big data’ approaches and concrete mining techniques (e.g. dimensionality reduction, feature selection, and extraction methods) will be addressed too. In order to overcome identified challenges, we outline an architecture framework design that we implement with open available tools in order to enable scalable and parallel machine learning applications in distributed systems.
Conference Paper
Full-text available
Modern world is not only about software and technology as the world advances it is becoming more data oriented and mathematical in nature. The current size of information that is brought in and processed is large and complex in size. Data size does not only involve using every single point of data that is reported. This information needs to be sized down and understood according to the application at hand. Data size is one issue and the other issue is the knowledge or information that needs to be extracted from it in order to obtain and achieve the purposeful meaning from the data. In memory and column oriented databases have presented viable and efficient solutions to optimize query time and column compressions. In addition to storing and retrieving data the information world has stepped up into big data with millions and terabytes of data as influx every single second. With the increase in the influx of data and out flux of responses generated and required. The world is now in need of both systems and software's that are efficient in storing huge data as well as application layer algorithms that are efficient enough to extract meaning from the layers or topologically dependent data. This paper is focused on analyzing in column store technique for managing mathematical and scientific big data involved in multiple markets; by using topological data meaning for analyzing and understanding the information from adaptive database systems. And for efficient storing in database the column oriented approach to big data analytics and query layers will be analyzed and optimized.
Article
Full-text available
Support Vector Machines (SVM) are powerful classification and regression tools. They have been widely studied by many scholars and applied in many kinds of practical fields. But their compute and storage requirements increase rapidly with the number of training vectors, putting many problems of practical interest out of their reach. For applying SVM to large scale data mining, parallel SVM are studied and some parallel SVM methods are proposed. Most currently parallel SVM methods are based on classical MPI model. It is not easy to be used in practical, especial to large scale data-intensive data mining problems. MapReduce is an efficient distribution computing model to process large scale data mining problems. Some MapReduce software were developed, such as Hadoop, Twister and so on. In this paper, parallel SVM based on iterative MapReduce model Twister is studied. The program flow is developed. The efficiency of the method is illustrated through analyzing practical problems.
Article
Full-text available
The wide variety of scientific user communities work with data since many years and thus have already a wide variety of data infrastructures in production today. The aim of this paper is thus not to create one new general data architecture that would fail to be adopted by each and any individual user community. Instead this contribution aims to design a reference model with abstract entities that is able to federate existing concrete infrastructures under one umbrella. A reference model is an abstract framework for understanding significant entities and relationships between them and thus helps to understand existing data infrastructures when comparing them in terms of functionality, services, and boundary conditions. A derived architecture from such a reference model then can be used to create a federated architecture that builds on the existing infrastructures that could align to a major common vision. This common vision is named as ’ScienceTube’ as part of this contribution that determines the high-level goal that the reference model aims to support. This paper will describe how a well-focused use case around data replication and its related activities in the EUDAT project aim to provide a first step towards this vision. Concrete stakeholder requirements arising from scientific end users such as those of the European Strategy Forum on Research Infrastructure (ESFRI) projects underpin this contribution with clear evidence that the EUDAT activities are bottom-up thus providing real solutions towards the so often only described ’high-level big data challenges’. The followed federated approach taking advantage of community and data centers (with large computational resources) further describes how data replication services enable data-intensive computing of terabytes or even petabytes of data emerging from ESFRI projects.
Article
Full-text available
Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data science programs, and publications are touting data science as a hot -- even "sexy" -- career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this paper we argue that there are good reasons why it has been hard to pin down exactly what data science is. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of Data Science precisely right now is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii) we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this paper we present a perspective that addresses all these things. We close by offering as examples a partial list of fundamental principles underlying data science.
Article
Full-text available
The EUDAT project is a pan-European data initiative that started in October 2011. The project brings together a unique consortium of 25 partners – including research communities, national data and high performance computing (HPC) centres, technology providers, and funding agencies – from 13 countries. EUDAT aims to build a sustainable cross-disciplinary and cross-national data infrastructure that provides a set of shared services for accessing and preserving research data.
Article
Full-text available
Support vector machines (SVMs) appeared in the early nineties as optimal margin classifiers in the context of Vapnik's statistical learning theory. Since then SVMs have been successfully applied to real-world data analysis problems, often providing improved results compared with other techniques. The SVMs operate within the framework of regularization theory by minimizing an empirical risk in a well-posed and consistent way. A clear advantage of the support vector approach is that sparse solutions to classifi- cation and regression problems are usually obtained: only a few samples are involved in the determination of the classification or regression functions. This fact facilitates the application of SVMs to problems that involve a large amount of data, such as text processing and bioinformatics tasks. This paper is intended as an introduction to SVMs and their applications, emphasizing their key features. In addition, some algorithmic extensions and illustrative real-world applications of SVMs are shown.
Chapter
Full-text available
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.
Conference Paper
Full-text available
FutureGrid provides novel computing capabilities that enable reproducible experiments while simultaneously supporting dynamic provisioning. This paper describes the FutureGrid experiment management framework to create and execute large scale scientific experiments for researchers around the globe. The experiments executed are performed by the various users of FutureGrid ranging from administrators to software developers and end users. The Experiment management framework will consist of software tools that record user and system actions to generate a reproducible set of tasks and resource configurations. Additionally, the experiment management framework can be used to share not only the experiment setup, but also performance information for the specific instantiation of the experiment. This makes it possible to compare a variety of experiment setups and analyze the impact Grid and Cloud software stacks have.
Chapter
Full-text available
After analyzing the SE process models, we have developed a joint model based on two standards to compare, process by process and activity by activity, the modus operandi in SE and DM & KD. This comparison revealed that CRISP-DM does not cover many project management-, organization- and quality-related tasks at all or at least thoroughly enough. This is now a must due to the complexity of the projects being developed in DM & KD these days. These projects not only involve examining huge volumes of data but also managing and organizing big interdisciplinary human teams. Consequently, we proposed a DM engineering process model that covers the above points. To do this, we made a distinction between process model, and methodology and life cycle. The proposed process model includes all the activities covered by CRISP-DM, but distributed across process groups that conform to engineering standards established by a field with over 40 years' experience, i.e. software engineering. The model is not complete, as the need for the processes, tasks and/or activities set out in IEEE 1074 or ISO 12207 and not covered by CRISP-DM has been stated but they have yet to be adapted and specified in detail. Additionally, this general outline needs to be further researched. First, the elements that CRISP-DM has been found not to cover at all or only in part would have to be specified and adapted from their SE counterpart. Second, the possible life cycle for DM would have to be examined and specified. Third, the process model specifies that what to do but not how to do it. A methodology is what specifies the "how to" part. Therefore, the different methodologies that are being used for each process would need to be examined and adapted to the model. Finally, a methodology is associated with a series of tools and techniques. DM has already developed many such tools (like Clementine or the neural network techniques), but tools that are well-established in SE (e.g. configuration management techniques) are missing. It remains to be seen how they can be adapted to DM and KD processes.
Conference Paper
Full-text available
In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the implementation of data mining applications. The question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them.
Conference Paper
Full-text available
MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.
Article
Full-text available
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.
Conference Paper
Full-text available
The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed method [13].
Article
Full-text available
Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
Article
Full-text available
In this paper, a toolbox LS-SVMlab for Matlab with implementations for a number of LS-SVM related algorithms is presented. The core of the toolbox is a performant LS-SVM training and simulation environment written in C-code. The functionality for classification, function approximation and unsuperpervised learning problems as well time-series prediction is explained. Extensions of LS-SVMs towards robustness, sparseness and weighted versions, as well as different techniques for tuning of hyper-parameters are included. An implementation of a Bayesian framework is made, allowing probabilistic interpretations, automatic hyperparameter tuning and input selection. The toolbox also contains algorithms of fixed size LS-SVMs which are suitable for handling large data sets. A recent overview on developments in the theory and algorithms of least squares support vector machines to which this LS-SVMlab toolbox is related is presented in [1].
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
The steadily increasing amounts of scientific data and the analysis of 'big data' is a fundamental characteristic in the context of computational simulations that are based on numerical methods or known physical laws. This represents both an opportunity and challenge on different levels for traditional distributed computing approaches, architectures, and infrastructures. On the lowest level data-intensive computing is a challenge since CPU speed has surpassed IO capabilities of HPC resources and on the higher levels complex cross-disciplinary data sharing is envisioned via data infrastructures in order to engage in the fragmented answers to societal challenges. This paper highlights how these levels share the demand for 'high productivity processing' of 'big data' including the sharing and analysis of 'large-scale science data-sets'. The paper will describe approaches such as the high-level European data infrastructure EUDAT as well as low-level requirements arising from HPC simulations used in distributed computing. The paper aims to address the fact that big data analysis methods such as computational steering and visualization, map-reduce, R, and others are around, but a lot of research and evaluations still need to be done to achieve scientific insights with them in the context of traditional distributed computing infrastructures.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
The UNICORE Grid-technology provides a seamless, secure and intuitive access to distributed Grid resources. In this paper we present the recent evolution from project results to production Grids. At the beginning UNICORE was developed as a prototype software in two projects funded by the German research ministry (BMBF). Over the following years, in various European-funded projects, UNICORE evolved to a full-grown and well-tested Grid middleware system, which today is used in daily production at many supercomputing centers worldwide. Beyond this production usage, the UNICORE technology serves as a solid basis in many European and International research projects, which use existing UNICORE components to implement advanced features, high level services, and support for applications from a growing range of domains. In order to foster these ongoing developments, UNICORE is available as open source under BSD licence at SourceForge, where new releases are published on a regular basis. This paper is a review of the UNICORE achievements so far and gives a glimpse on the UNICORE roadmap.
Article
In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine what we felt were useful features from two existing computer languages. We feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scoping.
Synergistic Challenges in Data-Intensive Science and Exascale Computing
  • Doe Ascac
  • Subcommittee
DOE ASCAC Data Subcommittee Report, 'Synergistic Challenges in Data-Intensive Science and Exascale Computing', 2013
PRACE: Europe's Supercomputing Research Infrastructure
  • Lippert Th
  • Erwin D Eickermann Th
Lippert Th., Eickermann Th., and Erwin D., 'PRACE: Europe's Supercomputing Research Infrastructure', Advances in Parallel Computing, Vol. 22, doi:10.3233/978-1-61499-041-3-7, 2012
Satelllite Imaging Corporation
  • Quickbird Satellite
  • Sensor
Quickbird Satellite Sensor, Satelllite Imaging Corporation, Online: http://www.satimagingcorp.com/satellite-sensors/quickbird.html
In situ monitoring of oxygen depletion in hypoxic ecosystems of coastal and open-seas, and land-locked water bodies
  • Web Page Hypox
HYPOX2 Project Web Page, 'In situ monitoring of oxygen depletion in hypoxic ecosystems of coastal and open-seas, and land-locked water bodies', Online: http://www.hypox.net/ [18] NIST Big Data Public Working Group, Online: http://bigdatawg.nist.gov/home.php
Big Data Analytics (BDA) interest group, Online: https://rd-alliance.org/internal-groups/big-data-analytics-ig
  • Research Data
  • Alliance
Research Data Alliance (RDA), Big Data Analytics (BDA) interest group, Online: https://rd-alliance.org/internal-groups/big-data-analytics-ig.html
Riding the wave-How Europe can gain from the rising tide of scientific data', European Union
  • J Wood
DOE ASCAC data subcommittee report
  • Doe Ascac Data Subcommittee
  • Report