Conference PaperPDF Available

Predictive Analytics of Sensor Data Using Distributed Machine Learning Techniques

Authors:

Abstract

This work is based on a real-life data-set collected from sensors that monitor drilling processes and equipment in an oil and gas company. The sensor data stream-in at an interval of one second, which is equivalent to 86400 rows of data per day. After studying state-of-the-art Big Data analytics tools including Mahout, RHadoop and Spark, we chose Ox data's H2O for this particular problem because of its fast in-memory processing, strong machine learning engine, and ease of use. Accurate predictive analytics of big sensor data can be used to estimate missed values, or to replace incorrect readings due malfunctioning sensors or broken communication channel. It can also be used to anticipate situations that help in various decision makings, including maintenance planning and operation.
Cross-Device Consumer Identification
Girma Kejela
Department Electrical Engineering
and Computer Science
University of Stavanger
Stavanger, Norway
Email: girma.kejela@uis.no
Chunming Rong
Department Electrical Engineering
and Computer Science
University of Stavanger
Stavanger, Norway
Email: chunming.rong@uis.no
Abstract—Nowadays, a typical household owns multiple dig-
ital devices that can be connected to the Internet. Advertising
companies always want to seamlessly reach consumers behind
devices instead of the device itself. However, the identity of
consumers becomes fragmented as they switch from one device
to another. A naive attempt is to use deterministic features such
as user name, telephone number and email address. However
consumers might refrain from giving away their personal infor-
mation because of privacy and security reasons. The challenge
in ICDM2015 contest is to develop an accurate probabilistic
model for predicting cross-device consumer identity without
using the deterministic user information.
In this paper we present an accurate and scalable cross-
device solution using an ensemble of Gradient Boosting De-
cision Trees (GBDT) and Random Forest. Our final solution
ranks 9th both on the public and private LB with F0.5 score
of 0.855.
Keywords-Ensemble; Xgboost ; Deep Learning; GBM; Ran-
dom Forest; ICDM2015 contest
I. INTRODUCTION
ICDM 2015 contest is sponsored by Drawbridge, a lead-
ing company in probabilistic cross-device identity solution.
The task is to identify a set of computer cookies and mobile
devices that belong to the same user. We entered the contest
with an intension of designing an accurate and scalable
probabilistic model that enables brands to target users across
devices without asking them to login (i.e. to give away their
personal information).
Our approach involves data cleansing, predicting some
of the missing values, joining mobile devices with com-
puter cookies based on the frequency of seeing them on
the same IP address, feature engineering, supervised learn-
ing, model combination/ensemble, and searching for more
matches where the confidence of the model on the pre-
dicted best match is low. We used command line tools like
’sed’ and ’cat’ to clean the data. Missing values of some
categorical variables (anonymous c0 and anonymous c1)
in device all basic and cookie all basic data have been
predicted by using other available variables. Prediction and
replacement of the missing values has been done in their
respective native tables before joining them with the other
tables. For example, to predict the missing anonymous c0 in
the device all basic table all the other variables except the
index variables have been used. Part of the data with known
category has been used as training set and the other part
with missing values has been used as test-set for prediction.
We kept 15% of the training set as validation set which is
used for optimizing the parameters of the model.
II. PREPARING DATA FO R BINARY CLASSIFICATION
The first challenge we faced when started to participate
in this contest was to join device, cookie, IP, and property
tables and construct the training and test set for binary
classification. An attempt to join on any of the categorical or
index variables resulted in tremendously large dataset with
big proportion of false matches. In this work a step-by-step
approach was used to join devices and cookies based on
common IP address starting with the most frequent IP on
which both the devices and the cookies have been seen.
As we observed throughout our experiments, cellular IPs
give a large amount of false matches than non-cellular IPs
leading to huge dataset that is computationally expensive.
Thus, we filtered out the majority of cellular IPs based on
the ip frequency variable. We also removed cookies with
unknown drawbridge handle as it failed to improve the
performance of the model.
The assumption behind joining devices and cookies based
on common IP address is that if the same user owns multiple
devices, then they should be seen on the same IP address at
least once. A device can have multiple IP matches and the
data might grow very large if we consider all the possible
matches. A reasonable approach to workaround this is to
return the most frequent cookie seen on the same IP as the
device first, and continue joining on the next most frequent
cookie seen on the IP until we find most cookies that match
devices in the training data.
id all ipipagg
Join on IP
IP device all basic
cookie all basic
id property Join on
device id
Join on
cookie id
devices
cookies
Join on IP
according to
Algorithm 1
device-cookie pairs
Figure 1. Joining the device, cookie, IP, and property tables
Algorithm 1 Join devices (D) and cookies (C) based on IP
address
1: procedure JOIN(D,C )
2: C1 5 most frequent cookies on each IP
3: DC1 D./C1.Inner join on IP
4: C2 C62 C1
5: Cellular cookies C2 with cellular IP
6: uniq cellular Remove repeated cookies
from Cellular cookies keeping those with highest
ip frequency.
7: non Cellular C2 with non cellular IP
8: new Cookies (non Cellular [
uniq cellular)
9: DC2 D./new Cookies .Inner join on IP
10: DC (DC1[DC2)
11: Return DC .Return Devices-Cookie pairs
12: end procedure
III. FEATURE ENGINEERING
All the features in the device all basic, cookie all basic,
id all ip, ipagg all, and id all property tables have been
used. We didn’t see any improvement in performance
by including features from the property category table
and we considered it as redundant table. Five categori-
cal features namely, anonymous c0, anonymous c1, anony-
mous c2, country, and property id are common to both
devices and cookies but are not used for joining the tables.
Instead of representing them twice in the device-cookie pair,
we replaced them with binary features. If an instance of a
variable on cookie side matches an instance of corresponding
variable on the device side, the value of the new binary
variable will be 1, otherwise it will be 0. After replacing
each pair with a single binary variable the number of features
are reduced from 48 to 43. This also removed variables
with large number of category (e.g. anonymous c2 and
property id) resulting in faster training time.
Very rare categories from categorical variables that are
specific to the device or cookie tables have been re-
moved. For instance we removed the device os variable’s
instances that occurred less than 2,000 times in dataset
of around 21 million data points keeping 48 most fre-
quent categories. Similarly, we kept 110 categories from
computer browser version, 6 from device type, and 48
from computer os type. After excluding the index variables
such as device id, cookie id, device drawbridge handle,
cookie drawbridge handle and IP, we got 38 variables. The
same set of variables has been used for all of the models but
we generated dummy variables from non-binary categorical
features in case of the Xgboost model.
Table I shows the relative importance of the 10 most
relevant variables computed by GBM model. Features with
’dev ’ prefix represent device side features and those with
’c ’ prefix represent cookie side features.
Table I
VARIABLE IMPORTANCE -FOR 10 MOST IMPORTANT VARIABLES
Rank Var. name rel. importance perc. importance
1 dev ip anonymous c2 2578230.75 47.57%
2 c ip anonymous c2 829895.06 15.31%
3 c anonymous 5 460277.00 8.49%
4 c idxip anonymous c3 378855.90 6.99%
5 computer browser version 210147.73 3.88%
6 device os 185905.98 3.43%
7 dev iidxp anonymous c3 146490.69 2.70%
8 computer os type 130295.96 2.40%
9 dev ip anonymous c1 100744.43 1.86%
10 dev anonymous 5 49326.52 0.91%
IV. SUPERVISED LEARNING
A user related variable, drawbridge handle, was given
only for the training set and it was used to construct binary
target variable that represent positive matches and negative
matches. If drawbridge handle on the device side is the same
as drawbridge handle on the cookie side it means that they
belong to the same user and it represents positive match
otherwise it will be considered as negative match. Once the
problem of matching devices and cookies that belong to a
single user is transformed to a binary classification problem,
we implemented state-of-the-art predictive models that can
separate true matches from false matches with high accuracy.
In many cases, combination of learning models result in
a better predictive performance than each of the individual
models. Our final result was obtained by combining Xg-
boost, Random Forest and GBM. As can be seen from Table
1, the best single model is Xgboost. Our test-set contains 6
million data points, out of which only 61,156 devices will
have one or more cookies that belong to the same user as the
device. The submission requires only 61,156 devices with
their true cookie matches and we selected the final matches
based on the maximum probability of the predicted positive
match.
To further improve the prediction performance we selected
the device ids in the test data whose maximum predicted
probability is less than 0.4 and re-joined them with cookies
based on all possible IP matches. We did this because if
the probability of the best much is low it could mean
that the true match is not included in the test-set and
we wanted to search for more possible matches among
cookies that were discarded during the joining process.
This created 7 million new test-set from just 5,501 device
IDs in the device test basic data. We predicted using an
already trained GBM model and used maximum probability
to identify the best match for submission. This improved the
score on public LB from 0.852 to 0.855.
Table II
LEADER BOARD SCORE (F0.5) OF INDIVIDUAL MODELS AND THEIR
ENSEMBLES
Model Puplic LB Private LB Remark
Deep learn-
ing
0.8338 0.8351 Hidden:(128,64,64,32),
Rectifier with dropout
Xgboost 0.8531 0.8541 learning rate:0.01,
depth:10,
num round:2000
GBM 0.84792 0.84793 learning rate:0.01,
depth:10,
num trees:1500
RF 0.8455 0.8472 num trees:1000, sam-
ple rate:0.85
Ensemble:
GBM, RF,
Xgboost
0.8525 0.8535 Combination method:
averaging
Ensemble:
GBM, RF,
Xgboost
with
rematching
0.855391 0.855454 Combination method:
averaging and maxi-
mum probability
V. S UMMARY AND CONCLUSION
Tables of dataset were pre-processed and joined based on
common variable matches. We used a step-by-step IP based
joining to identify 176,133 cookies that belong to the same
user as the devices given in the training all basic data out of
180,083 possible matches. Instead of considering all possible
IP matches which produces tremendously large dataset, we
filtered out the less frequent cellular IPs. This left out circa
2.2% of the training set without their true matches and we
expect the same proportion was left out from the test set
too, corresponding to 2.2% error. The final error is the sum
of this error and the error produced by the binary classifier
models.
Supervised machine learning models have been used to
separate positive matches from false matches (i.e., cookies
that have been seen on the same IP as devices but don’t
belong to the same user). Ensemble of GBM, Random
Forest, and Xgboost outperforms each of the individual
models. The best single model is Xgboost with public LB
score of 0.853.
The cellular IPs discarded during the joining of devices
with cookies has helped to reduce the size of the training
data but it accounts for circa 2.2% of the final error. We
tried to compensate for this error by selecting devices with
predicted values of low probability (in this case, less than
0.4) and joined them with cookies based on all possible IP
matches creating a new test-set. We used an already trained
GBM to predict on the new test-set and selected prediction
with maximum probability as the best matches. This im-
proved the F0.5 score on the private LB by circa 0.002. For
better performance, our solution can be combined with other
solutions that identified more percentage of device-cookie
matches but used less accurate predictive models.
REFERENCES
[1] Thomas G. Dietterich, Ensemble Methods in Machine Learn-
ing, Oregon State University.
[2] Drawbridge Media kit: https://gallery.mailchimp.com/
dd5380a49beb13eb00838c7e2/files/Drawbridge MediaKit Jun2015 1 .pdf
[3] Contest Data Set: https://www.kaggle.com/c/icdm-2015-
drawbridge-cross-device-connections/data
[4] The code used in this contest will be released here:
https://github.com/Girmak/ICDM2015
... The h2o package from H2O.ai, an open-source machine learning package, was used in R (RStudio 3.5.1) to create the models because of H2O's ease of use, compatibility with multiple languages, ability to scale to large data, and library of widely used machine learning algorithms. 18 This allows for the development of several models to assess the robustness of machine learning for a specific application. Additionally, H2O has several built-in parameters that were used to limit overfitting the models to ensure accurate results such as the low default number of trees, limiting the number of epochs, and using the right amount of regularization. ...
... The maximum tree depth default is five trees and the default number of bins for the histogram to build then split is 20. 18,20 GLMs are extensions of traditional linear regressions and can be used for response variables that follow distributions other than just the normal distribution. H2O's GLM includes distributions such as binomial, Gaussian, Poisson, and more and fits the model based on the maximum likelihood estimation via iteratively reweighed least squares. ...
... These models integrate live sensor data with engineering models and historical records to simulate system dynamics under various conditions. By continuously synchronizing with physical systems, digital twins provide a contextualized and evolving view of component performance, degradation trends, and potential failure points [29]. ...
Article
Full-text available
The advent of the Internet of Things (IoT) has significantly transformed maintenance strategies for mechanical systems, transitioning from reactive and preventive approaches to intelligent, predictive maintenance frameworks. This paper explores the integration of IoT technologies-specifically sensor networks, edge computing, and cloud infrastructure-into mechanical system monitoring to enable real-time diagnostics and failure prediction. It outlines the evolution of maintenance strategies and highlights how embedded sensing and continuous data collection are foundational to predictive analytics. Through detailed examination of system architecture, communication protocols, and machine learning methodologies, the paper illustrates how predictive models and digital twins enhance fault detection, equipment longevity, and resource allocation. Case studies demonstrate quantifiable operational benefits, including reduced unplanned downtime and cost savings. The strategic and organizational implications are analyzed, emphasizing workforce transformation, implementation barriers, and cybersecurity considerations. Ultimately, this study presents a comprehensive framework for implementing IoT-enabled predictive maintenance and suggests future research directions centered on AI convergence, system interoperability, and sustainability in industrial operations.
... In this paper, we propose Cloud-based machine learning tools for enhanced big data applications (e.g., [34,7,21]), where the main idea is that of predicting the "next" workload occurring against the target Cloud infrastructure via an innovative ensemble-based (e.g., [47]) approach combining the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data (e.g., [17]). So-called workload categorization problem plays a critical role in improving the efficiency and the reliability of Cloud-based big data applications (e.g., [60,62]). ...
Article
Full-text available
We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the "next" workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combines the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. The so-called workload categorization problem plays a critical role in improving the efficiency and reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate in the distributed classification approach on top of virtual machines, which represent classical "commodity" settings for Cloud-based big data applications. Given a number of known reference workloads, and an unknown workload, in this paper we deal with the problem of finding the reference workload which is most similar to the unknown one. The depicted scenario turns out to be useful in a plethora of modern information system applications. We name this problem as coarse-grained workload classification, because, instead of characterizing the unknown workload in terms of finer behaviors, such as CPU, memory, disk, or network intensive patterns, we classify the whole unknown workload as one of the (possible) reference workloads. Reference workloads represent a category of workloads that are relevant in a given applicative environment. In particular, we focus our attention on the classification problem described above in the special case represented by virtualized environments. Today, Virtual Machines (VMs) have become very popular because they offer important advantages to modern computing environments such as cloud computing or server farms. In virtualization frameworks, workload classification is very useful for accounting, security reasons, or user profiling. Hence, our research makes more sense in such environments, and it turns out to be very useful in a special context like Cloud Computing, which is emerging now. In this respect, our approach consists of running several machine learning-based classifiers of different workload models, and then deriving the best classifier produced by the Dempster-Shafer Fusion, in order to magnify the accuracy of the final classification. Experimental assessment and analysis clearly confirm the benefits derived from our classification framework. The running programs which produce unknown workloads to be classified are treated in a similar way. A fundamental aspect of this paper concerns the successful use of data fusion in workload classification. Different types of metrics are in fact fused together using the Dempster-Shafer theory of evidence combination, giving a classification accuracy of slightly less than 80%. The acquisition of data from the running process, the pre-processing algorithms, and the workload classification are described in detail. Various classical algorithms have been used for classification to classify the workloads, and the results are compared.
... The rapid improvement in industrial operations and technology, the internet of things, smart gadgets, and social media, digital data has grown in volume and complexity at a rapid rate [1][2][3][4]. Big data is a collection of records with a large volume and a rate of exponential growth over time [5]. For example, the New York Stock Exchange market creates over 1TB of new data per day, social media creates more than 500TB, and a Jet engine creates around 10TB of data in only 30 minutes, according to the report. ...
Article
Full-text available
Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform’s qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used ”SUSY,” ”HIGGS,” ”BANK,” and ”HEPMASS” dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.
... Therefore, it is also important to underline that in machine learning, there is something called the "No Free Lunch" theorem. It states that no one algorithm works best for every problem, and it is especially relevant for supervised learning (i.e., predictive modelling) [111]. There are many factors at play, such as the size, quality, and nature of data; the available computational time; the urgency of the task; and what you want to do with the data [112]. ...
Article
Full-text available
The increasing availability of data, gathered by sensors and intelligent machines, is changing the way decisions are made in the manufacturing sector. In particular, based on predictive approach and facilitated by the nowadays growing capabilities of hardware, cloud-based solutions, and new learning approaches, maintenance can be scheduled—over cell engagement and resource monitoring—when required, for minimizing (or managing) unexpected equipment failures, improving uptime through less aggressive maintenance schedules, shortening unplanned downtime, reducing excess (direct and indirect) cost, reducing long-term damage to machines and processes, and improve safety plans. With access to increased levels of data (and over learning mechanisms), companies have the capability to conduct statistical tests using machine learning algorithms, in order to uncover root causes of problems previously unknown. This study analyses the maturity level and contributions of machine learning methods for predictive maintenance. An upward trend in publications for predictive maintenance using machine learning techniques was identified with the USA and China leading. A mapping study—steady set until early 2019 data—was employed as a formal and well-structured method to synthesize material and to report on pervasive areas of research. Type of equipment, sensors, and data are mapped to properly assist new researchers in positioning new research activities in the domain of smart maintenance. Hence, in this paper, we focus on data-driven methods for predictive maintenance (PdM) with a comprehensive survey on applications and methods until, for the sake of commenting on stable proposal, 2019 (early included). An equal repartition between evaluation and validation studies was identified, this being a symptom of an immature but growing research area. In addition, the type of contribution is mainly in the form of models and methodologies. Vibrational signal was marked as the most used data set for diagnosis in manufacturing machinery monitoring; furthermore, supervised learning is reported as the most used predictive approach (ensemble learning is growing fast). Neural networks, followed by random forests and support vector machines, were identified as the most applied methods encompassing 40% of publications, of which 67% related to deep neural network with long short-term memory predominance. Notwithstanding, there is no robust approach (no one reported optimal performance over different case tests) that works best for every problem. We finally conclude the research in this area is moving fast to gather a separate focused analysis over the last two years (whenever stable implementations will appear).
... Both works are just examples of a whole family of activities that took place even before the eve of Big Data but already pointed towards the necessity of increasing the number of variables. The acquisition of models in Big Data volumes is often done via distributed machine learning [9], [10], and commonly involving strategies like deep learning [11] or ensemble algorithms. Today, works like Gao et al. [12] yield promising results in applying autoregressive networks directly on these highdimensional raw data streams. ...
Chapter
The implementation of artificial intelligence faces different challenges of infrastructural, data related, security related and social scope. These aspects are discussed, reflecting on the requirements of introducing such a technology in a broader way. Machine learning, as subfield of artificial intelligence, benefits from advances in Big Data science. One example is the λ\lambda -architecture which can be used for the treatment of streamed process data in industrial applications. Digital twins are shown as further tool providing object-oriented data for machine learning applications. Yet, the increasing freedom of data transfer within a plant, as propagated by Industry 4.0, poses new risks for information technology and automation systems: security of those components is one of the big challenges. Here, artificial intelligence can seen as both risk and solution. A last relevant challenge is acceptance among the staff, as artificial intelligence is associated with fears. Counterstrategies for those fears are presented as a proposed guideline for real applications. Finally, current frontiers at process industry are considered and discussed. These include the need for strengthening the use of high-dimensional data availability, increased roll-out of optimisation concepts and rigorous progresses in semantic modelling of processes and process chains, in order to fully exploit the beneficial scope of artificial intelligence in industry.
Article
Full-text available
A Smart Manufacturing (SM) system should be capable of handling high volume data, processing high velocity data and manipulating high variety data. Big data analytics can enable timely and accurate insights using machine learning and predictive analytics to make better decisions. The objective of this paper is to present big data analytics modeling in the metal cutting industry. This paper includes: 1) identification of manufacturing data to be analyzed, 2) design of a functional architecture for deriving analytic models, and 3) design of an analytic model to predict a sustainability performance especially power consumption, using the big data infrastructure. A prototype system has been developed for this proof-of-concept, using open platform solutions including MapReduce, Hadoop Distributed File System (HDFS), and a machine-learning tool. To derive a cause-effect relationship of the analytic model, STEP-NC (a standard that enables the exchange of design- to-manufacturing data, especially machining) plan data and MTConnect machine monitoring data are used for a cause factor and an effect factor, respectively.
Conference Paper
Full-text available
In this paper, we describe a novel analytics system that enables query processing and predictive analytics over streams of big aviation data. As part of an Internal Research and Development project, Boeing Research and Technology (BR&T) Advanced Air Traffic Management (AATM) built a system that makes predictions based upon descriptive patterns of massive aviation data. Boeing AATM has been receiving live Aircraft Situation Display to Industry (ASDI) data and archiving it for over two years. At the present time, there is not an easy mechanism to perform analytics on the data. The incoming ASDI data is large, compressed, and requires correlation with other flight data before it can be analyzed. The service exposes this data once it has been uncompressed, correlated, and stored in a data warehouse for further analysis using a variety of descriptive, predictive, and possibly prescriptive analytics tools. The service is being built partially in response to requests from Boeing Commercial Aviation (BCA) for analysis of capacity and flow in the US National Airspace System (NAS). The service utilizes a custom tool developed by Embry Riddle Aeronautical University (ERAU) that correlates the raw ASDI feed, IBM Warehouse with DB2 for data management, WebSphere Message Broker for real-time message brokering, SPSS Modeler for statistical analysis, and Cognos BI for front-end business intelligence (BI) visualization tools. This paper describes a scalable service architecture, implementation and value it adds to the aviation domain.
Article
Full-text available
Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling. A theoretical information is complemented with descriptive examples and illustrations which cover all the stages of the gradient boosting model design. Considerations on handling the model complexity are discussed. Three practical examples of gradient boosting applications are presented and comprehensively analyzed.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Conference Paper
We investigate an approach of predicting marine sensor data in short-term using fuzzy clustering. The proposed method uses the similar pattern sequences with their weights using fuzzy clustering for prediction. Experimental results show that the proposed method helps in improving the accuracy in prediction over some existing approaches.
Article
The healthcare sector deals with large volumes of electronic data related to patient services. This article describes two novel applications that leverage big data to detect fraud, abuse, waste, and errors in health insurance claims, thus reducing recurrent losses and facilitating enhanced patient care. The results indicate that claim anomalies detected using these applications help private health insurance funds recover hidden cost overruns that aren't detectable using transaction processing systems. This article is part of a special issue on leveraging big data and business analytics.
Article
The history of oilwell drilling includes a number of innovative techniques to infer the conditions down-hole and their impact on drilling performance. Surface and down-hole sensors have been deployed to measure related physical phenomena that can be generally associated with the actual environment. Some direct measurements have been made after drilling using sensors suspended on a wire line, and this evolved into near real-time measurements using Measurement While Drilling (MWD) and Logging While Drilling (LWD) equipment that use low bandwidth pulse telemetry communications. With the advent to high speed wired drillpipe, new sensors are needed to take advantage of the improved telemetry capabilities to produce true real-time measurements. New surface sensors are needed that employ smart technologies and enhanced data processing. By combining these high-speed, high quality measurements with predictive models, set points and limits can be directly connected to the latest generation of open-architecture rig control systems to improve drilling safety and efficiency. This paper briefly outlines the shortcomings of a wide variety of existing sensor designs and identifies emerging technologies that could provide better measurements if the harsh environmental constraints can be met. The authors then call to non-industry participants to adapt sensor technologies used in other commercial applications to help automate the process of drilling for oil and gas. They discuss where the oil and gas industry needs accurate, real-time measurements of chemical, rheological and physical properties, and how these measurements could be used. They also address the need for power generation and/or storage to overcome challenges encountered today. The paper ends with a look ahead as the industry seeks new sensors, perhaps based upon a very different set of physical measurements that will improve our ability to safely produce oil and gas until alternate energy sources become commonplace.
Conference Paper
Wireless sensor networks have a vast amount of applications including environmental monitoring, military, ecology, agriculture, inventory control, robotics and health care. This paper focuses on the area of monitoring and protection of oil and gas operations using wireless sensor networks that are optimized to decrease installation, and maintenance cost, energy requirements, increase reliability and improve communication efficiency. In addition, simulation experiments using the proposed model are presented. Such models could provide new tools for research in predictive maintenance and condition-based monitoring of factory machinery in general and for “open architecture machining systems” in particular. Wireless sensing no longer needs to be relegated to locations where access is difficult or where cabling is not practical. Wireless condition monitoring systems can be cost effectively implemented in extensive applications that were historically handled by running routes with data collectors. The result would be a lower cost program with more frequent data collection, increased safety, and lower spare parts inventories. Facilities would be able to run leaner because they will have more confidence in their ability to avoid downtime.