BookPDF Available

Abstract

Extracting knowledge and emerging patterns from sensor data is a nontrivial task. The challenges for the knowledge discovery community are expected to be immense. On one hand, dynamic data streams or events require real-time analysis methodologies and systems, while on the other hand centralized processing through high end computing is also required for generating offline predictive insights, which in turn can facilitate real-time analysis. In addition, emerging societal problems require knowledge discovery solutions that are designed to investigate anomalies, changes, extremes and nonlinear processes, and departures from the normal. Keeping in view the requirements of the emerging field of knowledge discovery from sensor data, we took initiative to develop a community of researchers with common interests and scientific goals, which culminated into the organization of Sensor-KDD series of workshops in conjunction with the prestigious ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. In this report, we summarize the events of the Second ACM-SIGKDD International Workshop on Knowledge Discovery form Sensor Data (Sensor-KDD 2008).
Knowledge Discovery from Sensor Data (SensorKDD)
Olufemi A. Omitaomu
Oak Ridge National Laboratory
1, Bethel Valley Road
Oak Ridge, TN 37831, USA
omitaomuoa@ornl.gov
Ranga Raju Vatsavai
Oak Ridge National Laboratory
1, Bethel Valley Road
Oak Ridge, TN 37831, USA
vatsavairr@ornl.gov
Auroop R. Ganguly
Oak Ridge National Laboratory
1, Bethel Valley Road
Oak Ridge, TN 37831, USA
gangulyar@ornl.gov
Nitesh V. Chawla
University of Notre Dame
Department of Computer Science and
Engineering, IN, USA
nchawla@cse.nd.edu
Joao Gama
University of Porto
Rua de Ceuta
Porto, Portugal
jgama@fep.up.pt
Mohamed Medhat Gaber
Monash University
900 Dandenong Road
Melbourne, Victoria, 3145, Australia
mohamed.gaber@infotech.mo
nash.edu.au
ABSTRACT
Wide-area sensor infrastructures, remote sensors, RFIDs, and
wireless sensor networks yield massive volumes of disparate,
dynamic, and geographically distributed data. As such sensors are
becoming ubiquitous, a set of broad requirements is beginning to
emerge across high-priority applications including adaptability to
climate change, electric grid monitoring, disaster preparedness
and management, national or homeland security, and the
management of critical infrastructures. The raw data from sensors
need to be efficiently managed and transformed to usable
information through data fusion, which in turn must be converted
to predictive insights via knowledge discovery, ultimately
facilitating automated or human-induced tactical decisions or
strategic policy based on decision sciences and decision support
systems.
Keeping in view the requirements of the emerging field of
knowledge discovery from sensor data, we took initiative to
develop a community of researchers with common interests and
scientific goals, which culminated into the organization of
SensorKDD series of workshops in conjunction with the
prestigious ACM SIGKDD International Conference of
Knowledge Discovery and Data Mining. In this report, we
summarize events at the Third ACM-SIGKDD International
Workshop on Knowledge Discovery form Sensor Data
(SensorKDD 2009).
1. INTRODUCTION
As wide-area sensor infrastructures, remote sensors, RFIDs, and
wireless sensor networks are becoming ubiquitous, the challenges
for the knowledge discovery community are expected to be
immense. On the one hand, dynamic data streams or events
require real-time analysis methodologies and systems, while on
the other hand centralized processing through high end computing
is also required for generating offline predictive insights, which in
turn can facilitate real-time analysis. The online and real-time
knowledge discovery imply immediate opportunities as well as
intriguing short- and long-term challenges for practitioners and
researchers in knowledge discovery. The opportunities would be
to develop new data mining approaches and adapt traditional and
emerging knowledge discovery methodologies to the requirements
of the emerging problems. In addition, emerging societal
problems require knowledge discovery solutions that are designed
to investigate anomalies, hotspots, changes, extremes and
nonlinear processes, and departures from the normal. The theme
for the 2009 SensorKDD workshop is around three inter-related
global priorities: climate change, energy assurance, and
infrastructural impacts. The workshop brings together researchers
from academia, government, and the industry working in various
aspects of knowledge discovery from sensor data.
1.1 Motivation
The expected ubiquity of sensors in the near future, combined
with the critical roles they are expected to play in high priority
application solutions, point to an era of unprecedented growth and
opportunities. The requirements described earlier imply
immediate opportunities as well as intriguing short- and long-term
challenges for practitioners and researchers in knowledge
discovery. In addition, the knowledge discovery and data mining
(KDD) community would be called upon, again and again, as
partners with domain experts to solve critical application solutions
in business and government, as well as in the domain sciences and
engineering.
The main motivation for this workshop stems from the increasing
need for a forum to exchange ideas and recent research results,
and to facilitate collaboration and dialog between academia,
government, and industrial stakeholders. Based on the positive
feedback from the previous workshop attendees and our own
experiences and interactions with the government agencies such
as the United States Department of Energy, United States
Department of Homeland Security, United States Department of
Defense, and involvement with numerous projects on knowledge
discovery from sensor data, we strongly believe in the
continuation of this workshop. We also believe that the ACM
SIGKDD conference is the right forum to organize this workshop
as it brings the KDD community together in this important area to
establish a much needed leadership position in research and
practice in the near term, as well as in the long term.
1.2 Previous SensorKDD Workshops
The previous two workshops SensorKDD’07 and
SensorKDD’08 – held in conjunction with the 13
th
and 14
th
ACM
SIGKDD Explorations Volume 11, Issue 2 Page 84
SIGKDD Conference on Knowledge Discovery and Data Mining
respectively attracted several participants as well as many high
quality papers and presentations. The 2007 workshop was
attended by more than seventy registered participants. The
workshop program includes presentations by authors of six
accepted full papers and four invited speakers. The invited
speakers were Prof. Pedro Domingos of the University of
Washington, Prof. Joydeep Ghosh of the University of Texas,
Austin, Prof. Hillol Kargupta of the University of Maryland,
Baltimore County, and Dr. Brian Worley of the Oak Ridge
National Laboratory (ORNL). There were also poster
presentations by authors of six accepted short papers. The
extended versions of papers presented at the workshop were
developed into a book titled Knowledge Discovery from Sensor
Data [1], the first book published in this specific discipline.
The 2008 workshop was attended by more than 60 registered
participants. There were presentations by authors of seven
accepted full papers and six accepted short papers; the workshop
program also include presentations by two invited speakers –
Prof. Jiawei Han of the University of Illinois at Urbana-
Champaign and Dr. Kendra Moore of the Defense Advanced
Research Projects Agency. The extended versions of papers
presented at the 2008 workshop are scheduled for publication as
Springer's LNCS post-proceedings in 2009.
2. SUMMARY OF THE 2009 WORKSHOP
The highlights of the SensorKDD 2009 workshop are oral
presentations of accepted papers, presentations by invited
speakers, and oral presentations by the winners of the first
SensorKDD Cup, similar in spirit to the KDD Cup, but in line
with the workshop theme.
Based on a minimum of two reviews per paper, we selected six
full paper and eight short papers. In addition to the oral
presentations of accepted papers, the workshop featured three
invited speakers: Dr. Aurelie Lozano of IBM T.J. Watson
Research Center, Mr. Alessandro Donati of the European Space
Agency, Darmstadt, Germany, and Prof. Carlos Guestrin of
Carnegie Mellon University. For the SensorKDD Cup, we
challenged contestants with intriguing problems along
with geographically distributed and dynamic data from sensors
and model simulations. The datasets were provided and/or
compiled by leveraging the Oak Ridge National Laboratory
resources. Two winners were selected to present their results at
the workshop. All papers presented at the workshop are available
in [2].
2.1 Full Research Papers
Below is a list of accepted full papers and their respective authors:
Handling Outliers and Concept Drift in Online Mass
Flow Prediction in CFB Boilers by J. Bakker, M.
Pechenizkiy, I. Zliobaite, A. Ivannikov, and T.
Karkkainen (Best Paper Award Winner).
An Exploration of Climate Data Using Complex
Networks by Karsten Steinhaeuser, Nitesh V. Chawla,
and Auroop R. Ganguly (1
st
Best Student Paper Award
Winner).
A Comparison of SNOTEL and AMSR-E Snow Water
Equivalent Datasets in Western U.S. Watersheds by
Cody L. Moser, Oubeidillah Aziz, Glenn A. Tootle,
Venkat Lakshmi, and Greg Kerr.
EDISKCO: Energy Efficient Distributed in-Sensor-
Network K-center Clustering with Outliers by Marwan
Hassani, Emmanuel Muller, and Thomas Seidl.
Phenological Event Detection from Multitemporal
Image Data by Ranga Raju Vatsavai.
Mining in a Mobile Environment by Sean McRoskey,
James Notwell, Nitesh V. Chawla, and Christian
Poellabauer (2
nd
Best Student Paper Award Winner).
2.2 Short Research Papers
The following is a list of accepted short papers and their
respective authors:
On the Identification of Intra-seasonal Changes in the
Indian Summer Monsoon by Shivam Tripathi and Rao
S. Govindaraju.
Reduction of Ground-Based Sensor Sites for Spatio-
Temporal Analysis of Aerosols by Vladan
Radosavljevic, Slobodan Vucetic, and Zoran
Obradovic.
OcVFDT: One-class Very Fast Decision Tree for One-
class Classification of Data Streams by Chen Li, Yang
Zhang, and Xue Li.
A Frequent Pattern Based Framework for Event
Detection in Sensor Network Stream Data by Li Wan,
Jianxin Liao, and Xiaomin Zhu.
Supervised Clustering via Principal Component
Analysis in a Retrieval Application by Esteban Garcia-
Cuesta, Ines M. Galvan, and Antonio J. de Castro.
A Novel Measure for Validating Clustering Results
Applied to Road Traffic by Yosr Naija and Kaouther
Blibech Sinaoui.
SkyTree: Scalable Skyline Computation for Sensor Data
by Jongwuk Lee and Seung-won Hwang.
Clustering of Power Quality Event Data Collected via
Monitoring Systems Installed on the Electricity
Network by Mennan Guder, Nihan Kesim Cicekli,
Ozgul Salor, and Isik Cadirci.
2.3 SensorKDD’09 Cup Winners
The titles and authors of the SensorKDD’09 Cup winners are:
Change Detection in Rainfall and Temperature Patterns
over India by Shivam Tripathi and Rao S. Govindaraju.
Anomaly Detection and Spatio-Temporal Analysis of
Global Climate System by Mahashweta Das and
Srinivasan Parthasarathy.
3. CONCLUSIONS
Extracting knowledge and emerging patterns from sensor data is a
nontrivial task. The challenges for the knowledge discovery
community are expected to be immense. As evidenced from the
participation and quality of submissions to the previous three
SensorKDD workshops, it is clear that the “Knowledge Discovery
from Sensor Data or SensorKDD” is a growing area and an
important specialty (sub-area) within knowledge discovery. The
SIGKDD Explorations Volume 11, Issue 2 Page 85
SensorKDD workshop is proven to be an attractive forum for the
researchers from academia, industry and government, to exchange
ideas, initiate collaborations and lay foundation to the future of
this important and growing area. The workshop witnessed lively
participation from all quarters, generated interesting discussions
immediately after each presentation and as well as at the end of
the workshop. All participants agreed for continued patronage for
the SensorKDD workshop. In addition to the ACM workshop
proceedings, extended papers will be published as post workshop
proceedings in Springer's well-known “Lecture Notes in
Computer Science” series.
4. SPONSORSHIP
The SensorKDD’09 workshop was sponsored by the Geographic
Information Science and Technology (GIST) Group at Oak Ridge
National Laboratory, the Computational Sciences and Engineering
(CSE) Division at the Oak Ridge National Laboratory, and
Cooperating Objects Network of Excellence (CONET).
5. ACKNOWLEDGMENTS
We would like to thank our sponsors for their kind donations. We
would like to thank the authors of all submitted papers and
presenters. Their innovation and creativity has resulted in a strong
technical program. We are highly indebted to the program
committee members, whose reviews ensured the development of a
competitive and strong technical program. The program
committee listed in alphabetical order of last names are: Adedeji
B. Badiru, Eric Auriol, Albert Bifet, Michaela Black, Jose del
Campo-Avila, Andre Carvalho, Sanjay Chawla, Diane Cook,
Alfredo Cuzzocrea, Christie Ezeife, David J. Erickson III, Yi
Fang, Francisco Ferrer, James H. Garrett, Joydeep Ghosh, Bryan
L. Gorman, Sara Graves, Ray Hickey, Forrest Hoffman, Luke
(Jun) Huan, Volkan Isler, Vandana Janeja, Yu (Cathy) Jiao, Ralf
Klinkenberg, Miroslav Kubat, Vipin Kumar, Mark Last, Chang-
Tien Lu, Elaine Parros Machado de Sousa, Sameep Mehta,
Laurent Mignet, S. Muthu Muthukrishnan, George Ostrouchov,
Rahul Ramachandran, Pedro Rodrigues, Josep Roure, Bernhard
Seeger, Cyrus Shahabi, Shashi Shekhar, Lucio Soibelman,
Alexandre Sorokine, Eduardo J. Spinosa, Karsten Steinhaeuser,
Peng Xu, Eiko Yoneki, Philip S. Yu, Nithya Vijayakumar, and
Guangzhi Qu.
We would like to thank our invited speakers, Dr. Aurélie Lozano
of IBM T.J. Watson Research Center, Mr. Alessandro Donati of
European Space Agency, Darmstadt, Germany, and Prof. Carlos
Guestrin of
Carnegie Mellon University, who, despite their busy
schedules, readily agreed and delivered highly motivating and
informative talks. We would like to thank, Dr. Brian Worley,
Director, Computational Sciences and Engineering Division
(CSED), Oak Ridge National Laboratory (ORNL), for his
encouragement, support, and continued patronage of SensorKDD
workshop series, and Dr. Budhendra Bhaduri, Group Leader,
Geographic Information Science and Technology, CSED, ORNL,
for his enthusiastic support and sponsorship.
This workshop report was compiled by Dr. Olufemi A. Omitaomu
of the Computational Sciences and Engineering Division at Oak
Ridge National Laboratory. The workshop report has been co-
authored by employees of UT-Battelle, LLC, under contract DE-
AC05-00OR22725 with the U.S. Department of Energy. The
United States Government retains, and the publisher by accepting
the article for publication, acknowledges that the United States
Government retains, a non-exclusive, paid-up, irrevocable, world-
wide license to publish or reproduce the published form of this
manuscript, or allow others to do so, for United States
Government purposes.
6. REFERENCES
[1]
Auroop R. Ganguly, Joao Gama, Olufemi A. Omitaomu,
Mohamed M. Gaber, and Ranga Raju Vatsavai (2009).
Knowledge Discovery from Sensor Data. New York, NY:
CRC Press, January.
[2] Olufemi A. Omitaomu, Auroop R. Ganguly, Joao Gama,
Ranga Raju Vatsavai, Nitesh V. Chawla, and Mohamed
Medhat Gaber (2009). Proceedings of the Third
International Workshop on Knowledge Discovery from
Sensor Data. Paris, France: ACM Digital Library. Available
online:
http://portal.acm.org/toc.cfm?id=1601966&coll=GUIDE&dl
=GUIDE&type=proceeding&idx=SERIES939&part=series&
WantType=Proceedings&title=KDD&CFID=49782169&CF
TOKEN=54456709.
About the Workshop Organizers:
Dr. Olufemi A. Omitaomu is a research scientist in the
Computational Sciences and Engineering Division at the Oak
Ridge National Laboratory. He is also an adjunct assistant
professor at the University of Tennessee, Knoxville. His research
interests include streaming and real-time data mining,
infrastructure modeling and analysis, machine learning, and signal
processing. He received Ph.D. in information engineering from
the University of Tennessee. He has published in top peer-
reviewed journals and conferences; co-organized and co-chaired
workshop and sessions at professional conferences including the
ACM Workshop on Knowledge Discovery from Sensor Data held
in conjunction with ACM SIGKDD 2007 and ACM SIGKDD
2008. He previously worked as a data analyst with Mobil
Exploration and Production Company for more than five years.
Dr. Auroop R. Ganguly is a research scientist within the
Computational Sciences and Engineering division of the Oak
Ridge National Laboratory since 2004. His research interests are
climate change impacts, geoscience informatics, civil and
environmental engineering, computational data sciences, and
knowledge discovery. Prior to ORNL, he has more than five years
of experience in the software industry, specifically Oracle
Corporation and a best-of-breed company subsequently acquired
by Oracle, and about a year in academia, specifically at the
University of South Florida in Tampa. He has a PhD from the
Civil and Environmental Engineering department of the
Massachusetts Institute of Technology, several years of research
experience with a group at the MIT Sloan School of Management,
experience in private consulting, and a wide range of peer-
reviewed publications spanning multiple disciplines. Currently, he
is also an adjunct professor at the University of Tennessee in
Knoxville.
SIGKDD Explorations Volume 11, Issue 2 Page 86
Dr. Joao Gama is a researcher at LIAAD-INESC Porto LA, the
Laboratory of Artificial Intelligence and Decision Support of the
University of Porto. His main research interest is learning from
Data Streams. He has published several articles in change
detection, learning decision trees from data streams, hierarchical
clustering from streams, among others. Editor of special issues on
Data Streams in Intelligent Data Analysis, Journal of Universal
Computer Science, and New Generation Computing. Co-chair of
ECML 2005 Porto, Portugal 2005, Conference chair of Discovery
Science 2009, and of a series of Workshops on Knowledge
Discovery in Data Streams, ECML 2004, Pisa, Italy, ECML 2005,
Porto, Portugal, ICML 2006, Pittsburg, US, ECML 2006 Berlin,
Germany, and SAC2007, Korea. Together with Dr. Mohamed M.
Gaber edited the book Learning from Data Streams Processing
Techniques in Sensor Networks, published by Springer.
Dr. Ranga Raju Vatsavai has been conducting research in the area
of spatiotemporal databases and data mining for the past 15 years.
Before joining the Oak Ridge National Laboratory (ORNL) as a
research scientist, he worked at IBM-Research (2004-06; IIT-
Delhi campus), U of Minnesota (1999-2004; Twin-cities campus,
MN), AT&T Labs (1998; Middletown, NJ), Center for
Development of Advanced Computing (1995-98; C-DAC, U of
Pune campus, India), and National Forest Data Management
Center (1990-95; FRI Campus, Dehradun, India). He has
published over thirty peer-reviewed articles and served on
program committees of several international conferences (KDD,
ICTAI, SSTDM). He was also involved in the design and
development of several highly successful software systems
(UMN-MapServer - a world leading open source WebGIS,
*Miner - a spatiotemporal data mining workbench, EASI/PACE
classification modules, and first parallel softcopy photogrammetry
system for IRS-1C/1D satellites). His broad research interests are
centered on spatial, spatiotemporal databases and data mining,
and computational geoinformatics; in particular he is interested in
statistical pattern recognition, semi-supervised learning, multiple
classifier systems, time series analysis and forecasting,
information retrieval, uncertainty and error handling.
Dr. Nitesh V. Chawla is an assistant professor at the University of
Notre Dame. Dr. Chawla's research interests are broadly in the
areas of data mining, machine learning, pattern recognition, and
their applications. More specifically his research has focused on
learning from massive datasets, distributed data mining/machine
learning, ensemble techniques, cost/distribution sensitive learning,
feature selection, and semi-supervised learning. His research has
also focused on the inter-disciplinary applications such as
intelligent scientific visualization, biometrics, bioinformatics,
natural language processing, and customer analytics.
Dr. Mohamed Medhat Gaber is a research Fellow at Monash
University, Australia. He has published more than 60 papers.
Mohamed is the co-editor of the book: Learning from Data
Streams: Processing Techniques in Sensor Networks, published
by Springer in 2007. His research interests include data stream
mining, wireless sensor networks and context-aware computing.
Mohamed has served in the program committees of several
international and local conferences and workshops in the area of
data mining and context-aware computing. He was the co-chair of
the IEEE International Workshop on Mining Evolving and
Streaming Data held in conjunction with ICDM 2006,
International Workshop on Knowledge Discovery from
Ubiquitous Data Streams held in conjunction with ECML/PKDD
2007, and the First and Second International Workshop on
Knowledge Discovery from Sensor Data held in conjunction with
ACM SIGKDD 2007/2008.
SIGKDD Explorations Volume 11, Issue 2 Page 87

Chapters (12)

Sensor networks and pervasive computing systems intimately combine computation, communication and interactions with the physical world, thus increasing the complexity of the development effort, violating communication protocol layering, and making traditional network diagnostics and debugging less effective at catching problems. Tighter coupling between communication, computation, and interaction with the physical world is likely to be an increasing trend in emerging edge networks and pervasive systems. This paper reviews recent tools developed by the authors to understand the root causes of complex interaction bugs in edge network systems that combine computation, communication and sensing. We concern ourselves with automated failure diagnosis in the face of non-reproducible behavior, high interactive complexity, and resource constraints. Several examples are given to finding bugs in real sensor network code using the tools developed, demonstrating the efficacy of the approach.
Histograms are a common technique for density estimation and they have been widely used as a tool in exploratory data analysis. Learning histograms from static and stationary data is a well known topic. Nevertheless, very few works discuss this problem when we have a continuous flow of data generated from dynamic environments. The scope of this paper is to detect changes from high-speed time-changing data streams. To address this problem, we construct histograms able to process examples once at the rate they arrive. The main goal of this work is continuously maintain a histogram consistent with the current status of the nature. We study strategies to detect changes in the distribution generating examples, and adapt the histogram to the most recent data by forgetting outdated data. We use the Partition Incremental Discretization algorithm that was designed to learn histograms from high-speed data streams. We present a method to detect whenever a change in the distribution generating examples occurs. The base idea consists of monitoring distributions from two different time windows: the reference window, reflecting the distribution observed in the past; and the current window which receives the most recent data. The current window is cumulative and can have a fixed or an adaptive step depending on the distance between distributions. We compared both distributions using Kullback-Leibler divergence, defining a threshold for change detection decision based on the asymmetry of this measure. We evaluated our algorithm with controlled artificial data sets and compare the proposed approach with nonparametric tests. We also present results with real word data sets from industrial and medical domains. Those results suggest that an adaptive window’s step exhibit high probability in change detection and faster detection rates, with few false positives alarms.
With the emergence of ubiquitous data mining and recent advances in mobile communications, there is a need for visualization techniques to enhance the user-interactions, real-time decision making and comprehension of the results of mining algorithms. In this paper we propose a novel architecture for situation-aware adaptive visualization that applies intelligent visualization techniques to data stream mining of sensory data. The proposed architecture incorporates fuzzy logic principles for modeling and reasoning about context/situations and performs gradual adaptation of data mining and visualization parameters according to the occurring situations. A prototype of the architecture is implemented and described in the paper through a real-world scenario in the area of healthcare monitoring.
Recognizing plans of moving agents is a natural goal for many sensor systems, with applications including robotic pathfinding, traffic control, and detection of anomalous behavior. This paper considers plan recognition complicated by the absence of contextual information such as labeled plans and relevant locations. Instead, we introduce 2 unsupervised methods to simultaneously estimate model parameters and hidden values within a Factor graph representing agent transitions over time. We evaluate our approach by applying it to goal prediction in a GPS dataset tracking 1074 ships over 5 days in the English channel.
Intrusion detection in wireless networks has become a vital part in wireless network security systems with wide spread use of Wireless Local Area Networks (WLAN). Currently, almost all devices are Wi-Fi (Wireless Fidelity) capable and can access WLAN. This paper proposes an Intrusion Detection System, WiFi Miner, which applies an infrequent pattern association rule mining Apriori technique to wireless network packets captured through hardware sensors for purposes of real time detection of intrusive or anomalous packets. Contributions of the proposed system includes effectively adapting an efficient data mining association rule technique to important problem of intrusion detection in a wireless network environment using hardware sensors, providing a solution that eliminates the need for hard-to-obtain training data in this environment, providing increased intrusion detection rate and reduction of false alarms. The proposed system, WiFi Miner solution approach is to find frequent and infrequent patterns on pre-processed wireless connection records using infrequent pattern finding Apriori algorithm proposed by this paper. The proposed Online Apriori-Infrequent algorithm improves the join and prune step of the traditional Apriori algorithm with a rule that avoids joining itemsets not likely to produce frequent itemsets as their results, there by improving efficiency and run times significantly. An anomaly score is assigned to each packet (record) based on whether the record has more frequent or infrequent patterns. Connection records with positive anomaly scores have more infrequent patterns than frequent patterns and are considered anomalous packets.
Real-world sensor time series are often significantly noisier and more difficult to work with than the relatively clean data sets that tend to be used as the basis for experiments in many research papers. In this paper we report on a large case-study involving statistical data mining of over 100 million measurements from 1700 freeway traffic sensors over a period of seven months in Southern California. We discuss the challenges posed by the wide variety of different sensor failures and anomalies present in the data. The volume and complexity of the data precludes the use of manual visualization or simple thresholding techniques to identify these anomalies. We describe the application of probabilistic modeling and unsupervised learning techniques to this data set and illustrate how these approaches can successfully detect underlying systematic patterns even in the presence of substantial noise and missing data. KeywordsProbabilistic modeling-MMPP-traffic-loop sensors-Poisson-Markov
The detection of outliers from spatio-temporal data is an important task due to the increasing amount of spatio-temporal data available and the need to understand and interpret it. Due to the limitations of current data mining techniques, new techniques to handle this data need to be developed. We propose a spatio-temporal outlier detection algorithm called Outstretch, which discovers the outlier movement patterns of the top-k spatial outliers over several time periods. The top-k spatial outliers are found using the Exact-Grid Top- k and Approx-Grid Top- k algorithms, which are an extension of algorithms developed by Agarwal et al. [1]. Since they use the Kulldorff spatial scan statistic, they are capable of discovering all outliers, unaffected by neighbouring regions that may contain missing values. After generating the outlier sequences, we show one way they can be interpreted, by comparing them to the phases of the El Niño Southern Oscilliation (ENSO) weather phenomenon to provide a meaningful analysis of the results.
Large-scale natural disasters cause external disturbances to networking infrastructure that lead to large-scale network-service disruption. To understand the impact of natural disasters to networks, it is important to localize and analyze network-service disruption after natural disasters occur. This work studies an inference of network-service disruption caused by the real natural disaster, Hurricane Katrina. We perform inference using large-scale Internet measurements and human inputs. We use clustering and feature extraction to reduce data dimensionality of sensory measurements and apply semi-supervised learning to jointly use sensory measurements and human inputs for inference. Our inference shows that after Katrina, approximately 25% of subnets were inferred as unreachable. We find that 62% of unreachable subnets were small subnets at the edges of networks, and 49% of these unreachabilities occurred after the landfall. The majority (73%) of unreachable subnets lasted longer than four weeks showing that Katrina caused extreme damage on networks and a slow recovery. Network-service disruption is inevitable after large-scale natural disasters occur. Thus, it is crucial to have effective inference techniques for more understanding of network responses and vulnerabilities to natural disasters.
Analyzing sensor data in pervasive computing applications brings unique challenges to the KDD community. The challenge is heightened when the underlying data source is dynamic and the patterns change. We introduce a new adaptive mining framework that detects patterns in sensor data, and more importantly, adapts to the changes in the underlying model. In our framework, the frequent and periodic patterns of data are first discovered by the Frequent and Periodic Pattern Miner (FPPM) algorithm; and then any changes in the discovered patterns over the lifetime of the system are discovered by the Pattern Adaptation Miner (PAM) algorithm, in order to adapt to the changing environment. This framework also captures vital context information present in pervasive computing applications, such as the startup triggers and temporal information. In this paper, we present a description of our mining framework and validate the approach using data collected in the CASAS smart home testbed.
Sensor data is usually represented by streaming time series. Current state-of-the-art systems for visualization include line plots and three-dimensional representations, which most of the time require screen resolutions that are not available in small transient mobile devices. Moreover, when data presents cyclic behaviors, such as in the electricity domain, predictive models may tend to give higher errors in certain recurrent points of time, but the human-eye is not trained to notice this cycles in a long stream. In these contexts, information is usually hard to extract from visualization. New visualization techniques may help to detect recurrent faulty predictions. In this paper we inspect visualization techniques in the scope of a real-world sensor network, quickly dwelling into future trends in visualization in transient mobile devices. We propose a simple dense pixel display visualization system, exploiting the benefits that it may represent on detecting and correcting recurrent faulty predictions. A case study is also presented, where a simple corrective strategy is studied in the context of global electrical load demand, exemplifying the utility of the new visualization method when compared with automatic detection of recurrent errors.
The detection of unusual profiles or anomalous behavioral characteristics from sensor data is especially complicated in security applications where the threatindicators may or may not be known in advance. Predictive modeling of massive volumes of historical data can yield insights on usual or baseline profiles, which in turn can be utilized to isolate unusual profileswhen new data are observed in real-time.Thus,an incremental anomaly detection approach is proposed. This is a two-stage approach in which the first stage processes the available historical data and develops statistics that are in turn used by the second stage in characterizing the new incoming data for real-time decisions. The first stage adopts a mixture model of probabilistic principal component analyzers to quantify each historical observation by probabilistic measures. The second stage is a chi-square based anomaly detection approach that utilizes the probabilistic measures obtained in the first stage to determine if the incoming data is an anomaly. The proposed anomaly detection approach performs satisfactorily on simulated and benchmark datasets. Theapproach is also illustrated in the context of detecting commercial trucks that may pose safety and security risk. It is able to consistently identified trucks with anomalous features in the scenarios investigated. KeywordsTransportation security-radioactive materials-incremental knowledge discovery-PPCA-chi-square statistics
The focus of this paper is the discovery of spatiotemporal neighborhoods in sensor datasets where a time series of data is collected at many spatial locations. The purpose of the spatiotemporal neighborhoods is to provide regions in the data where knowledge discovery tasks such as outlier detection, can be focused. As building blocks for the spatiotemporal neighborhoods, we have developed a method to generate spatial neighborhoods and a method to discretize temporal intervals. These methods were tested on real life datasets including (a) sea surface temperature data from the Tropical Atmospheric Ocean Project (TAO) array in the Equatorial Pacific Ocean and (b)highway sensor network data archive. We have found encouraging results which are validated by real life phenomenon.
... Fig. 2 shows a generic illustration of how to interpret this curve by different fictional approaches. The efficient technique with high performance is the one with a high detection rate and low false alarm rate with a large Area Under the Curve (AUC) [13]. From Fig. 2, we can say that approach 2 realize the optimal performance compared to approaches 3 and 4. The approach 1 represents the ideal case [14]. ...
Article
Wireless Sensor Networks (WSNs) are one of the main components of the Internet of things (IoT) for gathering information and monitoring the environment in a variety of applications (medical, agricultural, manufacturing, militarily, etc.). However, data collected by sensors and sent to the base station are susceptible to have outliers. These outliers can occur due to sensor nodes themselves or to the harsh environment where they are deployed. Thus, the WSNs have to detect the outliers and take actions to ensure network quality of service (in terms of reliability, latency, etc.) and to avoid further degradation of the application efficiency. In this paper, we propose an enhanced approach of our previous work to achieve better performance for outlier issues in WSNs. The enhancement tackles the clustering phase and the outlier detection phase. The classification phase remains the same as in the EODCA work by using the Inverse Distance Weighting (IDW) method. The enhancement is titled EEODCA for Enhanced Efficient Outlier Detection and Classification Algorithm. For evaluation, we conduct a comparison study between EEODCA, EODCA and another work from the literature and thus for the multivariate data case. Simulation results with both synthetic and real-life datasets showed that the EEODCA outperforms the studied techniques in terms of several metrics like Detection Rate (DR), False Alarm Rate (FAR), Accuracy Rate (ACC), F1_score and Area Under the Curve (AUC).
... Can the data science methods be carefully designed to avoid spurious generalizations, and to extract physically based patterns that can be interpreted by climate scientists? Solutions for massive data volume and complexity have already made their mark in scientific and engineering disciplines as diverse as biology, astrophysics, and Internet phenomena such as Google or Facebook (Berriman et al., 2010; Langmead et al., 2010; Yang et al., 2011) and spawned new fields of research such as sensor networks (Ganguly et al., 2009a). Climate problems increasingly demand data-driven solutions, but the relevant approaches need to consider relatively unique challenges not present, or not as predominant, in fields where data sciences have proved enormously successful thus far. ...
Article
Full-text available
Extreme events such as heat waves, cold spells, floods, droughts, tropical cyclones, and tornadoes have potentially devastating impacts on natural and engineered systems and human communities worldwide. Stakeholder decisions about critical infrastructures, natural resources, emergency preparedness and humanitarian aid typically need to be made at local to regional scales over seasonal to decadal planning horizons. However, credible climate change attribution and reliable projections at more localized and shorter time scales remain grand challenges. Long-standing gaps include inadequate understanding of processes such as cloud physics and ocean–land–atmosphere interactions, limitations of physics-based computer models, and the importance of intrinsic climate system variability at decadal horizons. Meanwhile, the growing size and complexity of climate data from model simulations and remote sensors increases opportunities to address these scientific gaps. This perspectives article explores the possibility that physically cognizant mining of massive climate data may lead to significant advances in generating credible predictive insights about climate extremes and in turn translating them to actionable metrics and information for adaptation and policy. Specifically, we propose that data mining techniques geared towards extremes can help tackle the grand challenges in the development of interpretable climate projections, predictability, and uncertainty assessments. To be successful, scalable methods will need to handle what has been called "big data" to tease out elusive but robust statistics of extremes and change from what is ultimately small data. Physically based relationships (where available) and conceptual understanding (where appropriate) are needed to guide methods development and interpretation of results. Such approaches may be especially relevant in situations where computer models may not be able to fully encapsulate current process understanding, yet the wealth of data may offer additional insights. Large-scale interdisciplinary team efforts, involving domain experts and individual researchers who span disciplines, will be necessary to address the challenge.
... Mining data streams in wireless sensor networks has many important scientific and security applications [19,20]. However, the realization of such applications is faced by two main constraints. ...
Conference Paper
Full-text available
Mining data streams is a critical task of actual Big Data applications. Usually, data stream mining algorithms work on resource-constrained environments, which call for novel requirements like availability of resources and adaptivity. Following this main trend, in this paper we propose a distributed data stream classification technique that has been tested on a real sensor network platform, namely, Sun SPOT. The proposed technique shows several points of research innovation, with are also confirmed by its effectiveness and efficiency assessed in our experimental campaign.
... Such systems are used to control the work of machine operators and to detect system faults. In the first case, human factors are the main source of concept drift, while in the second, the change of the systems context [29, 76]. ...
... Such systems are used to control the work of machine operators and to detect system faults. In the first case, human factors are the main source of concept drift, while in the second, the change of the system's context [136,61,161]. ...
Article
Graph partitioning is an optimization problem that deals with dividing a large geographical transportation network into subnetworks in favor of balancing the workload and minimizing the communication among them. Over the past decades, various models have been developed in such a way to satisfy a multi-objective problem such as delivery time and managerial cost. In real life, because of inevitable changes during network’s lifetime, it is vital to offer survivability and resilience in the existence of network failure and disruption. This paper proposes a partitioning technique called Hierarchical recursive progression1⁺ (HRP1⁺) that is scalable and can deal with large and complex networks. The HRP1⁺ method is an extension of the HRP1 algorithm. The approach is tested on benchmark data-sets using different disruption scenarios, including partial and complete disruptions on network edges and nodes.
Article
In recent years the analysis of data streams has received a lot of attention. This is motivated by the increase of the number of applications which generate huge amounts of high speed temporal data. Let us think to sensor networks, computer networks, manufactures. Data streams are usually highly evolving, thus mining changes in data is a challenging task. In this paper we will deal with the structural drift detection problem where the aim is to discover and to describe changes in proximity relations among multiple data streams. We will introduce a new strategy whose effectiveness is shown through an application on simulated data.
Article
The article is focused on evaluating the relevance of load profiling information in electrical load forecasting, using neural networks as the forecasting methodology. Different models, with and without load profiling information, were tested and compared, and, the importance of the different inputs was investigated, using the concept of partial derivatives to understand the relevance of including this type of data in the input space. The paper presents a model for the day ahead load profile prediction for an area with many consumers. The results were analyzed with a simulated load diagram (to illustrate a distribution feeder) and also with a specific output of a 60/15 kV real distribution substation that feeds a small town. The adopted methodology was successfully implemented and resulted in reducing the mean absolute percentage error between 0.5% and 16%, depending on the nature of the concurrent methodology used and the forecasted day, with a major benefit regarding the treatment of special days (holidays). The results illustrate an interesting potential for the use of the load profiling information in forecasting.
Article
Full-text available
The scan statistic is commonly used to test if a one dimensional point process is purely random, or if any clusters can be detected. Here it is simultaneously extended in three directions:(i) a spatial scan statistic for the detection of clusters in a multi-dimensional point process is proposed, (ii) the area of the scanning window is allowed to vary, and (iii) the baseline process may be any inhomogeneous Poisson process or Bernoulli process with intensity pro-portional to some known function. The main interest is in detecting clusters not explained by the baseline process. These methods are illustrated on an epidemiological data set, but there are other potential areas of application as well.
Article
Full-text available
With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining.
Chapter
Advanced statistical modeling and knowledge representation techniques for a newly emerging area of machine learning and probabilistic reasoning; includes introductory material, tutorials for different proposed approaches, and applications. Handling inherent uncertainty and exploiting compositional structure are fundamental to understanding and designing large-scale systems. Statistical relational learning builds on ideas from probability theory and statistics to address uncertainty while incorporating tools from logic, databases and programming languages to represent structure. In Introduction to Statistical Relational Learning, leading researchers in this emerging area of machine learning describe current formalisms, models, and algorithms that enable effective and robust reasoning about richly structured systems and data. The early chapters provide tutorials for material used in later chapters, offering introductions to representation, inference and learning in graphical models, and logic. The book then describes object-oriented approaches, including probabilistic relational models, relational Markov networks, and probabilistic entity-relationship models as well as logic-based formalisms including Bayesian logic programs, Markov logic, and stochastic logic programs. Later chapters discuss such topics as probabilistic models with unknown objects, relational dependency networks, reinforcement learning in relational domains, and information extraction. By presenting a variety of approaches, the book highlights commonalities and clarifies important differences among proposed approaches and, along the way, identifies important representational and algorithmic issues. Numerous applications are provided throughout.
Article
A gridded dataset of historical daily precipitation for South America is now available. The data are combined in a simple manner into daily 1° and 2.5° gridded fields for the period 1940-2003. The fields provided are daily precipitation totals and station counts. The counts give the number of stations that are included in each grid point for each day. The dataset is expected to help improve the understanding of precipitation variability.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Article
Un modello alternativo per analizzare la relazione fra variabilì e quello che va sotto il nome di "classification and regression trees" (CART); sebbene l'idea di fondo sia la stessa e preferibile analizzare separatamente gli alberi di classificazione dalla regressione ad albero.