Article

REMI: A framework of reusable elements for mining heterogeneous data with missing information: A Tale of Congestion in Two Smart Cities

If you want to read the PDF, try requesting it from the authors.

Abstract and Figures

Applications targeting smart cities tackle common challenges, however solutions are seldom portable from one city to another due to the heterogeneity of smart city ecosystems. A major obstacle involves the differences in the levels of available information. In this work, we present REMI, which is a mining framework that handles varying degrees of information availability by providing a meta-solution to missing data. The framework core concept is the REMI layered stack architecture, offering two complementary approaches to dealing with missing information, namely data enrichment (DARE) and graceful degradation (GRADE). DARE aims at inference of missing information levels, while GRADE attempts to mine the patterns using only the existing data.We show that REMI provides multiple ways for re-usability, while being fault tolerant and enabling incremental development. One may apply the architecture to different problem instantiations within the same domain, or deploy it across various domains. Furthermore, we introduce the other three components of the REMI framework backing the layered stack. To support decision making in this framework, we show a mapping of REMI into an optimization problem (OTP) that balances the trade-off between three costs: inaccuracies in inference of missing data (DARE), errors when using less information (GRADE), and gathering of additional data. Further, we provide an experimental evaluation of REMI using real-world transportation data coming from two European smart cities, namely Dublin and Warsaw. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.
This content is subject to copyright. Terms and conditions apply.
Journal of Intelligent Information Systems (2018) 51: 367–388
https://doi.org/10.1007/s10844-018-0524-5
REMI: A framework of reusable elements for mining
heterogeneous data with missing information
A Tale of Congestion in Two Smart Cities
Avigdor Gal1·Dimitrios Gunopulos2·Nikolaos Panagiotou2·Nicolo Rivetti1·
Arik Senderovich3·Nikolas Zygouras2
Received: 25 July 2017 / Revised: 26 July 2018 / Accepted: 31 July 2018 /
Published online: 11 August 2018
©Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Applications targeting smart cities tackle common challenges, however solutions are seldom
portable from one city to another due to the heterogeneity of smart city ecosystems. A major
obstacle involves the differences in the levels of available information. In this work, we
present REMI, which is a mining framework that handles varying degrees of information
availability by providing a meta-solution to missing data. The framework core concept is the
REMI layered stack architecture, offering two complementary approaches to dealing with
missing information, namely data enrichment (DARE) and graceful degradation (GRADE).
DARE aims at inference of missing information levels, while GRADE attempts to mine
the patterns using only the existing data.We show that REMI provides multiple ways for
re-usability, while being fault tolerant and enabling incremental development. One may
apply the architecture to different problem instantiations within the same domain, or deploy
it across various domains. Furthermore, we introduce the other three components of the
REMI framework backing the layered stack. To support decision making in this framework,
we show a mapping of REMI into an optimization problem (OTP) that balances the trade-
off between three costs: inaccuracies in inference of missing data (DARE), errors when
using less information (GRADE), and gathering of additional data. Further, we provide an
experimental evaluation of REMI using real-world transportation data coming from two
European smart cities, namely Dublin and Warsaw.
Keywords Reusable elements ·Missing information ·Mining ·Complex patterns ·
Enrichment ·Graceful degradation
Avigdor Gal
avigal@technion.ac.il
Dimitrios Gunopulos
dg@di.uoa.gr
Extended author information available on the last page of the article.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... The correlations between the travel times of nearby links and different time slots are crucial for inferring the traffic state of a particular link [Niu et al. 2014;Zhang et al. 2016]. Online methods that determine the time required by a bus to reach a specified bus stop were proposed in [Gal et al. 2017], [Gal et al. 2018] and [Yu et al. 2011]. In [Wang et al. 2016b] the authors propose a method that estimates the travel time by identifying near-neighbor trajectories, with similar origin and destination. ...
Conference Paper
Travel time estimation is a critical task, useful to many urban applications at the individual citizen and the stakeholder level. This paper presents a novel hybrid algorithm for travel time estimation that leverages historical and sparse real-time trajectory data. Given a path and a departure time we estimate the travel time taking into account the historical information, the real-time trajectory data and the correlations among different road segments. We detect similar road segments using historical trajectories, and use a latent representation to model the similarities. Our experimental evaluation demonstrates the effectiveness of our approach.
... Big data are relevant not only in data mining problems (e.g., [9,10]) but also in web applications and systems (e.g., [3,11]). Deriving research issues have promoted a wide and vibrant research effort devoted to devise innovative models, techniques and algorithms for effectively and efficiently supporting big data processing under a wide collection of research perspective, ranging from big data management to information retrieval from big data sources, from big data analytics to big knowledge processing, from big data applications to machine learning tools for big data analysis (e.g., [2,7]), and so forth. This is also complemented by exciting advanced topics such as privacy (e.g., [8]) and opinion mining (e.g., [1]) in social networks. ...
Conference Paper
This paper provides an overview of the workshops co-located with the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), held during October 22-26, 2018 in Turin, Italy.
Chapter
This paper contains a brief overview on issues and challenges of the emerging topic recognized as flexible querying and analytics for smart cities and smart societies, which is strictly related to the actual big data management and analytics research trend, along with a brief overview of the FQAS 2019 international conference.
Article
Full-text available
Natural and anthropogenic hazards are frequently responsible for disaster events, leading to damaged physical infrastructure, which can result in loss of electrical power for affected locations. Remotely-sensed, nighttime satellite imagery from the Suomi National Polar-orbiting Partnership (Suomi-NPP) Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB) can monitor power outages in disaster-affected areas through the identification of missing city lights. When combined with locally-relevant geospatial information, these observations can be used to estimate power outages, defined as geographic locations requiring manual intervention to restore power. In this study, we produced a power outage product based on Suomi-NPP VIIRS DNB observations to estimate power outages following Hurricane Sandy in 2012. This product, combined with known power outage data and ambient population estimates, was then used to predict power outages in a layered, feedforward neural network model. We believe this is the first attempt to synergistically combine such data sources to quantitatively estimate power outages. The VIIRS DNB power outage product was able to identify initial loss of light following Hurricane Sandy, as well as the gradual restoration of electrical power. The neural network model predicted power outages with reasonable spatial accuracy, achieving Pearson coefficients (r) between 0.48 and 0.58 across all folds. Our results show promise for producing a continental United States (CONUS)- or global-scale power outage monitoring network using satellite imagery and locally-relevant geospatial data.
Conference Paper
Full-text available
Along the increasing digitization and interconnection in almost every domain in society or business, data is growing exponentially. It is expected that the worldwide Internet traffic will triple until 2020 in comparison to 2015. In the same time, the transmitted data volume will move from 53,2 Exabytes per months to 161 Exabytes per months [Cisco, 2016]. Cities and communities can support the provisioning and usage of urban data and benefit from resulting new services for the monitoring, understanding, decision making, steering, and control. Providing urban data is also supported by the ongoing movement of opening governmental data, but goes beyond. Urban data can include data from public, industrial, scientific or private sources. Yet, the design of urban data is still ongoing and numerous initiatives and standardization efforts on smart cities and communities put the grounds for the uptake and interoperability of urban data.
Conference Paper
Full-text available
Increasing attention has been paid of late to the problem of detecting and explaining “deviant” process instances, i.e. instances diverging from normal/desired outcomes (e.g., frauds, faults, SLA violations), based on log data. Current solutions allow to discriminate between deviant and normal instances, by combining the extraction of (sequence-based) behavioral patterns with standard classifier-induction methods. However, there is no general consensus on which kind of patterns are the most suitable for such a task, while mixing multiple pattern families together will produce a cumbersome redundant representation of log data that may well confuse the learner. We here propose an ensemble-learning approach to this deviance mining tasks, where multiple base learners are trained on different feature-based views of the given log (obtained each by using a distinguished family of patterns). The final model, induced through a stacking procedure, can implicitly reason on heterogeneous kinds of structural features, by leveraging the predictions of the base models. To make the discovered models more effective, the approach leverages resampling techniques and exploits non-structural process data. The approach was implemented and tested on real-life logs, where it reached compelling performances w.r.t. state-of-the-art methods.
Article
Full-text available
Geospatial intelligence has traditionally relied on the use of archived and unvarying data for planning and exploration purposes. In consequence, the tools and methods that are architected to provide insight and generate projections only rely on such datasets. Albeit, if this approach has proven effective in several cases, such as land use identification and route mapping, it has severely restricted the ability of researchers to inculcate current information in their work. This approach is inadequate in scenarios requiring real-time information to act and to adjust in ever changing dynamic environments, such as evacuation and rescue missions. In this work, we propose PlanetSense, a platform for geospatial intelligence that is built to harness the existing power of archived data and add to that, the dynamics of real-time streams, seamlessly integrated with sophisticated data mining algorithms and analytics tools for generating operational intelligence on the fly. The platform has four main components - i. GeoData Cloud - a data architecture for storing and managing disparate datasets; ii. Mechanism to harvest real-time streaming data; iii. Data analytics framework; iv. Presentation and visualization through web interface and RESTful services. Using two case studies, we underpin the necessity of our platform in modeling ambient population and building occupancy at scale.
Article
Data-driven modeling usually suffers from data sparsity, especially for large-scale modeling for urban phenomena based on single-source urban-infrastructure data under fine-grained spatial-temporal contexts. To address this challenge, we motivate, design, and implement UrbanCPS, a cyber-physical system with heterogeneous model integration, based on extremely-large multi-source infrastructures in the Chinese city Shenzhen, involving 42,000 vehicles, 10 million residents, and 16 million smartcards. Based on temporal, spatial, and contextual contexts, we formulate an optimization problem about how to optimally integrate models based on highly diverse datasets under three practical issues, that is, heterogeneity of models, input data sparsity, or unknown ground truth. We further propose a real-world application called Speedometer, inferring real-time traffic speeds in urban areas. The evaluation results show that, compared to a state-of-the-art system, Speedometer increases the inference accuracy by 29% on average.
Conference Paper
Geospatial intelligence has traditionally relied on the use of archived and unvarying data for planning and exploration purposes. In consequence, the tools and methods that are architected to provide insight and generate projections only rely on such datasets. Albeit, if this approach has proven effective in several cases, such as land use identification and route mapping, it has severely restricted the ability of researchers to inculcate current information in their work. This approach is inadequate in scenarios requiring real-time information to act and to adjust in ever changing dynamic environments, such as evacuation and rescue missions. In this work, we propose PlanetSense, a platform for geospatial intelligence that is built to harness the existing power of archived data and add to that, the dynamics of real-time streams, seamlessly integrated with sophisticated data mining algorithms and analytics tools for generating operational intelligence on the fly. The platform has four main components -- i) GeoData Cloud -- a data architecture for storing and managing disparate datasets; ii) Mechanism to harvest real-time streaming data; iii) Data analytics framework; iv) Presentation and visualization through web interface and RESTful services. Using two case studies, we underpin the necessity of our platform in modeling ambient population and building occupancy at scale.
Conference Paper
Applications such as autonomous driving or real-time route recommendations require up-to-date and accurate digital maps. However, manually creating and updating such maps is too costly to meet the rising demands. As large collections of GPS trajectories become widely available, constructing and updating maps using such trajectory collections can greatly reduce the cost of such maps. Unfortunately, due to GPS noise and varying trajectory sampling rates, inferring maps from GPS trajectories can be very challenging. In this paper, we present a framework to create up-to-date maps with rich knowledge from GPS trajectory collections. Starting from an unstructured GPS point cloud, we discover road segments using novel graph-based clustering techniques with prior knowledge on road design. Based on road segments, we develop a scale- and orientation-invariant traj-SIFT feature to localize and recognize junctions using a supervised learning framework. Maps with rich knowledge are created based on discovered road segments and junctions. Compared to state-of-the-art methods, our approach can efficiently construct high-quality maps at city scales from large collections of GPS trajectories.
Article
The drive toward smart cities alongside the increasing adoption of personal sensors is leading to big sensor data, which is so large and complex that traditional methods for utilizing it are inadequate. Although systems exist for storing and managing large-scale sensor data, the real value of such data are the insights it could enable. However, no current platforms enable sensor data to be taken from collection through use in models to produce useful data products. This article explores key challenges and introduces the Concinnity sensor data platform. Concinnity takes sensor data from collection to final product via a cloud-based data repository and easy-to-use workflow system. It supports rapid development of applications built on sensor data using data fusion and the integration and composition of models to form novel workflows. These key features enable value to be efficiently derived from sensor data.
Article
Urban mobility impacts urban life to a great extent. To enhance urban mobility, much research was invested in traveling time prediction: given an origin and destination, provide a passenger with an accurate estimation of how long a journey lasts. In this work, we investigate a novel combination of methods from Queueing Theory and Machine Learning in the prediction process. We propose a prediction engine that, given a scheduled bus journey (route) and a 'source/destination' pair, provides an estimate for the traveling time, while considering both historical data and real-time streams of information that are transmitted by buses. We propose a model that uses natural segmentation of the data according to bus stops and a set of predictors, some use learning while others are learning-free, to compute traveling time. Our empirical evaluation, using bus data that comes from the bus network in the city of Dublin, demonstrates that the snapshot principle, taken from Queueing Theory, works well yet suffers from outliers. To overcome the outliers problem, we use Machine Learning techniques as a regulator that assists in identifying outliers and propose prediction based on historical data.