ArticlePDF Available

Predicting Transportation Modes of GPS Trajectories using Feature Engineering and Noise Removal

Authors:

Abstract and Figures

Understanding transportation mode from GPS (Global Positioning System) traces is an essential topic in the data mobility domain. In this paper, a framework is proposed to predict transportation modes. This framework follows a sequence of five steps: (i) data preparation, where GPS points are grouped in trajectory samples; (ii) point features generation; (iii) trajectory features extraction; (iv) noise removal; (v) normalization. We show that the extraction of the new point features: bearing rate, the rate of rate of change of the bearing rate and the global and local trajectory features, like medians and percentiles enables many classifiers to achieve high accuracy (96.5%) and f1 (96.3%) scores. We also show that the noise removal task affects the performance of all the models tested. Finally, the empirical tests where we compare this work against state-of-art transportation mode prediction strategies show that our framework is competitive and outperforms most of them.
Content may be subject to copyright.
Predicting Transportation Modes of GPS
Trajectories using Feature Engineering and
Noise Removal
Mohammad Etemad1, Am´ılcar Soares J´unior1, and Stan Matwin12
1Institute for Big Data Analytics, Dalhousie University, Halifax
2Institute for Computer Science, Polish Academy of Sciences, Warsaw
Abstract. Understanding transportation mode from GPS (Global Posi-
tioning System) traces is an essential topic in the data mobility domain.
In this paper, a framework is proposed to predict transportation modes.
This framework follows a sequence of five steps: (i) data preparation,
where GPS points are grouped in trajectory samples; (ii) point features
generation; (iii) trajectory features extraction; (iv) noise removal; (v)
normalization. We show that the extraction of the new point features:
bearing rate, the rate of rate of change of the bearing rate and the global
and local trajectory features, like medians and percentiles enables many
classifiers to achieve high accuracy (96.5%) and f1 (96.3%) scores. We
also show that the noise removal task affects the performance of all the
models tested. Finally, the empirical tests where we compare this work
against state-of-art transportation mode prediction strategies show that
our framework is competitive and outperforms most of them.
Keywords: Feature engineering, Noise removal, Trajectory classifica-
tion
1 Introduction
Research on trajectory analysis is a mature area since positioning devices are now
used to track people, vehicles, vessels, and animals. In the case of trajectory data,
the object’s movement is represented as a discrete collection of spatiotemporal
points.
A domain where trajectories are frequently analyzed is the prediction of
transportation modes from users, which is essential for cities and people to reduce
travel time and traffic congestion. Transportation mode estimation involves two
steps [11]: (i) extraction of segments of the same transportation modes; and
(ii) classification of transportation modes for each segment. For the first step,
several segmentation algorithms have been proposed in the past years and include
temporal-based [8], cost function-based [5] and semantic-based methods [7]. For
the second step, which is the focus of this work, the classification (or prediction)
of the transportation modes is performed by creating domain expert features for
supervised classification (e.g., the distance between consecutive points, velocities,
acceleration, and bearing).
arXiv:1802.10164v1 [cs.OH] 27 Feb 2018
2 Mohammad Etemad, Am´ılcar Soares J´unior, and Stan Matwin
We classify the research in transportation modes prediction regarding the
type of features in two branches: (i) domain expert features; and (ii) learned
features. From raw GPS data points (e.g., latitude, longitude and time) it is
possible to calculate many attributes regarding the moving object’s movement.
Examples include distance traveled between points, estimated speed, bearing,
acceleration, etc. For segments of trajectories, it is possible to extract mean,
median, minimum, maximum, standard deviations, etc., of point-wise features.
These are examples of domain expert features employed to predict transportation
modes. Examples of works that apply domain expert features include [6,11].
In this work, we also explore the effects of noise removal in the prediction
of transportation modes. Dealing with noise in trajectories is essential because
GPS recorder devices are not accurate in the moving object’s positioning due to
many reasons like satellite geometry, signal blockage, atmospheric conditions,
and receiver design features/quality. By removing GPS noise, it is expected
that the derived features from the trajectories are more likely to represent the
standard pattern of a transportation mode.
Noise-perturbed GPS data influences the quality of the domain expert fea-
tures, e.g. distance traveled, speed or acceleration are susceptible to errors. It is
important to point out that these errors may impact the distributions of values,
where statistics like the mean, in trajectory segments of transportation modes.
This uncertainty of data can lead a classifier to create models that are not able
to accurately predict a transportation mode from a trajectory. Thus, the works
in transportation mode prediction are classified regarding the (i) presence or
(ii) absence of noise removal strategies. An example of work in the transporta-
tion mode prediction that does not deal with noise removal is [11]. In others,
like [10,4,1,2,9], noise is removed. This paper applies domain expert features
and noise removal to predict transportation are as follows: (i) we introduce new
point and trajectory features; (ii) we propose a framework composed of 5 steps
for transportation mode prediction; (iii) we compare the proposed approach with
state-of-art strategies and show that our results are competitive.
2 A framework for transportation mode prediction
In this section, we present the sequence of steps used in this work to predict
transportation modes (Figure 1). This framework has five steps and is described
in detail below.
In this work, we define a trajectory as a sequence of GPS points that belongs
to the same transportation mode. In the first (step 1), we group the raw GPS
points by userid,day and transportation mode to create trajectory samples. We
discard trajectory samples with less than 10 GPS points because these examples
may affect our model since trajectories with low quality may be created.
In this work, we calculate some point features (step 2) that were used previ-
ously in literature [11]: distance, speed, acceleration, jerk[1], and bearing.
Two new features are introduced in this work, named bearing rate, and the
rate of bearing rate. They are detailed as follows. The bearing rate was computed
Predicting Trans. Mode using Feature Engineering&Noise Removal 3
Fig. 1. The steps of the proposed framework to predict transportation modes
using Eq. 1, where Biand Bi+1 are the bearing values in points iand i+ 1, and
∆t is the time difference.
Brate(i+1) = (Bi+1 Bi)/∆t (1)
Some moving objects tend to change the bearing more often because they
commute in a straightforward route. This behavior can be captured by using the
rate of the bearing rate. This feature is calculated using Eq.2.
Brrate(i+1) = (Br ate(i+1) Brate(i))/∆t (2)
After calculating all the point features for each trajectory, we extract some
statistical attributes referred to as trajectory features (step 3). Trajectory fea-
tures are divided into two different types: (i) global trajectory features, which
summarize information regarding the whole trajectory in a single value; and (ii)
local trajectory features, which describe a local part of the trajectory. In this
work, we extracted global features like the Minimum, Maximum, Mean, Median,
and Standard Deviation values of each trajectory point feature to feed our clas-
sifier. The local trajectory features extracted in this work was the percentiles of
every point feature. Five different percentiles were extracted (10, 25, 50, 75, and
90) and were used in the models tested in this work. In summary, we compute
70 trajectory features (10 statistical measures including five global and five local
features calculated for 7 point features) for each transportation mode example.
In step 4, the framework deals with noise in the data. In this work, we used
a simple method called median filter to create a mask. The method is described
in Algorithm 1 (threshold = 3) and it removes the noise based on speedmean
(i.e. the average speed of a trajectory) attribute since a human can classify the
transportation mode mostly by knowing the mean speed of a trajectory.
Finally, we normalized the features (step 5) using the Min-Max normaliza-
tion method, since this method preserves the relationship between the values to
transform features to the same range and improve the quality of classification
process [3].
4 Mohammad Etemad, Am´ılcar Soares J´unior, and Stan Matwin
Data: Speed mean of trajectories
Result: mask vector to remove the noisy trajectories
difference − |speedmeanT raj ectory median(speedmean)|;
median dif f erence median(dif f erence);
if median difference == 0 then
indicator 0;
else
indicator difference/median dif f erence;
end
return indicator >threshold ;
Algorithm 1: mask the noisy samples to remove from dataset using median
3 Experiments
In this section, we detail the experiments performed in this work to validate
our framework. The data used in this work is the GeoLife GPS dataset, that
was collected by Microsoft Research Asia from April 2007 to October 2011 [11].
The dataset has a 5,504,363 number of records labeled by eleven transporta-
tion modes: taxi (4.41%); car (9.40%); train (10.19%); subway (5.68%); walk
(29.35%); airplane (0.16%); boat (0.06%); bike (17.34%); run (0.03%); motorcy-
cle (0.006%); and bus (23.33%).
In the literature, we observed different sub-selections of these classes for
evaluating transportation mode prediction strategies; therefore, we decided to
select different target subsets for comparing our result with other papers.
To evaluate the performance of classifiers in this work we used the Accuracy
and the F1 measure. In all our experiments, we used a 10-fold cross-validation
strategy and computed a paired t-test to verify if the difference in the means were
statistically different. We executed our framework with different classifiers such
as Decision Tree (DT) (with maxdepth equals five), Random Forest (RF) (with
50 trees estimators), Neural Network (NN), Naive Bayes (NB), and Quadratic
Discriminant Analysis (QDA). In all cases, the random forest surpasses all the
other classifiers in both accuracy and f1.
Subsequently, we compared the RF using all the steps of our framework
against the results of five papers. It is important to point out that all these papers
reported their accuracy values on the Geolife dataset. Table 1 shows a side-by-
side comparison between some related works and the results of our framework.
Our work does not surpass Jiang’s et al. accuracy [4] but outperforms all the
others. It is important to highlight that the complexity and high training time
of the RNN model used in his work may not be worth the 1.42% difference in
accuracy.
Finally, we evaluated the effects of noise removal performed by our frame-
work. We established as a baseline the performance of our framework using the
data to train classifiers with noise and without noise (clean). Table 2 shows
the mean of the f1 values obtained by 10-fold cross-validation for the different
group of classes. We can observe in Table 2 that for all classifiers and different
Predicting Trans. Mode using Feature Engineering&Noise Removal 5
Table 1. Comparison of accuracy and f1 measure of proposed model against related
works
Related work Proposed Model
Reference: classes used in the experiments acc acc f1
Dabiri et al. [1] : walk, bike, bus, driving, and train 84.8% 93.35% 93.22%
Jiang et al.[4]: bike, car, walk, and bus 97.9% 96.45% 96.31%
Xiao et al. [9] : walk, bus&taxi, bike, car, subway, and train 90.77% 93.19% 92.81%
Zheng et al.[11] : walk, driving, bus, and bike 76.2% 93.61% 93.51%
Endo et al.[2] : walk, car, taxi, bike, subway, bus, and train 83.2% 90.20% 89.95%
subgroups of classes, performance gains ranging from 2.56 (Decision Tree, using
classes of [2]) to 28.15 (QDA, using classes of [11]) in f1.
Table 2. F1 measures to classifiers for different class groups.
Reference DT RF NN NB QDA
with
noise clean with
noise clean with
noise clean with
noise clean with
noise clean
Dabiri et al. [1] 85.56 92.31 88.07 93.22 85.18 89.87 63.30 82.91 54.76 79.83
Jiang et al.[4] 88.26 95.47 91.56 96.31 88.63 94.11 65.68 85.19 54.70 82.55
Xiao et al. [9] 84.38 89.79 88.75 92.81 82.93 89.01 51.40 70.03 47.81 71.45
Zheng et al.[11] 85.62 91.92 88.72 93.51 85.76 91.33 64.61 84.22 51.33 79.48
Endo et al.[2] 79.53 82.09 85.57 89.95 79.33 85.70 57.31 72.68 49.13 72.30
Finally, Table 3 shows the mean of the accuracy values obtained by 10-
fold cross-validation. For all classifiers and different subgroups of classes and
classifiers, performance gains ranging from 3.36 (Decision Tree, using classes of
[2]) to 29.04 (QDA, using classes of [4]) in accuracy were observed. The results
presented in this section indicate that dealing with noise in transportation mode
prediction is an important topic, and the lack of this step in the classification
task decreases the performance of the classifiers.
Table 3. Accuracy to classifiers for different class groups.
Class
group
DT RF NN NB QDA
with
noise clean with
noise clean with
noise clean with
noise clean with
noise clean
Dabiri et al. [1] 85.54 92.36 88.47 93.35 85.54 90.13 63.56 83.28 53.65 79.76
Jiang et al.[4] 88.41 95.54 91.91 96.45 88.80 94.21 63.70 84.31 53.03 82.07
Xiao et al. [9] 85.01 89.96 89.33 93.19 83.61 89.43 51.96 69.90 46.59 70.99
Zheng et al.[11] 85.77 92.13 89.09 93.61 86.10 91.45 64.36 84.53 50.85 79.50
Endo et al.[2] 80.25 83.61 86.36 90.20 80.27 86.28 56.66 73.27 47.92 71.60
6 Mohammad Etemad, Am´ılcar Soares J´unior, and Stan Matwin
4 Conclusions and Future Works
In this work, we propose a framework for transportation mode prediction using
feature engineering and noise removal. The results showed that the newly engi-
neered features (e.g., bearing rate, and rate of bearing rate) and the application
of a noise removal technique improve the performance of all tested classifiers.
We intend to extend this work in two directions: (i) test and evaluate different
noise removal techniques like wavelet-based, MCMC and fast Fourier based de-
noising methods, and (ii) investigate the performance of trajectory segmentation
algorithms and include this step in our framework.
Acknowledgments The authors would like to thank NSERC (Natural Sciences
and Engineering Research Council of Canada) for financial support.
References
1. Sina Dabiri and Kevin Heaslip. Inferring transportation modes from gps tra-
jectories using a convolutional neural network. Transportation Research Part C:
Emerging Technologies, 86:360–371, 2018.
2. Yuki Endo, Hiroyuki Toda, Kyosuke Nishida, and Akihisa Kawanobe. Deep feature
extraction from trajectories for transportation mode estimation. In Pacific-Asia
Conference on Knowledge Discovery and Data Mining, pages 54–66. Springer, 2016.
3. Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and tech-
niques. Elsevier, 2011.
4. X. Jiang, E. N. Souza, A. Pesaranghader, B. Hu, D. L. Silver, and S. Matwin.
Trajectorynet: An embedded gps trajectory representation for point-based classi-
fication using recurrent neural networks. arXiv preprint arXiv:1705.02636, 2017.
5. A. Soares J´unior, B. N. Moreno, V. C. Times, S. Matwin, and L. A. F. C. GRASP-
UTS: an algorithm for unsupervised trajectory segmentation. Int. J. of Geograph-
ical Information Science, 29(1):46–68, 2015.
6. Miao Lin and Wen-Jing Hsu. Mining gps data for mobility patterns: A survey.
Pervasive and Mobile Computing, 12:1–16, 2014.
7. S. Spaccapietra, C. Parent, M. L. Damiani, J. A. Macedo, F. Porto, and
C. Vangenot. A conceptual view on trajectories. Data and Knowledge Engineering,
65(1):126–146, 2008.
8. Leon Stenneth, Ouri Wolfson, Philip S. Yu, and Bo Xu. Transportation mode
detection using mobile phones and gis information. In Proceedings of the 19th ACM
SIGSPATIAL International Conference on Advances in Geographic Information
Systems, GIS ’11, pages 54–63, New York, NY, USA, 2011. ACM.
9. Z. Xiao, Y. Wang, K. Fu, and Fan Wu. Identifying different transportation modes
from trajectory data using tree-based ensemble classifiers. ISPRS, 6(2):57, 2017.
10. G. Yanyun, Z. Fang, C. Shaomeng, and L. Haiyong. A convolutional neural net-
works based transportation mode identification algorithm. In 2017 International
Conf. on Indoor Positioning and Indoor Navigation (IPIN), pages 1–7, Sept 2017.
11. Yu Zheng, Quannan Li, Yukun Chen, Xing Xie, and Wei-Ying Ma. Understanding
mobility based on gps data. In Proceedings of the 10th international conference on
Ubiquitous computing, pages 312–321. ACM, 2008.
... As for the field of transportation mode detection, the research community has provided several extensive works based on machine learning techniques. Etemad et al. [47] provide a framework for the prediction of transportation mode based on GPS data only. The key contribution of the authors work is to propose trajectory point features generation and trajectory segments feature extraction which comprise bearing rate, the change rate of the bearing rate and the global and local trajectory features. ...
... Furthermore, and after distinguishing between stop and move segments, the move segments are labeled by transportation means (e.g., metro, bus, car, etc.), which is represented by the transportation detection box in Figure 3.5. We take advantage of the work of Etemad et al. [47], which we have already discussed in Section 3.2.1, to detect the transportation mode and include the results in the post-processing layer. In the following section, we present our proposed algorithm for Grid Density-Based Stop Detection (GDSD). ...
Thesis
Full-text available
Air quality is one of the major risk factors in human health. Mobile Crowd Sensing (MCS), which is a new paradigm based on the emerging connected micro-sensor technology, offers the opportunity of the assessment of personal exposure to air pollution anywhere and anytime. This leads to the continuous generation of geolocated data series, which results in a big data volume. Such data is deemed to be a mine of information for various analysis, and a unique opportunity of knowledge discovery about pollution exposure. However, achieving this analysis is far from straightforward. In fact, there is a gap to fill between the raw sensor data series and usable information: raw data is highly uneven, noisy, and incomplete. The major challenge addressed by this thesis is to fill this gap by providing a holistic approach for data analytics and mining in the context of MCS. We establish an end-to-end analytics pipeline, which encompasses data preprocessing, their enrichment with contextual information, as well as data modeling and storage. We implemented this pipeline while ensuring its automatized deployment. The proposed approaches have been applied to real-world datasets collected within the Polluscope project.
... As for the field of feature extraction, the research community has provided several extensive works based on machine learning techniques. Etemad et al. [18] provide a framework for the prediction of transportation mode based on GPS data only. The key contribution of the authors' work is to propose trajectory point features generation and trajectory segments feature extraction which comprise bearing rate, the change rate of the bearing rate, and the global and local trajectory features. ...
... We further discuss these rules in Section 6.1. Furthermore, and after distinguishing between stop and move segments, the move segments are labeled by transportation means (e.g., metro, bus, car, etc.), which is represented by the transportation detection box in Fig. 2. We take advantage of the work of Etemad et al. [18], which we have already discussed in the related work section, to detect the transportation mode and include the results in the post-processing layer. In the following section, we present our proposed algorithm for Grid Density-Based Stop Detection (GDSD). ...
Article
Full-text available
With the rapid advancements of sensor technologies and mobile computing, Mobile Crowd Sensing (MCS) has emerged as a new paradigm to collect massive-scale rich trajectory data. Nomadic sensors empower people and objects with the capability of reporting and sharing observations on their state, their behavior and/or their surrounding environments. Processing and mining multi-source sensor data in MCS raise several challenges due to their multi-dimensional nature where the measured parameters (i.e., dimensions) may differ in terms of quality, variability, and time scale. We consider the context of air quality MCS and focus on the task of mining the micro-environment from the MCS data. Relating the measures to their micro-environment is crucial to interpret them and analyse the participant’s exposure properly. In this paper, we focus on the problem of investigating the feasibility of recognizing the human’s micro-environment in an environmental MCS scenario. We propose a novel approach for learning and predicting the micro-environment of users from their trajectories enriched with environmental data represented as multidimensional time series plus GPS tracks. We put forward a multi-view learning approach that we adapt to our context, and implement it along with other time series classification approaches. We extend the proposed approach to a hybrid method that employs trajectory segmentation to bring the best of both methods. We optimise the proposed approaches either by analysing the exact geolocation (which is privacy invasive), or simply applying some a priori rules (which is privacy friendly). The experimental results, applied to real MCS data, not only confirm the power of MCS and air quality (AQ) data in characterizing the micro-environment, but also show a moderate impact of the integration of mobility data in this recognition. Furthermore, and during the training phase, multi-view learning shows similar performance as the reference deep learning algorithm, without requiring specific hardware. However, during the application of models on new data, the deep learning algorithm fails to outperform our proposed models.
... A trajectory is the tracing of an object through physical space, and is now a first-class object in spatial data analysis due to the ease of creating these objects via GPS trackers. Also, their analysis has been at the forefront of geo-spatial research in clustering [5,6,18,7,51], similarity measures [3,12,32,2], classification [16,25,30,31,41,42,48,55,24,27,43,37,28,17,29,36], and transportation mode detection [14,15,21,23,47,52,54,22,45,44]. In this paper we do not factor in the absolute time these traces are made, but mostly treat the trajectories as geometric objects in the plane. ...
... The focus of [25] is on mode-of-flight identification from very short subsets of trajectories. Another line of work is on inferring transportation modes [14,15,21,23,47,52,54,22,45,44] which we directly compare against in Section 3.6. Yet, the core trajectory classification task factors into many important challenges, and is destined to have an ever-expanding role as spatial data analysis increases in automation. ...
Preprint
Full-text available
We provide the first comprehensive study on how to classify trajectories using only their spatial representations, measured on 5 real-world data sets. Our comparison considers 20 distinct classifiers arising either as a KNN classifier of a popular distance, or as a more general type of classifier using a vectorized representation of each trajectory. We additionally develop new methods for how to vectorize trajectories via a data-driven method to select the associated landmarks, and these methods prove among the most effective in our study. These vectorized approaches are simple and efficient to use, and also provide state-of-the-art accuracy on an established transportation mode classification task. In all, this study sets the standard for how to classify trajectories, including introducing new simple techniques to achieve these results, and sets a rigorous standard for the inevitable future study on this topic.
... Such applications are in fact permeate various fields of study ( [6]). One for example may mention, recognition of ship types and their activities ( [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], and [17]), transportation modes ( [18], [19], [20], [21], [22], [23], [24], [25], and [26]), recognition of animals and their behaviour ( [5], [27], [28], [29], [30] [10], and [21]), labeling hurricanes ( [31] and [21]), recognition of flying objects ( [32] and [33]), gesture recognition ( [34]), and labeling abnormal movements ( [35]). ...
... Research on deriving transport modes from GPS data relies mostly on supervised learning methods that use a large training data set of labelled trajectories to learn the nuances between the different modes of transport automatically (eg, [10,13,6,3]). Since we did not have access to an extensive training data set or time to collect one, we had to settle for a simpler approach using heuristics. ...
Chapter
Full-text available
Contact tracing applications generally rely on Bluetooth data. This type of data works well to determine whether a contact occurred (smartphones were close to each other) but cannot offer the contextual information GPS data can offer. Did the contact happen on a bus? In a building? And of which type? Are some places recurrent contact locations? By answering such questions, GPS data can help develop more accurate and better-informed contact tracing applications. This chapter describes the ideas and approaches implemented for GPS data within the Smittestopp contact tracing application.We will present the pipeline used and the contribution of GPS data for contextual information, using inferred transport modes and surrounding POIs, showcasing the opportunities in the use of GPS information. Finally,we discuss ethical and privacy considerations, as well as some lessons learned.
Chapter
Full-text available
Manual contact tracing has been a key component in controlling the outbreak of the COVID-19 pandemic. The identification and isolation of close contacts of confirmed cases have successfully interrupted transmission chains and reduced the disease spread. Even though manual contact tracing has been widely used, its practice has shown that it is slow and cannot be scaled up once the epidemic grows beyond the early phase. In this case, digital contact tracing can play a significant role in controlling the pandemic. In this chapter, based on our experience and lessons learned from the Smittestopp project, we discuss the main prerequisites for the efficient implementation and validation of digital contact tracing in a population. Specifically, we discuss how to translate a close contact defined for manual tracing to proximity events discovered by a phone, that is, how to define a meaningful risk score and validate the digital contact tracing. We discuss challenges related to each step and provide solutions to some of them, even though questions still remain.
Chapter
Full-text available
An efficient backend solution is of great importance for any large-scale system, and Smittestopp is no exception. The Smittestopp backend comprises various components for user and device registration, mobile app data ingestion, database and cloud operations, and web interface support. This chapter describes our journey from a vague idea to a deployed system. We provide an overview of the system internals and design iterations and discuss the challenges that we faced during the development process, along with the lessons learned. The Smittestopp backend handled around 1.5 million registered devices and provided various insights and analyses before being discontinued a few months after its launch.
Chapter
Full-text available
Bluetooth data is used as the main method for contact tracing with Smittestopp. When two active devices are within Bluetooth range, they will record the ID of the paired device, along with information about the received signal quality. In this chapter, we describe how this method is implemented in Smittestopp, and how Bluetooth data is processed and analysed, to determine if an encounter between two users should be considered a qualified contact with a risk of contamination.We show that distance estimation based on Bluetooth signals is challenging due to differences between devices, lack of information on transmit power and varying environmental factors. Based on this experience, we propose a simple rule for identifying contacts based on received signal strength combined with information about the operating system type.
Chapter
Full-text available
An important secondary purpose of the Smittestopp development was to provide aggregated data sets describing mobility and social interactions in Norway’s population. The data were to be used to monitor the effect of government regulations and recommendations, provide input to advanced computational models to predict the pandemic’s spread, and provide input to fundamental epidemiology research. In this chapter we describe the challenges and technical solutions of Smittestopp’s data aggregation, as well as preliminary results from the time period when the app was active.We first give a detailed overview of the requirements, specifying the types of data to be collected and the level of spatial and temporal aggregation. We then proceed to describe the concepts for anonymization via :-anonymity and Y-differential privacy (Y-DP ), and the technical solutions for collecting and aggregating data from the database. In particular, we present details of how GPS- and Bluetooth events were mapped to geographical regions and points of interest, and the solutions employed for efficient data retrieval and processing. The preliminary results demonstrate how the recorded GPS- and Bluetooth events match with expected temporal and spatial variations in mobility and social interactions, and indicate the usefulness of the aggregated data as a tool for pandemic monitoring and research. One of the main criticisms of Smittestopp concerns the centralized storage of individuals’ movements, even if such data were used and presented only at an aggregated and anonymized level. In this chapter, we also outline a completely different approach, where the GPS data do not leave the user’s phone but are, instead, pre-processed to a much higher level of privacy before being dispatched to a server-side data aggregation algorithm. This approach, which would make the app significantly less intrusive, is made possible by recent advances in determining close contacts from Bluetooth data, either by a revised Smittestopp algorithm or by means of the Google/Apple Exposure Notification framework.
Article
Full-text available
Understanding and discovering knowledge from GPS (Global Positioning System) traces of human activities is an essential topic in mobility-based urban computing. We propose TrajectoryNet-a neural network architecture for point-based trajectory classification to infer real world human transportation modes from GPS traces. To overcome the challenge of capturing the underlying latent factors in the low-dimensional and heterogeneous feature space imposed by GPS data, we develop a novel representation that embeds the original feature space into another space that can be understood as a form of basis expansion. We also enrich the feature space via segment-based information and use Maxout activations to improve the predictive power of Recurrent Neural Networks (RNNs). We achieve over 98% classification accuracy when detecting four types of transportation modes, outperforming existing models without additional sensory data or location-based prior knowledge.
Article
Full-text available
Recognition of transportation modes can be used in different applications including human behavior research, transport management and traffic control. Previous work on transportation mode recognition has often relied on using multiple sensors or matching Geographic Information System (GIS) information, which is not possible in many cases. In this paper, an approach based on ensemble learning is proposed to infer hybrid transportation modes using only Global Position System (GPS) data. First, in order to distinguish between different transportation modes, we used a statistical method to generate global features and extract several local features from sub-trajectories after trajectory segmentation, before these features were combined in the classification stage. Second, to obtain a better performance, we used tree-based ensemble models (Random Forest, Gradient Boosting Decision Tree, and XGBoost) instead of traditional methods (K-Nearest Neighbor, Decision Tree, and Support Vector Machines) to classify the different transportation modes. The experiment results on the later have shown the efficacy of our proposed approach. Among them, the XGBoost model produced the best performance with a classification accuracy of 90.77% obtained on the GEOLIFE dataset, and we used a tree-based ensemble method to ensure accurate feature selection to reduce the model complexity.
Article
Full-text available
An important problem in the knowledge discovery of trajectories is segmentation in subparts (subtrajectories). Existing algorithms for trajectory segmentation generally use explicit criteria to create segments. In this article, we propose segmenting trajectories using a novel, unsupervised approach, in which no explicit criteria are predetermined. To achieve this, we apply the Minimum Description Length (MDL) principle, which can measure homogeneity in the trajectory data by computing the similarities between landmarks (i.e. representative points of the trajectory) and the points in their neighborhood. Based on the homogeneity measurements, we propose an algorithm named Greedy Randomized Adaptive Search Procedure for Unsupervised Trajectory Segmentation (GRASP-UTS), which is a meta-heuristic that builds segments by modifying the number and positions of landmarks. We perform experiments with GRASP-UTS in two real-world datasets, using segment purity and coverage metrics to evaluate its efficiency. Experimental results demonstrate that GRASP-UTS correctly segmented sample trajectories without predetermined criteria, by computing similarities between landmarks and other trajectory points.
Article
Full-text available
Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.
Conference Paper
Full-text available
The transportation mode such as walking, cycling or on a train denotes an important characteristic of the mobile user's context. In this paper, we propose an approach to inferring a user's mode of transportation based on the GPS sensor on her mobile device and knowledge of the underlying transportation network. The transportation network information considered includes real time bus locations, spatial rail and spatial bus stop information. We identify and derive the relevant features related to transportation network information to improve classification effectiveness. This approach can achieve over 93.5% accuracy for inferring various transportation modes including: car, bus, aboveground train, walking, bike, and stationary. Our approach improves the accuracy of detection by 17% in comparison with the GPS only approach, and 9% in comparison with GPS with GIS models. The proposed approach is the first to distinguish between motorized transportation modes such as bus, car and aboveground train with such high accuracy. Additionally, if a user is travelling by bus, we provide further information about which particular bus the user is riding. Five different inference models including Bayesian Net, Decision Tree, Random Forest, Naïve Bayesian and Multilayer Perceptron, are tested in the experiments. The final classification system is deployed and available to the public.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Conference Paper
This paper addresses the problem of feature extraction for estimating users’ transportation modes from their movement trajectories. Previous studies have adopted supervised learning approaches and used engineers’ skills to find effective features for accurate estimation. However, such hand-crafted features cannot always work well because human behaviors are diverse and trajectories include noise due to measurement error. To compensate for the shortcomings of hand-crafted features, we propose a method that automatically extracts additional features using a deep neural network (DNN). In order that a DNN can easily handle input trajectories, our method converts a raw trajectory data structure into an image data structure while maintaining effective spatio-temporal information. A classification model is constructed in a supervised manner using both of the deep features and hand-crafted features. We demonstrate the effectiveness of the proposed method through several experiments using two real datasets, such as accuracy comparisons with previous methods and feature visualization.
Article
With the help of various positioning tools, individuals’ mobility behaviors are being continuously captured from mobile phones, wireless networking devices and GPS appliances. These mobility data serve as an important foundation for understanding individuals’ mobility behaviors. For instance, recent studies show that, despite the dissimilarity in the mobility areas covered by individuals, there is high regularity in the human mobility behaviors, suggesting that most individuals follow a simple and reproducible pattern. This survey paper reviews relevant results on uncovering mobility patterns from GPS datasets. Specially, it covers the results about inferring locations of significance for prediction of future moves, detecting modes of transport, mining trajectory patterns and recognizing location-based activities. The survey provides a general perspective for studies on the issues of individuals’ mobility by reviewing the methods and algorithms in detail and comparing the existing results on the same issues. Several new and emergent issues concerning individuals’ mobility are proposed for further research.