Conference PaperPDF Available

Classification of Travel Modes from Cellular Network Data Using Machine Learning Algorithms

Authors:

Abstract and Figures

Data availability in recent years has grown exponentially, allowing researchers in the transport sector to harness valuable information regarding traffic flows. In that sense, cellular network data represents valuable traffic information when dealing with spatially large areas due to its property of collecting route data using distant mobile base stations. This property enables the automatic collection of origin-destination data, which is traditionally collected using field or online questionnaires. This paper aims to present the possibility of using origin-destination data extracted from cellular network dataset to classify travel modes. A case study was performed on the dataset collected in the City of Rijeka, Croatia. Dataset is evaluated on five machine learning algorithms, which resulted in Random forest as the highest performing algorithm with an accuracy score of 99.93%
Content may be subject to copyright.
Classification of Travel Modes from Cellular
Network Data Using Machine Learning Algorithms
Leo Tišljari´
c1, Dominik Cvetek1, Valentin Vareški´
c2, Martin Greguri´
c1
1Faculty of Transport and Traffic Sciences, University of Zagreb
Vukeli´
ceva 4, HR-10000 Zagreb, Croatia
2iOLAP d.o.o.
Prolaz Marije Krucifikse Kozuli´
c 1, HR-51000, Rijeka, Croatia
ltisljaric@fpz.unizg.hr
Abstract—Data availability in recent years has grown expo-
nentially, allowing researchers in the transport sector to harness
valuable information regarding traffic flows. In that sense, cel-
lular network data represents valuable traffic information when
dealing with spatially large areas due to its property of collecting
route data using distant mobile base stations. This property
enables the automatic collection of origin-destination data, which
is traditionally collected using field or online questionnaires. This
paper aims to present the possibility of using origin-destination
data extracted from cellular network dataset to classify travel
modes. A case study was performed on the dataset collected in
the City of Rijeka, Croatia. Dataset is evaluated on five machine
learning algorithms, which resulted in Random forest as the
highest performing algorithm with an accuracy score of 99.93%.
Keywords—Travel mode classification; Machine learning; Cel-
lular network data; Origin-destination matrices
I. INTROD UC TI ON
Deployment of various traffic sensors combined with the
increased development of data processing techniques and com-
puting power resulted in exponential growth of available traffic
datasets in the recent decade. Consequently, many data-driven
road traffic-related research topics emerged like traffic data
fusion models [1], [2], traffic state estimation and prediction
[3], [4], and traffic control [5].
In this paper, the methodology for the classification of
Origin-Destination (OD) data extracted from cellular network
data using machine learning algorithms is presented. The
paper’s primary goal is to compare the performances of the
most used machine learning algorithms to identify and propose
the best-performing ones.
Authors in [6] compared classification algorithms for se-
lecting the travel mode. Random Forest (RF) and K-Nearest
Neighbors (KNN) were compared. RF showed better accuracy
for travel mode classification applied to Global Navigation
Satellite System (GNSS) historical data and data streaming.
Zang et al. in [7] analyzed the navigational behavior of users
using GNSS traces. The goal of analyzing data is to enable
the construction of a more fine-tuned optimal route. Authors in
[8] discuss a new travel mode classification approach using a
Convolutional Neural Network (CNN) on accelerometer data.
Authors in [9] analyze the potential and limitations of using
cellular network data for traffic analysis. Katatian et al. [10]
calculate travel times as the primary attributes for clustering.
Trips between the exact origin and destination zones are
combined in a cluster. KNN is used to calculate clusters
representing a particular travel modes: walking, public trans-
portation, or driving a private car. Authors in [11] compare
three geometry-based mode classification methods and three
supervised methods to classify trips extracted from the cellular
network.
Contributions of this paper are as follows: (i) methodol-
ogy for processing OD-based datasets for machine learning
algorithms, (ii) evaluation of machine learning algorithms for
classification of travel modes using OD dataset, and (iii)
proposed methodology and evaluation are applied on the real-
life dataset from the City of Rijeka, Croatia.
The rest of the paper is organized as follows. Section 2
presents a used methodology that describes the usage of OD
matrices, presents the used dataset, prepossessing steps, and
feature selection part. Then, every used machine learning
algorithm is described with advantages and disadvantages.
Section 3 presents the results of the evaluation on five machine
learning algorithms. The conclusion and future work directions
are given in section 4.
II. METHODOLOGY
A. Origin-destination matrix
Cells in the OD matrix represent the number of trip records
in the observed time period, where each trip is realized as
unimodal. Fig.1 represents the used OD matrix where the
trips are recorded at the City of Rijeka, Croatia. Data was
collected and aggregated for average working day in a year
with excluded data from a holiday season. The data contains
records divided into 48 sectors covering the wide city area
represented in Fig.2.
The OD matrix is mostly used as an input for the traffic
simulations, where the matrix is used for generating the trips
that represent the initial traffic model of the observed area
[12]. The matrix can also be used as an input for traffic state
estimation, anomaly detection or prediction models because it
highlights the area of the increased traffic activity [13].
978-1-6654-4437-8/21/$31.00 ©2021 IEEE
63rd International Symposium ELMAR-2021, 13-15 September 2021, Zadar, Croatia
173
2021 International Symposium ELMAR | 978-1-6654-4437-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/ELMAR52657.2021.9550817
Authorized licensed use limited to: University of Zagreb: Faculty of Electrical Engineering and Computing. Downloaded on October 05,2021 at 11:35:13 UTC from IEEE Xplore. Restrictions apply.
Attribute Description
Interval Time interval when the trip started.
Duration Duration of the trip [s].
Start sector Id of the trip starting sector.
End sector Id of the trip ending sector.
Air distance Euclidean distance between start
and end of the trip [m].
Air speed Obtained speed using Euclidean
distance [m/s].
Road distance Real road distance between start
and end of the trip [m].
Road speed Obtained speed using road distance
[m/s].
Mode Travel mode used for the trip.
TABLE I. Attributes of the dataset
B. Data
The used dataset contains around 500,000 OD records
extracted from the cellular network, which represent trip data.
Every trip record is described with nine attributes presented
in Table I. Attribute ’Interval’ represents the time interval in
which the trip is recorded, with possible values of 00:00-06:00,
06:00-09:00, 09:00-14:00, 14:00-18:00 or 18:00-00:00h. ’Du-
ration’ represents the total trip duration in seconds. The
’Air distance’ and ’Air speed’ represent computed Euclidean
distance and the speed calculated using Euclidean distance
and the difference in time from the trip start to end. The
’Road distance’ and ’Road speed’ represent computed actual
distance and speed of vehicles when traveling using road
infrastructure. The ’Mode’ attribute represents the travel mode
used for completing the trip, where recorded travel modes are
’car,’ ’public transport,’ and ’walking.
C. Data preprocessing
Distributions of the observed attributes are shown in the
diagonal images in Fig.3. Due to highly skewed distributions,
we used the adjusted box plot method to remove anomalies
because it is a method that does not take any parametric as-
sumptions and uses med couple as a robust skewness estimator
[14]. The anomaly detection resulted in excluding trips that
had a duration longer than 9,000 s (2.5h).
Figure 1. OD matrix for recorded trips at the City of Rijeka and its nearby
surrounding
Figure 2. Sectors (spatial zones) for the City of Rijeka
After the anomaly detection, the feature scaling step is
conducted. To mitigate the influence of different units and
large differences in attribute values, all attributes are scaled
to the [0,1] range.
After the data preprocessing, 504,419 trip records were used
for further analysis. The most dominant travel mode was a
’car’ with 398,729 records, followed by ’public transport’ with
83,060 records, and ’walking’ with 22,630 records.
D. Feature selection
After examining all dataset attributes represented in Table I,
four features were selected for further analysis and the learning
process: ’Duration,’ ’Air speed,’ ’Road distance,’ and ’Road
speed.’
When considering the correlation plots in Fig.3, it can be
observed that attributes ’Road distance’ and ’Air distance’
provide redundant information, and one can be removed. We
removed the ’Air distance’ attribute from the learning because
road distance is a more informative attribute for traffic data
analysis.
E. Machine learning algorithms
This section presents five machine learning algorithms
for classification of the travel modes used in this research
with corresponding advantages and disadvantages. We analyze
following algorithms: (i) Decision tree (DT), (ii) K-Nearest
Neighbors, (iii) Logistics Regression (LR), (iv) Naive Bayes
(NB), and (v) Random Forest.
1) K-Nearest Neighbors: Machine learning algorithm that
solves classification and regression problems. KNN is a non-
parametric classification method, which classifies data points
based on its similarity measure. Algorithm calculates the dis-
tance between data points using the preferred distance metric
and adds the distance to an ordered collection. Then, it sort the
ordered collection from smallest to largest and picks the first K
groups from the sorted collection [15]. The main disadvantage
of KNN as it becomes significantly slower with increasing data
volume makes it an impractical choice in environments where
rapid forecasting is required. The main advantages of KNN for
63rd International Symposium ELMAR-2021, 13-15 September 2021, Zadar, Croatia
174
Authorized licensed use limited to: University of Zagreb: Faculty of Electrical Engineering and Computing. Downloaded on October 05,2021 at 11:35:13 UTC from IEEE Xplore. Restrictions apply.
classification are very simple application, robust in terms of
search space; for example, classes do not have to be linearly
separable.
2) Decision Tree: Machine learning algorithm that solves
both classification and regression tasks. A DT can be used to
visually represent the decision-making process using a tree-
like model of decisions. The main advantages of a DT are
that it does not require scaling and normalization, and it is
very intuitive and easy to explain. The main disadvantage is
that it often involves a higher time to train the model, and
small changes in the data can lead to a large change in the
structure of the optimal decision tree because calculation can
go far more complex compared to other algorithms [16].
3) Random Forest: Machine learning algorithm that can
solve both classification and regression problems. RF builds
multiple decision trees and merges them together to get a
more accurate and stable predictions [17]. One of the benefits
of using the RF is the power of handling large data sets
with higher dimensionality, and it is an effective method
for estimating missing data. The main limitation is that a
large number of trees can make the algorithm too slow and
ineffective for real-time predictions.
4) Logistic regression: In a classification problem, the
output variable can only take discrete values for a given set of
inputs, and it models the data using the sigmoid function. The
advantage is that LR is easy to implement, interpret, and very
efficient to train. The disadvantage is that it is challenging to
capture complex relationships. In high-dimensional datasets,
this can lead to the model over-fitting into the training set.
Non-linear problems cannot be solved with LR because it has
a linear decision background [18].
5) Naive Bayes: NB is a classification technique based
on Bayes’ Theorem with assumption of independence among
predictors. It is used to discriminate against different objects
based on specific features. The main advantage of NB is that it
requires a small amount of training data. When the assumption
Figure 3. Distributions and correlations between attributes
63rd International Symposium ELMAR-2021, 13-15 September 2021, Zadar, Croatia
175
Authorized licensed use limited to: University of Zagreb: Faculty of Electrical Engineering and Computing. Downloaded on October 05,2021 at 11:35:13 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Total accuracy for machine learning algorithms
of independent predictors holds true, the classifier performs
better as compared to other models. The main limitation
of NB is the assumption of independent predictors. In real
life scenario, the predictors are dependent, and it is almost
impossible that we get a set of predictors which are entirely
independent [19].
III. RESULTS
This section presents the evaluation of the cellular network
dataset on five machine learning algorithms. For each algo-
rithm, total accuracy is presented, alongside precision, recall,
F-1 score, and corresponding confusion matrices. The input
dataset is divided into training and test sets by using the
standard ratio of 30% for test and 70% for training. The
experiments are done using Python programming language
with the package Scikit-Learn [20]. The used code for this
research is publicly available on the Github repository [21].
The total accuracy of the algorithms is shown in Fig.4. It
can be observed that RF achieved the best result with a total
accuracy of 99.35%. KNN and DT achieved high accuracy
scores, while LR and NB failed to achieve sufficient scores.
Figure 5. Precision, recall and F1 score values for machine learning algorithms
Fig.5 represent precision, recall, and F-1 values for every
algorithm. We report the precision calculated as T P /(T P +
F P ), recall T P /(T P +F N ), and F-1 as a harmonic mean
between precision and recall, where T P stands for true
positive, F P for true negative, and F N for false negative
values.
Generally, confusion matrices, also called error matrices,
show the difference between a number of true versus predicted
class labels. Fig.6 represents confusion matrices with normal-
ized values for all observed algorithms. Confusion matrices
also confirm RF as the best performing algorithm with well-
separated classes. KNN and DT algorithms achieved good
separation between classes, NB could not separate the ’car’
and ’public transport’ classes, while LR could not separate
the ’car’ from ’public transport’ and ’walking’ from ’public
transport’ classes.
IV. CON CL US IO N
This paper presents a methodology for processing the OD
dataset extracted from the cellular network mobile records.
The paper’s main goal was to evaluate the dataset on the
most used machine learning algorithms for classification tasks
and report the results by comparing the performances. Based
on the results, conclusions can be drawn: (i) best performing
algorithm was RF, with a total accuracy of 99.93%, (ii) KNN
and DT algorithms can be used for this purpose because of
high accuracy rates and well separation of the classes, and (iii)
LR and NB are not well suited for the classification task on
this dataset.
Future work directions based on this dataset include auto-
matic feature extraction from OD matrices. As OD matrix is
presented as a heatmap, it can be used as an input for learning
the CNN to estimate the traffic states, and the results of the
CNN can be validated using the methodology proposed in this
paper.
ACKNOW LE DG ME NT
This research has been supported by the University of
Zagreb, Student Centre as part of the project “Znanstveno-
istraživaˇ
cke aktivnosti studentske istraživaˇ
cke skupine SIS-
DVA” and European Regional Development Fund under the
grant KK.01.1.1.01.0009 (DATACROSS). Data used for this
research is provided by Ericsson Nikola Tesla Ltd. through the
collaboration with the Laboratory for Data Science in Traffic
and Logistics at Faculty of Transport and Traffic Sciences,
University of Zagreb.
REFERENCES
[1] D. Cvetek, M. Muštra, N. Jeluši´
c, and L. Tišljari´
c, “A survey of methods
and technologies for congestion estimation based on multisource data
fusion,” Applied Sciences, vol. 11, no. 5, 2021.
[2] D. Cvetek, I. Horenec, M. Muštra, and N. Jeluši´
c, “Analysis of cor-
relation between dwell time measured using bluetooth detector and
occupancy,” in 2019 International Symposium ELMAR, pp. 31–34, 2019.
[3] L. Tišljari´
c, T. Cari´
c, B. Abramovi´
c, and T. Fratrovi´
c, “Traffic state
estimation and classification on citywide scale using speed transition
matrices,” Sustainability, vol. 12, no. 18, 2020.
63rd International Symposium ELMAR-2021, 13-15 September 2021, Zadar, Croatia
176
Authorized licensed use limited to: University of Zagreb: Faculty of Electrical Engineering and Computing. Downloaded on October 05,2021 at 11:35:13 UTC from IEEE Xplore. Restrictions apply.
(a) (b)
(c) (d)
(e)
Figure 6. Confusion matrices for machine learning algorithms: (a) Decision
tree, (b) K-nearest neighbors, (c) Logistics regression, (d) Naive Bayes, and
(e) Random forest
[4] T. Erdeli´
c, T. Cari´
c, M. Erdeli´
c, L. Tišljari´
c, A. Turkovi´
c, and N. Jeluši´
c,
“Estimating congestion zones and travel time indexes based on the
floating car data,” Computers, Environment and Urban Systems, vol. 87,
p. 101604, 2021.
[5] F. Vrbani´
c, E. Ivanjko, K. Kuši´
c, and D. ˇ
Cakija, “Variable speed limit
and ramp metering for mixed traffic flows: A review and open questions,”
Applied Sciences, vol. 11, no. 6, 2021.
[6] M. Erdeli´
c, T. Cari´
c, E. Ivanjko, and N. Jeluši´
c, “Classification of travel
modes using streaming gnss data,” Transportation Research Procedia,
vol. 40, pp. 209–216, 2019. TRANSCOM 2019 13th International
Scientific Conference on Sustainable, Modern and Safe Transport.
[7] L. Zhang, S. Dalyot, and M. Sester, Travel-Mode Classification for
Optimizing Vehicular Travel Route Planning, pp. 277–295. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2013.
[8] H. Wang, G. Liu, J. Duan, and L. Zhang, “Detecting transportation
modes using deep neural network,” IEICE Transactions on Information
and Systems, vol. E100.D, no. 5, pp. 1132–1135, 2017.
[9] N. Breyer, D. GundlegÃ¥rd, and C. Rydergren, “Cellpath routing and
route traffic flow estimation based on cellular network data,Journal of
Urban Technology, vol. 25, no. 2, pp. 85–104, 2018.
[10] Kalatian, Arash and Shafahi, Yousef, “Travel mode detection exploiting
cellular network data,” MATEC Web Conf., vol. 81, p. 03008, 2016.
[11] N. Breyer, D. GundlegÃ¥rd, and C. Rydergren, “Travel mode classi-
fication of intercity trips using cellular network data,” Transportation
Research Procedia, vol. 52, pp. 211–218, 2021. 23rd EURO Working
Group on Transportation Meeting, EWGT 2020, 16-18 September 2020,
Paphos, Cyprus.
[12] L. Novaˇ
cko, L. Šimunovi´
c, and D. Krasi´
c, “Estimation of
origin-destination trip matrices for small cities,” Promet -
Trafficamp;Transportation, vol. 26, pp. 419–428, Oct. 2014.
[13] H. Fanaee-T and J. Gama, “Event detection from traffic tensors: A hybrid
model,” Neurocomputing, vol. 203, pp. 22–33, 2016.
[14] M. Hubert and E. Vandervieren, “An adjusted boxplot for skewed
distributions,” Computational Statistics Data Analysis, vol. 52, no. 12,
pp. 5186 – 5201, 2008.
[15] S. Oh, Y.-J. Byon, and H. Yeo, “Improvement of search strategy
with k-nearest neighbors approach for traffic state prediction,IEEE
Transactions on Intelligent Transportation Systems, vol. 17, no. 4,
pp. 1146–1156, 2016.
[16] D. Tong, Y. R. Qu, and V. K. Prasanna, “Accelerating decision tree
based traffic classification on fpga and multicore platforms,IEEE
Transactions on Parallel and Distributed Systems, vol. 28, no. 11,
pp. 3046–3059, 2017.
[17] N. Dogru and A. Subasi, “Traffic accident detection using random forest
classifier,” in 2018 15th Learning and Technology Conference (L T),
pp. 40–45, 2018.
[18] S. Agarwal, P. Kachroo, and E. Regentova, “A hybrid model using
logistic regression and wavelet transformation to detect traffic incidents,
IATSS Research, vol. 40, no. 1, pp. 56–63, 2016.
[19] J. Zhang, C. Chen, Y. Xiang, W. Zhou, and Y. Xiang, “Internet traffic
classification by aggregating correlated naive bayes predictions,IEEE
Transactions on Information Forensics and Security, vol. 8, no. 1, pp. 5–
15, 2013.
[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-
derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and Édouard
Duchesnay, “Scikit-learn: Machine learning in python,Journal of
Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011.
[21] L. Tišljari´
c, D. Cvetek, and V. Vareški´
c, “Transport Mode Classification.
https://github.com/tisljaricleo/transport-mode-classification, 2021.
63rd International Symposium ELMAR-2021, 13-15 September 2021, Zadar, Croatia
177
Authorized licensed use limited to: University of Zagreb: Faculty of Electrical Engineering and Computing. Downloaded on October 05,2021 at 11:35:13 UTC from IEEE Xplore. Restrictions apply.
Article
With the popularity of mobile devices, the signalling data generated by them provides significant opportunities for studying intercity travel behaviour in terms of data scale and information continuity. However, due to the low quality of the data in spatial accuracy, temporal frequency, and traffic semantics, the accuracy of identifying individual travel modes is low and it is difficult to extend to complex traffic scenarios. In this paper, we propose a new framework for identifying individual intercity travel modes based on mobile signalling data. The framework includes components for data pre-processing, geo-information mapping, feature and attribute extraction, and travel mode recognition. We utilize a comprehensive detection model to identify users’ multimodal intercity transport behaviour. Using two modules, Random Forest Embedding (RFE) and Bidirectional Long Short-Term Memory (Bi-LSTM), the model can capture the spatiotemporal characteristics and complex multi-stage associations in intercity travel chains. A large-scale mobile phone dataset from Jiangsu Province, China, was used for verification. The results showed that, on average, the method was able to detect travel mode with 92% accuracy. This study provides valuable support for further research on individual travel behaviour and the enhancement of transportation planning.
Article
Full-text available
The trend of increasing traffic demand is causing congestion on existing urban roads, including urban motorways, resulting in a decrease in Level of Service (LoS) and safety, and an increase in fuel consumption. Lack of space and non-compliance with cities’ sustainable urban plans prevent the expansion of new transport infrastructure in some urban areas. To alleviate the aforementioned problems, appropriate solutions come from the domain of Intelligent Transportation Systems by implementing traffic control services. Those services include Variable Speed Limit (VSL) and Ramp Metering (RM) for urban motorways. VSL reduces the speed of incoming vehicles to a bottleneck area, and RM limits the inflow through on-ramps. In addition, with the increasing development of Autonomous Vehicles (AVs) and Connected AVs (CAVs), new opportunities for traffic control are emerging. VSL and RM can reduce traffic congestion on urban motorways, especially so in the case of mixed traffic flows where AVs and CAVs can fully comply with the control system output. Currently, there is no existing overview of control algorithms and applications for VSL and RM in mixed traffic flows. Therefore, we present a comprehensive survey of VSL and RM control algorithms including the most recent reinforcement learning-based approaches. Best practices for mixed traffic flow control are summarized and new viewpoints and future research directions are presented, including an overview of the currently open research questions.
Article
Full-text available
Traffic congestion occurs when traffic demand is greater than the available network capacity. It is characterized by lower vehicle speeds, increased travel times, arrival unreliability, and longer vehicular queueing. Congestion can also impose a negative impact on the society by decreasing the quality of life with increased pollution, especially in urban areas. To mitigate the congestion problem, traffic engineers and scientists need quality, comprehensive, and accurate data to estimate the state of traffic flow. Various types of data collection technologies have different advantages and disadvantages as well as data characteristics, such as accuracy, sampling frequency, and geospatial coverage. Multisource data fusion increases the accuracy and provides a comprehensive estimation of the performance of traffic flow on a road network. This paper presents a literature overview related to the estimation of congestion and prediction based on the data collected from multiple sources. An overview of data fusion methods and congestion indicators used in the literature for traffic state and congestion estimation is given. Results of these methods are analyzed, and a disseminative analysis of the advantages and disadvantages of surveyed methods is presented.
Article
Full-text available
Efficiently predicting traffic congestion benefits various traffic stakeholders, from regular commuters and logistic operators to urban planners and responsible authorities. This study aims to give a high-quality estimation of traffic conditions from a large historical Floating Car Data (FCD) with two main goals: (i) estimation of congestion zones on a large road network, and (ii) estimation of travel times within congestion zones in the form of the time-varying Travel Time Indexes (TTIs). On the micro level, the traffic conditions, in the form of speed profiles were mapped to links in the road network. On the macro level, the observed area was divided into a fine-grained grid and represented as an image where each pixel indicated congestion intensity. Spatio-temporal characteristics of congestion zones were determined by morphological closing operation and Monte Carlo simulation coupled with temporal clustering. As a case study, the road network in Croatia was selected with spatio-temporal analysis differentiating between the summer season and the rest of the year season. To validate the proposed approach, three comparisons were conducted: (i) comparison to real routes' travel times driven in a controlled manner, (ii) comparison to historical trajectory dataset, and (iii) comparison to the state-of-the-art method. Compared to the real measured travel times, using zone's time-varying TTIs for travel time estimation resulted in the mean relative percentage error of 4.13%, with a minor difference to travel times estimated on the micro level, and a significant improvement compared to the current Croatian industrial navigation. The results support the feasibility of estimating congestion zones and time-varying TTIs on a large road network from FCD, with the application in urban planning and time-dependent routing operations due to: significant reduction in the data volume without notable quality loss, and meaningful reduction in the pre-processing computation time.
Article
Full-text available
Many applications in transport planning require an understanding of travel patterns separated by travel mode. To use cellular network data as observations of human mobility in these applications, classification by travel mode is needed. Existing classification methods for GPS-trajectories are often inefficient for cellular network data, which has lower resolution in space and time than GPS data. In this study, we compare three geometry-based mode classification methods and three supervised methods to classify trips extracted from cellular network data in intercity origin-destination pairs as either road or train. To understand the difficulty of the problem, we use a labeled dataset of 255 trips in two OD-pairs to train the supervised classification methods and to evaluate the classification performance. For an OD-pair where the road and train routes are not separated by more than four kilometers, the geometry-based methods classify 4.5% - 7.1% of the trips wrong, while two of the supervised methods can classify all trips correctly. Using a large-scale dataset of 29037 trips, we find that separation between classes is less evident than in the labeled dataset and show that the choice of classification methods impacts the aggregated modal split estimate.
Article
Full-text available
The rising need for mobility, especially in large urban centers, consequently results in congestion, which leads to increased travel times and pollution. Advanced traffic management systems are being developed to take the advantage of increased mobility positive effects and minimize the negative ones. The first step dealing with congestion in urban areas is the detection of congested areas and the estimation of the congestion level. This paper presents a a method for a traffic state estimation on a citywide scale using the novel traffic data representation, named Speed Transition Matrix (STM). The proposed method uses traffic data to extract the STMs and to estimate the traffic state based on the Center Of Mass (COM) computation for every STM. The COM-based approach enables the simplification of the clustering process and provides increased interpretability of the resulting clusters. Using the proposed method, traffic data is analyzed, and the traffic state is estimated for the most relevant road segments in the City of Zagreb, which is the capital and the largest city in Croatia. The traffic state classification results are validated using the cross-validation method and the domain knowledge data with the resulting accuracy of 97% and 91%, respectively. The results indicate the possible application of the proposed method for the traffic state estimation on macro-and micro-locations in the city area. In the end, the application of STMs for traffic state estimation, traffic management, and anomaly detection is discussed.
Article
Full-text available
Over the last decade, smartphones became a valuable source of traffic data. GNSS data and other data from smartphone sensors can be successfully used in travel mode classification. Travel mode classification data are a significant source of information for various applications such as travel planning, urban road operations or user behavior understanding. Today, the availability of access to real-time data streams makes fast and real-time classification of travel modes possible. Because of different characteristics of data streams, the applied classification method has to be adjusted to the particular data stream. In this paper two classification methods, k Nearest Neighbors and Random Forest, are compared with emphasis on accuracy. First, they are applied for classification of travel modes using a static GNSS dataset, and afterward using streaming GNSS data. For the purpose of classification, characteristic distribution of velocity and acceleration for different travel modes is determined. Regarding streaming GNSS data, the influence of the window size on the classification accuracy is analyzed. Obtained results show that both classification methods can be successfully applied for the classification of travel modes.
Article
The signaling data in cellular networks provide means for analyzing the use of transportation systems. We propose methods that aim to reconstruct the used route through a transportation network from call detail records (CDRs) which are spatially and temporally sparse. The route estimation methods are compared based on the individual routes estimated. We also investigate the effect of different route estimation methods when employed in a complete network assignment for a larger city. Using an available CDR dataset for Dakar, Senegal, we show that the choice of the route estimation method can have a significant impact on resulting link flows.
Article
Machine learning (ML) algorithms have been shown to be effective in classifying a broad range of applications in the internet traffic. In this paper, we propose algorithms and architectures to realize online traffic classification using flow level features. First, we develop a traffic classifier based on C4.5 decision tree algorithm and Entropy-MDL (Minimum Description Length) discretization algorithm. It achieves an overall accuracy of 97.92% for classifying eight major applications. Next we propose approaches to accelerate the classifier on FPGA (Field Programmable Gate Array) and multicore platforms. We optimize the original classifier by merging it with discretization. Our implementation of this optimized decision tree achieves 7500+ Million Classifications Per Second (MCPS) on a state-of-the-art FPGA platform and 75-150 MCPS on two state-of-the-art multicore platforms. We also propose a divide and conquer approach to handle imbalanced decision trees. Our implementation of the divide-and-conquer approach achieves 10000+ MCPS on a state-of-the-art FPGA platform and 130-340 MCPS on two state-of-the-art multicore platforms. We conduct extensive experiments on both platforms for various application scenarios to compare the two approaches.