ArticlePDF Available

SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches

Authors:

Abstract and Figures

The scarcity of publicly available cellular association traces hinders user location-based research and various data-driven services, highlighting the importance of data synthesis in this field. In this paper, we investigate the cellular association trace synthesis (CATS) problem, aiming to generate diverse and realistic cellular association traces based on road segment-based trajectories and corresponding departure times. To substantiate our research, we first gather substantial data, including road segment-based trajectories, base station (BS) distribution, and ground truths of cellular association traces. We then perform systematic data analysis to reveal technical challenges such as disparity in geographic spaces, complex and dynamic BS handover, and poor performance of single-dimension approaches. To address these challenges, we propose SynthCAT, a novel scheme that fuses model-based and data-driven approaches. Specifically, SynthCAT includes: i) A model-based coarse-grained cellular association trace generation component, encompassing GPS reference generation, weighted historical average time generation, Bayesian decision, and time mapping modules. This component establishes a unified GPS space to map road and BS spaces, generates initial time information, synthesizes coarse-grained spatial cellular association traces by following explicit BS handover rules, and maps the corresponding arrival time for each trace point; ii) A fine-grained cellular association trace generation component, which combines model-based and data-driven approaches. This employs a two-stage Autoencoder Generative Adversarial Network (AEGAN) to refine cellular association traces based on the coarse-grained ones. Extensive field experiments validate the efficacy of SynthCAT in terms of trace similarity to ground truths and its efficiency in supporting practical downstream applications.
Content may be subject to copyright.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of
Model-Based and Data-Driven Approaches
FENG LYU,Central South University, China
JIE ZHANG,Central South University, China
HUALI LU,Central South University, China
HUAQING WU,University of Calgary, Canada
FAN WU,Central South University, China
YONGMIN ZHANG,Central South University, China
YAOXUE ZHANG,Tsinghua University, China
The scarcity of publicly available cellular association traces hinders user location-based research and various data-driven
services, highlighting the importance of data synthesis in this eld. In this paper, we investigate the cellular association trace
synthesis (CATS) problem, aiming to generate diverse and realistic cellular association traces based on road segment-based
trajectories and corresponding departure times. To substantiate our research, we rst gather substantial data, including road
segment-based trajectories, base station (BS) distribution, and ground truths of cellular association traces. We then perform
systematic data analysis to reveal technical challenges such as disparity in geographic spaces, complex and dynamic BS
handover, and poor performance of single-dimension approaches. To address these challenges, we propose SynthCAT, a novel
scheme that fuses model-based and data-driven approaches. Specically, SynthCAT includes: i) A model-based coarse-grained
cellular association trace generation component, encompassing GPS reference generation, weighted historical average time
generation, Bayesian decision, and time mapping modules. This component establishes a unied GPS space to map road and BS
spaces, generates initial time information, synthesizes coarse-grained spatial cellular association traces by following explicit
BS handover rules, and maps the corresponding arrival time for each trace point; ii) A ne-grained cellular association trace
generation component, which combines model-based and data-driven approaches. This employs a two-stage Autoencoder
Generative Adversarial Network (AEGAN) to rene cellular association traces based on the coarse-grained ones. Extensive
eld experiments validate the ecacy of SynthCAT in terms of trace similarity to ground truths and its eciency in supporting
practical downstream applications.
CCS Concepts: Networks
Mobile networks;Computing methodologies
Modeling and simulation;Model
verication and validation.
Additional Key Words and Phrases: Synthesizing cellular association traces, Model-based and data-driven fusion, Autoencoder
Generative Adversarial Network, Downstream applications support
Corresponding author: Huali Lu.
Authors’ addresses: Feng Lyu, Central South University, Changsha, China, fenglyu@csu.edu.cn; Jie Zhang, Central South University, Changsha,
China, jie_zhang@csu.edu.cn; Huali Lu, Central South University, Changsha, China, huali_lu@csu.edu.cn; Huaqing Wu, University of Calgary,
Calgary, Canada, huaqing.wu1@ucalgary.ca; Fan Wu, Central South University, Changsha, China, wfwufan@csu.edu.cn; Yongmin Zhang,
Central South University, Changsha, China, zhangyongmin@csu.edu.cn; Yaoxue Zhang, Tsinghua University, Beijing, China, zhangyx@
tsinghua.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
permissions@acm.org.
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 2474-9567/2024/12-ART161
https://doi.org/10.1145/3699730
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:2 Lyu et al.
ACM Reference Format:
Feng Lyu, Jie Zhang, Huali Lu, Huaqing Wu, Fan Wu, Yongmin Zhang, and Yaoxue Zhang. 2024. SynthCAT: Synthesizing
Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches. Proc. ACM Interact. Mob. Wearable
Ubiquitous Technol. 8, 4, Article 161 (December 2024), 24 pages. https://doi.org/10.1145/3699730
1 INTRODUCTION
Cellular networks are integral to our daily lives, ensuring seamless wireless connections for mobile users [
1
].
According to Ericsson’s reports [
2
], there will be a substantial rise in cellular IoT connections, from 1.7 billion
in 2020 to 5.9 billion in 2026. This explosive growth will generate an abundance of cellular association traces,
presenting immense potential for scientic research and diverse location-based services, including user traveling
behavior proling [
3
5
], human mobility modeling [
6
9
], outdoor localization [
10
,
11
], map matching [
12
14
], etc.
However, practical issues hinder their widespread usage, including the lack of publicly available data, prohibitive
manual collection cost, and unavailability of network operator data for direct utilization. The synthesis of realistic
and diverse cellular association traces becomes critical to bridge the gap, which can enhance existing datasets,
facilitate training of deep learning models, synthesize target cellular association traces in any city without
requiring additional measurements, and protect user privacy. That is a potent approach to expedite research,
development, and implementation processes across various applications.
However, synthesizing realistic and diverse cellular association traces presents a formidable technical challenge,
marked by three distinct hurdles. Firstly, the disparity in geographic spaces between road segment-based trajec-
tories and base station (BS) handover-based cellular association traces stands as a critical obstacle. Overcoming
this challenge necessitates an accurate resolution of the mapping issue from the road space to the BS space.
Secondly, in areas with multiple BSs, the complexity and dynamism of BS handover processes increase due to
considerations such as signal coverage and load balancing. While real cellular networks follow a well-dened BS
handover process governed by received signal levels, designing a judicious handover rule that incorporates signal
characteristics like strength, coverage, and direction to emulate the operator’s black-box handover strategy is
indispensable for eective synthesis of cellular association traces. Lastly, when it comes to synthesizing moving
traces, relying solely on traditional model-based or data-driven approaches often encounters limitations, leading
to trace inaccuracies caused by underlying assumptions or unmet real-world training data demands. To overcome
this limitation, an innovative approach that synergistically combines the strengths of both techniques becomes
essential. A model-based approach can capture the fundamental essence of underlying data patterns using explicit
mathematical or statistical models with minimal data requirements. Simultaneously, a data-driven approach
excels in unveiling hidden patterns and insights in complex and dynamic environments.
In the eld of synthesizing mobility data, existing studies can be broadly categorized into road trajectory
synthesis [
15
17
], and GPS trajectory synthesis [
18
20
]. The rst category focuses on synthesizing routes which
represent the movement of objects along specic road segments, typically consisting of spatial coordinates or
unique road identiers. For instance, Jiang et al. [
15
] proposed a two-stage GAN-based approach to generate
continuous road trajectories. On the other hand, GPS trajectory synthesis involves synthesizing trajectories based
on GPS location data, providing a more continuous and detailed representation of an object’s movement. For
instance, Sun et al. [
18
] investigate the task of GPS trajectory synthesis while ensuring privacy preservation
and data utility. However, these studies mainly target at mobility data derived from vehicle tracking systems or
GPS locating devices, the traces of which are limited in terms of user scales and spatio-temporal coverage. They
often overlook data collected from cellular networks, which actually oer strengths in terms of coverage, user
penetration, granularity, and cost-eectiveness. Consequently, the research on synthesizing cellular association
traces is currently an unexplored area. Moreover, current data synthesis techniques are typically either purely
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:3
model-based [
21
24
] or data-driven [
15
,
25
28
], both of which have inherent drawbacks. Particularly, model-
based approaches may achieve limited performance with simplistic assumptions, while data-driven approaches
often require substantial real-world training data that may not be readily available.
In this paper, we initiate the investigation of synthesizing cellular association traces in accordance with
road segment-based trajectories and corresponding departure times. Initially, we dene the cellular association
trace synthesis (CATS) problem and describe the data acquisition process, encompassing road segment-based
trajectories, BS distribution, and ground truths of cellular association traces. Subsequently, we conduct a systematic
data analysis, revealing technical challenges in terms of disparity in geographic spaces,complex and dynamic
BS handover, and poor performance of single-dimension approaches. Drawing inspiration from the data-driven
insights, we propose an innovative cellular association trace synthesis scheme, named SynthCAT, which integrates
model-based and data-driven approaches in a coarse-grained to ne-grained trace synthesis process. In SynthCAT,
we rst propose a model-based coarse-grained cellular association trace generation method that: i) generates
GPS reference points to address the disparity between road and BS spaces; ii) generates initial time information
for GPS reference points by modeling the relationship between average travel speed, time, and area; iii) maps
timestamped GPS-based trajectories to BS handover-based trajectories using a Bayesian decision model with
explicit BS switching rules and time mapping. We then propose a model-based and data-driven fusion module,
utilizing a two-stage Autoencoder Generative Adversarial Network (AEGAN) to synthesize ne-grained cellular
association traces based on the coarse-grained ones. The rst stage generates a plausible grid representation of
the trajectory matrix, while the second stage focuses on mapping ne-grained BS IDs with cellular association
trace points within the grids and calculating the nal arrival time for each trace point. Extensive experiments are
conducted to demonstrate the ecacy of SynthCAT, illustrating unequivocally that it signicantly outperforms
state-of-the-art baselines. The main contributions are summarized as follows.
This study presents an innovative investigation of the CATS problem, aiming to synthesize realistic and
diverse cellular association traces aligned with road segment-based trajectories and corresponding departure
times. By bridging a gap in existing literature, which predominantly focuses on road and GPS trajectories,
our work facilitates downstream applications that benet from extensive synthesized traces without privacy
concerns.
We acquire substantial data to support this study including road segment-based trajectories, BS distribution,
and ground truths of cellular association traces. Then, we conduct a systematic analytics and reveal several
important technical challenges, including disparity in geographic spaces, complex and dynamic BS handover,
and poor performance of single-dimension approaches.
We propose an original cellular association trace synthesis scheme, called SynthCAT, which incorporates
both a model-based coarse-grained cellular association trace generation component and a ne-grained
cellular association trace generation component, fusing model-based and data-driven approaches. This
integrated scheme collectively synthesizes traces that closely resemble ground truths and eciently support
downstream applications. Extensive eld experiments corroborate its ecacy.
The remainder of this paper is organized as follows. We commence with a detailed exposition of the problem
denition and data acquisition in Section 2. Subsequently, in Section 3, we conduct an empirical data analytics
to reveal data-driven challenges. Section 4delves into the design of SynthCAT. Extensive experiments are
presented in Section 5. Following that, Section 6delves into the limitations of this work, and Section 7provides a
comprehensive review of related works. Finally, we conclude the paper in Section 8.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:4 Lyu et al.
Road Map
 = (,)
: Nodes on the road map
: Road segments
Road Segment-based Trajectory
Ground Truth of Cellular Trace
A Synthetic Example of Trace
BS Distribution
 =
= (,,)
Departure Time
(
,
)
(,)
Fig. 1. Illustration of cellular association trace synthesis (CATS) problem.
2 PROBLEM DEFINITION AND DATA ACQUISITION
2.1 Problem Definition
When a user with a SIM card moves along geographic road segments, the cellular network operator passively
collects association traces (i.e., logs of when and which BS the user associates with) according to the road segment-
based trajectories and corresponding departure times. These cellular association traces oer signicant advantages
such as full user penetration, complete area coverage, and continuous time availability. They eectively capture
users’ mobile behaviors and enable various location-based services. However, realizing their full potential is
challenged by the limited availability of open-source cellular association data, primarily due to privacy concerns
and commercial considerations. To bridge this gap, our study focuses on synthesizing these traces based on road
segment-based trajectories, corresponding departure times, and the underlying BS distribution.
Figure 1illustrates the cellular association trace synthesis problem. Our goal is to synthesize cellular association
traces that align with road segment-based trajectories within a predened road map, where the BS distribution is
known in advance, along with the corresponding departure times. The synthesized traces aim to resemble the
ground truths in both spatial and temporal dimensions. We formulate the problem as follows.
Denition 2.1 (Road Segment-Based Trajectory). A road segment-based trajectory is a sequence of road
segments generated when moving. It is represented by a sequence of chronologically ordered points
𝑅=
{𝑟1, 𝑟2, . . . , 𝑟𝑛}, where 𝑟𝑖represents a road segment ID.
Denition 2.2 (Cellular Association Trace). A cellular association trace is a sequence of timestamped BSs
accessed by a user on the move, passively collect by the network operator. It is denoted by T={(𝑏𝑠1, 𝑡 1),(𝑏𝑠2,
𝑡2), . . . , (𝑏𝑠𝑛, 𝑡𝑛)}
, where
𝑏𝑠𝑖
denotes the
𝑖
-th associated BS, and
𝑡𝑖
represents the arrival time at
𝑏𝑠𝑖
. Each BS entry
includes information about the ID 𝑖𝑑, geographic location 𝑙𝑜𝑐 =(𝑙𝑜 𝑛, 𝑙𝑎𝑡 ), and signal coverage radius 𝑟𝑎𝑑.
Denition 2.3 (Cellular Association Trace Synthesis (CATS) Problem). Given a set of road segment-based
trajectories
{𝑅1, 𝑅2, . . . , 𝑅𝑛}
, along with their corresponding departure times, a complete BS distribution, and a
small set of cellular association traces
{T
1,T
2, . . . , T
𝑚}
for model training, the CATS problem aims to synthesize
a new large dataset of cellular association traces
ˆ
T
1,ˆ
T
2, . . . , ˆ
T
𝑀(𝑀𝑚)
. Here,
𝑅𝑖
,
T
𝑖
, and
ˆ
T
𝑖
denote the
𝑖
-th
road segment-based trajectory,
𝑖
-th original cellular association trace, and the
𝑖
-th newly synthesized cellular
association trace, respectively.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:5
Data Collection in Car
Data Collection Platform
Data Collection in Subway
Fig. 2. Cellular association trace collection platform and campaigns.
Denition 2.4 (Map Matching). Map matching is the process of projecting a cellular association trace
T
onto the road topology to get a matched road-segment trajectory
𝑅
, which serves as the related downstream
application to verify the eectiveness of the proposed SynthCAT in this paper.
2.2 Data Acquisition
2.2.1 Road Segment-Based Trajectories. In this study, we investigate the CATS problem within three 10
𝑘𝑚×
10
𝑘𝑚
city areas, encompassing two urban areas and one suburban area. To substantiate our research, we leverage the
data from OpenStreetMap (OSM) [
29
], a publicly accessible open-source map service with physical geographical
information. Firstly, we collect a road map comprising 8,993 road segments with various types of roads, such
as motorways, primary roads, and footways. The road map is represented as a directed graph
𝑔𝑟𝑎𝑝 =(𝑉 , 𝐸)
,
where
𝑉
denotes nodes representing intersections or terminal points, and
𝐸
represents road segments connecting
these nodes. Each node
𝑣𝑖
consists of a unique node ID
𝑣𝑖𝑑𝑖
and its corresponding geographic location
(𝑙𝑜𝑛𝑖, 𝑙𝑎𝑡𝑖)
denoting longitude and latitude, respectively. Each edge
𝑒𝑗
𝑖
between nodes
𝑣𝑖
and
𝑣𝑗
contains several GPS positions
forming a geometry property. Within the road map, we generate 4,000 road segment-based trajectories from
OSM using a path planning algorithm. Notably, to enhance the robustness of study, we collect dierent types of
routes with dierentiated BS densities and driving conditions. Specically, we randomly select two nodes,
𝑣𝑖
and
𝑣𝑗
from
𝑉
as the origin and destination, and then employ the path planning algorithm to generate a route
𝑅
between
𝑣𝑖
and
𝑣𝑗
. This dataset serves as valuable input for the CATS problem, enabling us to explore and develop
ecient approaches for cellular association trace synthesis.
2.2.2 Base Station Distribution. To synthesize cellular association traces, we require the full distribution of BSs
in the targeted area. Initially, we obtain the BS distribution data, denoted by
𝐵𝑆 ={𝑏𝑠𝑖}
, from an open-source
database [
30
], containing the distribution data of BSs in one country. To validate the feasibility of using this
open-source data, we compared it with the BS distribution data obtained from a cooperative network operator,
which, however, cannot be publicly shared due to privacy and commercial concerns. The comparison veries that
the BS distribution in the open-source dataset closely aligns with the real BS distribution. Although there is a
slight deviation in the geographic location (within 20 meters), it is acceptable and does not aect the design and
performance of the cellular association trace synthesis scheme. Considering the limitations of using operator’s
data, we adopt the distribution of BSs from the open-source database, which contains 5,770 BSs within the
targeted areas, to substantiate this research.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:6 Lyu et al.
②③
(a) Road space (b) BS space
Cellular association trace 1
Cellular association trace 2
Road segment-based
trajectory
(c) Examples of BS handover
0 0.2 0.4 0.6 0.8 1
Jaccard Distance
0
0.2
0.4
0.6
0.8
1
CDF
(d) Uncertainty of BS handover
Model-driven Data-driven Real data
0
0.2
0.4
0.6
0.8
Value
Precision
Recall
(e) Poor performance
Fig. 3. Technical challenges.
2.2.3 Ground Truths of Cellular Association Traces. To facilitate our research and achieve sucient training and
testing data, we develop a comprehensive cellular association trace collection platform and conduct extensive data
collection campaigns, as illustrated in Fig. 2. This platform comprises a real-time Android-based data collection
App, along with 20 mobile phones. To ensure broad compatibility across all cellular networks, we focus on
collecting three universally available elds during BS association, i.e.,
operator
,
cid
, and
time
. With the active
involvement of consenting volunteers, we collect substantial ground truths of cellular association traces along
the road segment-based trajectories generated earlier. Our goal is to establish a one-to-many correspondence
between road segment-based trajectories and cellular association traces, providing ground truths for model
training. Ultimately, we successfully amass a total of 4,000 cellular association traces, involving interactions with
2,882 BSs within the targeted areas, and totaling an overall time length of 210.9 hours.
3 DATA-DRIVEN CHALLENGES
3.1 Disparity in Geographic Spaces
Synthesizing cellular association traces in accordance with road segment-based trajectories faces the challenge of
disparate geographic spaces, primarily characterized by heterogeneous distribution and representation mismatch
of road segments and BSs. Firstly, road segments exhibit a continuous and interconnected distribution, whereas
BSs are scattered and irregularly distributed. Secondly, road segment-based trajectories are typically represented
using spatial coordinates or unique road identiers, while cellular association traces are represented based on BSs.
Figure 3(a) and 3(b) exemplify this dierence in representing three routes traveled by the same user. Although
they convey the same information, they are represented in distinct ways. To overcome these challenges, it is
necessary to establish the mapping between road and BS spaces, associating road segments with specic BSs.
However, this mapping is not always a one-to-one case. A single road segment may be served by multiple BSs,
and conversely, a single BS may cover multiple road segments. Thus, addressing the disparity in road and BS
spaces becomes critical in synthesizing accurate and realistic cellular association traces.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:7
3.2 Complex and Dynamic BS Handover
The intricate nature of BS connections and handovers introduces considerable uncertainty to cellular association
traces, inuenced by several factors. The presence of multiple BSs within an area, the limited coverage range of
BSs, and varying signal strengths in dierent areas contribute to the complexity of BS handovers. The continuous
movement of users and dynamic network conditions further compounds the uncertainty, leading to diverse
association sets even for users traveling the same route in the same vehicle, as depicted in Fig. 3(c). To quantify
the uniqueness of cellular association traces, Fig. 3(d) illustrates cumulative distribution functions (CDF) of the
Jaccard distance, measuring dissimilarity in associated BS sets. The CDF results reveal that, even along the same
road, no pairs of traces are identical (Jaccard distance is 0) but potentially completely dierent ( Jaccard distance
is 1), highlighting the distinctiveness of each trace. These inherent complexities and dynamics in BS handover
present formidable challenges for cellular association trace synthesis.
3.3 Poor Performance of Single-Dimension Approaches
To overcome the limitations posed by insucient available data, it is crucial to develop advanced techniques for
cellular association trace synthesis. While some existing methods aim to generate large-scale synthetic data, they
often fall short in terms of data quality due to their simplistic nature and reliance on a single-dimension design
paradigm. For instance, model-based approaches depend solely on assumptions and simplied models, failing to
accurately capture the intricate dynamics of cellular association traces. Similarly, data-driven approaches learn
features through a black-box methodology, neglecting explicit mathematical features and being constrained by
limited, biased, and non-diverse training data. The suboptimal performance of single-dimension approaches
becomes apparent in downstream applications, as illustrated in Fig. 3(e) with map matching as an example.
Two representative algorithms, a model-based approach (Bayes) and a data-driven approach (trajGANs)
1
, were
selected to assess the impact of their synthesized data on a typical map matching task. The observed results
indicate that the data synthesized by both model-based and data-driven models does not signicantly improve
map matching performance. Thus, there is a pressing need to develop advanced techniques that synergistically
leverage multidimensional approaches for cellular association trace synthesis.
4 DESIGN OF SYNTHCAT
4.1 Design Overview
Figure 4presents an overview of the SynthCAT design, comprising two major components: model-based coarse-
grained cellular association trace generation, and ne-grained cellular association trace generation via fusion of
model-based and data-driven approaches. The rst component synthesizes a coarse-grained cellular association
trace for a given road segment-based trajectory and departure time, utilizing three modules, i.e., GPS reference
generation, weighted historical average time generation, and Bayesian decision. These modules: i) generate GPS
reference points to unify dierent road and BS spaces into a single GPS space; ii) calculate arrival times for each
GPS point based on weighted historical average travel speeds; iii) synthesize a coarse-grained cellular association
trace using the generated timestamped GPS-based trajectory and BS distribution via Bayesian decision. The second
component renes the synthesized traces by integrating model-based and data-driven approaches using real data
in two stages: rst, rening the traces with grid representation to enhance realism, and second, synthesizing
diverse traces by mapping grids to BSs and their corresponding arrival times. The following subsections provide
a detailed explanation of each component, highlighting their functionalities and contributions.
1Detailed setup can be found in the performance evaluation section.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:8 Lyu et al.
Road map GPS Reference
Generation
Weighted Historical
Average Time Generation
0
50
100
Time
zone 1
Time
zone 2
Time
zone 3
urban suburban highway
 =()
Timestamped GPS-
based trajectory
(0,0)
Road segment-
based trajectory
GPS-based
spatial trajectory
0
Bayesian
Decision
R
R
d
d
BS distribution
Coarse-grained cellular
association trace
(0,0)
Model-Based Coarse-Grained Cellular
Association Trace Generation
0
Coarse-grained cellular
association trace
Gridding Grid-based
presentation
Real cellular
association trace
Grid-based
presentation
Discriminator
Generator
Encoder
Decoder
Attention
Grid-Based Representation
g

Fine-grained cellular association traces
Fine-Grained Cellular Association Trace Generation via Fusion of Model-Based and Data-Driven Approaches
Grid-based spatial trace Grid-based temporal trace
+
Stage One
Stage Two
(,)(,)(,)
Gridding
Fig. 4. Design overview of SynthCAT.
4.2 Model-Based Coarse-Grained Cellular Association Trace Generation
4.2.1 GPS Reference Generation. With data analysis, it is veried that a one-to-one correspondence between road
segments and BSs is lacking. To establish the mapping between road and BS spaces, the challenge of representing
one-to-many relationship needs to be addressed. Upon close examination of road segment-based trajectories and
cellular association traces, it is found that they intersect in the GPS space. Both the starting and ending points of
road segments and the location of BSs are represented by GPS coordinates. However, relying solely on sparse
GPS points is insucient for accurately mapping the two spaces. To tackle this issue, we propose to use the
intermediate GPS reference points to associate road segments with specic BSs.
To generate GPS reference points, we propose three processing methods, i.e., linear interpolation, corner
sparsication, and denseness rearrangement. The choice of method depends on two thresholds, i.e., sampling
distance of GPS points
gap
, and corner angle
𝜃
which is used to identify whether three GPS points consist
a turning-corner. Initially, we obtain an initial set of GPS points (
𝑋
) from the road segment-based trajectory,
denoted by
(𝑙𝑜𝑛𝑖, 𝑙𝑎𝑡𝑖)
for each point
𝑥𝑖
. Next, we calculate the distance
d
between
𝑥𝑖
and
𝑥𝑖+1
. If
d>gap
, we
interpolate between 𝑥𝑖and 𝑥𝑖+1using linear interpolation, denoted by
𝑙𝑎𝑡 =
𝑙𝑜𝑛𝑖+1𝑙𝑜𝑛
𝑙𝑜𝑛𝑖+1𝑙𝑜𝑛𝑖
𝑙𝑎𝑡𝑖+𝑙𝑜𝑛 𝑙𝑜𝑛𝑖
𝑙𝑜𝑛𝑖+1𝑙𝑜𝑛𝑖
𝑙𝑎𝑡𝑖+1,(1)
gap =(𝑙𝑎𝑡 𝑙𝑎𝑡𝑖)2+(𝑙𝑜𝑛 𝑙 𝑜𝑛𝑖)2.(2)
Combining Eq. (1) and Eq. (2), the coordinates
(𝑙𝑜𝑛 , 𝑙𝑎𝑡 )
of the interpolated GPS point can be obtained. If
d<gap
,
we calculate the angle
𝛼
between
𝑥𝑖+1𝑥𝑖
and
𝑥𝑖+1𝑥𝑖+2
. If
𝛼<𝜃
, indicating a turning-corner, we perform corner
sparsication by removing point
𝑥𝑖+1
. If
𝛼>𝜃
, representing a dense area of GPS points, we conduct denseness
rearrangement by deleting point
𝑥𝑖+1
and inserting a new point
𝑥𝑖+1
at a consistent interval. This process is
repeated until all points in
𝑋
have been processed. Finally, we generate a uniform and dense GPS-based spatial
trajectory for each road segment-based trajectory, which serves as location reference points for spaces mapping.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:9
Hour
20
25
30
35
40
45
Speed (km/h)
06 12 18 0 6 12 18 23
(a) Average travel speed over two days
(b) Average travel speed in various areas
Hour
20
25
30
35
40
45
Speed (km/h)
Observed values
Fitted results
04 8 12 16 20 23
(c) Modeling of average travel speed
Fig. 5. Description of average travel speed collected from Amap API.
4.2.2 Weighted Historical Average Time Generation. Once a GPS-based spatial trajectory is generated, the next
step is to determine the corresponding travel time for each GPS trajectory point. Travel time can be intuitively
calculated based on travel distance and speed, considering the uniform and dense characteristics of the GPS-based
spatial trajectory. Therefore, obtaining an accurate travel speeds for dierent trajectories is crucial. To address
this, we propose a weighted historical average method.
Initially, we collect average travel speed data over a continuous two-week period using the Amap API
2
, an
open-source platform providing real-time trac information. This platform oers real-time trac conditions
within any specied area, including congestion levels, the proportion of congested road sections, and average
travel speeds, with data sampled at one-hour intervals. Figure 5illustrates the data collected. Our data analysis
reveals two key observations. First, the average travel speed exhibits a clear periodicity, with a 24-hour cycle.
Second, the average travel speeds in dierent areas show a linear relationship. Based on these ndings, we model
the average travel speed as a function of time and area, expressed as
𝑠𝑝𝑒𝑒𝑑 =𝜔 𝑓 (𝑡)
, where
𝑓(𝑡)
represents the
city’s average travel speed over one day, with time
𝑡
ranging from 0 to 23, and
𝜔
is a weight for dierent areas.
Extensive tting trials indicate that a double sine function provides the best t, denoted by
𝑓(𝑡)=𝛼1𝑠𝑖𝑛 (𝛼2𝑡+𝛼3) + 𝛼4𝑠𝑖𝑛 (𝛼5𝑡+𝛼6) + 𝛼7(3)
where
(𝛼1, 𝛼2, 𝛼 3, 𝛼4, 𝛼5, 𝛼6, 𝛼 7)
are tting parameters. After optimization, these parameters are determined as
(
11
.
5
,
0
.
1
,
3
,
4
.
5
,
0
.
67
,
1
.
23
,
38
.
76
)
, with
𝜔
ranging from 0.8 to 1.1, higher in more developed areas. Thus, the
average travel speed function is
𝑠𝑝𝑒𝑒𝑑 =𝜔(11.5𝑠𝑖𝑛 (0.1𝑡+3) + 4.5𝑠𝑖𝑛 (0.67𝑡1.23) + 38.76), 𝜔 (0.8,1.1)(4)
Figure 5(c) shows the tting results of Eq. 4, demonstrating that our tting results closely match the observed
values, with homogeneous shapes. Finally, we calculate the travel time for each pair of adjacent GPS points based
on their distance and corresponding travel speed. The arrival time for each GPS point is then determined by
adding the travel time to the arrival time of the previous point.
4.2.3 Bayes Decision. To synthesize cellular association traces from timestamped GPS-based trajectories, we
propose a GPS-to-BS mapping algorithm based on the Bayesian decision model. This method matches each GPS
point with the optimal BS and maps the corresponding time to the BS, thereby generating coarse-grained cellular
association traces.
Firstly, we determine the possible BS candidate set
𝐵={𝑏1, 𝑏2, ..., 𝑏𝑛}
for a given GPS reference
𝑥
from BS
distribution
𝐵𝑆 ={𝑏𝑠𝑖}
, based on a distance constraint
𝑑𝑖𝑠
. If
𝐷𝑖𝑠 𝑡𝑎𝑛𝑐𝑒 (𝑏𝑠𝑖, 𝑥)<𝑑𝑖𝑠
, then
𝑏𝑠𝑖
is a possible BS
candidate for
𝑥
, where
𝐷𝑖𝑠 𝑡𝑎𝑛𝑐𝑒 ()
calculates the spherical distance between two points. Since
𝑑𝑖𝑠
is an empirical
2https://report.amap.com/detail.do?city=430100.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:10 Lyu et al.
value, it might be large than the BS coverage radius, meaning
𝑥
is not necessarily within each BS candidate’s
coverage. Therefore, we assess the likelihood of 𝑥being covered by each candidate 𝑏𝑖as follows
𝑝(𝑏𝑖)=
(𝑟𝑎𝑑𝑖𝑑𝑖)2
𝑟𝑎𝑑2
𝑖
, 𝑑𝑖<𝑟𝑎𝑑𝑖, 𝑖 (1,2,· · · , 𝑛),
𝑝(𝑏𝑖)=0,otherwise ,
(5)
where 𝑟𝑎𝑑𝑖represents the coverage radius of BS candidate 𝑏𝑖, and 𝑑𝑖=𝐷𝑖 𝑠𝑡𝑎𝑛𝑐 𝑒 (𝑏𝑖, 𝑥).
The conditional probability of
𝑥
undergoing a handover from
𝑏𝑖1
to
𝑏𝑖
is denoted by
𝑝(𝑥|𝑏𝑖), 𝑖 (
1
,
2
,· · · , 𝑛)
.
The overall probability of 𝑥experiencing a handover across all BS coverage areas is
𝑃(𝑋)=𝑝(𝑥|𝑏1)𝑝(𝑏1)+ · · · + 𝑝(𝑥|𝑏𝑛)𝑝(𝑏𝑛).(6)
According to Bayes’ theorem, the probability of 𝑥switching to each BS is calculated as
𝑝(𝑏𝑖|𝑥)=
𝑝(𝑋𝑖)
𝑃(𝑋), 𝑖 (1,2,· · · , 𝑛 )
=
𝑝(𝑥|𝑏𝑖)𝑝(𝑏𝑖)
𝑝(𝑥|𝑏1)𝑝(𝑏1)+ · · · + 𝑝(𝑥|𝑏𝑛)𝑝(𝑏𝑛).
(7)
The decision function (DF) is then obtained as
𝐷𝐹 =max {𝑝(𝑏1|𝑥),· · · , 𝑝 (𝑏𝑛|𝑥) } .(8)
The maximum value of DF corresponds to the BS that the GPS reference
𝑥
should choose for mapping. In this
paper, the parameter is designed using maximum a posteriori estimation. It is assumed that
𝑝(𝑥|𝑏𝑖)
follows an
independent exponential distribution
𝜆𝑒𝜆𝑑𝑖
. Considering the independence of the location of
𝑥
within the BS
coverage area, the likelihood function of 𝜆is established as
𝐿=
𝑛
Ö
𝑖=1𝜆𝑒𝜆𝑑𝑖.(9)
Solving Eq. (9) gives
𝜆=
𝑛
Í𝑛
𝑖=1𝑑𝑖
=
1
¯
𝐷,(10)
where
¯
𝐷
represents the mean distance between
𝑥
and all BS
𝑏𝑖
. Through this optimal BS selection process, each
GPS reference is matched with the optimal BS.
Secondly, we generate temporal information by mapping timestamps from GPS points to their corresponding
BSs. When multiple GPS points are mapped to the same BS due to variations in sampling intervals and coverage
areas, we assign the timestamp of the rst GPS point as the representative timestamp for that BS. This process
ultimately yields coarse-grained cellular association traces that approximate the original road segment-based
trajectories and their corresponding departure times.
4.3
Fine-Grained Cellular Association Trace Generation via Fusion of Model-Based and Data-Driven
Approaches
The model-based approach captures only explicit features, resulting in coarse-grained and less realistic cellular
association traces. To address this limitation, we further design a two-stage AEGAN component that combines
model-based and data-driven approaches for trajectory renement. As shown in Fig. 6, this fusion component
includes a novel AEGAN to generate plausible grid-based trajectories and a spatial selection and time mapping
scheme to determine BS IDs and map nal arrival times.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:11
Coarse-grained cellular
association trace
(,)
Grid-based
representation
Fake
Decoder
Embedding Layer
𝐺
𝐺0
𝐺0
𝐺
Attention Generator
2

1
3
0
GRU
GRU
GRU
GRU
GRU
GRU
𝐺1
𝐺2
𝐺
GRU
GRU
GRU
FC
FC
FC
0
1
2
1
2
1
Encoder
GRU
GRU
GRU
Embedding Layer
FC
Sigmoid
Discriminator
Real
Grid-based
trace
(,)
(,)
(,)
Fine-Grained
Cellular
Association
Traces
Spatial Selection and Time Mapping
Stage One: Synthesizing Grid-Based Traces Stage Two: Synthesizing Cellular Association Traces
Grid-Based Representation
0


19
32
10
11
0

19
32
10
11
Spatial
19 21
32
14
2
10 11
Temporal
Spatial Temporal
13
19
14
33
21
23
10 11
32
Real cellular
association trace
(,)
Grid-based
representation
Fig. 6. Architecture of two-stage AEGAN.
4.3.1 Stage One: Synthesizing Grid-Based Traces. As shown in Fig. 6, the rst stage consists of three modules:
the grid-based representation for converting the cellular association trace into a grid-based representation, the
generator for approximating the distribution of real trace samples, and the discriminator for distinguishing
between real and synthetic trace samples.
Grid-Based Representation. Our AEGAN model renes synthesized trajectories by taking coarse-grained
cellular association traces as input. To create a simplied representation that is robust against location noise
and enhances the generalization of trajectories, we project these traces into grid-based representations. First, we
divide the targeted area into equally-sized grid cells with a given side length,
𝜖
, in both latitude and longitude,
resulting in a total of
𝑙𝑟×𝑙𝑐
cells. Next, given a cellular association trace
T={(𝑏𝑠1, 𝑡 1),(𝑏𝑠2, 𝑡 2), . . . , (𝑏𝑠𝑛, 𝑡𝑛)}
, we
discretize it by mapping each BS point to a grid cell based on its location, and then convert it into a grid-based
representation
C
. Here,
C R𝑁𝑡×2
contains both spatial and temporal dimensions, where
𝑁𝑡
represents the
maximum length of the traces. For each entry
(𝑏𝑠𝑘, 𝑡𝑘)
of the raw cellular association trace
T
, which falls into
the cell
(𝑖, 𝑗 )
,
C𝑘1=𝑖𝑙𝑐+𝑗
, and
C𝑘2=𝑡𝑘𝑡𝑘1
, where
𝑡𝑘
is the Unix timestamp and
C𝑘0=
0. Real cellular
association traces undergo similar processing to achieve grid-based representation, ensuring consistency with
the coarse-grained cellular association traces.
Generator (
G
). As illustrated in Fig. 6, our model’s generator comprises three key sub-modules, i.e., an encoder,
an attention layer, and a decoder. The encoder consists of an embedding layer, a bidirectional gated recurrent
unit (Bi-GRU) layer, and an output layer, while the decoder includes a GRU layer, a fully connected (FC) layer,
and an argmax layer.
A grid-based representation
C=𝒄1,𝒄2, . . . , 𝒄𝑁𝑡
is rst fed into the embedding layer to learn a dense vector
representation
𝒈1,𝒈2, . . . , 𝒈𝑁𝑡
, instead of using a one-hot representation. Subsequently, the Bi-GRU layer
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:12 Lyu et al.
extracts features from the entire trajectory and transforms the input vector sequence into a sequence of hidden
states 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡, denoted as 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡=𝑓𝒈1,𝒈2, . . . , 𝒈𝑁𝑡,(11)
where 𝑓() represents the function learned by the Bi-GRU layer, and 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡are the encoder outputs.
To capture long-distance interdependencies and emphasize important features, we incorporate an attention
layer into the decoder. Specically, we rst feed a Start Of Sequence (SOS) token to the GRU layer to initiate
the decoding process. At the
𝑖
-th step of decoding, the current decoder hidden state is
𝒔𝑖1
, i.e., the output from
the
(𝑖
1
)
-th GRU. We then apply an attention layer to search for the most relevant representation vectors by
computing the similarity between
𝒔𝑖1
and all encoder hidden states
𝒉1,𝒉2, . . . , 𝒉𝑁𝑡
to generate a context vector
𝒄𝒙𝑖using the following formulas
𝑐𝑜𝑟 𝑟 𝑗=𝐹𝒔𝑖1,𝒉𝑗,𝑗, 1𝑗𝑁𝑡,
𝑞𝑗=exp 𝑐𝑜𝑟𝑟𝑗/
𝑁𝑡
𝑗=1
exp 𝑐𝑜𝑟𝑟𝑗,𝑗, 1𝑗𝑁𝑡,
𝒄𝒙𝑖=
𝑁𝑡
𝑗=1
𝑞𝑗·𝒉𝑗,
(12)
where
𝐹()
denotes the attention function,
𝑐𝑜𝑟 𝑟 𝑗
represents the score measuring the correlation between the
current hidden state
𝒔𝑖1
and an encoding hidden state
𝒉𝑗
, and
𝑞𝑗
measures the importance of
𝑐𝑜𝑟 𝑟 𝑗
. The context
vector
𝒄𝒙𝑖
and the output
𝒚𝑖1
from the
(𝑖
1
)
-th step are concatenated and fed into the GRU with the current
hidden state 𝒔𝑖1to generate the new hidden state 𝒔𝑖and output 𝒚𝑖,
𝒔𝑖=GRU (𝒔𝑖1,𝒚𝑖1𝒄𝒙𝑖),
𝒚𝑖=FC (𝒔𝑖𝒄𝒙𝑖),(13)
where the FC layer generates a
|
G
| +
1-dimensional vector
𝒚𝑖
, with
|
G
|=𝑙𝑟×𝑙𝑐
. We then feed
𝒚𝑖
into the argmax
layer to transform its dimension from
|
G
| +
1to 2, with the rst dimension representing the spatial fenature and
the second dimension representing the temporal feature, corresponding to the input
𝒄𝑖
. Finally, upon generating
an End Of Sequence (EOS) token, the decoder completes the conversion process, producing a grid-based trajectory
𝑌R𝑁𝑡×2.
Discriminator (
D
). The discriminator is designed to determine whether the input trajectory is a ground-truth
trajectory or a generated trajectory. For stabilization and acceleration, the design of discriminator is relatively
simple, consisting of an embedding layer, a GRU layer, a FC layer, and a sigmoid function. Specically, the
embedding layer transforms either the generator output
𝑌
or the grid-represented ground-truth trajectory
C𝑟𝑒𝑎𝑙
into a dense vector. The GRU layer sequentially processes trajectory points and generates a sequence of outputs.
The FC layer then processes the entire output of the input trajectory, and the result is passed through a sigmoid
activation function
𝜎()
to obtain the discriminator’s output
D𝑜𝑢𝑡
, which represents the probability of classifying
the input as a true cellular association trace.
Model Training. When training the AEGAN model, the traditional binary cross-entropy loss function [
31
] is
insucient to constrain the optimization problem addressed in our work. Therefore, we carefully design a new
loss function, consisting of two parts, i.e.,
𝐿𝑜𝑠𝑠 ( G)
and
𝐿𝑜𝑠𝑠 (D)
, to jointly optimize our proposed model. The
objective function can be characterized as follows
min
{𝜓D}𝐿𝑜𝑠𝑠 (D) =min
{𝜓D}LD,
min
{𝜓G}𝐿𝑜𝑠𝑠 (G) =min
{𝜓G}LG+ L𝑠𝑝𝑎𝑡 𝑖𝑜 + L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 + L𝑠𝑒 𝑞 ,(14)
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:13
with
LD=EC𝑟𝑒𝑎𝑙 𝑃data [log (1 D (C𝑟𝑒𝑎𝑙 ))] +EC∼𝑃𝐺[log (0 D (G(C)))] ,
LG=EC∼𝑃𝐺[log (1 D (G(C)))] ,
L𝑠𝑝𝑎𝑡 𝑖𝑜 =E"|G|
𝑖=1𝑚𝑖𝑚
𝑖2#,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 =E
1
𝑁𝑡
𝑁𝑡
𝑖=1
|G|+1
𝑗=|G|+1𝑇𝑟𝑖 𝑗 𝑇 𝑟
𝑖 𝑗 2,
L𝑠𝑒𝑞 =E
𝑁𝑡
𝑖=1
|G|+1
𝑗=1
𝑇𝑟𝑖 𝑗 log ©«
𝑒𝑇 𝑟
𝑖 𝑗
Í|G|+1
𝑞=1𝑒𝑇 𝑟
𝑖𝑞 ª®¬,
(15)
where
𝜓G
and
𝜓D
are the parameters of the generator
G
and discriminator
D
, respectively.
LG
and
LD
represent
the adversarial loss for the generator and discriminator, respectively.
E[]
denotes the expected value of the
distribution function,
𝑃data
represents the distribution of the real samples, and
𝑃𝐺
represents the distribution of
the generated samples.
L𝑠𝑝𝑎𝑡 𝑖𝑜
represents the spatial similarity loss,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
represents the temporal similarity
loss, and L𝑠𝑒𝑞 represents the overall sequence similarity loss.
Specically, we rst transform the output of generator
G
into a
𝑁𝑡× |G| + 1
matrix, where
𝑁𝑡
denotes the
maximum length of the trajectories. The rst
|
G
|
dimensions form a one-hot vector representing the spatial
features of the trajectory, while the
|
G
| +
1-th dimension represents the temporal feature of the trajectory.
L𝑠𝑝𝑎𝑡 𝑖𝑜
calculates the sum of squares due to error of the grid spatial matrix
𝑚
of the true trajectory and
𝑚
of the
corresponding generated trajectory, where
𝑚𝑖=
1if this grid belongs to this trajectory, otherwise,
𝑚𝑖=
0.
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
calculates the mean square error of the temporal features between the real trajectory
𝑇𝑟
and the
corresponding generated trajectory
𝑇𝑟
.
L𝑠𝑒𝑞
calculates the sequence similarity of the real trajectory
𝑇𝑟
and
the corresponding generated trajectory
𝑇𝑟
. During model training, the adversarial generator and discriminator
are jointly optimized using alternating stochastic gradient descent, with the gradients calculated by the Adam
optimizer. Our well-designed loss function enables the generator to eciently synthesize high-quality data while
enhancing the discriminator’s ability to dierentiate between real and fake samples.
4.3.2 Stage Two: Synthesizing Cellular Association Traces. After the joint optimization of the adversarial generator
and discriminator in the rst stage, we obtain ne-grained grid-based cellular association traces based on the
coarse-grained traces. In the second stage, our objective is to select the appropriate BS within each grid and map
the arrival time for each selected BS to generate the ne-grained cellular association traces. We experimentally
determine the most suitable grid size in the rst stage, as partition size aects the number of BSs inside each grid
and consequently inuences the quality of the synthesized traces. The grid size is chosen such that each grid
contains only one or two BSs, ensuring that the BSs within each grid can eectively characterize the grid. To
ensure diversity in the synthesized cellular association traces, we adopt a spatial selection and time mapping
scheme. A random BS is chosen within each grid, and its arrival time is calculated by adding the departure time
to the cumulative travel time to the grid. This approach generates multiple ne-grained cellular association
traces from each grid-based trace, each with similar features but dierent BS compositions, enhancing the overall
diversity of the synthetic trajectories. Moreover, the second stage does not involve model training, allowing for
rapid and ecient synthesis of ne-grained cellular traces.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:14 Lyu et al.
Table 1. Category description of employed baselines.
First Catagory Second Catagory Baseline
Spatial Trajectory Generation Data Augmentation Model RS
Data Synthesis Model (Model-Based) Bayes
Temporal Trajectory Generation Non-Machine Learning-Based Model TEMP
Machine Learning-Based Model DOT
Spatio-Temporal Trajectory Generation Data Synthesis Model (Data-Driven) trajGANs
AEGAN
5 PERFORMANCE EVALUATION
5.1 Evaluation Methodology
Experiment Setup. In this work, we collect substantial road segment-based trajectories and corresponding
cellular association traces for performance evaluation, totally encompassing 4,000 pairs of routes and cellular
traces with a distance of 6,204 km. For the collected cellular traces, 80% of them are used for model training while
the remaining 20% are reserved for testing. We implement SynthCAT using Python and PyTorch, and run it on a
server equipped with 4 CPUs, each containing 192 Intel(R) Xeon(R) Platinum 8260 processor running at 2.40 GHz,
along with the utilization of a graphics processing unit card (NVIDIA Tian X) to accelerate the training.
Baselines. To comprehensively assess the performance of SynthCAT, we employ six reasonable baselines across
three broad categories and ve subcategories, as described in Table 1. Since no existing approach is specically
designed for synthesizing cellular association traces, we adapt these baselines based on their underlying principles
to align with the CATS problem.
Random Substitution (RS): A low-complexity yet eective method, RS involves replacing a few random
BSs in a cellular association trace with nearby BSs.
Bayesian Decision Model (Bayes) [
32
]: A model-based approach that relies on a predened Bayesian
decision model to synthesize traces.
Temporally Weighted Neighbors (TEMP) [
33
]: A representative historical trajectory-based method that
estimates travel time by averaging the travel time of historical trajectories with similar origins, destinations
and departure times.
Diusion-based Origin-destination Travel Time Estimation (DOT) [
34
]: A state-of-the-art framework
combining a conditioned pixelated trajectory denoiser and a masked vision transformer model to accurately
and explainably infer and estimate travel times from historical trajectories.
trajGANs [35]: A data-driven approach widely adopted for generating trajectories using GANs.
AEGAN [
36
]. A data-driven approach that combines autoencoder architecture, Bi-GRU, and attention
mechanisms with GAN to synthesize traces.
Metrics. To comprehensively assess performance, we employ eight evaluation metrics across two categories:
data delity and utility. The rst six metrics evaluate data delity, with the rst three for the spatial dimension
and the last three for the temporal dimension. The remaining two metrics assess data utility.
Jensen-Shannon Divergence (JSD) [
37
]: An aggregated-level metric measures the discrepancy between
the distributions of synthesized traces and real traces. It is calculated by
𝐽 𝑆 (𝑝𝑞)=1
2𝐾𝐿 𝑝𝑝+𝑞
2+
1
2𝐾𝐿 𝑞𝑝+𝑞
2
, where
𝑝
and
𝑞
are two distributions, and
𝐷𝐾𝐿 (𝑢𝑤)=𝐸[log (𝑢𝑖)log (𝑤𝑖)] =Í𝑖𝑢𝑖log 𝑢𝑖
𝑤𝑖
is the relative entropy of 𝑢with respect to 𝑤.
Hausdor Distance [
38
]: An individual-level metric measures the spatial dissimilarity between synthetic
cellular association traces and real ones by calculating the distance between two point sets.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:15
0
0.5
200
Hausdroff distance
1
1.5
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(a) RS vs. SynthCAT in Hausdor distance
0
0.2
0.4
0.6
200
Jaccard index
0.8
1
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(b) RS vs. SynthCAT in Jaccard index
0
0.5
1
200
JSD
10-4
1.5
2
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(c) RS vs. SynthCAT in JSD
Fig. 7. Performance comparison with data augmentation model.
0 0.5 1 1.5 2 2.5 3
Hausdroff distance
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(a) CDF of Hausdor Distance
0 0.2 0.4 0.6 0.8 1
Jaccard index
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(b) CDF of Jaccard Index
012345
JSD 10-4
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(c) CDF of JSD
Fig. 8. Performance comparison with data synthesis model.
Jaccard Index [
39
]: Another individual-level metric quantifying spatial similarity between real and
synthetic traces. It is calculated by J (𝐴, 𝐵)=|𝐴𝐵|
|𝐴𝐵|, where 𝐴and 𝐵are any two sets to be compared.
Root Mean Squared Error (RMSE): Refers to the standard deviation of the dierences between the raw
data values and generated ones, calculated by
1
𝑁Í𝑁
𝑖=1𝑡𝑖ˆ
𝑡𝑖2
, where
𝑁
is the total number of traces,
and 𝑡𝑖and ˆ
𝑡𝑖denote the raw and generated travel times.
Mean Absolute Error (MAE): Refers to the average absolute error, calculated by 1
𝑁Í𝑁
𝑖=1𝑡𝑖ˆ
𝑡𝑖.
Mean Absolute Percentage Error (MAPE): Refers to the mean absolute percentage error, calculated by
1
𝑁Í𝑁
𝑖=1𝑡𝑖ˆ
𝑡𝑖
𝑡𝑖.
Precision: Refers to the ratio of the total length of the correctly matched route to the total length of the
route. Mathematically,
𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|𝑌𝑚𝑌𝑔|
|𝑌𝑚|
, where
𝑌𝑚
and
𝑌𝑔
denote the map-matched route and ground
truth route, respectively.
Recall: Represents the ratio of the total length of the correctly matched route to the total length of the
route in ground truth, i.e., 𝑅𝑒𝑐𝑎𝑙𝑙 =|𝑌𝑚𝑌𝑔|
|𝑌𝑔|.
5.2 Data Fidelity Evaluation
5.2.1 Comparison of Spatial Dimension. We evaluate data delity in spatial dimension in two aspects as follows.
Comparison with Data Augmentation Model. We rst conduct a comparative analysis of the data delity
between our SynthCAT and the data augmentation model RS. Considering the sensitivity of RS to sampling
rates and ranges in the random replacement strategy, we test it across 10 sampling rates (10% to 100%) and 10
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:16 Lyu et al.
MAPE (s)
2
4
6
8
10
Value
0
0.3
0.6
0.9
1.2
1.5
Search range = 100m
Search range = 200m
Search range = 300m
Search range = 500m
Search range = 1000m
Search range = 2000m
6.70
2.23
0.88
MAE (s)
RMSE (s)
(a) Comparison with non-machine learning-based method
RMSE (min) MAE (min) MAPE (min)
0
1
2
3
4
5
Value
trajGANs AEGAN DOT synthCAT
(b) Comparison with machine learning-based method
Fig. 9. Performance comparison of temporal dimension.
sampling ranges (100 to 1000 meters). The results, shown in Fig. 7(a)-7(c), evaluate the performance of both
algorithms based on indicators such as Hausdor distance, Jaccard index, and JSD under varying sampling
conditions. Smaller Hausdor distance and JSD, along with a larger Jaccard index, indicate superior data delity.
Key ndings from Fig. 7reveal that RS exhibits lower data delity with larger sampling rates and ranges. In
comparison, SynthCAT consistently outperforms RS at both aggregate and individual levels, prevailing in 99% of
cases. For instance, concerning the Hausdor distance metric, SynthCAT surpasses RS unless the sampling range
is under 200 and the sampling rate is below 0.2. The advantages become more prominent with an increase in
sampling rate and sampling range. Compared to the data augmentation algorithm, our emphSynthCAT achieves
excellent performance, nearly comparable to its best.
Comparison with Data Synthesis Model. We proceed to compare the data delity of our SynthCAT with
other data synthesis models. Figure 8(a)-8(c) display the CDFs of Hausdor distance, Jaccard index, and JSD
achieved by dierent strategies. Notably, SynthCAT consistently outperforms other baselines across all metrics.
For example, in terms of the Hausdor distance metric, SynthCAT achieves scores below 1.0 in roughly 80%
of cases, whereas other baselines all have scores exceeding 1.0. This substantial performance improvement
further underscores the ecacy of our proposed SynthCAT. Additionally, despite the limited data scale causing
the data-driven schemes to perform worse than the model-based method, our proposed fusion approach still
outperforms both types of approaches, indicating the advantages of our SynthCAT, i.e., its ability to achieve
superior results with less training data and enhanced performance.
5.2.2 Comparison of Temporal Dimension. We then evaluate data delity in temporal dimension as follows.
Comparison with Non-Machine Learning-Based model. We rst compare the performance of data delity
in the temporal dimension between our SynthCAT and the non-machine learning-based model TEMP.TEMP
predicts travel time based solely on origin, destination, and departure time without generating spatial trajectories.
Consequently, we feed it the spatial cellular association trace synthesized by our SynthCAT to generate temporal
dimension information for each trace point. This allows us to perform a point-by-point comparison of the
generated temporal information. Given TEMP’s sensitivity to the search range of neighbors, we test it across 6
search ranges. The comparison results are presented in Fig. 9(a), where the red line denotes the scores of our
SynthCAT. From Fig. 9(a), it is evident that SynthCAT consistently outperforms TEMP across all metrics by a
signicant margin. For example, in terms of MAPE, SynthCAT achieves a score of 0
.
88, whereas the best score
achieved by TEMP is 1
.
41 when the search range is 100 meters, reecting a 37
.
6% improvement. Additionally,
TEMP’s performance remains relatively unchanged when the search range varies from 100 meters to 2000 meters.
This may be attributed to the high sparsity of the cellular association trace and the complex, dynamic nature of
BS handovers, which hinder stable parameter performance. Therefore, it is evident that generating the temporal
information of cellular association traces cannot be adequately achieved based merely on the average travel time
of historical trajectories.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:17
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(a) Real vs. synthesized traces (SynthCAT)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(b) Real vs. synthesized traces (RS)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(c) Real vs. synthesized traces (Bayes)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(d) Real vs. synthesized traces (TrajGANs)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(e) Real vs. synthesized traces (AEGAN )
Fig. 10. Visualization of cellular association trace synthesis.
Comparison with Machine Learning-Based model. We then evaluate data delity of our SynthCAT
comparing with several machine learning-based models. Due to the inconsistencies in length and the inability of
some comparison baselines to generate spatial trajectories, a point-by-point comparison of time information is
not feasible. Therefore, we assess the performance of each algorithm by comparing the overall travel time of the
entire trajectory. The results, as illustrated in Fig. 9(b), reveal two main ndings. First, our SynthCAT consistently
outperforms the other baselines, even surpassing the current state-of-the-art travel time estimation model, i.e.,
DOT. Second, trajGANs and AEGAN exhibit poor performance due to limited training data. In contrast, our
SynthCAT demonstrates signicantly better performance owing to the fusion of model-based and data-driven
approaches, thereby proving the superiority of our proposed fusion framework.
5.2.3 Visualization. To enhance the clarity of our proposed SynthCAT’s eectiveness, we present visualizations
of synthesized cellular association traces generated by various methods. Shown in Fig. 10, it is evident that
traces synthesized by SynthCAT and RS closely resemble the ground truth, showcasing a high degree of realism.
In contrast, traces generated by other methods exhibit lower quality, featuring noticeable deviations from the
ground truths. This observation underscores the superior data delity achieved by SynthCAT.
5.2.4 Comparison of Model Generalization. Data synthesis models provide a signicant advantage in generating
realistic cellular association traces for road-segment trajectories unseen during training. To evaluate their
performance in new areas, we re-divided the data, using an urban area as the training set and a new urban area
and a suburban area as the test sets. This allowed us to assess the models’ generalization to both similar and
dissimilar areas. The results, presented in Figures 11 and 12, reveal distinct performance patterns. Our SynthCAT
model stands out in both spatial and temporal dimensions across similar and dissimilar areas compared to baseline
models. Notably, Bayes,trajGANs, and AEGAN demonstrate consistent performance with minimal dierences
between similar and dissimilar areas, showing a slight performance boost in suburban environments due to their
simpler context. In contrast, SynthCAT surpasses other baselines in similar areas, while its performance aligns
with other algorithms in dissimilar areas. Two key conclusions can be drawn: SynthCAT eectively generalizes
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:18 Lyu et al.
Hausdorff
0
0.5
1
1.5
Value
Jaccard
0
0.05
0.1
0.15
0.2
0.25
Bayes
trajGANs
AEGAN
JSD
0
0.4
0.8
1.2 10-3
SynthCAT
(a) Spatial performance of generalization to new area (suburban)
Hausdorff
0
0.5
1
1.5
2
Value
Jaccard
0
0.2
0.4
0.6
0.8
Bayes
trajGANs
AEGAN
JSD
0
1
2
3
4
510-4
SynthCAT
(b) Spatial performance of generalization to new area (urban)
Fig. 11. Performance comparison of model generalization in spatial dimension.
RMSE (min) MAE (min) MAPE (min)
0
1
2
3
4
5
Value
trajGANs AEGAN DOT synthCAT
(a) Temporal performance of generalization to new area (suburban)
RMSE (min) MAE (min) MAPE (min)
0
2
4
6
Value
trajGANs AEGAN DOT synthCAT
(b) Temporal performance of generalization to new area (urban)
Fig. 12. Performance comparison of model generalization in temporal dimension.
0 5 10 30 50 70 90 100 500 1000
Magnification times
0.6
0.65
0.7
0.75
0.8
0.85
Value
Precision Recall
(a) Performance of map matching vs. synthesized data scale
RS
Bayes TrajGANs AEGAN SynthCAT
0.6
0.65
0.7
0.75
0.8
0.85
Value
Precision Recall
(b) Map matching performances with 100×synthesized data
Fig. 13. Performance comparison of data utility in map matching task.
to unknown areas, particularly excelling in similar areas, and the contextual information of the unknown area
signicantly inuences model performance. In the temporal dimension, the performance of each baseline mirrors
that in the spatial dimension, while SynthCAT exhibits superior performance in dissimilar suburban areas. This
improvement is attributed to lower congestion in suburban areas, enabling the model-based component to
perform better with less training data. Overall, the generalization performance of SynthCAT surpasses that of
other baseline models in both spatial and temporal dimensions.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:19
Table 2. Experimental results of ablation experiment.
Model Hausdor Jaccard JSD RMSE (min) MAE (min) MAPE (min)
-data-driven method 1.5616 0.1250 2.17e-4 2.0857 1.1988 0.7640
-model-based method 1.7576 0.0112 4.62e-4 2.3690 2.0670 0.9935
-L𝑠𝑝𝑎𝑡 𝑖𝑜 0.7960 0.7439 6.57e-5 1.1404 0.9800 1.0297
-L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 0.7830 0.7494 6.46e-5 1.8354 1.1733 0.7308
-L𝑠𝑒𝑞 1.3593 0.2924 5.34e-4 1.1166 0.9602 0.9750
-L𝑠𝑝𝑎𝑡 𝑖𝑜 &L𝑡𝑒𝑚𝑝 𝑜𝑟𝑎𝑙 &L𝑠𝑒𝑞 1.5271 0.0199 2.01e-4 1.9748 1.1769 0.7532
SynthCAT 0.7455 0.7687 6.06e-5 1.0743 0.9205 0.6841
5.3 Data Utility Evaluation
In addition to data delity evaluation, we explore the practical implications of synthesized cellular association
traces in a map-matching application. Our synthesized data proves valuable in supplementing the scarce training
data. For map matching, we employ an Encoder-Decoder architecture with Bi-GRU and an attention mechanism.
Firstly, we evaluate how the scale of synthesized data impacts the map-matching task accuracy. Real-world data is
divided into 80% for training and 20% for inference. Systematically varying synthesized data scales in the training
set, ranging from 0 to 1000 times the real data, demonstrates improved model performance, as shown in Fig.
13(a). The matching precision consistently rises with larger synthesized data scales, reaching a 28% improvement
when the synthesis is 100 times the real data. However, there’s a performance limit with increasing data scales,
as model structure also inuences outcomes. These ndings highlight the practicality of our synthesized data in
real-world applications. Secondly, we compare dierent models’ data utility by synthesizing data at 100 times
the real data scale for map-matching model training, and the results are shown in Fig. 13(b). Testing the trained
models shows that our SynthCAT outperforms other methods. Notably, Bayes,trajGANs, and AEGAN perform
less eectively, while RS is slightly inferior to SynthCAT.RS’s parameters contribute to its success, using sampling
rates and ranges that closely mimic real data. However, its high privacy risk makes it less suitable for real-world
deployment by network operators.
5.4 Ablation Experiment
Eectiveness of Devised Models. In SynthCAT, we synthesize cellular association traces by integrating model-
based and data-driven approaches. To explore their respective contributions, we remove the model-based and
data-driven components separately to examine the nal performance. Table 2shows the metric results of removing
corresponding components. We can observe that without data-driven or model-based method, the performance
of SynthCAT can degrade signicantly, demonstrating that a single-dimensional approach is not sucient for the
CATS problem. In addition, we can observe that the results of SynthCAT without model-based method are worse
than that of SynthCAT without data-driven method. It happens due to the limited training data scale, which
largely demonstrates the importance and urgency of solving the CATS problem.
Eectiveness of
L𝑠𝑝𝑎𝑡 𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
.We then validate the eectiveness of our devised loss function,
including individual spatial similarity loss
L𝑠𝑝𝑎𝑡 𝑖𝑜
, individual temporal similarity loss
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, individual
sequence similarity loss
L𝑠𝑒𝑞
, and the combination of
L𝑠𝑝𝑎𝑡 𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
. From the results shown in
Table 2, we can make the following two major statements. First, the contributions of
L𝑠𝑝𝑎𝑡𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
are all integral to the performance improvement. Second,
L𝑠𝑒𝑞
plays the dominant role in achieving the
superior performance in SynthCAT.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:20 Lyu et al.
6 DISCUSSIONS
6.1 Lessons Learned
Integration of Model-Based and Data-Driven Approaches. The model-based design paradigm excels in
explicitly expressing data patterns through mathematical or statistical models with minimal training data
requirements, albeit constrained by limitations in accuracy and exibility. On the other hand, the data-driven
design paradigm oers the advantages of adapting to complex and dynamic environments, mining hidden features
with good accuracy and exibility, but often requires substantial training data. The integration of both approaches
harnesses their respective strengths, overcoming individual shortcomings and adapting well to limited training
data while maintaining high accuracy and exibility. This integration stands as the distinctive advantage of our
scheme compared to others, allowing us to achieve optimal results even with constrained self-collected data.
Trade-o between Data Fidelity and Utility. In general, a positive correlation exists between data delity
and utility, implying that higher data delity enhances downstream application utility. However, our experiments
reveal a nuanced perspective within our scenario. The inherent uncertainty of ST cellular association traces
challenges the conventional notion. Excessive pursuit of data delity does not necessarily lead to a signicant
improvement in downstream application utility. Instead, it may restrict the volume of synthetic data and further
diminish the utility of downstream applications. Thus, achieving an optimal trade-o between delity and utility
becomes imperative, considering the diverse characteristics of downstream applications.
6.2 Limitations and Future Works
Model Generalization. Generally, the generalization of our proposed SynthCAT should be evaluated across
various aspects, including dierent models of mobile devices, transportation modes, road conditions, and ge-
ographic areas. However, in Section 5, we focus solely on assessing the model’s generalization ability to new
and distinct areas due to the following constraints. Firstly, the data currently utilized in our study are collected
manually, a process that is both time-consuming and labor-intensive. Conducting a comprehensive evaluation of
the model’s generalization requires a large volume of data collected under diverse conditions and environments,
which is not feasible within a short time frame. At present, the collected data are insucient for a thorough
evaluation. Secondly, we regard generalization to new areas as more critical than other aspects. Variations
in equipment, transportation modes, and road conditions within the same area primarily aect the temporal
dimension generation of the model, with minimal impact on the spatial dimension. In contrast, generalization
to new areas inuences data generation in both temporal and spatial dimensions. Therefore, for these reasons,
our current evaluation focuses solely on the model’s generalization to new areas. In future work, we plan to
collaborate with network operators to fully evaluate our model using the extensive volume of privacy-free
cellular association traces they provide. This collaboration will also facilitate further optimization of our model
to enhance its generalization and real-world deployment performance.
Context Features. We recognize the importance of incorporating urban data, like road networks, land use, and
socio-demographic information, to enhance the robustness of cellular association trace synthesis. The comparison
of model generalization in Section 5indirectly highlights this point. However, directly introducing external
context information to improve model generalization can make the model overly complex. The diversity and
relevance of input data impose strict requirements on data integrity, limiting the practical application of the model.
Hence, the cellular association trace synthesis scheme should strike a balance between scene detail and model
complexity. In future work, we aim to explore how urban context information can enhance model generalization
without adding unnecessary complexity.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:21
6.3 Ethical Considerations and Data Release
We recognize the signicant privacy concerns associated with using cellular association traces for research. To
address these concerns, we have implemented stringent measures to protect the privacy of mobile users in both
self-collected data campaigns and potential future deployments with network operators. Firstly, in self-collected
data eorts, we have adopted several privacy-preserving approaches: i) All phone numbers of participating
volunteers have been encrypted to prevent any association with individual user identiers. ii) Our custom
data collection App only collects data when actively open, and all data collection activities comply with strict
security and privacy policies. iii) During data processing and model design, individual or personal data is carefully
excluded. Secondly, when our proposed scheme is deployed by a network operator in the future, it will only
utilize encrypted trajectory data, ensuring user privacy and security, with a condentiality agreement in place.
i) User IDs will be hashed into global identiers, making it impossible to trace back to individual users. ii) The
trajectory data used will consist of signaling data from BS switching rather than GPS data, which is inherently
sparse and contains signicant location noise, thus preventing user tracking and protecting user privacy. iii) Data
processing will occur on the network operator’s servers, with access restricted to authorized researchers.
Regarding data sharing, we acknowledge the potential benets of releasing informative cellular association
traces to advance research in the mobile computing community. As part of our nal release, we will provide
the implementation codes for our proposed approach, including the Android-based data collection app, the
self-collected real data, the synthesized data, and SynthCAT.
7 RELATED WORK
7.1 Services Empowered by Cellular Mobility Data
In the literature, there have been numerous research works focusing on service design based on cellular mobility
data, e.g., urban planning and intelligent transportation [
40
42
], location-based services [
14
,
43
,
44
], smart cities
[
45
48
], emergency response and disaster management [
49
51
], etc. For instance, based on large-scale cellular
data, Schläpfer et al. [
40
] presented an understanding study on the universal patterns observed in human mobility.
Shen et al. [
14
] proposed a map matching algorithm in accordance with cellular mobility data. Fang et al. [
44
]
presented an approach called CellSense to enhance the recovery of human mobility patterns using cellular data.
In addition, in [
49
,
50
], the authors have explored the potential of cellular mobility data as a valuable source of
information for informing public health actions during the COVID-19 pandemic. However, although cellular
mobility data can empower extensive data-driven services from dierent aspects and advance scientic research
by providing modeling clues, the scarcity of real data as the dominant reason still hinders the progress, motivating
our research in this paper.
7.2 Mobility Data Synthesis
In the realm of mobility data synthesis, existing studies can be broadly categorized into road trajectory synthesis
[
15
17
] and GPS trajectory synthesis [
18
20
]. The rst category focuses on generating synthetic trajectories
representing object movements along specic road segments or links. For example, Jiang et al. [
15
] proposed a
two-stage GAN-based approach for generating continuous road trajectories. On the other hand, GPS trajectory
synthesis aims to create synthetic GPS trajectories that imitate real-world object or individual movement patterns.
For instance, Sun et al. [
18
] investigate the task of GPS trajectory synthesis while ensuring privacy preservation
and data utility. However, these existing works predominantly focus on synthesizing mobility data derived from
GPS or vehicle tracking systems, and often overlook the potential of synthesizing mobility data collected from
cellular networks. Cellular networks complement advantages, such as wider coverage, higher user penetration,
ne-grained granularity, and cost-eectiveness, making them valuable as sources of mobility data. Moreover, the
conventional techniques used in data synthesis tend to fall into pure model-based methods [
21
24
] or data-driven
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:22 Lyu et al.
methods [
15
,
25
28
]. Pure model-based methods usually have limited performance and applicability to complex
real-world scenarios, while data-driven methods heavily rely on the scale and diversity of training data, which is
hard to be achieved in many practical scenarios. As a result, both types of methods have inherent drawbacks that
hinder their widespread use and eectiveness in mobility data synthesis. Overall, there is a signicant research
gap when it comes to synthesizing cellular association traces, and innovative algorithms are required to address
this blank space in the eld of mobility data synthesis.
8 CONCLUSION
In this paper, we have proposed SynthCAT, an innovative scheme for synthesizing realistic and diverse cellular
association traces in accordance with road segment-based trajectories and corresponding departure times. By
integrating model-based and data-driven approaches, SynthCAT eectively maps road space to BS space, generates
initial time information, and synthesizes cellular association traces in a coarse-to-ne manner. Particularly, by
synergistically leveraging the strengths of both techniques, our approach not only captures the fundamental
essence of underlying association rules through explicit mathematical models but also extracts implicit patterns
and insights using data-driven methods. Extensive experiments demonstrate that SynthCAT outperforms the
state-of-the-art baselines in terms of data delity of synthesized cellular association traces and data utility in
supporting practical downstream applications.
ACKNOWLEDGMENTS
This work was supported in part by the National Key Research and Development Program of China under
Grant 2022YFF0604504, in part by the National Natural Science Foundation of China under Grants 62422216,
62320106006, 62372472, 62341201, and 62172445, in part by the 111 Project under Grant B18059, in part by the
Hunan Provincial Natural Science Foundation of China under Grant 2024JJ4068, and in part by the Natural
Sciences and Engineering Research Council of Canada under Grant RGPIN-2023-03759.
REFERENCES
[1]
Zhou Qin, Fang Cao, Yu Yang, Shuai Wang, Yunhuai Liu, Chang Tan, and Desheng Zhang. CellPred: A Behavior-aware Scheme for
Cellular Data Usage Prediction. Proc. ACM UbiComp, 4(1):1–24, 2020.
[2] Ericsson AB. Ericsson mobility report. November 2020.
[3]
Xiaobo Zhou, Shuxin Ge, Tie Qiu, Keqiu Li, and Mohammed Atiquzzaman. Energy-Ecient Service Migration for Multi-User Heteroge-
neous Dense Cellular Networks. IEEE Trans. Mobile Comput., 22(2):890 905, 2021.
[4]
Huaming Yang, Zhongzhou Xia, Jersy Shin, Jingyu Hua, Yunlong Mao, and Sheng Zhong. A Comprehensive Study of Trajectory Forgery
and Detection in Location-Based Services. IEEE Trans. Mobile Comput., 2023.
[5]
Sijing Duan, Dan Wang, Ju Ren, Feng Lyu, Ye Zhang, Huaqing Wu, and Xuemin Shen. Distributed Articial Intelligence Empowered by
End-Edge-Cloud Computing: A Survey. IEEE Commun. Surv. Tutor., 2022.
[6]
Yiwei Song, Dongzhe Jiang, Yunhuai Liu, Zhou Qin, Chang Tan, and Desheng Zhang. HERMAS: A Human Mobility Embedding
Framework with Large-scale Cellular Signaling Data. Proc. ACM UbiComp, 5(3):1–21, 2021.
[7]
Yige Zhang, Weixiong Rao, Kun Zhang, Lei Chen, and Ding Chen. Outdoor Position Recovery from Heterogeneous Telco Cellular Data.
IEEE Trans. Knowl. Data. Eng., 2022.
[8]
Junjun Si, Jin Yang, Yang Xiang, Hanqiu Wang, Li Li, Rongqing Zhang, Bo Tu, and Xiangqun Chen. TrajBERT: BERT-Based Trajectory
Recovery with Spatial-Temporal Renement for Implicit Sparse Trajectories. IEEE Trans. Mobile Comput., 2023.
[9]
Qianru Wang, Bin Guo, Lu Cheng, and Zhiwen Yu. sUrban: Stable Prediction for Unseen Urban Data from Location-based Sensors. Proc.
ACM UbiComp, 7(3):1–20, 2023.
[10]
Çağkan Yapar, Ron Levie, Gitta Kutyniok, and Giuseppe Caire. Real-Time Outdoor Localization Using Radio Maps: A Deep Learning
Approach. IEEE Trans. Wireless Commun., 11(12):9703–9717, 2023.
[11]
Oliveira, Leonardo L de and Eisenkraemer, Gabriel H and Carara, Everton A and Martins, Joao B and Monteiro, Jose. Mobile Localization
Techniques for Wireless Sensor Networks: Survey and Recommendations. ACM Trans. Sens. Netw., 19(2):1–39, 2023.
[12]
Joachim Gudmundsson, Martin P Seybold, and Sampson Wong. Map matching queries on realistic input graphs under the Fréchet
distance. In Proc. ACM-SIAM SODA, pages 1464–1492, 2023.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches 161:23
[13]
Huimin Ren, Sijie Ruan, Yanhua Li, Jie Bao, Chuishi Meng, Ruiyuan Li, and Yu Zheng. MTrajRec: Map-Constrained Trajectory Recovery
via Seq2Seq Multi-task Learning. In Proc. ACM SIGKDD, pages 1410–1419, 2021.
[14]
Zhihao Shen, Wan Du, Xi Zhao, and Jianhua Zou. DMM: Fast Map Matching for Cellular Data. In Proc. ACM MobiCom, pages 1–14, 2020.
[15]
Wenjun Jiang, Wayne Xin Zhao, Jingyuan Wang, and Jiawei Jiang. Continuous Trajectory Generation Based on Two-Stage GAN. In
Proc. AAAI, 2023.
[16]
Qingyan Zhu, Yize Chen, Hao Wang, Zhenyu Zeng, and Hao Liu. A Knowledge-Enhanced Framework for Imitative Transportation
Trajectory Generation. In Proc. IEEE ICDM, pages 823–832, 2022.
[17]
Xinyue Sun, Qingqing Ye, Haibo Hu, Yuandong Wang, Kai Huang, Tianyu Wo, and Jie Xu. Synthesizing Realistic Trajectory Data With
Dierential Privacy. IEEE Trans. Intell. Transport. Syst., 2023.
[18]
Xinyue Sun, Qingqing Ye, Haibo Hu, Jiawei Duan, Qiao Xue, Tianyu Wo, and Jie Xu. PU TS: Privacy-Preserving and Utility-Enhancing
Framework for Trajectory Synthesization. IEEE Trans. Knowl. Data. Eng., 2023.
[19]
Huandong Wang, Changzheng Gao, Yuchen Wu, Depeng Jin, Lina Yao, and Yong Li. PateGail: A Privacy-Preserving Mobility Trajectory
Generator with Imitation Learning. In Proc. AAAI, volume 37, pages 14539–14547, 2023.
[20] Pingfu Chao, Wen Hua, Rui Mao, Jiajie Xu, and Xiaofang Zhou. A Survey and Quantitative Study on Map Inference Algorithms From
GPS Trajectories. IEEE Trans. Knowl. Data. Eng., 34(1):15–28, 2020.
[21]
Shan Jiang, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C González. The TimeGeo modeling
framework for urban mobility without travel surveys. Nat. Acad. Sci., 113(37):E5370–E5378, 2016.
[22]
Shaojie Qiao, Dayong Shen, Xiaoteng Wang, Nan Han, and William Zhu. A Self-Adaptive Parameter Selection Trajectory Prediction
Approach via Hidden Markov Models. IEEE Trans. Intell. Transport. Syst., 16(1):284–296, 2014.
[23]
Fengli Xu, Zhen Tu, Yong Li, Pengyu Zhang, Xiaoming Fu, and Depeng Jin. Trajectory Recovery From Ash: User Privacy Is NOT
Preserved in Aggregated Mobility Data. In Proc. ACM WWW, pages 1241–1250, 2017.
[24]
Yuntao Du, Yujia Hu, Zhikun Zhang, Ziquan Fang, Lu Chen, Baihua Zheng, and Yunjun Gao. LDPTrace: Locally Dierentially Private
Trajectory Synthesis. In Proc. VLDB Endow, 2023.
[25]
Daniel Glake, Fabian Panse, Ula Lenfers, Thomas Clemen, and Norbert Ritter. Spatio-temporal Trajectory Learning using Simulation
Systems. In Proc. ACM CIKM, pages 592–602, 2022.
[26]
Yuan Yuan, Huandong Wang, Jingtao Ding, Depeng Jin, and Yong Li. Learning to Simulate Daily Activities via Modeling Dynamic
Human Needs. In Proc. ACM WWW, pages 906–916, 2023.
[27]
Yuan Yuan, Jingtao Ding, Huandong Wang, Depeng Jin, and Yong Li. Activity Trajectory Generation via Modeling Spatiotemporal
Dynamics. In Proc. ACM SIGKDD, pages 4752–4762, 2022.
[28]
Yongheng Deng, Feng Lyu, Ju Ren, Yi-Chao Chen, Peng Yang, Yuezhi Zhou, and Yaoxue Zhang. FAIR: Quality-Aware Federated Learning
with Precise User Incentive and Model Aggregation. In Proc. IEEE INFOCOM, pages 1–10, 2021.
[29] OpenStreetMap.
[30] LBS Database.
[31]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
Generative Adversarial Nets. In Proc. NIPS, volume 27, 2014.
[32]
Lu Gan, Youngji Kim, Jessy W Grizzle, Jerey M Walls, Ayoung Kim, Ryan M Eustice, and Maani Ghaari. Multitask Learning for
Scalable and Dense Multilayer Bayesian Map Inference. IEEE Trans. Robot., 39(1):699–717, 2022.
[33]
Hongjian Wang, Xianfeng Tang, Yu-Hsuan Kuo, Daniel Kifer, and Zhenhui Li. A Simple Baseline for Travel Time Estimation using
Large-scale Trip Data. ACM Trans. Intell. Syst. Technol., 10(2):1–22, 2019.
[34]
Yan Lin, Huaiyu Wan, Jilin Hu, Shengnan Guo, Bin Yang, Youfang Lin, and Christian S Jensen. Origin-Destination Travel Time Oracle
for Map-based Services. In Proc. SIGMOD, 2023.
[35]
Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social GAN: Socially Acceptable Trajectories with
Generative Adversarial Networks. In Proc. IEEE CVPR, pages 2255–2264, 2018.
[36]
Meihui Shi, Derong Shen, Yue Kou, Tiezheng Nie, and Ge Yu. Multi-task Generative Adversarial Network for Missing Mobility Data
Imputation. In Proc. ACM CIKM, pages 4480–4484, 2022.
[37] Bent Fuglede and Flemming Topsoe. Jensen-Shannon Divergence and Hilbert space embedding. In Proc. IEEE ISIT, page 31, 2004.
[38] Xie, Dong and Li, Feifei and Phillips, Je M. Distributed Trajectory Similarity Search. In Proc. VLDB Endow, pages 1478–1489, 2017.
[39] Costa, Luciano da F. Further Generalizations of the Jaccard Index. arXiv preprint arXiv:2110.09619, 2021.
[40]
Markus Schläpfer, Lei Dong, Kevin O’Keee, Paolo Santi, Michael Szell, Hadrien Salat, Samuel Anklesaria, Mohammad Vazifeh, Carlo
Ratti, and Georey B West. The universal visitation law of human mobility. Nature, 593(7860):522–527, 2021.
[41]
Huan Zhou, Tong Wu, Haijun Zhang, and Jie Wu. Incentive-Driven Deep Reinforcement Learning for Content Caching and D2D
Ooading. IEEE J. Sel. Areas Commun., 39(8):2445–2460, 2021.
[42]
Yi Zhao, Zimu Zhou, Xu Wang, Tongtong Liu, Yunhao Liu, and Zheng Yang. CellTradeMap: Delineating Trade Areas for Urban
Commercial Districts with Cellular Networks. In Proc. IEEE INFOCOM, pages 937–945, 2019.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:24 Lyu et al.
[43]
Fan Wu, Ju Ren, Feng Lyu, Peng Yang, Yongmin Zhang, Deyu Zhang, and Yaoxue Zhang. Boosting Internet Card Cellular Business via
User Portraits: A Case of Churn Prediction. In Proc. IEEE INFOCOM, pages 640–649, 2022.
[44]
Zhihan Fang, Yu Yang, Guang Yang, Yikuan Xian, Fan Zhang, and Desheng Zhang. CellSense: Human Mobility Recovery via Cellular
Network Data Enhancement. Proc. ACM UbiComp, 5(3):1–22, 2021.
[45]
Mojtaba Vaezi, Amin Azari, Saeed R Khosravirad, Mahyar Shirvanimoghaddam, M Mahdi Azari, Danai Chasaki, and Petar Popovski.
Cellular, Wide-Area, and Non-Terrestrial IoT: A Survey on 5G Advances and the Road toward 6G. IEEE Commun. Surv. Tutor., 24(2):1117–
1174, 2022.
[46]
Zezu Liang, Yuan Liu, Tat-Ming Lok, and Kaibin Huang. Multi-Cell Mobile Edge Computing: Joint Service Migration and Resource
Allocation. IEEE Trans. Wireless Commun., 20(9):5898–5912, 2021.
[47]
Amit Sheoran, Sonia Fahmy, Matthew Osinski, Chunyi Peng, Bruno Ribeiro, and Jia Wang. Experience: Towards Automated Customer
Issue Resolution in Cellular Networks. In Proc. ACM MobiCom, pages 1–13, 2020.
[48]
Huali Lu, Feng Lyu, Ju Ren, Jiadi Yu, Fan Wu, Yaoxue Zhang, and Xuemin Sherman Shen. CODE: Compact IoT Data Collection with
Precise Matrix Sampling and Ecient Inference. In Proc. IEEE ICDCS, pages 743–753, 2022.
[49]
Nuria Oliver, Bruno Lepri, Harald Sterly, Renaud Lambiotte, Sébastien Deletaille, Marco De Nadai, Emmanuel Letouzé, Albert Ali Salah,
Richard Benjamins, Ciro Cattuto, et al. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle.
Science advances, 6(23):eabc0764, 2020.
[50]
Kyra H Grantz, Hannah R Meredith, Derek AT Cummings, et al. The use of mobile phone data to inform analysis of COVID-19 pandemic
epidemiology. Nature communications, 11(1):4961, 2020.
[51]
Tra Huong Thi Le, Nguyen H Tran, Yan Kyaw Tun, Minh NH Nguyen, Shashi Raj Pandey, Zhu Han, and Choong Seon Hong. An
Incentive Mechanism for Federated Learning in Wireless Cellular Networks: An Auction Approach. IEEE Trans. Wireless Commun.,
20(8):4874–4887, 2021.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the realm of human mobility data analysis, a multitude of constraints result in the publication of sparse, non-uniform implicit trajectories without explicit location information, such as coordinates. Researchers have dedicated substantial efforts towards trajectory recovery, aiming to densify trajectories and gain a more comprehensive understanding of human mobility. However, existing trajectory recovery methods focus on explicit trajectories, and require extensive historical data to capture users' mobility patterns. Nevertheless, implicit trajectories are usually more sparse than explicit trajectories. Addressing these challenges, we propose TrajBERT, an innovative BERT-based trajectory recovery method with spatial-temporal refinement. TrajBERT employs the Transformer encoder to learn mobility patterns bi-directionally and enhances the predictions by cross-stage temporal refinement. Subsequently, we design an output layer with global spatial refinement with a novel spatial-temporal aware loss function. To evaluate the performance of TrajBERT, we conduct a series of experiments on real-world datasets. Remarkably,TrajBERT yields at least 8.2% performance improvement compared to the state-of-the-art trajectory recovery approachs. Furthermore, TrajBERT successfully mitigates the cold start problem commonly experienced with new users lacking historical trajectories. It also shows superior robustness when faced with extremely sparse trajectories, thus demonstrating its potential as a practical tool in the field of human mobility analysis.
Article
Map matching is a common preprocessing step for analysing vehicle trajectories. In the theory community, the most popular approach for map matching is to compute a path on the road network that is the most spatially similar to the trajectory, where spatial similarity is measured using the Fréchet distance. A shortcoming of existing map matching algorithms under the Fréchet distance is that every time a trajectory is matched, the entire road network needs to be reprocessed from scratch. An open problem is whether one can preprocess the road network into a data structure, so that map matching queries can be answered in sublinear time. In this paper, we investigate map matching queries under the Fréchet distance. We provide a negative result for geometric planar graphs. We show that, unless SETH fails, there is no data structure that can be constructed in polynomial time that answers map matching queries in O (( pq ) 1 − δ ) query time for any δ > 0, where p and q are the complexities of the geometric planar graph and the query trajectory, respectively. We provide a positive result for realistic input graphs, which we regard as the main result of this paper. We show that for c -packed graphs, one can construct a data structure of O~(cp)\tilde{O}(cp) size that can answer (1 + ε)-approximate map matching queries in O~(c4qlog4p)\tilde{O}(c^4 q \log ^4 p) time, where O~()\tilde{O}(\cdot) hides lower-order factors and dependence on ε.
Article
Given an origin (O), a destination (D), and a departure time (T), an Origin-Destination (OD) travel time oracle~(ODT-Oracle) returns an estimate of the time it takes to travel from O to D when departing at T. ODT-Oracles serve important purposes in map-based services. To enable the construction of such oracles, we provide a travel-time estimation (TTE) solution that leverages historical trajectories to estimate time-varying travel times for OD pairs. The problem is complicated by the fact that multiple historical trajectories with different travel times may connect an OD pair, while trajectories may vary from one another. To solve the problem, it is crucial to remove outlier trajectories when doing travel time estimation for future queries. We propose a novel, two-stage framework called Diffusion-based Origin-destination Travel Time Estimation (DOT), that solves the problem. First, DOT employs a conditioned Pixelated Trajectories (PiT) denoiser that enables building a diffusion-based PiT inference process by learning correlations between OD pairs and historical trajectories. Specifically, given an OD pair and a departure time, we aim to infer a PiT. Next, DOT encompasses a Masked Vision Transformer~(MViT) that effectively and efficiently estimates a travel time based on the inferred PiT. We report on extensive experiments on two real-world datasets that offer evidence that DOT is capable of outperforming baseline methods in terms of accuracy, scalability, and explainability.
Article
Recent machine learning research on smart cities has achieved great success in predicting future trends, under the key assumption that the test data follows the same distribution of the training data. The rapid urbanization, however, makes this assumption challenging to hold in practice. Because new data is emerging from new environments (e.g., an emerging city or region), which may follow different distributions from data in existing environments. Different from transfer-learning methods accessing target data during training, we often do not have any prior knowledge about the new environment. Therefore, it is critical to explore a predictive model that can be effectively adapted to unseen new environments. This work aims to address this Out-of-Distribution (OOD) challenge for sustainable cities. We propose to identify two kinds of features that are useful for OOD prediction in each environment: (1) the environment-invariant features to capture the shared commonalities for predictions across different environments; and (2) the environment-aware features to characterize the unique information of each environment. Take bike riding as an example. The bike demands of different cities often follow the same pattern that they significantly increase during the rush hour on workdays. Meanwhile, there are also some local patterns in each city because of different cultures and citizens' travel preferences. We introduce a principled framework -- sUrban -- that consists of an environment-invariant optimization module for learning invariant representation and an environment-aware optimization module for learning environment-aware representation. Evaluation on real-world datasets from various urban application domains corroborates the generalizability of sUrban. This work opens up new avenues to smart city development.
Article
Generating human mobility trajectories is of great importance to solve the lack of large-scale trajectory data in numerous applications, which is caused by privacy concerns. However, existing mobility trajectory generation methods still require real-world human trajectories centrally collected as the training data, where there exists an inescapable risk of privacy leakage. To overcome this limitation, in this paper, we propose PateGail, a privacy-preserving imitation learning model to generate mobility trajectories, which utilizes the powerful generative adversary imitation learning model to simulate the decision-making process of humans. Further, in order to protect user privacy, we train this model collectively based on decentralized mobility data stored in user devices, where personal discriminators are trained locally to distinguish and reward the real and generated human trajectories. In the training process, only the generated trajectories and their rewards obtained based on personal discriminators are shared between the server and devices, whose privacy is further preserved by our proposed perturbation mechanisms with theoretical proof to satisfy differential privacy. Further, to better model the human decision-making process, we propose a novel aggregation mechanism of the rewards obtained from personal discriminators. We theoretically prove that under the reward obtained based on the aggregation mechanism, our proposed model maximizes the lower bound of the discounted total rewards of users. Extensive experiments show that the trajectories generated by our model are able to resemble real-world trajectories in terms of five key statistical metrics, outperforming state-of-the-art algorithms by over 48.03%. Furthermore, we demonstrate that the synthetic trajectories are able to efficiently support practical applications, including mobility prediction and location recommendation.
Article
Simulating the human mobility and generating large-scale trajectories are of great use in many real-world applications, such as urban planning, epidemic spreading analysis, and geographic privacy protect. Although many previous works have studied the problem of trajectory generation, the continuity of the generated trajectories has been neglected, which makes these methods useless for practical urban simulation scenarios. To solve this problem, we propose a novel two-stage generative adversarial framework to generate the continuous trajectory on the road network, namely TS-TrajGen, which efficiently integrates prior domain knowledge of human mobility with model-free learning paradigm. Specifically, we build the generator under the human mobility hypothesis of the A* algorithm to learn the human mobility behavior. For the discriminator, we combine the sequential reward with the mobility yaw reward to enhance the effectiveness of the generator. Finally, we propose a novel two-stage generation process to overcome the weak point of the existing stochastic generation process. Extensive experiments on two real-world datasets and two case studies demonstrate that our framework yields significant improvements over the state-of-the-art methods.
Article
Vehicle trajectory data is essential for traffic management and location-based services. However, publishing real-life trajectory data has been challenging because vehicle trajectories contain users' sensitive information. Differential privacy addresses such problems by publishing a synthetic version of the input dataset, but existing works always assume the real-world data is absolutely accurate. This assumption no longer holds in trajectory data because it typically contains errors due to inaccurate positioning services, which leads to poor performance of data synthesized by such trajectories. Even worse, existing works may generate unrealistic trajectories due to their coarse data synthesis methods, resulting in low practical utility or even inability to handle complex tasks. In this paper, we propose a P rivacy-preserving and U tility-enhancing framework for T rajectory S ynthesization ( PUTS ). Our framework mitigates the impact of data errors in trajectories on differential privacy mechanisms, by exploiting map-matching techniques and real-world road network structure. In PUTS , a two-layer approach from path to trajectory synthesis is proposed to not only guarantee the reality of synthetic trajectories, but also scale up PUTS in real-world applications. Extensive experiments on real-world datasets show that PUTS significantly outperforms existing methods in terms of utility in a range of real-world applications.
Article
Trajectory data has the potential to greatly benefit a wide-range of real-world applications, such as tracking the spread of the disease through people's movement patterns and providing personalized location-based services based on travel preference. However, privacy concerns and data protection regulations have limited the extent to which this data is shared and utilized. To overcome this challenge, local differential privacy provides a solution by allowing people to share a perturbed version of their data, ensuring privacy as only the data owners have access to the original information. Despite its potential, existing point-based perturbation mechanisms are not suitable for real-world scenarios due to poor utility, dependence on external knowledge, high computational overhead, and vulnerability to attacks. To address these limitations, we introduce LDPTrace, a novel locally differentially private trajectory synthesis framework. Our framework takes into account three crucial patterns inferred from users' trajectories in the local setting, allowing us to synthesize trajectories that closely resemble real ones with minimal computational cost. Additionally, we present a new method for selecting a proper grid granularity without compromising privacy. Our extensive experiments using real-world as well as synthetic data, various utility metrics and attacks, demonstrate the efficacy and efficiency of LDPTrace.
Article
Global Navigation Satellite Systems typically perform poorly in urban environments, where the likelihood of line-of-sight conditions between devices and satellites is low. Therefore, alternative location methods are required to achieve good accuracy. We present LocUNet: A convolutional, end-to-end trained neural network (NN) for the localization task, which is able to estimate the position of a user from the received signal strength (RSS) of a small number of Base Stations (BS). Using estimations of pathloss radio maps of the BSs and the RSS measurements of the users to be localized, LocUNet can localize users with state-of-the-art accuracy and enjoys high robustness to inaccuracies in the estimations of radio maps. The proposed method does not require generating RSS fingerprints of each specific area where the localization task is performed and is suitable for real-time applications. Moreover, two novel datasets that allow for numerical evaluations of RSS and ToA methods in realistic urban environments are presented and made publicly available for the research community. By using these datasets, we also provide a fair comparison of state-of-the-art RSS and ToA-based methods in the dense urban scenario and show numerically that LocUNet outperforms all the compared methods.
Article
Many mobile apps access users' trajectories to provide critical services (e.g., trip tracking). Unfortunately, in such apps, malicious users may upload fake trajectories to cheat providers for illegal benefits. There are few works in the literature that delicately study trajectory forgery problems. In this paper, we first take the perspective of attackers and consider how they would fabricate vivid trajectories confronting a strict provider. In particular, we use the technique of adversarial examples in deep learning to propose a trajectory forgery method, which produces fake trajectories satisfying two conditions: (1) having the motion characteristics indistinguishable from those of real ones, and (2) matching reasonable walking, cycling, or driving routes when being projected to the map. Our experiments show that they can hardly be detected by mainstream trajectory service providers, even after being equipped with machine learning-based approaches. Therefore, we further present dedicated countermeasures by validating the reasonability of reported received signal strength indicator (RSSI) data of scanned WiFi APs in commercial areas and scanned Cellular APs in rural areas, respectively. They can deal well with the most challenging replay scenario, which can hardly be handled by existing radio-based location verification methods. We conduct extensive real-world experiments covering walking, cycling, and driving scenarios to demonstrate the high detection accuracy of both methods.