Content uploaded by Fan Wu
Author content
All content in this area was uploaded by Fan Wu on Jan 22, 2025
Content may be subject to copyright.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of
Model-Based and Data-Driven Approaches
FENG LYU,Central South University, China
JIE ZHANG,Central South University, China
HUALI LU∗,Central South University, China
HUAQING WU,University of Calgary, Canada
FAN WU,Central South University, China
YONGMIN ZHANG,Central South University, China
YAOXUE ZHANG,Tsinghua University, China
The scarcity of publicly available cellular association traces hinders user location-based research and various data-driven
services, highlighting the importance of data synthesis in this eld. In this paper, we investigate the cellular association trace
synthesis (CATS) problem, aiming to generate diverse and realistic cellular association traces based on road segment-based
trajectories and corresponding departure times. To substantiate our research, we rst gather substantial data, including road
segment-based trajectories, base station (BS) distribution, and ground truths of cellular association traces. We then perform
systematic data analysis to reveal technical challenges such as disparity in geographic spaces, complex and dynamic BS
handover, and poor performance of single-dimension approaches. To address these challenges, we propose SynthCAT, a novel
scheme that fuses model-based and data-driven approaches. Specically, SynthCAT includes: i) A model-based coarse-grained
cellular association trace generation component, encompassing GPS reference generation, weighted historical average time
generation, Bayesian decision, and time mapping modules. This component establishes a unied GPS space to map road and BS
spaces, generates initial time information, synthesizes coarse-grained spatial cellular association traces by following explicit
BS handover rules, and maps the corresponding arrival time for each trace point; ii) A ne-grained cellular association trace
generation component, which combines model-based and data-driven approaches. This employs a two-stage Autoencoder
Generative Adversarial Network (AEGAN) to rene cellular association traces based on the coarse-grained ones. Extensive
eld experiments validate the ecacy of SynthCAT in terms of trace similarity to ground truths and its eciency in supporting
practical downstream applications.
CCS Concepts: •Networks
→
Mobile networks;•Computing methodologies
→
Modeling and simulation;Model
verication and validation.
Additional Key Words and Phrases: Synthesizing cellular association traces, Model-based and data-driven fusion, Autoencoder
Generative Adversarial Network, Downstream applications support
∗Corresponding author: Huali Lu.
Authors’ addresses: Feng Lyu, Central South University, Changsha, China, fenglyu@csu.edu.cn; Jie Zhang, Central South University, Changsha,
China, jie_zhang@csu.edu.cn; Huali Lu, Central South University, Changsha, China, huali_lu@csu.edu.cn; Huaqing Wu, University of Calgary,
Calgary, Canada, huaqing.wu1@ucalgary.ca; Fan Wu, Central South University, Changsha, China, wfwufan@csu.edu.cn; Yongmin Zhang,
Central South University, Changsha, China, zhangyongmin@csu.edu.cn; Yaoxue Zhang, Tsinghua University, Beijing, China, zhangyx@
tsinghua.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
permissions@acm.org.
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 2474-9567/2024/12-ART161
https://doi.org/10.1145/3699730
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:2 •Lyu et al.
ACM Reference Format:
Feng Lyu, Jie Zhang, Huali Lu, Huaqing Wu, Fan Wu, Yongmin Zhang, and Yaoxue Zhang. 2024. SynthCAT: Synthesizing
Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches. Proc. ACM Interact. Mob. Wearable
Ubiquitous Technol. 8, 4, Article 161 (December 2024), 24 pages. https://doi.org/10.1145/3699730
1 INTRODUCTION
Cellular networks are integral to our daily lives, ensuring seamless wireless connections for mobile users [
1
].
According to Ericsson’s reports [
2
], there will be a substantial rise in cellular IoT connections, from 1.7 billion
in 2020 to 5.9 billion in 2026. This explosive growth will generate an abundance of cellular association traces,
presenting immense potential for scientic research and diverse location-based services, including user traveling
behavior proling [
3
–
5
], human mobility modeling [
6
–
9
], outdoor localization [
10
,
11
], map matching [
12
–
14
], etc.
However, practical issues hinder their widespread usage, including the lack of publicly available data, prohibitive
manual collection cost, and unavailability of network operator data for direct utilization. The synthesis of realistic
and diverse cellular association traces becomes critical to bridge the gap, which can enhance existing datasets,
facilitate training of deep learning models, synthesize target cellular association traces in any city without
requiring additional measurements, and protect user privacy. That is a potent approach to expedite research,
development, and implementation processes across various applications.
However, synthesizing realistic and diverse cellular association traces presents a formidable technical challenge,
marked by three distinct hurdles. Firstly, the disparity in geographic spaces between road segment-based trajec-
tories and base station (BS) handover-based cellular association traces stands as a critical obstacle. Overcoming
this challenge necessitates an accurate resolution of the mapping issue from the road space to the BS space.
Secondly, in areas with multiple BSs, the complexity and dynamism of BS handover processes increase due to
considerations such as signal coverage and load balancing. While real cellular networks follow a well-dened BS
handover process governed by received signal levels, designing a judicious handover rule that incorporates signal
characteristics like strength, coverage, and direction to emulate the operator’s black-box handover strategy is
indispensable for eective synthesis of cellular association traces. Lastly, when it comes to synthesizing moving
traces, relying solely on traditional model-based or data-driven approaches often encounters limitations, leading
to trace inaccuracies caused by underlying assumptions or unmet real-world training data demands. To overcome
this limitation, an innovative approach that synergistically combines the strengths of both techniques becomes
essential. A model-based approach can capture the fundamental essence of underlying data patterns using explicit
mathematical or statistical models with minimal data requirements. Simultaneously, a data-driven approach
excels in unveiling hidden patterns and insights in complex and dynamic environments.
In the eld of synthesizing mobility data, existing studies can be broadly categorized into road trajectory
synthesis [
15
–
17
], and GPS trajectory synthesis [
18
–
20
]. The rst category focuses on synthesizing routes which
represent the movement of objects along specic road segments, typically consisting of spatial coordinates or
unique road identiers. For instance, Jiang et al. [
15
] proposed a two-stage GAN-based approach to generate
continuous road trajectories. On the other hand, GPS trajectory synthesis involves synthesizing trajectories based
on GPS location data, providing a more continuous and detailed representation of an object’s movement. For
instance, Sun et al. [
18
] investigate the task of GPS trajectory synthesis while ensuring privacy preservation
and data utility. However, these studies mainly target at mobility data derived from vehicle tracking systems or
GPS locating devices, the traces of which are limited in terms of user scales and spatio-temporal coverage. They
often overlook data collected from cellular networks, which actually oer strengths in terms of coverage, user
penetration, granularity, and cost-eectiveness. Consequently, the research on synthesizing cellular association
traces is currently an unexplored area. Moreover, current data synthesis techniques are typically either purely
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:3
model-based [
21
–
24
] or data-driven [
15
,
25
–
28
], both of which have inherent drawbacks. Particularly, model-
based approaches may achieve limited performance with simplistic assumptions, while data-driven approaches
often require substantial real-world training data that may not be readily available.
In this paper, we initiate the investigation of synthesizing cellular association traces in accordance with
road segment-based trajectories and corresponding departure times. Initially, we dene the cellular association
trace synthesis (CATS) problem and describe the data acquisition process, encompassing road segment-based
trajectories, BS distribution, and ground truths of cellular association traces. Subsequently, we conduct a systematic
data analysis, revealing technical challenges in terms of disparity in geographic spaces,complex and dynamic
BS handover, and poor performance of single-dimension approaches. Drawing inspiration from the data-driven
insights, we propose an innovative cellular association trace synthesis scheme, named SynthCAT, which integrates
model-based and data-driven approaches in a coarse-grained to ne-grained trace synthesis process. In SynthCAT,
we rst propose a model-based coarse-grained cellular association trace generation method that: i) generates
GPS reference points to address the disparity between road and BS spaces; ii) generates initial time information
for GPS reference points by modeling the relationship between average travel speed, time, and area; iii) maps
timestamped GPS-based trajectories to BS handover-based trajectories using a Bayesian decision model with
explicit BS switching rules and time mapping. We then propose a model-based and data-driven fusion module,
utilizing a two-stage Autoencoder Generative Adversarial Network (AEGAN) to synthesize ne-grained cellular
association traces based on the coarse-grained ones. The rst stage generates a plausible grid representation of
the trajectory matrix, while the second stage focuses on mapping ne-grained BS IDs with cellular association
trace points within the grids and calculating the nal arrival time for each trace point. Extensive experiments are
conducted to demonstrate the ecacy of SynthCAT, illustrating unequivocally that it signicantly outperforms
state-of-the-art baselines. The main contributions are summarized as follows.
•
This study presents an innovative investigation of the CATS problem, aiming to synthesize realistic and
diverse cellular association traces aligned with road segment-based trajectories and corresponding departure
times. By bridging a gap in existing literature, which predominantly focuses on road and GPS trajectories,
our work facilitates downstream applications that benet from extensive synthesized traces without privacy
concerns.
•
We acquire substantial data to support this study including road segment-based trajectories, BS distribution,
and ground truths of cellular association traces. Then, we conduct a systematic analytics and reveal several
important technical challenges, including disparity in geographic spaces, complex and dynamic BS handover,
and poor performance of single-dimension approaches.
•
We propose an original cellular association trace synthesis scheme, called SynthCAT, which incorporates
both a model-based coarse-grained cellular association trace generation component and a ne-grained
cellular association trace generation component, fusing model-based and data-driven approaches. This
integrated scheme collectively synthesizes traces that closely resemble ground truths and eciently support
downstream applications. Extensive eld experiments corroborate its ecacy.
The remainder of this paper is organized as follows. We commence with a detailed exposition of the problem
denition and data acquisition in Section 2. Subsequently, in Section 3, we conduct an empirical data analytics
to reveal data-driven challenges. Section 4delves into the design of SynthCAT. Extensive experiments are
presented in Section 5. Following that, Section 6delves into the limitations of this work, and Section 7provides a
comprehensive review of related works. Finally, we conclude the paper in Section 8.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:4 •Lyu et al.
Road Map
= (,)
: Nodes on the road map
: Road segments
Road Segment-based Trajectory
Ground Truth of Cellular Trace
A Synthetic Example of Trace
BS Distribution
=
= (,,)
Departure Time
(
,
)
(,)
Fig. 1. Illustration of cellular association trace synthesis (CATS) problem.
2 PROBLEM DEFINITION AND DATA ACQUISITION
2.1 Problem Definition
When a user with a SIM card moves along geographic road segments, the cellular network operator passively
collects association traces (i.e., logs of when and which BS the user associates with) according to the road segment-
based trajectories and corresponding departure times. These cellular association traces oer signicant advantages
such as full user penetration, complete area coverage, and continuous time availability. They eectively capture
users’ mobile behaviors and enable various location-based services. However, realizing their full potential is
challenged by the limited availability of open-source cellular association data, primarily due to privacy concerns
and commercial considerations. To bridge this gap, our study focuses on synthesizing these traces based on road
segment-based trajectories, corresponding departure times, and the underlying BS distribution.
Figure 1illustrates the cellular association trace synthesis problem. Our goal is to synthesize cellular association
traces that align with road segment-based trajectories within a predened road map, where the BS distribution is
known in advance, along with the corresponding departure times. The synthesized traces aim to resemble the
ground truths in both spatial and temporal dimensions. We formulate the problem as follows.
Denition 2.1 (Road Segment-Based Trajectory). A road segment-based trajectory is a sequence of road
segments generated when moving. It is represented by a sequence of chronologically ordered points
𝑅=
{𝑟1, 𝑟2, . . . , 𝑟𝑛}, where 𝑟𝑖represents a road segment ID.
Denition 2.2 (Cellular Association Trace). A cellular association trace is a sequence of timestamped BSs
accessed by a user on the move, passively collect by the network operator. It is denoted by T={(𝑏𝑠1, 𝑡 1),(𝑏𝑠2,
𝑡2), . . . , (𝑏𝑠𝑛, 𝑡𝑛)}
, where
𝑏𝑠𝑖
denotes the
𝑖
-th associated BS, and
𝑡𝑖
represents the arrival time at
𝑏𝑠𝑖
. Each BS entry
includes information about the ID 𝑖𝑑, geographic location 𝑙𝑜𝑐 =(𝑙𝑜 𝑛, 𝑙𝑎𝑡 ), and signal coverage radius 𝑟𝑎𝑑.
Denition 2.3 (Cellular Association Trace Synthesis (CATS) Problem). Given a set of road segment-based
trajectories
{𝑅1, 𝑅2, . . . , 𝑅𝑛}
, along with their corresponding departure times, a complete BS distribution, and a
small set of cellular association traces
{T
1,T
2, . . . , T
𝑚}
for model training, the CATS problem aims to synthesize
a new large dataset of cellular association traces
ˆ
T
1,ˆ
T
2, . . . , ˆ
T
𝑀(𝑀≫𝑚)
. Here,
𝑅𝑖
,
T
𝑖
, and
ˆ
T
𝑖
denote the
𝑖
-th
road segment-based trajectory,
𝑖
-th original cellular association trace, and the
𝑖
-th newly synthesized cellular
association trace, respectively.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:5
Data Collection in Car
Data Collection Platform
Data Collection in Subway
Fig. 2. Cellular association trace collection platform and campaigns.
Denition 2.4 (Map Matching). Map matching is the process of projecting a cellular association trace
T
onto the road topology to get a matched road-segment trajectory
𝑅
, which serves as the related downstream
application to verify the eectiveness of the proposed SynthCAT in this paper.
2.2 Data Acquisition
2.2.1 Road Segment-Based Trajectories. In this study, we investigate the CATS problem within three 10
𝑘𝑚×
10
𝑘𝑚
city areas, encompassing two urban areas and one suburban area. To substantiate our research, we leverage the
data from OpenStreetMap (OSM) [
29
], a publicly accessible open-source map service with physical geographical
information. Firstly, we collect a road map comprising 8,993 road segments with various types of roads, such
as motorways, primary roads, and footways. The road map is represented as a directed graph
𝑔𝑟𝑎𝑝ℎ =(𝑉 , 𝐸)
,
where
𝑉
denotes nodes representing intersections or terminal points, and
𝐸
represents road segments connecting
these nodes. Each node
𝑣𝑖
consists of a unique node ID
𝑣𝑖𝑑𝑖
and its corresponding geographic location
(𝑙𝑜𝑛𝑖, 𝑙𝑎𝑡𝑖)
denoting longitude and latitude, respectively. Each edge
𝑒𝑗
𝑖
between nodes
𝑣𝑖
and
𝑣𝑗
contains several GPS positions
forming a geometry property. Within the road map, we generate 4,000 road segment-based trajectories from
OSM using a path planning algorithm. Notably, to enhance the robustness of study, we collect dierent types of
routes with dierentiated BS densities and driving conditions. Specically, we randomly select two nodes,
𝑣𝑖
and
𝑣𝑗
from
𝑉
as the origin and destination, and then employ the path planning algorithm to generate a route
𝑅
between
𝑣𝑖
and
𝑣𝑗
. This dataset serves as valuable input for the CATS problem, enabling us to explore and develop
ecient approaches for cellular association trace synthesis.
2.2.2 Base Station Distribution. To synthesize cellular association traces, we require the full distribution of BSs
in the targeted area. Initially, we obtain the BS distribution data, denoted by
𝐵𝑆 ={𝑏𝑠𝑖}
, from an open-source
database [
30
], containing the distribution data of BSs in one country. To validate the feasibility of using this
open-source data, we compared it with the BS distribution data obtained from a cooperative network operator,
which, however, cannot be publicly shared due to privacy and commercial concerns. The comparison veries that
the BS distribution in the open-source dataset closely aligns with the real BS distribution. Although there is a
slight deviation in the geographic location (within 20 meters), it is acceptable and does not aect the design and
performance of the cellular association trace synthesis scheme. Considering the limitations of using operator’s
data, we adopt the distribution of BSs from the open-source database, which contains 5,770 BSs within the
targeted areas, to substantiate this research.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:6 •Lyu et al.
①
②
③
④
①
②③④
⑤
⑥
①②③
④
⑤
(a) Road space (b) BS space
Cellular association trace 1
Cellular association trace 2
Road segment-based
trajectory
(c) Examples of BS handover
0 0.2 0.4 0.6 0.8 1
Jaccard Distance
0
0.2
0.4
0.6
0.8
1
CDF
(d) Uncertainty of BS handover
Model-driven Data-driven Real data
0
0.2
0.4
0.6
0.8
Value
Precision
Recall
(e) Poor performance
Fig. 3. Technical challenges.
2.2.3 Ground Truths of Cellular Association Traces. To facilitate our research and achieve sucient training and
testing data, we develop a comprehensive cellular association trace collection platform and conduct extensive data
collection campaigns, as illustrated in Fig. 2. This platform comprises a real-time Android-based data collection
App, along with 20 mobile phones. To ensure broad compatibility across all cellular networks, we focus on
collecting three universally available elds during BS association, i.e.,
operator
,
cid
, and
time
. With the active
involvement of consenting volunteers, we collect substantial ground truths of cellular association traces along
the road segment-based trajectories generated earlier. Our goal is to establish a one-to-many correspondence
between road segment-based trajectories and cellular association traces, providing ground truths for model
training. Ultimately, we successfully amass a total of 4,000 cellular association traces, involving interactions with
2,882 BSs within the targeted areas, and totaling an overall time length of 210.9 hours.
3 DATA-DRIVEN CHALLENGES
3.1 Disparity in Geographic Spaces
Synthesizing cellular association traces in accordance with road segment-based trajectories faces the challenge of
disparate geographic spaces, primarily characterized by heterogeneous distribution and representation mismatch
of road segments and BSs. Firstly, road segments exhibit a continuous and interconnected distribution, whereas
BSs are scattered and irregularly distributed. Secondly, road segment-based trajectories are typically represented
using spatial coordinates or unique road identiers, while cellular association traces are represented based on BSs.
Figure 3(a) and 3(b) exemplify this dierence in representing three routes traveled by the same user. Although
they convey the same information, they are represented in distinct ways. To overcome these challenges, it is
necessary to establish the mapping between road and BS spaces, associating road segments with specic BSs.
However, this mapping is not always a one-to-one case. A single road segment may be served by multiple BSs,
and conversely, a single BS may cover multiple road segments. Thus, addressing the disparity in road and BS
spaces becomes critical in synthesizing accurate and realistic cellular association traces.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:7
3.2 Complex and Dynamic BS Handover
The intricate nature of BS connections and handovers introduces considerable uncertainty to cellular association
traces, inuenced by several factors. The presence of multiple BSs within an area, the limited coverage range of
BSs, and varying signal strengths in dierent areas contribute to the complexity of BS handovers. The continuous
movement of users and dynamic network conditions further compounds the uncertainty, leading to diverse
association sets even for users traveling the same route in the same vehicle, as depicted in Fig. 3(c). To quantify
the uniqueness of cellular association traces, Fig. 3(d) illustrates cumulative distribution functions (CDF) of the
Jaccard distance, measuring dissimilarity in associated BS sets. The CDF results reveal that, even along the same
road, no pairs of traces are identical (Jaccard distance is 0) but potentially completely dierent ( Jaccard distance
is 1), highlighting the distinctiveness of each trace. These inherent complexities and dynamics in BS handover
present formidable challenges for cellular association trace synthesis.
3.3 Poor Performance of Single-Dimension Approaches
To overcome the limitations posed by insucient available data, it is crucial to develop advanced techniques for
cellular association trace synthesis. While some existing methods aim to generate large-scale synthetic data, they
often fall short in terms of data quality due to their simplistic nature and reliance on a single-dimension design
paradigm. For instance, model-based approaches depend solely on assumptions and simplied models, failing to
accurately capture the intricate dynamics of cellular association traces. Similarly, data-driven approaches learn
features through a black-box methodology, neglecting explicit mathematical features and being constrained by
limited, biased, and non-diverse training data. The suboptimal performance of single-dimension approaches
becomes apparent in downstream applications, as illustrated in Fig. 3(e) with map matching as an example.
Two representative algorithms, a model-based approach (Bayes) and a data-driven approach (trajGANs)
1
, were
selected to assess the impact of their synthesized data on a typical map matching task. The observed results
indicate that the data synthesized by both model-based and data-driven models does not signicantly improve
map matching performance. Thus, there is a pressing need to develop advanced techniques that synergistically
leverage multidimensional approaches for cellular association trace synthesis.
4 DESIGN OF SYNTHCAT
4.1 Design Overview
Figure 4presents an overview of the SynthCAT design, comprising two major components: model-based coarse-
grained cellular association trace generation, and ne-grained cellular association trace generation via fusion of
model-based and data-driven approaches. The rst component synthesizes a coarse-grained cellular association
trace for a given road segment-based trajectory and departure time, utilizing three modules, i.e., GPS reference
generation, weighted historical average time generation, and Bayesian decision. These modules: i) generate GPS
reference points to unify dierent road and BS spaces into a single GPS space; ii) calculate arrival times for each
GPS point based on weighted historical average travel speeds; iii) synthesize a coarse-grained cellular association
trace using the generated timestamped GPS-based trajectory and BS distribution via Bayesian decision. The second
component renes the synthesized traces by integrating model-based and data-driven approaches using real data
in two stages: rst, rening the traces with grid representation to enhance realism, and second, synthesizing
diverse traces by mapping grids to BSs and their corresponding arrival times. The following subsections provide
a detailed explanation of each component, highlighting their functionalities and contributions.
1Detailed setup can be found in the performance evaluation section.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:8 •Lyu et al.
Road map ①GPS Reference
Generation
②Weighted Historical
Average Time Generation
0
50
100
Time
zone 1
Time
zone 2
Time
zone 3
urban suburban highway
=()
Timestamped GPS-
based trajectory
(0,0)
Road segment-
based trajectory
GPS-based
spatial trajectory
0
③Bayesian
Decision
R
R
d
d
BS distribution
Coarse-grained cellular
association trace
(0,0)
Model-Based Coarse-Grained Cellular
Association Trace Generation
0
Coarse-grained cellular
association trace
Gridding Grid-based
presentation
Real cellular
association trace
Grid-based
presentation
Discriminator
Generator
Encoder
Decoder
Attention
Grid-Based Representation
g
Fine-grained cellular association traces
Fine-Grained Cellular Association Trace Generation via Fusion of Model-Based and Data-Driven Approaches
…
Grid-based spatial trace Grid-based temporal trace
+
Stage One
Stage Two
(,)(,)(,)
Gridding
Fig. 4. Design overview of SynthCAT.
4.2 Model-Based Coarse-Grained Cellular Association Trace Generation
4.2.1 GPS Reference Generation. With data analysis, it is veried that a one-to-one correspondence between road
segments and BSs is lacking. To establish the mapping between road and BS spaces, the challenge of representing
one-to-many relationship needs to be addressed. Upon close examination of road segment-based trajectories and
cellular association traces, it is found that they intersect in the GPS space. Both the starting and ending points of
road segments and the location of BSs are represented by GPS coordinates. However, relying solely on sparse
GPS points is insucient for accurately mapping the two spaces. To tackle this issue, we propose to use the
intermediate GPS reference points to associate road segments with specic BSs.
To generate GPS reference points, we propose three processing methods, i.e., linear interpolation, corner
sparsication, and denseness rearrangement. The choice of method depends on two thresholds, i.e., sampling
distance of GPS points
gap
, and corner angle
𝜃
which is used to identify whether three GPS points consist
a turning-corner. Initially, we obtain an initial set of GPS points (
𝑋
) from the road segment-based trajectory,
denoted by
(𝑙𝑜𝑛𝑖, 𝑙𝑎𝑡𝑖)
for each point
𝑥𝑖
. Next, we calculate the distance
d
between
𝑥𝑖
and
𝑥𝑖+1
. If
d>gap
, we
interpolate between 𝑥𝑖and 𝑥𝑖+1using linear interpolation, denoted by
𝑙𝑎𝑡 =
𝑙𝑜𝑛𝑖+1−𝑙𝑜𝑛
𝑙𝑜𝑛𝑖+1−𝑙𝑜𝑛𝑖
𝑙𝑎𝑡𝑖+𝑙𝑜𝑛 −𝑙𝑜𝑛𝑖
𝑙𝑜𝑛𝑖+1−𝑙𝑜𝑛𝑖
𝑙𝑎𝑡𝑖+1,(1)
gap =(𝑙𝑎𝑡 −𝑙𝑎𝑡𝑖)2+(𝑙𝑜𝑛 −𝑙 𝑜𝑛𝑖)2.(2)
Combining Eq. (1) and Eq. (2), the coordinates
(𝑙𝑜𝑛 , 𝑙𝑎𝑡 )
of the interpolated GPS point can be obtained. If
d<gap
,
we calculate the angle
𝛼
between
−−−−→
𝑥𝑖+1𝑥𝑖
and
−−−−−−→
𝑥𝑖+1𝑥𝑖+2
. If
𝛼<𝜃
, indicating a turning-corner, we perform corner
sparsication by removing point
𝑥𝑖+1
. If
𝛼>𝜃
, representing a dense area of GPS points, we conduct denseness
rearrangement by deleting point
𝑥𝑖+1
and inserting a new point
𝑥𝑖+1
at a consistent interval. This process is
repeated until all points in
𝑋
have been processed. Finally, we generate a uniform and dense GPS-based spatial
trajectory for each road segment-based trajectory, which serves as location reference points for spaces mapping.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:9
Hour
20
25
30
35
40
45
Speed (km/h)
06 12 18 0 6 12 18 23
(a) Average travel speed over two days
Hour
20
25
30
35
40
45
Speed (km/h)
Overall Area1
Area2 Area3
04 8 12 16 20 23
(b) Average travel speed in various areas
Hour
20
25
30
35
40
45
Speed (km/h)
Observed values
Fitted results
04 8 12 16 20 23
(c) Modeling of average travel speed
Fig. 5. Description of average travel speed collected from Amap API.
4.2.2 Weighted Historical Average Time Generation. Once a GPS-based spatial trajectory is generated, the next
step is to determine the corresponding travel time for each GPS trajectory point. Travel time can be intuitively
calculated based on travel distance and speed, considering the uniform and dense characteristics of the GPS-based
spatial trajectory. Therefore, obtaining an accurate travel speeds for dierent trajectories is crucial. To address
this, we propose a weighted historical average method.
Initially, we collect average travel speed data over a continuous two-week period using the Amap API
2
, an
open-source platform providing real-time trac information. This platform oers real-time trac conditions
within any specied area, including congestion levels, the proportion of congested road sections, and average
travel speeds, with data sampled at one-hour intervals. Figure 5illustrates the data collected. Our data analysis
reveals two key observations. First, the average travel speed exhibits a clear periodicity, with a 24-hour cycle.
Second, the average travel speeds in dierent areas show a linear relationship. Based on these ndings, we model
the average travel speed as a function of time and area, expressed as
𝑠𝑝𝑒𝑒𝑑 =𝜔 𝑓 (𝑡)
, where
𝑓(𝑡)
represents the
city’s average travel speed over one day, with time
𝑡
ranging from 0 to 23, and
𝜔
is a weight for dierent areas.
Extensive tting trials indicate that a double sine function provides the best t, denoted by
𝑓(𝑡)=𝛼1𝑠𝑖𝑛 (𝛼2𝑡+𝛼3) + 𝛼4𝑠𝑖𝑛 (𝛼5𝑡+𝛼6) + 𝛼7(3)
where
(𝛼1, 𝛼2, 𝛼 3, 𝛼4, 𝛼5, 𝛼6, 𝛼 7)
are tting parameters. After optimization, these parameters are determined as
(
11
.
5
,
0
.
1
,
3
,
4
.
5
,
0
.
67
,−
1
.
23
,
38
.
76
)
, with
𝜔
ranging from 0.8 to 1.1, higher in more developed areas. Thus, the
average travel speed function is
𝑠𝑝𝑒𝑒𝑑 =𝜔(11.5𝑠𝑖𝑛 (0.1𝑡+3) + 4.5𝑠𝑖𝑛 (0.67𝑡−1.23) + 38.76), 𝜔 ∈ (0.8,1.1)(4)
Figure 5(c) shows the tting results of Eq. 4, demonstrating that our tting results closely match the observed
values, with homogeneous shapes. Finally, we calculate the travel time for each pair of adjacent GPS points based
on their distance and corresponding travel speed. The arrival time for each GPS point is then determined by
adding the travel time to the arrival time of the previous point.
4.2.3 Bayes Decision. To synthesize cellular association traces from timestamped GPS-based trajectories, we
propose a GPS-to-BS mapping algorithm based on the Bayesian decision model. This method matches each GPS
point with the optimal BS and maps the corresponding time to the BS, thereby generating coarse-grained cellular
association traces.
Firstly, we determine the possible BS candidate set
𝐵={𝑏1, 𝑏2, ..., 𝑏𝑛}
for a given GPS reference
𝑥
from BS
distribution
𝐵𝑆 ={𝑏𝑠𝑖}
, based on a distance constraint
𝑑𝑖𝑠
. If
𝐷𝑖𝑠 𝑡𝑎𝑛𝑐𝑒 (𝑏𝑠𝑖, 𝑥)<𝑑𝑖𝑠
, then
𝑏𝑠𝑖
is a possible BS
candidate for
𝑥
, where
𝐷𝑖𝑠 𝑡𝑎𝑛𝑐𝑒 ()
calculates the spherical distance between two points. Since
𝑑𝑖𝑠
is an empirical
2https://report.amap.com/detail.do?city=430100.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:10 •Lyu et al.
value, it might be large than the BS coverage radius, meaning
𝑥
is not necessarily within each BS candidate’s
coverage. Therefore, we assess the likelihood of 𝑥being covered by each candidate 𝑏𝑖as follows
𝑝(𝑏𝑖)=
(𝑟𝑎𝑑𝑖−𝑑𝑖)2
𝑟𝑎𝑑2
𝑖
, 𝑑𝑖<𝑟𝑎𝑑𝑖, 𝑖 ∈ (1,2,· · · , 𝑛),
𝑝(𝑏𝑖)=0,otherwise ,
(5)
where 𝑟𝑎𝑑𝑖represents the coverage radius of BS candidate 𝑏𝑖, and 𝑑𝑖=𝐷𝑖 𝑠𝑡𝑎𝑛𝑐 𝑒 (𝑏𝑖, 𝑥).
The conditional probability of
𝑥
undergoing a handover from
𝑏𝑖−1
to
𝑏𝑖
is denoted by
𝑝(𝑥|𝑏𝑖), 𝑖 ∈ (
1
,
2
,· · · , 𝑛)
.
The overall probability of 𝑥experiencing a handover across all BS coverage areas is
𝑃(𝑋)=𝑝(𝑥|𝑏1)𝑝(𝑏1)+ · · · + 𝑝(𝑥|𝑏𝑛)𝑝(𝑏𝑛).(6)
According to Bayes’ theorem, the probability of 𝑥switching to each BS is calculated as
𝑝(𝑏𝑖|𝑥)=
𝑝(𝑋𝑖)
𝑃(𝑋), 𝑖 ∈ (1,2,· · · , 𝑛 )
=
𝑝(𝑥|𝑏𝑖)𝑝(𝑏𝑖)
𝑝(𝑥|𝑏1)𝑝(𝑏1)+ · · · + 𝑝(𝑥|𝑏𝑛)𝑝(𝑏𝑛).
(7)
The decision function (DF) is then obtained as
𝐷𝐹 =max {𝑝(𝑏1|𝑥),· · · , 𝑝 (𝑏𝑛|𝑥) } .(8)
The maximum value of DF corresponds to the BS that the GPS reference
𝑥
should choose for mapping. In this
paper, the parameter is designed using maximum a posteriori estimation. It is assumed that
𝑝(𝑥|𝑏𝑖)
follows an
independent exponential distribution
𝜆𝑒−𝜆𝑑𝑖
. Considering the independence of the location of
𝑥
within the BS
coverage area, the likelihood function of 𝜆is established as
𝐿=
𝑛
Ö
𝑖=1𝜆𝑒−𝜆𝑑𝑖.(9)
Solving Eq. (9) gives
𝜆=
𝑛
Í𝑛
𝑖=1𝑑𝑖
=
1
¯
𝐷,(10)
where
¯
𝐷
represents the mean distance between
𝑥
and all BS
𝑏𝑖
. Through this optimal BS selection process, each
GPS reference is matched with the optimal BS.
Secondly, we generate temporal information by mapping timestamps from GPS points to their corresponding
BSs. When multiple GPS points are mapped to the same BS due to variations in sampling intervals and coverage
areas, we assign the timestamp of the rst GPS point as the representative timestamp for that BS. This process
ultimately yields coarse-grained cellular association traces that approximate the original road segment-based
trajectories and their corresponding departure times.
4.3
Fine-Grained Cellular Association Trace Generation via Fusion of Model-Based and Data-Driven
Approaches
The model-based approach captures only explicit features, resulting in coarse-grained and less realistic cellular
association traces. To address this limitation, we further design a two-stage AEGAN component that combines
model-based and data-driven approaches for trajectory renement. As shown in Fig. 6, this fusion component
includes a novel AEGAN to generate plausible grid-based trajectories and a spatial selection and time mapping
scheme to determine BS IDs and map nal arrival times.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:11
Coarse-grained cellular
association trace
(,)
Grid-based
representation
Fake
Decoder
Embedding Layer
𝐺
𝐺0
𝐺0
𝐺
Attention Generator
⨁
2
1
3
…
0
GRU
GRU
GRU
…
GRU
GRU
GRU
……
𝐺1
𝐺2
𝐺
GRU
GRU
GRU
…
⨁⨁
⨁
⨁⨁⨁
FC
…
FC
FC
0
1
2
1
2
−1
Encoder
…
GRU
GRU
GRU
…
Embedding Layer
FC
Sigmoid
Discriminator
⨁
Real
⨁
Grid-based
trace
(,)
(,)
(,)
…
Fine-Grained
Cellular
Association
Traces
Spatial Selection and Time Mapping
Stage One: Synthesizing Grid-Based Traces Stage Two: Synthesizing Cellular Association Traces
Grid-Based Representation
…
…
0
∆
∆
∆
…
…
19
32
10
11
…
…
0
∆
∆
∆
…
…
19
32
10
11
Spatial
19 21
32
14
2
10 11
Temporal
Spatial Temporal
13
19
14
33
21
23
10 11
32
Real cellular
association trace
(,)
Grid-based
representation
Fig. 6. Architecture of two-stage AEGAN.
4.3.1 Stage One: Synthesizing Grid-Based Traces. As shown in Fig. 6, the rst stage consists of three modules:
the grid-based representation for converting the cellular association trace into a grid-based representation, the
generator for approximating the distribution of real trace samples, and the discriminator for distinguishing
between real and synthetic trace samples.
Grid-Based Representation. Our AEGAN model renes synthesized trajectories by taking coarse-grained
cellular association traces as input. To create a simplied representation that is robust against location noise
and enhances the generalization of trajectories, we project these traces into grid-based representations. First, we
divide the targeted area into equally-sized grid cells with a given side length,
𝜖
, in both latitude and longitude,
resulting in a total of
𝑙𝑟×𝑙𝑐
cells. Next, given a cellular association trace
T={(𝑏𝑠1, 𝑡 1),(𝑏𝑠2, 𝑡 2), . . . , (𝑏𝑠𝑛, 𝑡𝑛)}
, we
discretize it by mapping each BS point to a grid cell based on its location, and then convert it into a grid-based
representation
C
. Here,
C ∈ R𝑁𝑡×2
contains both spatial and temporal dimensions, where
𝑁𝑡
represents the
maximum length of the traces. For each entry
(𝑏𝑠𝑘, 𝑡𝑘)
of the raw cellular association trace
T
, which falls into
the cell
(𝑖, 𝑗 )
,
C𝑘1=𝑖𝑙𝑐+𝑗
, and
C𝑘2=𝑡𝑘−𝑡𝑘−1
, where
𝑡𝑘
is the Unix timestamp and
C𝑘0=
0. Real cellular
association traces undergo similar processing to achieve grid-based representation, ensuring consistency with
the coarse-grained cellular association traces.
Generator (
G
). As illustrated in Fig. 6, our model’s generator comprises three key sub-modules, i.e., an encoder,
an attention layer, and a decoder. The encoder consists of an embedding layer, a bidirectional gated recurrent
unit (Bi-GRU) layer, and an output layer, while the decoder includes a GRU layer, a fully connected (FC) layer,
and an argmax layer.
A grid-based representation
C=𝒄1,𝒄2, . . . , 𝒄𝑁𝑡
is rst fed into the embedding layer to learn a dense vector
representation
𝒈1,𝒈2, . . . , 𝒈𝑁𝑡
, instead of using a one-hot representation. Subsequently, the Bi-GRU layer
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:12 •Lyu et al.
extracts features from the entire trajectory and transforms the input vector sequence into a sequence of hidden
states 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡, denoted as 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡=𝑓𝒈1,𝒈2, . . . , 𝒈𝑁𝑡,(11)
where 𝑓() represents the function learned by the Bi-GRU layer, and 𝒉1,𝒉2, . . . , 𝒉𝑁𝑡are the encoder outputs.
To capture long-distance interdependencies and emphasize important features, we incorporate an attention
layer into the decoder. Specically, we rst feed a Start Of Sequence (SOS) token to the GRU layer to initiate
the decoding process. At the
𝑖
-th step of decoding, the current decoder hidden state is
𝒔𝑖−1
, i.e., the output from
the
(𝑖−
1
)
-th GRU. We then apply an attention layer to search for the most relevant representation vectors by
computing the similarity between
𝒔𝑖−1
and all encoder hidden states
𝒉1,𝒉2, . . . , 𝒉𝑁𝑡
to generate a context vector
𝒄𝒙𝑖using the following formulas
𝑐𝑜𝑟 𝑟 𝑗=𝐹𝒔𝑖−1,𝒉𝑗,∀𝑗, 1≤𝑗≤𝑁𝑡,
𝑞𝑗=exp 𝑐𝑜𝑟𝑟𝑗/
𝑁𝑡
𝑗=1
exp 𝑐𝑜𝑟𝑟𝑗,∀𝑗, 1≤𝑗≤𝑁𝑡,
𝒄𝒙𝑖=
𝑁𝑡
𝑗=1
𝑞𝑗·𝒉𝑗,
(12)
where
𝐹()
denotes the attention function,
𝑐𝑜𝑟 𝑟 𝑗
represents the score measuring the correlation between the
current hidden state
𝒔𝑖−1
and an encoding hidden state
𝒉𝑗
, and
𝑞𝑗
measures the importance of
𝑐𝑜𝑟 𝑟 𝑗
. The context
vector
𝒄𝒙𝑖
and the output
𝒚𝑖−1
from the
(𝑖−
1
)
-th step are concatenated and fed into the GRU with the current
hidden state 𝒔𝑖−1to generate the new hidden state 𝒔𝑖and output 𝒚𝑖,
𝒔𝑖=GRU (𝒔𝑖−1,𝒚𝑖−1⊕𝒄𝒙𝑖),
𝒚𝑖=FC (𝒔𝑖⊕𝒄𝒙𝑖),(13)
where the FC layer generates a
|
G
| +
1-dimensional vector
𝒚𝑖
, with
|
G
|=𝑙𝑟×𝑙𝑐
. We then feed
𝒚𝑖
into the argmax
layer to transform its dimension from
|
G
| +
1to 2, with the rst dimension representing the spatial fenature and
the second dimension representing the temporal feature, corresponding to the input
𝒄𝑖
. Finally, upon generating
an End Of Sequence (EOS) token, the decoder completes the conversion process, producing a grid-based trajectory
𝑌∈R𝑁𝑡×2.
Discriminator (
D
). The discriminator is designed to determine whether the input trajectory is a ground-truth
trajectory or a generated trajectory. For stabilization and acceleration, the design of discriminator is relatively
simple, consisting of an embedding layer, a GRU layer, a FC layer, and a sigmoid function. Specically, the
embedding layer transforms either the generator output
𝑌
or the grid-represented ground-truth trajectory
C𝑟𝑒𝑎𝑙
into a dense vector. The GRU layer sequentially processes trajectory points and generates a sequence of outputs.
The FC layer then processes the entire output of the input trajectory, and the result is passed through a sigmoid
activation function
𝜎()
to obtain the discriminator’s output
D𝑜𝑢𝑡
, which represents the probability of classifying
the input as a true cellular association trace.
Model Training. When training the AEGAN model, the traditional binary cross-entropy loss function [
31
] is
insucient to constrain the optimization problem addressed in our work. Therefore, we carefully design a new
loss function, consisting of two parts, i.e.,
𝐿𝑜𝑠𝑠 ( G)
and
𝐿𝑜𝑠𝑠 (D)
, to jointly optimize our proposed model. The
objective function can be characterized as follows
min
{𝜓D}𝐿𝑜𝑠𝑠 (D) =min
{𝜓D}LD,
min
{𝜓G}𝐿𝑜𝑠𝑠 (G) =min
{𝜓G}LG+ L𝑠𝑝𝑎𝑡 𝑖𝑜 + L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 + L𝑠𝑒 𝑞 ,(14)
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:13
with
LD=EC𝑟𝑒𝑎𝑙 ∼𝑃data [log (1− D (C𝑟𝑒𝑎𝑙 ))] +EC∼𝑃𝐺[log (0− D (G(C)))] ,
LG=EC∼𝑃𝐺[log (1− D (G(C)))] ,
L𝑠𝑝𝑎𝑡 𝑖𝑜 =E"|G|
𝑖=1𝑚𝑖−𝑚′
𝑖2#,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 =E
1
𝑁𝑡
𝑁𝑡
𝑖=1
|G|+1
𝑗=|G|+1𝑇𝑟𝑖 𝑗 −𝑇 𝑟 ′
𝑖 𝑗 2,
L𝑠𝑒𝑞 =−E
𝑁𝑡
𝑖=1
|G|+1
𝑗=1
𝑇𝑟𝑖 𝑗 log ©«
𝑒𝑇 𝑟 ′
𝑖 𝑗
Í|G|+1
𝑞=1𝑒𝑇 𝑟 ′
𝑖𝑞 ª®¬,
(15)
where
𝜓G
and
𝜓D
are the parameters of the generator
G
and discriminator
D
, respectively.
LG
and
LD
represent
the adversarial loss for the generator and discriminator, respectively.
E[]
denotes the expected value of the
distribution function,
𝑃data
represents the distribution of the real samples, and
𝑃𝐺
represents the distribution of
the generated samples.
L𝑠𝑝𝑎𝑡 𝑖𝑜
represents the spatial similarity loss,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
represents the temporal similarity
loss, and L𝑠𝑒𝑞 represents the overall sequence similarity loss.
Specically, we rst transform the output of generator
G
into a
𝑁𝑡× |G| + 1
matrix, where
𝑁𝑡
denotes the
maximum length of the trajectories. The rst
|
G
|
dimensions form a one-hot vector representing the spatial
features of the trajectory, while the
|
G
| +
1-th dimension represents the temporal feature of the trajectory.
L𝑠𝑝𝑎𝑡 𝑖𝑜
calculates the sum of squares due to error of the grid spatial matrix
𝑚
of the true trajectory and
𝑚′
of the
corresponding generated trajectory, where
𝑚𝑖=
1if this grid belongs to this trajectory, otherwise,
𝑚𝑖=
0.
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
calculates the mean square error of the temporal features between the real trajectory
𝑇𝑟
and the
corresponding generated trajectory
𝑇𝑟′
.
L𝑠𝑒𝑞
calculates the sequence similarity of the real trajectory
𝑇𝑟
and
the corresponding generated trajectory
𝑇𝑟′
. During model training, the adversarial generator and discriminator
are jointly optimized using alternating stochastic gradient descent, with the gradients calculated by the Adam
optimizer. Our well-designed loss function enables the generator to eciently synthesize high-quality data while
enhancing the discriminator’s ability to dierentiate between real and fake samples.
4.3.2 Stage Two: Synthesizing Cellular Association Traces. After the joint optimization of the adversarial generator
and discriminator in the rst stage, we obtain ne-grained grid-based cellular association traces based on the
coarse-grained traces. In the second stage, our objective is to select the appropriate BS within each grid and map
the arrival time for each selected BS to generate the ne-grained cellular association traces. We experimentally
determine the most suitable grid size in the rst stage, as partition size aects the number of BSs inside each grid
and consequently inuences the quality of the synthesized traces. The grid size is chosen such that each grid
contains only one or two BSs, ensuring that the BSs within each grid can eectively characterize the grid. To
ensure diversity in the synthesized cellular association traces, we adopt a spatial selection and time mapping
scheme. A random BS is chosen within each grid, and its arrival time is calculated by adding the departure time
to the cumulative travel time to the grid. This approach generates multiple ne-grained cellular association
traces from each grid-based trace, each with similar features but dierent BS compositions, enhancing the overall
diversity of the synthetic trajectories. Moreover, the second stage does not involve model training, allowing for
rapid and ecient synthesis of ne-grained cellular traces.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:14 •Lyu et al.
Table 1. Category description of employed baselines.
First Catagory Second Catagory Baseline
Spatial Trajectory Generation Data Augmentation Model RS
Data Synthesis Model (Model-Based) Bayes
Temporal Trajectory Generation Non-Machine Learning-Based Model TEMP
Machine Learning-Based Model DOT
Spatio-Temporal Trajectory Generation Data Synthesis Model (Data-Driven) trajGANs
AEGAN
5 PERFORMANCE EVALUATION
5.1 Evaluation Methodology
Experiment Setup. In this work, we collect substantial road segment-based trajectories and corresponding
cellular association traces for performance evaluation, totally encompassing 4,000 pairs of routes and cellular
traces with a distance of 6,204 km. For the collected cellular traces, 80% of them are used for model training while
the remaining 20% are reserved for testing. We implement SynthCAT using Python and PyTorch, and run it on a
server equipped with 4 CPUs, each containing 192 Intel(R) Xeon(R) Platinum 8260 processor running at 2.40 GHz,
along with the utilization of a graphics processing unit card (NVIDIA Tian X) to accelerate the training.
Baselines. To comprehensively assess the performance of SynthCAT, we employ six reasonable baselines across
three broad categories and ve subcategories, as described in Table 1. Since no existing approach is specically
designed for synthesizing cellular association traces, we adapt these baselines based on their underlying principles
to align with the CATS problem.
•
Random Substitution (RS): A low-complexity yet eective method, RS involves replacing a few random
BSs in a cellular association trace with nearby BSs.
•
Bayesian Decision Model (Bayes) [
32
]: A model-based approach that relies on a predened Bayesian
decision model to synthesize traces.
•
Temporally Weighted Neighbors (TEMP) [
33
]: A representative historical trajectory-based method that
estimates travel time by averaging the travel time of historical trajectories with similar origins, destinations
and departure times.
•
Diusion-based Origin-destination Travel Time Estimation (DOT) [
34
]: A state-of-the-art framework
combining a conditioned pixelated trajectory denoiser and a masked vision transformer model to accurately
and explainably infer and estimate travel times from historical trajectories.
•trajGANs [35]: A data-driven approach widely adopted for generating trajectories using GANs.
•
AEGAN [
36
]. A data-driven approach that combines autoencoder architecture, Bi-GRU, and attention
mechanisms with GAN to synthesize traces.
Metrics. To comprehensively assess performance, we employ eight evaluation metrics across two categories:
data delity and utility. The rst six metrics evaluate data delity, with the rst three for the spatial dimension
and the last three for the temporal dimension. The remaining two metrics assess data utility.
•
Jensen-Shannon Divergence (JSD) [
37
]: An aggregated-level metric measures the discrepancy between
the distributions of synthesized traces and real traces. It is calculated by
𝐽 𝑆 (𝑝∥𝑞)=1
2𝐾𝐿 𝑝∥𝑝+𝑞
2+
1
2𝐾𝐿 𝑞∥𝑝+𝑞
2
, where
𝑝
and
𝑞
are two distributions, and
𝐷𝐾𝐿 (𝑢∥𝑤)=𝐸[log (𝑢𝑖)−log (𝑤𝑖)] =Í𝑖𝑢𝑖log 𝑢𝑖
𝑤𝑖
is the relative entropy of 𝑢with respect to 𝑤.
•
Hausdor Distance [
38
]: An individual-level metric measures the spatial dissimilarity between synthetic
cellular association traces and real ones by calculating the distance between two point sets.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:15
0
0.5
200
Hausdroff distance
1
1.5
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(a) RS vs. SynthCAT in Hausdor distance
0
0.2
0.4
0.6
200
Jaccard index
0.8
1
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(b) RS vs. SynthCAT in Jaccard index
0
0.5
1
200
JSD
10-4
1.5
2
400 1
Sample range
0.8
600
Sample ratio
0.6
800 0.4
0.2
1000
(c) RS vs. SynthCAT in JSD
Fig. 7. Performance comparison with data augmentation model.
0 0.5 1 1.5 2 2.5 3
Hausdroff distance
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(a) CDF of Hausdor Distance
0 0.2 0.4 0.6 0.8 1
Jaccard index
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(b) CDF of Jaccard Index
012345
JSD 10-4
0
0.2
0.4
0.6
0.8
1
CDF
Bayes
trajGANs
AEGAN
SynthCAT
(c) CDF of JSD
Fig. 8. Performance comparison with data synthesis model.
•
Jaccard Index [
39
]: Another individual-level metric quantifying spatial similarity between real and
synthetic traces. It is calculated by J (𝐴, 𝐵)=|𝐴∩𝐵|
|𝐴∪𝐵|, where 𝐴and 𝐵are any two sets to be compared.
•
Root Mean Squared Error (RMSE): Refers to the standard deviation of the dierences between the raw
data values and generated ones, calculated by
1
𝑁Í𝑁
𝑖=1𝑡𝑖−ˆ
𝑡𝑖2
, where
𝑁
is the total number of traces,
and 𝑡𝑖and ˆ
𝑡𝑖denote the raw and generated travel times.
•Mean Absolute Error (MAE): Refers to the average absolute error, calculated by 1
𝑁Í𝑁
𝑖=1𝑡𝑖−ˆ
𝑡𝑖.
•
Mean Absolute Percentage Error (MAPE): Refers to the mean absolute percentage error, calculated by
1
𝑁Í𝑁
𝑖=1𝑡𝑖−ˆ
𝑡𝑖
𝑡𝑖.
•
Precision: Refers to the ratio of the total length of the correctly matched route to the total length of the
route. Mathematically,
𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|𝑌𝑚∩𝑌𝑔|
|𝑌𝑚|
, where
𝑌𝑚
and
𝑌𝑔
denote the map-matched route and ground
truth route, respectively.
•
Recall: Represents the ratio of the total length of the correctly matched route to the total length of the
route in ground truth, i.e., 𝑅𝑒𝑐𝑎𝑙𝑙 =|𝑌𝑚∩𝑌𝑔|
|𝑌𝑔|.
5.2 Data Fidelity Evaluation
5.2.1 Comparison of Spatial Dimension. We evaluate data delity in spatial dimension in two aspects as follows.
Comparison with Data Augmentation Model. We rst conduct a comparative analysis of the data delity
between our SynthCAT and the data augmentation model RS. Considering the sensitivity of RS to sampling
rates and ranges in the random replacement strategy, we test it across 10 sampling rates (10% to 100%) and 10
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:16 •Lyu et al.
MAPE (s)
2
4
6
8
10
Value
0
0.3
0.6
0.9
1.2
1.5
Search range = 100m
Search range = 200m
Search range = 300m
Search range = 500m
Search range = 1000m
Search range = 2000m
6.70
2.23
0.88
MAE (s)
RMSE (s)
(a) Comparison with non-machine learning-based method
RMSE (min) MAE (min) MAPE (min)
0
1
2
3
4
5
Value
trajGANs AEGAN DOT synthCAT
(b) Comparison with machine learning-based method
Fig. 9. Performance comparison of temporal dimension.
sampling ranges (100 to 1000 meters). The results, shown in Fig. 7(a)-7(c), evaluate the performance of both
algorithms based on indicators such as Hausdor distance, Jaccard index, and JSD under varying sampling
conditions. Smaller Hausdor distance and JSD, along with a larger Jaccard index, indicate superior data delity.
Key ndings from Fig. 7reveal that RS exhibits lower data delity with larger sampling rates and ranges. In
comparison, SynthCAT consistently outperforms RS at both aggregate and individual levels, prevailing in 99% of
cases. For instance, concerning the Hausdor distance metric, SynthCAT surpasses RS unless the sampling range
is under 200 and the sampling rate is below 0.2. The advantages become more prominent with an increase in
sampling rate and sampling range. Compared to the data augmentation algorithm, our emphSynthCAT achieves
excellent performance, nearly comparable to its best.
Comparison with Data Synthesis Model. We proceed to compare the data delity of our SynthCAT with
other data synthesis models. Figure 8(a)-8(c) display the CDFs of Hausdor distance, Jaccard index, and JSD
achieved by dierent strategies. Notably, SynthCAT consistently outperforms other baselines across all metrics.
For example, in terms of the Hausdor distance metric, SynthCAT achieves scores below 1.0 in roughly 80%
of cases, whereas other baselines all have scores exceeding 1.0. This substantial performance improvement
further underscores the ecacy of our proposed SynthCAT. Additionally, despite the limited data scale causing
the data-driven schemes to perform worse than the model-based method, our proposed fusion approach still
outperforms both types of approaches, indicating the advantages of our SynthCAT, i.e., its ability to achieve
superior results with less training data and enhanced performance.
5.2.2 Comparison of Temporal Dimension. We then evaluate data delity in temporal dimension as follows.
Comparison with Non-Machine Learning-Based model. We rst compare the performance of data delity
in the temporal dimension between our SynthCAT and the non-machine learning-based model TEMP.TEMP
predicts travel time based solely on origin, destination, and departure time without generating spatial trajectories.
Consequently, we feed it the spatial cellular association trace synthesized by our SynthCAT to generate temporal
dimension information for each trace point. This allows us to perform a point-by-point comparison of the
generated temporal information. Given TEMP’s sensitivity to the search range of neighbors, we test it across 6
search ranges. The comparison results are presented in Fig. 9(a), where the red line denotes the scores of our
SynthCAT. From Fig. 9(a), it is evident that SynthCAT consistently outperforms TEMP across all metrics by a
signicant margin. For example, in terms of MAPE, SynthCAT achieves a score of 0
.
88, whereas the best score
achieved by TEMP is 1
.
41 when the search range is 100 meters, reecting a 37
.
6% improvement. Additionally,
TEMP’s performance remains relatively unchanged when the search range varies from 100 meters to 2000 meters.
This may be attributed to the high sparsity of the cellular association trace and the complex, dynamic nature of
BS handovers, which hinder stable parameter performance. Therefore, it is evident that generating the temporal
information of cellular association traces cannot be adequately achieved based merely on the average travel time
of historical trajectories.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:17
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(a) Real vs. synthesized traces (SynthCAT)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(b) Real vs. synthesized traces (RS)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(c) Real vs. synthesized traces (Bayes)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(d) Real vs. synthesized traces (TrajGANs)
Ground truth
SynthCAT
Ground truth
RS
Ground truth
Bayes
Ground truth
AEGAN
Ground truth
TrajGANs
(e) Real vs. synthesized traces (AEGAN )
Fig. 10. Visualization of cellular association trace synthesis.
Comparison with Machine Learning-Based model. We then evaluate data delity of our SynthCAT
comparing with several machine learning-based models. Due to the inconsistencies in length and the inability of
some comparison baselines to generate spatial trajectories, a point-by-point comparison of time information is
not feasible. Therefore, we assess the performance of each algorithm by comparing the overall travel time of the
entire trajectory. The results, as illustrated in Fig. 9(b), reveal two main ndings. First, our SynthCAT consistently
outperforms the other baselines, even surpassing the current state-of-the-art travel time estimation model, i.e.,
DOT. Second, trajGANs and AEGAN exhibit poor performance due to limited training data. In contrast, our
SynthCAT demonstrates signicantly better performance owing to the fusion of model-based and data-driven
approaches, thereby proving the superiority of our proposed fusion framework.
5.2.3 Visualization. To enhance the clarity of our proposed SynthCAT’s eectiveness, we present visualizations
of synthesized cellular association traces generated by various methods. Shown in Fig. 10, it is evident that
traces synthesized by SynthCAT and RS closely resemble the ground truth, showcasing a high degree of realism.
In contrast, traces generated by other methods exhibit lower quality, featuring noticeable deviations from the
ground truths. This observation underscores the superior data delity achieved by SynthCAT.
5.2.4 Comparison of Model Generalization. Data synthesis models provide a signicant advantage in generating
realistic cellular association traces for road-segment trajectories unseen during training. To evaluate their
performance in new areas, we re-divided the data, using an urban area as the training set and a new urban area
and a suburban area as the test sets. This allowed us to assess the models’ generalization to both similar and
dissimilar areas. The results, presented in Figures 11 and 12, reveal distinct performance patterns. Our SynthCAT
model stands out in both spatial and temporal dimensions across similar and dissimilar areas compared to baseline
models. Notably, Bayes,trajGANs, and AEGAN demonstrate consistent performance with minimal dierences
between similar and dissimilar areas, showing a slight performance boost in suburban environments due to their
simpler context. In contrast, SynthCAT surpasses other baselines in similar areas, while its performance aligns
with other algorithms in dissimilar areas. Two key conclusions can be drawn: SynthCAT eectively generalizes
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:18 •Lyu et al.
Hausdorff
0
0.5
1
1.5
Value
Jaccard
0
0.05
0.1
0.15
0.2
0.25
Bayes
trajGANs
AEGAN
JSD
0
0.4
0.8
1.2 10-3
SynthCAT
(a) Spatial performance of generalization to new area (suburban)
Hausdorff
0
0.5
1
1.5
2
Value
Jaccard
0
0.2
0.4
0.6
0.8
Bayes
trajGANs
AEGAN
JSD
0
1
2
3
4
510-4
SynthCAT
(b) Spatial performance of generalization to new area (urban)
Fig. 11. Performance comparison of model generalization in spatial dimension.
RMSE (min) MAE (min) MAPE (min)
0
1
2
3
4
5
Value
trajGANs AEGAN DOT synthCAT
(a) Temporal performance of generalization to new area (suburban)
RMSE (min) MAE (min) MAPE (min)
0
2
4
6
Value
trajGANs AEGAN DOT synthCAT
(b) Temporal performance of generalization to new area (urban)
Fig. 12. Performance comparison of model generalization in temporal dimension.
0 5 10 30 50 70 90 100 500 1000
Magnification times
0.6
0.65
0.7
0.75
0.8
0.85
Value
Precision Recall
(a) Performance of map matching vs. synthesized data scale
RS
Bayes TrajGANs AEGAN SynthCAT
0.6
0.65
0.7
0.75
0.8
0.85
Value
Precision Recall
(b) Map matching performances with 100×synthesized data
Fig. 13. Performance comparison of data utility in map matching task.
to unknown areas, particularly excelling in similar areas, and the contextual information of the unknown area
signicantly inuences model performance. In the temporal dimension, the performance of each baseline mirrors
that in the spatial dimension, while SynthCAT exhibits superior performance in dissimilar suburban areas. This
improvement is attributed to lower congestion in suburban areas, enabling the model-based component to
perform better with less training data. Overall, the generalization performance of SynthCAT surpasses that of
other baseline models in both spatial and temporal dimensions.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:19
Table 2. Experimental results of ablation experiment.
Model Hausdor Jaccard JSD RMSE (min) MAE (min) MAPE (min)
-data-driven method 1.5616 0.1250 2.17e-4 2.0857 1.1988 0.7640
-model-based method 1.7576 0.0112 4.62e-4 2.3690 2.0670 0.9935
-L𝑠𝑝𝑎𝑡 𝑖𝑜 0.7960 0.7439 6.57e-5 1.1404 0.9800 1.0297
-L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 0.7830 0.7494 6.46e-5 1.8354 1.1733 0.7308
-L𝑠𝑒𝑞 1.3593 0.2924 5.34e-4 1.1166 0.9602 0.9750
-L𝑠𝑝𝑎𝑡 𝑖𝑜 &L𝑡𝑒𝑚𝑝 𝑜𝑟𝑎𝑙 &L𝑠𝑒𝑞 1.5271 0.0199 2.01e-4 1.9748 1.1769 0.7532
SynthCAT 0.7455 0.7687 6.06e-5 1.0743 0.9205 0.6841
5.3 Data Utility Evaluation
In addition to data delity evaluation, we explore the practical implications of synthesized cellular association
traces in a map-matching application. Our synthesized data proves valuable in supplementing the scarce training
data. For map matching, we employ an Encoder-Decoder architecture with Bi-GRU and an attention mechanism.
Firstly, we evaluate how the scale of synthesized data impacts the map-matching task accuracy. Real-world data is
divided into 80% for training and 20% for inference. Systematically varying synthesized data scales in the training
set, ranging from 0 to 1000 times the real data, demonstrates improved model performance, as shown in Fig.
13(a). The matching precision consistently rises with larger synthesized data scales, reaching a 28% improvement
when the synthesis is 100 times the real data. However, there’s a performance limit with increasing data scales,
as model structure also inuences outcomes. These ndings highlight the practicality of our synthesized data in
real-world applications. Secondly, we compare dierent models’ data utility by synthesizing data at 100 times
the real data scale for map-matching model training, and the results are shown in Fig. 13(b). Testing the trained
models shows that our SynthCAT outperforms other methods. Notably, Bayes,trajGANs, and AEGAN perform
less eectively, while RS is slightly inferior to SynthCAT.RS’s parameters contribute to its success, using sampling
rates and ranges that closely mimic real data. However, its high privacy risk makes it less suitable for real-world
deployment by network operators.
5.4 Ablation Experiment
Eectiveness of Devised Models. In SynthCAT, we synthesize cellular association traces by integrating model-
based and data-driven approaches. To explore their respective contributions, we remove the model-based and
data-driven components separately to examine the nal performance. Table 2shows the metric results of removing
corresponding components. We can observe that without data-driven or model-based method, the performance
of SynthCAT can degrade signicantly, demonstrating that a single-dimensional approach is not sucient for the
CATS problem. In addition, we can observe that the results of SynthCAT without model-based method are worse
than that of SynthCAT without data-driven method. It happens due to the limited training data scale, which
largely demonstrates the importance and urgency of solving the CATS problem.
Eectiveness of
L𝑠𝑝𝑎𝑡 𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
.We then validate the eectiveness of our devised loss function,
including individual spatial similarity loss
L𝑠𝑝𝑎𝑡 𝑖𝑜
, individual temporal similarity loss
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, individual
sequence similarity loss
L𝑠𝑒𝑞
, and the combination of
L𝑠𝑝𝑎𝑡 𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
. From the results shown in
Table 2, we can make the following two major statements. First, the contributions of
L𝑠𝑝𝑎𝑡𝑖𝑜
,
L𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
, and
L𝑠𝑒𝑞
are all integral to the performance improvement. Second,
L𝑠𝑒𝑞
plays the dominant role in achieving the
superior performance in SynthCAT.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:20 •Lyu et al.
6 DISCUSSIONS
6.1 Lessons Learned
Integration of Model-Based and Data-Driven Approaches. The model-based design paradigm excels in
explicitly expressing data patterns through mathematical or statistical models with minimal training data
requirements, albeit constrained by limitations in accuracy and exibility. On the other hand, the data-driven
design paradigm oers the advantages of adapting to complex and dynamic environments, mining hidden features
with good accuracy and exibility, but often requires substantial training data. The integration of both approaches
harnesses their respective strengths, overcoming individual shortcomings and adapting well to limited training
data while maintaining high accuracy and exibility. This integration stands as the distinctive advantage of our
scheme compared to others, allowing us to achieve optimal results even with constrained self-collected data.
Trade-o between Data Fidelity and Utility. In general, a positive correlation exists between data delity
and utility, implying that higher data delity enhances downstream application utility. However, our experiments
reveal a nuanced perspective within our scenario. The inherent uncertainty of ST cellular association traces
challenges the conventional notion. Excessive pursuit of data delity does not necessarily lead to a signicant
improvement in downstream application utility. Instead, it may restrict the volume of synthetic data and further
diminish the utility of downstream applications. Thus, achieving an optimal trade-o between delity and utility
becomes imperative, considering the diverse characteristics of downstream applications.
6.2 Limitations and Future Works
Model Generalization. Generally, the generalization of our proposed SynthCAT should be evaluated across
various aspects, including dierent models of mobile devices, transportation modes, road conditions, and ge-
ographic areas. However, in Section 5, we focus solely on assessing the model’s generalization ability to new
and distinct areas due to the following constraints. Firstly, the data currently utilized in our study are collected
manually, a process that is both time-consuming and labor-intensive. Conducting a comprehensive evaluation of
the model’s generalization requires a large volume of data collected under diverse conditions and environments,
which is not feasible within a short time frame. At present, the collected data are insucient for a thorough
evaluation. Secondly, we regard generalization to new areas as more critical than other aspects. Variations
in equipment, transportation modes, and road conditions within the same area primarily aect the temporal
dimension generation of the model, with minimal impact on the spatial dimension. In contrast, generalization
to new areas inuences data generation in both temporal and spatial dimensions. Therefore, for these reasons,
our current evaluation focuses solely on the model’s generalization to new areas. In future work, we plan to
collaborate with network operators to fully evaluate our model using the extensive volume of privacy-free
cellular association traces they provide. This collaboration will also facilitate further optimization of our model
to enhance its generalization and real-world deployment performance.
Context Features. We recognize the importance of incorporating urban data, like road networks, land use, and
socio-demographic information, to enhance the robustness of cellular association trace synthesis. The comparison
of model generalization in Section 5indirectly highlights this point. However, directly introducing external
context information to improve model generalization can make the model overly complex. The diversity and
relevance of input data impose strict requirements on data integrity, limiting the practical application of the model.
Hence, the cellular association trace synthesis scheme should strike a balance between scene detail and model
complexity. In future work, we aim to explore how urban context information can enhance model generalization
without adding unnecessary complexity.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:21
6.3 Ethical Considerations and Data Release
We recognize the signicant privacy concerns associated with using cellular association traces for research. To
address these concerns, we have implemented stringent measures to protect the privacy of mobile users in both
self-collected data campaigns and potential future deployments with network operators. Firstly, in self-collected
data eorts, we have adopted several privacy-preserving approaches: i) All phone numbers of participating
volunteers have been encrypted to prevent any association with individual user identiers. ii) Our custom
data collection App only collects data when actively open, and all data collection activities comply with strict
security and privacy policies. iii) During data processing and model design, individual or personal data is carefully
excluded. Secondly, when our proposed scheme is deployed by a network operator in the future, it will only
utilize encrypted trajectory data, ensuring user privacy and security, with a condentiality agreement in place.
i) User IDs will be hashed into global identiers, making it impossible to trace back to individual users. ii) The
trajectory data used will consist of signaling data from BS switching rather than GPS data, which is inherently
sparse and contains signicant location noise, thus preventing user tracking and protecting user privacy. iii) Data
processing will occur on the network operator’s servers, with access restricted to authorized researchers.
Regarding data sharing, we acknowledge the potential benets of releasing informative cellular association
traces to advance research in the mobile computing community. As part of our nal release, we will provide
the implementation codes for our proposed approach, including the Android-based data collection app, the
self-collected real data, the synthesized data, and SynthCAT.
7 RELATED WORK
7.1 Services Empowered by Cellular Mobility Data
In the literature, there have been numerous research works focusing on service design based on cellular mobility
data, e.g., urban planning and intelligent transportation [
40
–
42
], location-based services [
14
,
43
,
44
], smart cities
[
45
–
48
], emergency response and disaster management [
49
–
51
], etc. For instance, based on large-scale cellular
data, Schläpfer et al. [
40
] presented an understanding study on the universal patterns observed in human mobility.
Shen et al. [
14
] proposed a map matching algorithm in accordance with cellular mobility data. Fang et al. [
44
]
presented an approach called CellSense to enhance the recovery of human mobility patterns using cellular data.
In addition, in [
49
,
50
], the authors have explored the potential of cellular mobility data as a valuable source of
information for informing public health actions during the COVID-19 pandemic. However, although cellular
mobility data can empower extensive data-driven services from dierent aspects and advance scientic research
by providing modeling clues, the scarcity of real data as the dominant reason still hinders the progress, motivating
our research in this paper.
7.2 Mobility Data Synthesis
In the realm of mobility data synthesis, existing studies can be broadly categorized into road trajectory synthesis
[
15
–
17
] and GPS trajectory synthesis [
18
–
20
]. The rst category focuses on generating synthetic trajectories
representing object movements along specic road segments or links. For example, Jiang et al. [
15
] proposed a
two-stage GAN-based approach for generating continuous road trajectories. On the other hand, GPS trajectory
synthesis aims to create synthetic GPS trajectories that imitate real-world object or individual movement patterns.
For instance, Sun et al. [
18
] investigate the task of GPS trajectory synthesis while ensuring privacy preservation
and data utility. However, these existing works predominantly focus on synthesizing mobility data derived from
GPS or vehicle tracking systems, and often overlook the potential of synthesizing mobility data collected from
cellular networks. Cellular networks complement advantages, such as wider coverage, higher user penetration,
ne-grained granularity, and cost-eectiveness, making them valuable as sources of mobility data. Moreover, the
conventional techniques used in data synthesis tend to fall into pure model-based methods [
21
–
24
] or data-driven
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:22 •Lyu et al.
methods [
15
,
25
–
28
]. Pure model-based methods usually have limited performance and applicability to complex
real-world scenarios, while data-driven methods heavily rely on the scale and diversity of training data, which is
hard to be achieved in many practical scenarios. As a result, both types of methods have inherent drawbacks that
hinder their widespread use and eectiveness in mobility data synthesis. Overall, there is a signicant research
gap when it comes to synthesizing cellular association traces, and innovative algorithms are required to address
this blank space in the eld of mobility data synthesis.
8 CONCLUSION
In this paper, we have proposed SynthCAT, an innovative scheme for synthesizing realistic and diverse cellular
association traces in accordance with road segment-based trajectories and corresponding departure times. By
integrating model-based and data-driven approaches, SynthCAT eectively maps road space to BS space, generates
initial time information, and synthesizes cellular association traces in a coarse-to-ne manner. Particularly, by
synergistically leveraging the strengths of both techniques, our approach not only captures the fundamental
essence of underlying association rules through explicit mathematical models but also extracts implicit patterns
and insights using data-driven methods. Extensive experiments demonstrate that SynthCAT outperforms the
state-of-the-art baselines in terms of data delity of synthesized cellular association traces and data utility in
supporting practical downstream applications.
ACKNOWLEDGMENTS
This work was supported in part by the National Key Research and Development Program of China under
Grant 2022YFF0604504, in part by the National Natural Science Foundation of China under Grants 62422216,
62320106006, 62372472, 62341201, and 62172445, in part by the 111 Project under Grant B18059, in part by the
Hunan Provincial Natural Science Foundation of China under Grant 2024JJ4068, and in part by the Natural
Sciences and Engineering Research Council of Canada under Grant RGPIN-2023-03759.
REFERENCES
[1]
Zhou Qin, Fang Cao, Yu Yang, Shuai Wang, Yunhuai Liu, Chang Tan, and Desheng Zhang. CellPred: A Behavior-aware Scheme for
Cellular Data Usage Prediction. Proc. ACM UbiComp, 4(1):1–24, 2020.
[2] Ericsson AB. Ericsson mobility report. November 2020.
[3]
Xiaobo Zhou, Shuxin Ge, Tie Qiu, Keqiu Li, and Mohammed Atiquzzaman. Energy-Ecient Service Migration for Multi-User Heteroge-
neous Dense Cellular Networks. IEEE Trans. Mobile Comput., 22(2):890 – 905, 2021.
[4]
Huaming Yang, Zhongzhou Xia, Jersy Shin, Jingyu Hua, Yunlong Mao, and Sheng Zhong. A Comprehensive Study of Trajectory Forgery
and Detection in Location-Based Services. IEEE Trans. Mobile Comput., 2023.
[5]
Sijing Duan, Dan Wang, Ju Ren, Feng Lyu, Ye Zhang, Huaqing Wu, and Xuemin Shen. Distributed Articial Intelligence Empowered by
End-Edge-Cloud Computing: A Survey. IEEE Commun. Surv. Tutor., 2022.
[6]
Yiwei Song, Dongzhe Jiang, Yunhuai Liu, Zhou Qin, Chang Tan, and Desheng Zhang. HERMAS: A Human Mobility Embedding
Framework with Large-scale Cellular Signaling Data. Proc. ACM UbiComp, 5(3):1–21, 2021.
[7]
Yige Zhang, Weixiong Rao, Kun Zhang, Lei Chen, and Ding Chen. Outdoor Position Recovery from Heterogeneous Telco Cellular Data.
IEEE Trans. Knowl. Data. Eng., 2022.
[8]
Junjun Si, Jin Yang, Yang Xiang, Hanqiu Wang, Li Li, Rongqing Zhang, Bo Tu, and Xiangqun Chen. TrajBERT: BERT-Based Trajectory
Recovery with Spatial-Temporal Renement for Implicit Sparse Trajectories. IEEE Trans. Mobile Comput., 2023.
[9]
Qianru Wang, Bin Guo, Lu Cheng, and Zhiwen Yu. sUrban: Stable Prediction for Unseen Urban Data from Location-based Sensors. Proc.
ACM UbiComp, 7(3):1–20, 2023.
[10]
Çağkan Yapar, Ron Levie, Gitta Kutyniok, and Giuseppe Caire. Real-Time Outdoor Localization Using Radio Maps: A Deep Learning
Approach. IEEE Trans. Wireless Commun., 11(12):9703–9717, 2023.
[11]
Oliveira, Leonardo L de and Eisenkraemer, Gabriel H and Carara, Everton A and Martins, Joao B and Monteiro, Jose. Mobile Localization
Techniques for Wireless Sensor Networks: Survey and Recommendations. ACM Trans. Sens. Netw., 19(2):1–39, 2023.
[12]
Joachim Gudmundsson, Martin P Seybold, and Sampson Wong. Map matching queries on realistic input graphs under the Fréchet
distance. In Proc. ACM-SIAM SODA, pages 1464–1492, 2023.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
SynthCAT: Synthesizing Cellular Association Traces with Fusion of Model-Based and Data-Driven Approaches •161:23
[13]
Huimin Ren, Sijie Ruan, Yanhua Li, Jie Bao, Chuishi Meng, Ruiyuan Li, and Yu Zheng. MTrajRec: Map-Constrained Trajectory Recovery
via Seq2Seq Multi-task Learning. In Proc. ACM SIGKDD, pages 1410–1419, 2021.
[14]
Zhihao Shen, Wan Du, Xi Zhao, and Jianhua Zou. DMM: Fast Map Matching for Cellular Data. In Proc. ACM MobiCom, pages 1–14, 2020.
[15]
Wenjun Jiang, Wayne Xin Zhao, Jingyuan Wang, and Jiawei Jiang. Continuous Trajectory Generation Based on Two-Stage GAN. In
Proc. AAAI, 2023.
[16]
Qingyan Zhu, Yize Chen, Hao Wang, Zhenyu Zeng, and Hao Liu. A Knowledge-Enhanced Framework for Imitative Transportation
Trajectory Generation. In Proc. IEEE ICDM, pages 823–832, 2022.
[17]
Xinyue Sun, Qingqing Ye, Haibo Hu, Yuandong Wang, Kai Huang, Tianyu Wo, and Jie Xu. Synthesizing Realistic Trajectory Data With
Dierential Privacy. IEEE Trans. Intell. Transport. Syst., 2023.
[18]
Xinyue Sun, Qingqing Ye, Haibo Hu, Jiawei Duan, Qiao Xue, Tianyu Wo, and Jie Xu. PU TS: Privacy-Preserving and Utility-Enhancing
Framework for Trajectory Synthesization. IEEE Trans. Knowl. Data. Eng., 2023.
[19]
Huandong Wang, Changzheng Gao, Yuchen Wu, Depeng Jin, Lina Yao, and Yong Li. PateGail: A Privacy-Preserving Mobility Trajectory
Generator with Imitation Learning. In Proc. AAAI, volume 37, pages 14539–14547, 2023.
[20] Pingfu Chao, Wen Hua, Rui Mao, Jiajie Xu, and Xiaofang Zhou. A Survey and Quantitative Study on Map Inference Algorithms From
GPS Trajectories. IEEE Trans. Knowl. Data. Eng., 34(1):15–28, 2020.
[21]
Shan Jiang, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C González. The TimeGeo modeling
framework for urban mobility without travel surveys. Nat. Acad. Sci., 113(37):E5370–E5378, 2016.
[22]
Shaojie Qiao, Dayong Shen, Xiaoteng Wang, Nan Han, and William Zhu. A Self-Adaptive Parameter Selection Trajectory Prediction
Approach via Hidden Markov Models. IEEE Trans. Intell. Transport. Syst., 16(1):284–296, 2014.
[23]
Fengli Xu, Zhen Tu, Yong Li, Pengyu Zhang, Xiaoming Fu, and Depeng Jin. Trajectory Recovery From Ash: User Privacy Is NOT
Preserved in Aggregated Mobility Data. In Proc. ACM WWW, pages 1241–1250, 2017.
[24]
Yuntao Du, Yujia Hu, Zhikun Zhang, Ziquan Fang, Lu Chen, Baihua Zheng, and Yunjun Gao. LDPTrace: Locally Dierentially Private
Trajectory Synthesis. In Proc. VLDB Endow, 2023.
[25]
Daniel Glake, Fabian Panse, Ula Lenfers, Thomas Clemen, and Norbert Ritter. Spatio-temporal Trajectory Learning using Simulation
Systems. In Proc. ACM CIKM, pages 592–602, 2022.
[26]
Yuan Yuan, Huandong Wang, Jingtao Ding, Depeng Jin, and Yong Li. Learning to Simulate Daily Activities via Modeling Dynamic
Human Needs. In Proc. ACM WWW, pages 906–916, 2023.
[27]
Yuan Yuan, Jingtao Ding, Huandong Wang, Depeng Jin, and Yong Li. Activity Trajectory Generation via Modeling Spatiotemporal
Dynamics. In Proc. ACM SIGKDD, pages 4752–4762, 2022.
[28]
Yongheng Deng, Feng Lyu, Ju Ren, Yi-Chao Chen, Peng Yang, Yuezhi Zhou, and Yaoxue Zhang. FAIR: Quality-Aware Federated Learning
with Precise User Incentive and Model Aggregation. In Proc. IEEE INFOCOM, pages 1–10, 2021.
[29] OpenStreetMap.
[30] LBS Database.
[31]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
Generative Adversarial Nets. In Proc. NIPS, volume 27, 2014.
[32]
Lu Gan, Youngji Kim, Jessy W Grizzle, Jerey M Walls, Ayoung Kim, Ryan M Eustice, and Maani Ghaari. Multitask Learning for
Scalable and Dense Multilayer Bayesian Map Inference. IEEE Trans. Robot., 39(1):699–717, 2022.
[33]
Hongjian Wang, Xianfeng Tang, Yu-Hsuan Kuo, Daniel Kifer, and Zhenhui Li. A Simple Baseline for Travel Time Estimation using
Large-scale Trip Data. ACM Trans. Intell. Syst. Technol., 10(2):1–22, 2019.
[34]
Yan Lin, Huaiyu Wan, Jilin Hu, Shengnan Guo, Bin Yang, Youfang Lin, and Christian S Jensen. Origin-Destination Travel Time Oracle
for Map-based Services. In Proc. SIGMOD, 2023.
[35]
Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social GAN: Socially Acceptable Trajectories with
Generative Adversarial Networks. In Proc. IEEE CVPR, pages 2255–2264, 2018.
[36]
Meihui Shi, Derong Shen, Yue Kou, Tiezheng Nie, and Ge Yu. Multi-task Generative Adversarial Network for Missing Mobility Data
Imputation. In Proc. ACM CIKM, pages 4480–4484, 2022.
[37] Bent Fuglede and Flemming Topsoe. Jensen-Shannon Divergence and Hilbert space embedding. In Proc. IEEE ISIT, page 31, 2004.
[38] Xie, Dong and Li, Feifei and Phillips, Je M. Distributed Trajectory Similarity Search. In Proc. VLDB Endow, pages 1478–1489, 2017.
[39] Costa, Luciano da F. Further Generalizations of the Jaccard Index. arXiv preprint arXiv:2110.09619, 2021.
[40]
Markus Schläpfer, Lei Dong, Kevin O’Keee, Paolo Santi, Michael Szell, Hadrien Salat, Samuel Anklesaria, Mohammad Vazifeh, Carlo
Ratti, and Georey B West. The universal visitation law of human mobility. Nature, 593(7860):522–527, 2021.
[41]
Huan Zhou, Tong Wu, Haijun Zhang, and Jie Wu. Incentive-Driven Deep Reinforcement Learning for Content Caching and D2D
Ooading. IEEE J. Sel. Areas Commun., 39(8):2445–2460, 2021.
[42]
Yi Zhao, Zimu Zhou, Xu Wang, Tongtong Liu, Yunhao Liu, and Zheng Yang. CellTradeMap: Delineating Trade Areas for Urban
Commercial Districts with Cellular Networks. In Proc. IEEE INFOCOM, pages 937–945, 2019.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.
161:24 •Lyu et al.
[43]
Fan Wu, Ju Ren, Feng Lyu, Peng Yang, Yongmin Zhang, Deyu Zhang, and Yaoxue Zhang. Boosting Internet Card Cellular Business via
User Portraits: A Case of Churn Prediction. In Proc. IEEE INFOCOM, pages 640–649, 2022.
[44]
Zhihan Fang, Yu Yang, Guang Yang, Yikuan Xian, Fan Zhang, and Desheng Zhang. CellSense: Human Mobility Recovery via Cellular
Network Data Enhancement. Proc. ACM UbiComp, 5(3):1–22, 2021.
[45]
Mojtaba Vaezi, Amin Azari, Saeed R Khosravirad, Mahyar Shirvanimoghaddam, M Mahdi Azari, Danai Chasaki, and Petar Popovski.
Cellular, Wide-Area, and Non-Terrestrial IoT: A Survey on 5G Advances and the Road toward 6G. IEEE Commun. Surv. Tutor., 24(2):1117–
1174, 2022.
[46]
Zezu Liang, Yuan Liu, Tat-Ming Lok, and Kaibin Huang. Multi-Cell Mobile Edge Computing: Joint Service Migration and Resource
Allocation. IEEE Trans. Wireless Commun., 20(9):5898–5912, 2021.
[47]
Amit Sheoran, Sonia Fahmy, Matthew Osinski, Chunyi Peng, Bruno Ribeiro, and Jia Wang. Experience: Towards Automated Customer
Issue Resolution in Cellular Networks. In Proc. ACM MobiCom, pages 1–13, 2020.
[48]
Huali Lu, Feng Lyu, Ju Ren, Jiadi Yu, Fan Wu, Yaoxue Zhang, and Xuemin Sherman Shen. CODE: Compact IoT Data Collection with
Precise Matrix Sampling and Ecient Inference. In Proc. IEEE ICDCS, pages 743–753, 2022.
[49]
Nuria Oliver, Bruno Lepri, Harald Sterly, Renaud Lambiotte, Sébastien Deletaille, Marco De Nadai, Emmanuel Letouzé, Albert Ali Salah,
Richard Benjamins, Ciro Cattuto, et al. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle.
Science advances, 6(23):eabc0764, 2020.
[50]
Kyra H Grantz, Hannah R Meredith, Derek AT Cummings, et al. The use of mobile phone data to inform analysis of COVID-19 pandemic
epidemiology. Nature communications, 11(1):4961, 2020.
[51]
Tra Huong Thi Le, Nguyen H Tran, Yan Kyaw Tun, Minh NH Nguyen, Shashi Raj Pandey, Zhu Han, and Choong Seon Hong. An
Incentive Mechanism for Federated Learning in Wireless Cellular Networks: An Auction Approach. IEEE Trans. Wireless Commun.,
20(8):4874–4887, 2021.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 8, No. 4, Article 161. Publication date: December 2024.