Guoren Wang’s research while affiliated with Beijing Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (595)


OpenGU: A Comprehensive Benchmark for Graph Unlearning
  • Preprint

January 2025

·

2 Reads

Bowen Fan

·

Yuming Ai

·

·

[...]

·

Guoren Wang

Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch. To address this issue, Graph Unlearning (GU) has emerged as a critical solution, with the potential to support dynamic graph updates in data management systems and enable scalable unlearning in distributed data systems while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Based on this unified benchmark framework, we are able to provide a comprehensive and fair evaluation for GU. Through extensive experimentation, we have drawn 8 crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research.


Figure 1: An example graph
Figure 2: A running example of our proposed NPivoter
Figure 5: Effect of the cost estimator
Figure 7: The impact of graph density (ms)
Datasets. í µí² í µí±¼ and í µí² í µí±½ are the average degrees for í µí±ˆ and í µí±‰ , respectively.

+1

Fast Biclique Counting on Bipartite Graphs: A Node Pivot-based Approach
  • Preprint
  • File available

December 2024

·

8 Reads

Counting the number of (p,q)(p, q)-bicliques (complete bipartite subgraphs) in a bipartite graph is a fundamental problem which plays a crucial role in numerous bipartite graph analysis applications. However, existing algorithms for counting (p,q)(p, q)-bicliques often face significant computational challenges, particularly on large real-world networks. In this paper, we propose a general biclique counting framework, called \npivot, based on a novel concept of node-pivot. We show that previous methods can be viewed as specific implementations of this general framework. More importantly, we propose a novel implementation of \npivot based on a carefully-designed minimum non-neighbor candidate partition strategy. We prove that our new implementation of \npivot has lower worst-case time complexity than the state-of-the-art methods. Beyond basic biclique counting, a nice feature of \npivot is that it also supports local counting (computing bicliques per node) and range counting (simultaneously counting bicliques within a size range). Extensive experiments on 12 real-world large datasets demonstrate that our proposed \npivot substantially outperforms state-of-the-art algorithms by up to two orders of magnitude.

Download

Multivariate Time Series Cleaning under Speed Constraints

December 2024

·

1 Read

Proceedings of the ACM on Management of Data

Errors are common in time series due to unreliable sensor measurements. Existing methods focus on univariate data but do not utilize the correlation between dimensions. Cleaning each dimension separately may lead to a less accurate result, as some errors can only be identified in the multivariate case. We also point out that the widely used minimum change principle is not always the best choice. Instead, we try to change the smallest number of data to avoid a significant change in the data distribution. In this paper, we propose MTCSC, the constraint-based method for cleaning multivariate time series. We formalize the repair problem, propose a linear-time method to employ online computing, and improve it by exploiting data trends. We also support adaptive speed constraint capturing. We analyze the properties of our proposals and compare them with SOTA methods in terms of effectiveness, efficiency versus error rates, data sizes, and applications such as classification. Experiments on real datasets show that MTCSC can have higher repair accuracy with less time consumption. Interestingly, it can be effective even when there are only weak or no correlations between the dimensions.


The statistical information of Citation Network and Wikipage Datasets. ( .Imp indicates the Underlying implications for nodes, edges, and classes in datasets, Node.C denotes for Node Classification Task, Edge.E denotes for Edge Existence Task
The statistical information of E-Commerce Datasets, including Rating Network, Review Network
The statistical information of Social Network Datasets, including Online Community datasets, Co-authorship Network
The statistical information of other datasets for node-level and link-level tasks
The statistical information of datasets for Graph-level Tasks. Avg Nodes and Avg Edges are denoted for the averaged number of nodes and edges in subgraphs, respectively
Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks

December 2024

·

15 Reads

With the increasing prevalence of cross-domain Text-Attributed Graph (TAG) Data (e.g., citation networks, recommendation systems, social networks, and ai4science), the integration of Graph Neural Networks (GNNs) and Large Language Models (LLMs) into a unified Model architecture (e.g., LLM as enhancer, LLM as collaborators, LLM as predictor) has emerged as a promising technological paradigm. The core of this new graph learning paradigm lies in the synergistic combination of GNNs' ability to capture complex structural relationships and LLMs' proficiency in understanding informative contexts from the rich textual descriptions of graphs. Therefore, we can leverage graph description texts with rich semantic context to fundamentally enhance Data quality, thereby improving the representational capacity of model-centric approaches in line with data-centric machine learning principles. By leveraging the strengths of these distinct neural network architectures, this integrated approach addresses a wide range of TAG-based Task (e.g., graph learning, graph reasoning, and graph question answering), particularly in complex industrial scenarios (e.g., supervised, few-shot, and zero-shot settings). In other words, we can treat text as a medium to enable cross-domain generalization of graph learning Model, allowing a single graph model to effectively handle the diversity of downstream graph-based Task across different data domains. This work serves as a foundational reference for researchers and practitioners looking to advance graph learning methodologies in the rapidly evolving landscape of LLM. We consistently maintain the related open-source materials at \url{https://github.com/xkLi-Allen/Awesome-GNN-in-LLMs-Papers}.


Scaling Up Graph Propagation Computation on Large Graphs: A Local Chebyshev Approximation Approach

December 2024

·

3 Reads

Graph propagation (GP) computation plays a crucial role in graph data analysis, supporting various applications such as graph node similarity queries, graph node ranking, graph clustering, and graph neural networks. Existing methods, mainly relying on power iteration or push computation frameworks, often face challenges with slow convergence rates when applied to large-scale graphs. To address this issue, we propose a novel and powerful approach that accelerates power iteration and push methods using Chebyshev polynomials. Specifically, we first present a novel Chebyshev expansion formula for general GP functions, offering a new perspective on GP computation and achieving accelerated convergence. Building on these theoretical insights, we develop a novel Chebyshev power iteration method (\ltwocheb) and a novel Chebyshev push method (\chebpush). Our \ltwocheb method demonstrates an approximate acceleration of O(N)O(\sqrt{N}) compared to existing power iteration techniques for both personalized PageRank and heat kernel PageRank computations, which are well-studied GP problems. For \chebpush, we propose an innovative subset Chebyshev recurrence technique, enabling the design of a push-style local algorithm with provable error guarantee and reduced time complexity compared to existing push methods. We conduct extensive experiments using 5 large real-world datasets to evaluate our proposed algorithms, demonstrating their superior efficiency compared to state-of-the-art approaches.


ITPNet: Towards Instantaneous Trajectory Prediction for Autonomous Driving

December 2024

·

5 Reads

Trajectory prediction of agents is crucial for the safety of autonomous vehicles, whereas previous approaches usually rely on sufficiently long-observed trajectory to predict the future trajectory of the agents. However, in real-world scenarios, it is not realistic to collect adequate observed locations for moving agents, leading to the collapse of most prediction models. For instance, when a moving car suddenly appears and is very close to an autonomous vehicle because of the obstruction, it is quite necessary for the autonomous vehicle to quickly and accurately predict the future trajectories of the car with limited observed trajectory locations. In light of this, we focus on investigating the task of instantaneous trajectory prediction, i.e., two observed locations are available during inference. To this end, we propose a general and plug-and-play instantaneous trajectory prediction approach, called ITPNet. Specifically, we propose a backward forecasting mechanism to reversely predict the latent feature representations of unobserved historical trajectories of the agent based on its two observed locations and then leverage them as complementary information for future trajectory prediction. Meanwhile, due to the inevitable existence of noise and redundancy in the predicted latent feature representations, we further devise a Noise Redundancy Reduction Former, aiming at to filter out noise and redundancy from unobserved trajectories and integrate the filtered features and observed features into a compact query for future trajectory predictions. In essence, ITPNet can be naturally compatible with existing trajectory prediction models, enabling them to gracefully handle the case of instantaneous trajectory prediction. Extensive experiments on the Argoverse and nuScenes datasets demonstrate ITPNet outperforms the baselines, and its efficacy with different trajectory prediction models.


CardOOD: Robust Query-driven Cardinality Estimation under Out-of-Distribution

December 2024

·

3 Reads

Query-driven learned estimators are accurate, flexible, and lightweight alternatives to traditional estimators in query optimization. However, existing query-driven approaches struggle with the Out-of-distribution (OOD) problem, where the test workload distribution differs from the training workload, leading to performancedegradation. In this paper, we present CardOOD, a general learning framework designed to construct robust query-driven cardinality estimators that are resilient against the OOD problem. Our framework focuses on offline training algorithms that develop one-off models from a static workload, suitable for model initialization and periodic retraining. In CardOOD, we extend classical transfer/robust learning techniques to train query-driven cardinalityestimators, and the algorithms fall into three categories: representation learning, data manipulation, and new learning strategies. As these learning techniques are originally evaluated in computervision tasks, we also propose a new learning algorithm that exploits the property of cardinality estimation. This algorithm, lying in the category of new learning strategy, models the partial order constraint of cardinalities by a self-supervised learning task. Comprehensive experimental studies demonstrate the efficacy of the algorithms of CardOOD in mitigating the OOD problem to varying extents. We further integrate CardOOD into PostgreSQL, showcasing its practical utility in query optimization.


Figure 4. An example to illustrate the MDGAN.
Figure 5. T-SNE visualization of features of different domains produced by DREAM, MMD, MisStyle and SelfReg on PACS modelset. Table V MODEL ATTRIBUTE CLASSIFICATION ACCURACY (%) ON S OF PACS MODELSET. RED AND BLUE INDICATE THE BEST AND SECOND BEST PERFORMANCE, RESPECTIVELY. DREAM* IS THE RESULT OF DOMAIN SHIFT SCENARIO, TRAINED WITH ONLY FIVE CLASSES (EXCEPT DOG AND ELEPHANT), WHILE THE BLACK-BOX MODEL IS TRAINED BY WHOLE SEVEN CLASSES IN PACS DATASET.
DREAM: Domain-agnostic Reverse Engineering Attributes of Black-box Model

December 2024

·

4 Reads

Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box model can be exposed through a sequence of queries. There is a crucial limitation: these works assume the training dataset of the target model is known beforehand and leverage this dataset for model attribute attack. However, it is difficult to access the training dataset of the target black-box model in reality. Therefore, whether the attributes of a target black-box model could be still revealed in this case is doubtful. In this paper, we investigate a new problem of black-box reverse engineering, without requiring the availability of the target model's training dataset. We put forward a general and principled framework DREAM, by casting this problem as out-of-distribution (OOD) generalization. In this way, we can learn a domain-agnostic meta-model to infer the attributes of the target black-box model with unknown training data. This makes our method one of the kinds that can gracefully apply to an arbitrary domain for model attribute reverse engineering with strong generalization ability. Extensive experimental results demonstrate the superiority of our proposed method over the baselines.


DREAM: Domain-agnostic Reverse Engineering Attributes of Black-box Model

December 2024

·

5 Reads

IEEE Transactions on Knowledge and Data Engineering

Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box model can be exposed through a sequence of queries. There is a crucial limitation: these works assume the training dataset of the target model is known beforehand and leverage this dataset for model attribute attack. However, it is difficult to access the training dataset of the target black-box model in reality. Therefore, whether the attributes of a target black-box model could be still revealed in this case is doubtful. In this paper, we investigate a new problem of black-box reverse engineering, without requiring the availability of the target model's training dataset. We put forward a general and principled framework DREAM, by casting this problem as out-of-distribution (OOD) generalization. In this way, we can learn a domain-agnostic metamodel to infer the attributes of the target black-box model with unknown training data. This makes our method one of the kinds that can gracefully apply to an arbitrary domain for model attribute reverse engineering with strong generalization ability. Extensive experimental results demonstrate the superiority of our proposed method over the baselines.



Citations (25)


... Driving environments often contain significant irrelevant or noisy information that does not contribute to planning. To address this, we apply the information bottleneck principle [18] to distill only the information relevant to decision-making. This approach ensures that the model prioritizes critical features for navigation, effectively minimizing the influence of extraneous data. ...

Reference:

FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback
On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving
  • Citing Conference Paper
  • June 2024

... Cohesive subgraphs can be defined using different cohesion measures. Among these variants, ℓtruss [9,[14][15][16][17]19] is one of the most commonly used ones. An ℓ-truss of a graph G is a maximal connected subgraph of G in which every edge is contained in at least ℓ triangles. ...

I/O Efficient Max-Truss Computation in Large Static and Dynamic Graphs
  • Citing Conference Paper
  • May 2024

... In the context of database systems, graphs provide a powerful means to represent and analyze relational data, enabling insights into structured information. To effectively capture the rich and interconnected information inherent in graph data, GNNs [24,48,63] have emerged as transformative tools, achieving remarkable success in optimizing database query performance [4,6,22,59], enhancing data management [2,3], and supporting other fields such as social networks [42,72], recommendation systems [9,31,62], biological networks [27,55,94] and data mining [35,57]. ...

Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification
  • Citing Conference Paper
  • May 2024

... FedGTA [41] encodes topology into smoothing confidence and graph moments to improve model aggregation. Moreover, other studies [42]- [45] also achieve satisfactory results on this challenge. To address missing edges, FedSage+ [26] integrates node representations, topology, and labels across subgraphs while training a neighbor generator to restore missing links, achieving subgraph-FL with robust and complete graph information. ...

AdaFGL: A New Paradigm for Federated Node Classification with Topology Heterogeneity
  • Citing Conference Paper
  • May 2024

... • (L2) Data-Unaware Model Structure: The existing mainstream federated fine-tuning approaches supporting heterogeneous models primarily rely on manually predefined model architectures composed of shared and personalized sub-modules [20]- [22], [24], [34], [35]. However, the design of heterogeneous sub-modules usually only takes into account local resource constraints [35] or the characteristics of local data [20]- [22], [24], [34], failed to comprehensively consider the cross-client data characteristics, which may result in suboptimal performance of federated fine-tuning given the nature of FL that the data among clients are usually statistically heterogeneous [36], [37]. ...

Feed: Towards Personalization-Effective Federated Learning
  • Citing Conference Paper
  • May 2024

... This lack of training data can lead to model overfitting, where the system performs well on known data but struggles when faced with new or diverse conditions. [602][603][604][605][606]  Impact: Without sufficient training data, ML models may lack the accuracy needed to reliably predict gene expression outcomes or metabolic responses in novel environments, reducing the systemʹs overall effectiveness in practical applications.  Potential Solutions: To mitigate this challenge, researchers can generate synthetic datasets using in silico simulations of microbial behaviors, which can help train models even in the absence of real-world data. ...

Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation
  • Citing Conference Paper
  • May 2024

... In various real-world applications, bipartite networks are ideally suited for representing the interaction between two different types of entities [1][2][3]. Many graph data mining tasks, such as relevant grammar learning, data compression, role mining based on role access control, process operation scheduling can be modelled as a problem on mining maximal bicliques [4,5]. However, there exists a more interesting subgraph called maximum butterfly generator which can derive more butterflies (or bicliques). ...

Maximal Biclique Enumeration: A Prefix Tree Based Approach
  • Citing Conference Paper
  • May 2024

... Following the completion of tasks, the sensing platform facilitates payments to the users. When juxtaposed with conventional sensing methodologies, MCS offers distinct advantages including heightened deployment flexibility, cost-efficiency, extensive coverage capabilities, and scalability [16], [17]. ...

Indoor Periodic Fingerprint Collections by Vehicular Crowdsensing via Primal-Dual Multi-Agent Deep Reinforcement Learning
  • Citing Article
  • October 2024

IEEE Journal on Selected Areas in Communications

... HGNNs generate node representations through recursive aggregation of messages from diverse types of neighboring nodes across various edge types. By adhering to this heterogeneityaware message-passing paradigm, HGNNs adeptly model the Yunhui complex relationships, diverse entity types, and heterogeneous semantics present in heterogeneous graphs, achieving state-ofthe-art performance in various downstream tasks, such as node classification [13], node clustering [14], community search [15], and link prediction [16]. [17].) ...

Self-Training GNN-based Community Search in Large Attributed Heterogeneous Information Networks
  • Citing Conference Paper
  • May 2024

... This results in the design of more robust models capable of capturing the asymmetric relationships between node pairs. Additionally, the inclusion of directed information allows for the development of advanced graph encoding mechanisms, such as dual encoding [48]- [50] and magnetic laplacian encoding [51]- [53], which significantly enhance the model's ability to represent graph structures and improve its performance. This advantage positions directed GNNs as a practical alternative within the broader landscape of graph learning techniques, enabling their use across diverse domains. ...

LightDiC: A Simple Yet Effective Approach for Large-Scale Digraph Representation Learning
  • Citing Article
  • May 2024

Proceedings of the VLDB Endowment