Wenzhong Tang’s research while affiliated with Beihang University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (39)


Figure 1. Analysis and comparison of the weights of the UNet and S2AgScUNet models. According to the HSTR theory [22], the al pha value indicates the quality of training for each layer, with values between 2 and 6 representing well-trained layers. A value greater than 6 suggests overfitting, while a value below 2 implies underfitting. Two components of the UNet model have alpha values exceeding 6, indicating a high risk of overfitting.
Figure 2. Comparison of inference time and performance. The horizontal axis represents the inference time (seconds), while the vertical axis shows performance metrics (IOU and Dice). LightUNet significantly outperforms other models in inference time while achieving performance comparable to or better than other models.
Figure 4. Structure of the Gated Attention module. X S denotes the input from the skip connection, representing features from the corresponding encoder layer, while X D represents the features from the Decoder layer. The Conv1 × 1 blocks are convolution layers with a filter size of 1 × 1, used to refine the features. ReLU is the Rectified Linear Unit activation function, introducing non-linearity into the network, and Sigmoid is the activation function that generates the weight map for gating.
Figure 5. Structure of Decoder Blocks in S2AgScUNet. Each Decoder Block consists of two basic blocks, which include convolution, BatchNorm, and ReLU layers, aiming to reconstruct spatial details for segmentation.
Figure 6. Knowledge distillation workflow based on S2ScAgUNet and LightUNet. The training process consists of two parts: L KD , representing the knowledge distillation loss, and L CE , representing the cross-entropy loss. The S2ScAgUNet serves as the teacher model, distilling knowledge into LightUNet to reduce parameters while maintaining high performance.

+1

Adapting SAM2 Model from Natural Images for Tooth Segmentation in Dental Panoramic X-Ray Images
  • Article
  • Full-text available

December 2024

·

21 Reads

Entropy

Zifeng Li

·

Wenzhong Tang

·

Shijun Gao

·

[...]

·

Dental panoramic X-ray imaging, due to its high cost-effectiveness and low radiation dose, has become a widely used diagnostic tool in dentistry. Accurate tooth segmentation is crucial for lesion analysis and treatment planning, helping dentists to quickly and precisely assess the condition of teeth. However, dental X-ray images often suffer from noise, low contrast, and overlapping anatomical structures, coupled with limited available datasets, leading traditional deep learning models to experience overfitting, which affects generalization ability. In addition, high-precision deep models typically require significant computational resources for inference, making deployment in real-world applications challenging. To address these challenges, this paper proposes a tooth segmentation method based on the pre-trained SAM2 model. We employ adapter modules to fine-tune the SAM2 model and introduce ScConv modules and gated attention mechanisms to enhance the model’s semantic understanding and multi-scale feature extraction capabilities for medical images. In terms of efficiency, we utilize knowledge distillation, using the fine-tuned SAM2 model as the teacher model for distilling knowledge to a smaller model named LightUNet. Experimental results on the UFBA-UESC dataset show that, in terms of performance, our model significantly outperforms the traditional UNet model in multiple metrics such as IoU, effectively improving segmentation accuracy and model robustness, particularly with limited sample datasets. In terms of efficiency, LightUNet achieves comparable performance to UNet, but with only 1.6% of its parameters and 24.0% of the inference time, demonstrating its feasibility for deployment on edge devices.

Download


a The existing method only evaluates noisy samples from distribution constraints based on similarity scores across modalities. b Our framework, in contrast, utilizes the k nearest neighbors to detect noisy correspondences between modalities
The framework of our method. This pipeline consists of three modules. a Feature extractor: the base network is based on the CLIP model, which utilizes 12 layers of the transformer model. b Neighbor-based noisy detector: based on the neighbor relations of two modalities, try to evaluate the confidence of pair correspondence in each epoch progressively. c Multi-view triplet loss function: We adopt multi-level memory banks to store identity-level text features, which are updated with a momentum strategy to handle the problem of data imbalance. Then, we utilize both sample-level and identity-level features to form a multi-view triplet loss, capturing the discriminative features of pedestrians
Performance comparisons between RDE and CNCM on RSTPReid with varying noise ratios
Performance comparisons between RDE and CNCM on RSTPReid with varying noise ratios
Cross-modality neighbor constraints based unbalanced multi-view text–image re-identification

Multimedia Systems

Text-to-image Person Re-Identification (TIReID) is an identity retrieval task between visual and textual modalities. Previous research focuses on learning rich and diverse modality-shared semantic features and achieving excellent performance. However, they still have several notable limitations: (1)Noisy influence: Due to the difficulty of cross-modality annotation and the uncertainty of crowdsourced label quality, it is inevitable to introduce noisy labels by incorrect text–image pairs. (2)Sample imbalance: Datasets collected from real-world sources often face an unbalanced distribution of samples across different categories, which results in inconsistent parameters update progress with the training phrase. To address these issues, we propose a two-stage training pipeline for TIReID learning with noisy correspondence. Firstly, we employ a Noisy Correspondence Detector based on heterogeneous relation retrieval estimating confidence weights from each sample pair. Secondly, we design a multi-view triplet loss function, which leverages sample-level features to interact with global class centers, addressing sample imbalance and facilitating a smoother distribution in feature space. Finally, we utilize these clean samples to train the model through a progressive learning process. Extensive experiments on RSTPReid, CUHK-PEDES, and ICFG-PEDES demonstrate the effectiveness of our method against the state-of-the-art TIReID methods.



EMFSA: Emoji-based multifeature fusion sentiment analysis

September 2024

·

33 Reads

Short texts on social platforms often suffer from insufficient emotional semantic expressions, sparse features, and polysemy. To enhance the accuracy achieved by sentiment analysis for short texts, this paper proposes an emoji-based multifeature fusion sentiment analysis model (EMFSA). The model mines the sentiments of emojis, topics, and text features. Initially, a pretraining method for feature extraction is employed to enhance the semantic expressions of emotions in text by extracting contextual semantic information from emojis. Following this, a sentiment- and emoji-masked language model is designed to prioritize the masking of emojis and words with implicit sentiments, focusing on learning the emotional semantics contained in text. Additionally, we proposed a multifeature fusion method based on a cross-attention mechanism by determining the importance of each word in a text from a topic perspective. Next, this method is integrated with the original semantic information of emojis and the enhanced text features, attaining improved sentiment representation accuracy for short texts. Comparative experiments conducted with the state-of-the-art baseline methods on three public datasets demonstrate that the proposed model achieves accuracy improvements of 2.3%, 10.9%, and 2.7%, respectively, validating its effectiveness.



Figure 1. Illustration of frequency priors in deepfake detection. (a): Source image. (b): Data frequency domain analysis. (c): Relative log amplitudes of Fourier transformed feature maps of ResNet50. (d): Relative log amplitudes of Fourier transformed feature maps of MkfaNet. (b) reveals the uniformity of the frequency distribution in real faces and the concentration of highfrequency anomalies in forged faces. (c) shows that ResNet50 has a relatively low logarithmic amplitude in the high-frequency region, indicating its insufficiency in capturing high-frequency details. (d) demonstrates that MkfaNet has a higher amplitude in the high-frequency region with broader coverage, highlighting its advantages in handling high-frequency details and identifying forgery features.
Figure 4. Visualization of latent embedding of detectors with t-SNE [60] on FF++ (c23) according to DeepfakeBench [68].
Figure 5. Grad-CAM activation maps [53] of fake and real images in the validation set of FFDI-2024 as cross-domain evaluation. Compare the naive detector with different backbones with ours. As for fake images, classical CNNs like ResNet-50 show robust but coarse localization of human faces, while modern architectures like Swin-T can activate some semantic features. Out MkfaNet not only exhibits precise localization of discriminative organs but also tells the difference between fake and real faces.
Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection

August 2024

·

72 Reads

Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, \textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.


Figure 3. Qualitative results across Biofors, RSIID, SciSp-C, and SciSp-H From left to right, each column represents spliced images, ground truths (GTs), and the predictions of competing methods, respectively.
Figure 8. Different splicing approaches for constructing SciSp-H (A-E) Illustration of five splicing approaches for constructing SciSp-H, including (A) vertical splicing, (B) horizontal splicing, (C) free splicing, (D) vertical removal and splicing, and (E) horizontal removal and splicing.
Figure 12. Framework of our UEMA ''DC'' is short for a depth-wise convolution layer.
Details of datasets for scientific image splicing detection
Pixel-level (i.e., F1 and MCC) and image-level (i.e., AUC and Acc) cross-testing results (%)
Exposing image splicing traces in scientific publications via uncertainty-guided refinement

August 2024

·

16 Reads

Patterns

Recently, a surge in image manipulations in scientific publications has led to numerous retractions, highlighting the importance of image integrity. Although forensic detectors for image duplication and synthesis have been researched, the detection of image splicing in scientific publications remains largely unexplored. Splicing detection is more challenging than duplication detection due to the lack of reference images and more difficult than synthesis detection because of the presence of smaller tampered-with areas. Moreover, disruptive factors in scientific images, such as artifacts, abnormal patterns, and noise, present misleading features like splicing traces, rendering this task difficult. In addition, the scarcity of high-quality datasets of spliced scientific images has limited advancements. Therefore, we propose the uncertainty-guided refinement network (URN) to mitigate these disruptive factors. We also construct a dataset for image splicing detection (SciSp) with 1,290 spliced images by collecting and manually splicing. Comprehensive experiments demonstrate the URN’s superior splicing detection performance.


Distribution-Guided Hierarchical Calibration Contrastive Network for Unsupervised Person Re-Identification

August 2024

·

6 Reads

·

4 Citations

IEEE Transactions on Circuits and Systems for Video Technology

The person re-identification task aims to retrieve the same identity under different cameras. The main difficulties of the task lie in the collection of a large amount of annotated data and the diversity of pedestrians. Therefore, how to learn a robust and discriminative representation feature with unlabeled data is the key to this task. The pseudo label based methods have shown significant effectiveness in the field by generating pseudo labels from unlabeled data instead of ground-truth labels. However, existing researches typically suffer two limitations: 1) The extracted features are insufficient to reflect the subtle local semantics; 2) The pseudo labels generated by clustering methods cannot avoid introducing noise, which will seriously affect the performance of the discriminative feature. In this paper, to address the above problems, we propose a Distribution-Guided Hierarchical Calibration Contrastive Network (DHCCN) to better exploit local clues and hierarchical representation, which can consider cross-granularity consistency and reduce the noise of pseudo labels by the calibrated feature distribution. A Hierarchical Feature Extractor is employed to capture the multi-granularity response of each image, and fuse both global salience and local subtle texture information of a pedestrian to generate the hierarchical feature. In addition, to reduce the error of the pseudo labels, we introduce a Feature Distribution Corrector to calibrate noisy features of low-confidence samples evaluated by a Gaussian Mixture Model. At last, we integrate cross-granularity consistency constraint by the difference between the global and local feature, which can help generate more accurate feature embedding and improve robustness of the model. Therefore, we can receive a performance that is close to the supervised person re-identification task by narrowing the gap between the pseudo and ground-truth label. Experiments on four standard benchmarks demonstrate the effectiveness of our method against the state-of-the-art unsupervised re-identification methods. The code is available at https://github.com/Li-Yongxi/2023-DHCCN .



Citations (24)


... For medical image privacy protection [90][91][92], Wen et al. [90] constructed a watermark-cycle consistency network for secure sharing of medical images in telemedicine, converting secret images into other meaningful images for transmission to reduce suspicions from attackers. This network is unique in that only the paired CycleGAN can recover the correct medical image. ...

Reference:

Image Privacy Protection: A Survey
HideMIA: Hidden Wavelet Mining for Privacy-Enhancing Medical Image Analysis
  • Citing Conference Paper
  • October 2024

... In this type of system, decisions are based on several modalities (e.g., visual information, NIR information, depth information or even temporal information from videos), which model different aspects of the physical environment. Utilizing multiple modalities can enhance robustness against various types of spoof attacks in real-world scenarios (Lin et al. [2024]). ...

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing
  • Citing Conference Paper
  • June 2024

... For accurately segmenting tumors in breast ultrasound images that suffer from low contrast, speckle noise, and blurred boundaries, the EH-former is proposed. This model employs region-wise curriculum learning (CL) to dynamically focus on challenging regional features, and includes uncertainty estimation and an Adaptive Easy-Hard region Separator (AdaSep) to significantly enhance segmentation accuracy and domain generalization [34]. MF-Net introduces a transformer-based auxiliary bi-encoder that models long-range dependencies to address the limitations of other methods in extracting local details and integrating diverse global semantic information [35]. ...

EH-former: Regional easy-hard-aware transformer for breast lesion segmentation in ultrasound images
  • Citing Article
  • September 2024

Information Fusion

... [12]Through heuristic learning, the strategy actively selects the samples that are most helpful to the model to be labeled by human experts, and adds the labeled instances into the training set, iterative training was used to improve the generalization performance of the classifier. [13]With the exponential increase of all kinds of data in the information age, the problem of data marking has been paid more and more attention by the academic and industrial circles, significant advances have been made in theory and algorithms, and have been widely used in image processing, speech recognition etc. [14], [15] This article mainly studies how to apply active learning technology in relationship extraction tasks. The meaning is that under the condition of small-scale labeled corpus, It can effectively utilize the potential information in large-scale unlabeled corpus to learn and select the most effective part of the corpus for manual annotation. ...

Research on joint model relation extraction method based on entity mapping

... 17 The analysis of intrinsic statistics to explore traces of image forgeries has also progressed. [18][19][20] For more generalized detection, many methods employ noise-sensitive filters 5,[21][22][23][24] or self-supervised learning 25,26 to suppress semantic information and analyze noise inconsistencies in the images. More recently, Hifi 27 expanded this task and proposed fine-grained image forgery detection with hierarchical labels. ...

CDS-Net: Cooperative dual-stream network for image manipulation detection
  • Citing Article
  • November 2023

Pattern Recognition Letters

... He et al. [38] introduced a hybrid CNN-transformer network (HCTNet) consisting of transformer encoder blocks (TEBlocks) in the encoder and a spatial-wise cross attention (SCA) module in the decoder to enhance breast lesion segmentation in BUS ultrasound images. Their application of the HCT network highlighted the importance of local features due to a unique computer kernel, though this focus led to difficulties in evaluating tumorlike shadows and speckle noise. ...

A deep supervised transformer U‐shaped full‐resolution residual network for the segmentation of breast ultrasound image

... Yu et al. [34] employ a transformer utilizing TCN for anomaly detection. Zeng et al. [35] also proposed an adversarial transformer with fused probability for anomaly detection. In the proposed model, the anomaly score is based on reconstruction error plus anomaly probability, which determines the probability of the current time stamp being anomalous. ...

Multivariate time series anomaly detection with adversarial transformer architecture in the Internet of Things
  • Citing Article
  • March 2023

Future Generation Computer Systems

... Te authors in [18] proposed a Keepalived-based highavailability dynamic web cluster that increases the resilience and robustness of a particular government system server against disruptions and pressures. Te authors in [19] proposed an airport security system using Keepalive to achieve high availability of its services. Te studies mentioned above have demonstrated the superiority of Keepalived technology from a system (or service) perspective. ...

Design and Implementation of Airport Security System Based on IoT Data Cloud Platform
  • Citing Conference Paper
  • September 2022

... Consequently, approaches such as EMT-Net [29], IML-ViT [13], and MVSS-Net [17], which utilize edge supervision, exhibit superior localization due to the model's ability for precise manipulation localization. ...

Image Manipulation Detection by Multiple Tampering Traces and Edge Artifact Enhancement
  • Citing Article
  • September 2022

Pattern Recognition

... these techniques, BUS is particularly advantageous due to its use of non-ionizing radiation, fast imaging, portability, and low cost. [3][4][5][6] BUS images provide a clear view of the breast tissue and help determine the size and shape of breast tumors for screening and diagnosis. However, accurately determining the size and shape of breast tumors from BUS images can be challenging and time-consuming, as it requires manual segmentation by operators or experts. ...

Deep-Learning-Based Ultrasound Sound-Speed Tomography Reconstruction with Tikhonov Pseudo-Inverse Priori
  • Citing Article
  • July 2022

Ultrasound in Medicine & Biology