Youdong Ding’s research while affiliated with Shanghai University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (85)


SqSFill : Joint spatial and spectral learning for high-fidelity image inpainting
  • Article

May 2025

Neurocomputing

Zihao Zhang

·

Feifan Cai

·

Qin Zhou

·

Youdong Ding







Comparison of model size and performance with state-of-the-art methods using the Vimeo90K [8] dataset. Our method achieves an ideal balance between excellent performance and model parameters.
The overall architecture of our proposed method. (a) Architecture of our model; (b) Transformer residual block structure in the parallel spatio-temporal attention Transformer layer; (c) Parallel spatio-temporal attention; (d) Temporal attention dimension transformation.
The overall structure of the sub-networks. (a) Context Extraction Network; (b) Multi-scale Prediction Frame Synthesis Network; (c) Synthesis Block.
Visual comparison with the state-of-the-art (SOTA) method using the Vimeo90K [8] testing set. The rectangular boxes are the comparison areas. GT is the ground truth.
Visualization comparison with other state-of-the-art (SOTA) methods on SNU-FILM [36]. The rectangular boxes are the comparison areas.

+1

Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation
  • Article
  • Full-text available

May 2024

·

39 Reads

Traditional video frame interpolation methods based on deep convolutional neural networks face challenges in handling large motions. Their performance is limited by the fact that convolutional operations cannot directly integrate the rich temporal and spatial information of inter-frame pixels, and these methods rely heavily on additional inputs such as optical flow to model motion. To address this issue, we develop a novel framework for video frame interpolation that uses Transformer to efficiently model the long-range similarity of inter-frame pixels. Furthermore, to effectively aggregate spatio-temporal features, we design a novel attention mechanism divided into temporal attention and spatial attention. Specifically, spatial attention is used to aggregate intra-frame information, integrating both attention and convolution paradigms through the simple mapping approach. Temporal attention is used to model the similarity of pixels on the timeline. This design achieves parallel processing of these two types of information without extra computational cost, aggregating information in the space–time dimension. In addition, we introduce a context extraction network and multi-scale prediction frame synthesis network to further optimize the performance of the Transformer. Our method and state-of-the-art methods are extensively quantitatively and qualitatively experimented on various benchmark datasets. On the Vimeo90K and UCF101 datasets, our model achieves improvements of 0.09 dB and 0.01 dB in the PSNR metrics over UPR-Net-large, respectively. On the Vimeo90K dataset, our model outperforms FLAVR by 0.07 dB, with only 40.56% of its parameters. The qualitative results show that for complex and large-motion scenes, our method generates sharper and more realistic edges and details.

Download

Luminance domain-guided low-light image enhancement

April 2024

·

59 Reads

·

5 Citations

Neural Computing and Applications

Images captured under low-light conditions often suffer from low contrast, high noise, and uneven brightness due to nightlight, backlight, and shadow. These challenges make it difficult to use them as high-quality inputs for visual tasks. Existing low-light enhancement methods tend to increase overall image brightness, which can cause overexposure of normal-light areas after enhancement. To solve this problem, this paper proposes an Uneven Dark Vision Network (UDVN) that consists of two sub-networks. The Luminance Domain Network (LDN) uses Direction-aware Spatial Context (DSC) and Feature Enhancement Module (FEM) to segment different light regions in the image and output the luminance domain mask. Guided by this mask, the Light Enhancement Network (LEN) uses the Cross-Domain Transformation Residual block (CDTR) to adaptively illuminate different regions with various lights. We also introduce a new region loss function to constrain the LEN to better enhance the quality of different light regions. In addition, we have constructed a new low-light synthesis dataset (UDL) that is larger, more diverse, and includes uneven lighting states in the real world. Extensive experiments on several benchmark datasets demonstrate that our proposed method is highly competitive with state-of-the-art (SOTA) methods. Specifically, it outperforms other methods in light recovery and detail preservation when processing uneven low-light images. The UDL dataset is publicly available at: https://github.com/YuhangLi-li/UDVN.


An Efficient Multi-Scale Attention Feature Fusion Network for 4K Video Frame Interpolation

March 2024

·

11 Reads

·

2 Citations

Video frame interpolation aims to generate intermediate frames in a video to showcase finer details. However, most methods are only trained and tested on low-resolution datasets, lacking research on 4K video frame interpolation problems. This limitation makes it challenging to handle high-frame-rate video processing in real-world scenarios. In this paper, we propose a 4K video dataset at 120 fps, named UHD4K120FPS, which contains large motion. We also propose a novel framework for solving the 4K video frame interpolation task, based on a multi-scale pyramid network structure. We introduce self-attention to capture long-range dependencies and self-similarities in pixel space, which overcomes the limitations of convolutional operations. To reduce computational cost, we use a simple mapping-based approach to lighten self-attention, while still allowing for content-aware aggregation weights. Through extensive quantitative and qualitative experiments, we demonstrate the excellent performance achieved by our proposed model on the UHD4K120FPS dataset, as well as illustrate the effectiveness of our method for 4K video frame interpolation. In addition, we evaluate the robustness of the model on low-resolution benchmark datasets.


Citations (43)


... Liu et al. [8] introduced RAUNA; the structure comprises a decomposition network (DecNet) influenced by algorithmic unrolling and adjustment networks that take into account both global and local brightness. Li et al. [44] introduced UDVN, an enhancement algorithm that effectively handles images with uneven low-light conditions. This algorithm concentrates on the light and shadow details within the image for low-light enhancement tasks, achieving competitive performance in enhancing low-light images. ...

Reference:

Multi-stage residual network with two fold attention mechanisms for low-light image enhancement
Luminance domain-guided low-light image enhancement

Neural Computing and Applications

... Lu Zou et al. [83] conceptualize 2D poses as graphs, redefining 3D estimation as a graph regression problem, where GCNs infer latent structural relationships within the human body. Bing Yu et al. [84] develop a Perceptual U-shaped Graph Convolutional Network (M-UGCN) using a Ushaped network with map-aware local enhancement, extending the receptive field and intensifying local node interactions across multiple scales to improve 2D-to-3D estimation. Building on this, Hua et al. [85] combine 2D pose estimates from dual views with triangulation to produce an initial 3D pose, subsequently refining it through a Cross-view U-shape Graph Convolutional Network (CV-UGCN) under weak supervision, applicable to any preceding 2D method. ...

Graph U-Shaped Network with Mapping-Aware Local Enhancement for Single-Frame 3D Human Pose Estimation

... Although SSL pretext tasks can be designed and employed for many different types of data (e.g., timeseries [16], text [17], video [18] [19], audio [20], point clouds [21], or even multimodal data [22] [23]), this article focuses on image analysis for computer vision applications. Moreover, it focuses on generic SSL methods and not ones explicitly designed for specific tasks (e.g., for multi-view clustering [24], product attribute recognition [25], etc.). The remainder of this paper is organized as follows: Section 2 briefly presents the most common categories of pretext tasks for visual SSL. ...

SC 2 -Net: Self-supervised learning for multi-view complementarity representation and consistency fusion network
  • Citing Article
  • August 2023

Neurocomputing

... As in several real-life situations, the ground truth version for the video signals in the new domain is not available, the use of the unsupervised learning techniques for carrying out the task of video-to-video translation is inevitable. Deep neural networks provide the state-of-the-art performances in various computer vision tasks [3][4][5][6][7][8][9], in view of their endto-end learning capability between the input and output data domains. In view of these explanations, the use of the deep unsupervised learning-based schemes seem to be a legitimate choice for designing a high-performance video-to-video translation system. ...

Multi-orientation depthwise extraction for stereo image super-resolution

Signal Image and Video Processing

... • SGRNet [185] is a two-stage network, which first employs a generator to create a shadow mask by merging foreground and background, and then predicts shadow parameters and fills the shadow area, producing an image with realistic shadows. • Liu et al. [186] enhances shadow generation in image compositing with multi-scale feature enhancement and multi-level feature fusion. This approach improves mask prediction accuracy and minimizes information loss in shadow parameter prediction, leading to enhanced shadow shapes and ranges. ...

Shadow Generation for Composite Image with Multi-level Feature Fusion
  • Citing Conference Paper
  • March 2023

... The results show that the proposed model produces sharper results closer to the ground truth, with fewer blurring effects and artifacts. Finally, Li et al. [208] (SRAGAN) design a complex GAN with local and global channel and spatial attention modules both in the generator and the discriminator network to capture short-as well as long-range dependencies between pixels. Several experiments proved the superiority of the proposed model, especially at higher scaling factors. ...

Single-image Super-resolution Based on Generative Adversarial Network with Dual Attention Mechanism
  • Citing Conference Paper
  • December 2022

... Chu et al. [11] proposed a two-stage network that first extracts the subtitle mask using a mask extraction network, then feeds the predicted mask and video frame into the generator to remove subtitles. Tu et al. [33] propose a lightweight mask extraction network that uses gated convolutions to generate unsubtitled videos. Although BVDNet [10] has small model parameters, it does not perform well in removing subtitles. ...

Deep Video Decaptioning via Subtitle Mask Prediction and Inpainting
  • Citing Conference Paper
  • December 2022

... Despite data limitations, recent research has explored different SLLIE [17] approaches. The end-to-end learningbased methods [2,9,11,15,22,26,29,30,39,43,53] aim to directly map low-light images to well-lit counterparts, often integrating Convolutional Neural Networks (CNNs). Several recent methods leverage transformers [7,41,52] to enhance the receptive field of vanilla convolutions for learning spatial information, inspired by advances in image restoration [48,57,58]. ...

LDNet: low-light image enhancement with joint lighting and denoising

Machine Vision and Applications