
Pan GaoTrinity College Dublin | TCD · Department of Computer Science
Pan Gao
Doctor of Philosophy
About
68
Publications
6,341
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
288
Citations
Introduction
I am currently an associate professor at Nanjing University of Aeronautics and Astronautics, Nanjing, China. I was a postdoctoral research fellow at Trinity College Dublin, working on the processing, compression, transmission, and evaluation of volumetric video.
Additional affiliations
January 2013 - September 2016
Publications
Publications (68)
Automated radiology report generation has the potential to improve radiology reporting and alleviate the workload of radiologists. However, the medical report generation task poses unique challenges due to the limited availability of medical data and the presence of data bias. To maximize the utility of available data and reduce data bias, we propo...
The worldwide commercialization of fifth generation (5G) wireless networks are pushing toward the deployment of immersive and high-quality VR-based telepresence systems. Among them, 3D object is generally digitized and represented as point cloud. However, realistically reconstructed 3D point clouds generally contain thousands up to millions of poin...
GAN inversion aims at inverting given images into corresponding latent codes for Generative Adversarial Networks (GANs), especially StyleGAN where exists a disentangled latent space that allows attribute-based image manipulation at latent level. As most inversion methods build upon Convolutional Neural Networks (CNNs), we transfer a hierarchical vi...
Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pix...
Image quality assessment is a fundamental problem in the field of image processing, and due to the lack of reference images in most practical scenarios, no-reference image quality assessment (NR-IQA), has gained increasing attention recently. With the development of deep learning technology, many deep neural network-based NR-IQA methods have been d...
Discovering inter-point connection for efficient high-dimensional feature extraction from point coordinate is a key challenge in processing point cloud. Most existing methods focus on designing efficient local feature extractors while ignoring global connection, or vice versa. In this paper, we design a new Inductive Bias-aided Transformer (IBT) me...
Vehicle exhaust is the main source of air pollution with the rapid increase of fuel vehicles. Automatic smoky vehicle detection in videos is a superior solution to traditional expensive remote sensing with ultraviolet-infrared light devices for environmental protection agencies. However, it is challenging to distinguish vehicle smoke from shadow an...
Thanks to the ability of providing an immersive and interactive experience, the uptake of 360 degree image content has been rapidly growing in consumer and industrial applications. Compared to planar 2D images, saliency prediction for 360 degree images is more challenging due to their high resolutions and spherical viewing ranges. Currently, most h...
Problems such as equipment defects or limited viewpoints will lead the captured point clouds to be incomplete. Therefore, recovering the complete point clouds from the partial ones plays an vital role in many practical tasks, and one of the keys lies in the prediction of the missing part. In this paper, we propose a novel point cloud completion app...
Existing point cloud learning methods aggregate features from neighbouring points relying on constructing graph in the spatial domain, which results in feature update for each point based on spatially-fixed neighbours throughout layers. In this paper, we propose a dynamic feature aggregation (DFA) method that can transfer information by constructin...
Predicting salient regions in images requires the capture of contextual information in the scene. Conventional saliency models typically use the encoder-decoder architecture and multi-scale feature fusion for modeling contextual features, which, however, possess huge computational cost and model parameters. In this paper, we address the saliency pr...
Many point cloud completion methods typically rely on two steps: coarse generation and 2D Grid deformed fine output. However, in the fine generation, the expansion range (2D Grid Scale) required by each point cloud sample may be vastly different. For example, if the expansion range for a vessel shape is applied to a table shape, the final output ma...
Image quality assessment is a fundamental problem in the field of image processing, and due to the lack of reference images in most practical scenarios, no-reference image quality assessment (NR-IQA), has gained increasing attention recently. With the development of deep learning technology, many deep neural network-based NR-IQA methods have been d...
Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a...
Jing Xu Wentao Shi Pan Gao- [...]
Qizhu Li
While a large number of recent works on semantic segmentation focus on designing and incorporating a transformer-based encoder, much less attention and vigor have been devoted to transformer-based decoders. For such a task whose hallmark quest is pixel-accurate prediction, we argue that the decoder stage is just as crucial as that of the encoder in...
As being one of the main representation formats of 3D real world and well-suited for virtual reality and augmented reality applications, point clouds have gained a lot of popularity. In order to reduce the huge amount of data, a considerable amount of research on point cloud compression has been done. However, given a target bit rate, how to proper...
Since the ultimate consumers and judgers of most video applications are human subjects, there has been growing interest in incorporating characteristics of the human visual system (HVS) in video coding for the further development of coding technology, called perceptual video coding (PVC). Although there have been numerous PVC methods reported in th...
It is well believed that Transformer performs better in semantic segmentation compared to convolutional neural networks. Nevertheless, the original Vision Transformer may lack of inductive biases of local neighborhoods and possess a high time complexity. Recently, Swin Transformer sets a new record in various vision tasks by using hierarchical arch...
Point cloud is a crucial representation of 3D contents, which has been widely used in many areas such as virtual reality, mixed reality, autonomous driving, etc. With the boost of the number of points in the data, how to efficiently compress point cloud becomes a challenging problem. In this paper, we propose a set of significant improvements to pa...
Point cloud is a crucial representation of 3D contents, which has been widely used in many areas such as virtual reality, mixed reality, autonomous driving, etc. With the boost of the number of points in the data, how to efficiently compress point cloud becomes a challenging problem. In this paper, we propose a set of significant improvements to pa...
It is well believed that Transformer performs better in semantic segmentation compared to convolutional neural networks. Nevertheless, the original Vision Transformer may lack of inductive biases of local neighborhoods and possess a high time complexity. Recently, Swin Transformer sets a new record in various vision tasks by using hierarchical arch...
Automatic smoky vehicle detection in videos is a superior solution to the traditional expensive remote sensing one with ultraviolet-infrared light devices for environmental protection agencies. However, it is challenging to distinguish vehicle smoke from shadow and wet regions coming from rear vehicle or clutter roads, and could be worse due to lim...
Video frame interpolation task has recently become more and more prevalent in the computer vision field. At present, a number of researches based on deep learning have achieved great success. Most of them are either based on optical flow information, or interpolation kernel, or a combination of these two methods. However, these methods have ignored...
Video frame interpolation task has recently become more and more prevalent in the computer vision field. At present, a number of researches based on deep learning have achieved great success. Most of them are either based on optical flow information, or interpolation kernel, or a combination of these two methods. However, these methods have ignored...
The massive amount of data usage for light field (LF) information poses grand challenges for efficient compression designs. There have been several LF video compression methods focusing on exploring efficient prediction structures reported in the literature. However, the number of possible prediction structures is infinite, and these methods fail t...
Motion estimation and motion compensation are indispensable parts of inter prediction in video coding. Since the motion vector of objects is mostly in fractional pixel units, original reference pictures may not accurately provide a suitable reference for motion compensation. In this paper, we propose a deep reference picture generator which can cre...
In geometry-based point cloud compression, the geometry information is typically compressed using octree coding. In octree coding, the size of the blocks in the voxelized point clouds, i.e., the number of voxels contained in a block, determines whether the geometry coding is lossless or lossy, and the degree of geometry compression in lossy coding....
As being one of the main representation formats of 3D real world and well-suited for virtual reality and augmented reality applications, point clouds have gained a lot of popularity. In order to reduce the huge amount of data, a considerable amount of research on point cloud compression has been done. However, given a target bit rate, how to proper...
The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the...
We present a coding and quality evaluation method for Six Degrees of Freedom (6DoF) video content. Firstly, the 6DoF video content is generated using an affordable capturing setup, and represented as the commonly used 6DoF format 3-D mesh. To overcome the difficulty in exploiting the temporal redundancy in mesh with different connectivity at every...
Recent studies have shown that neural network (NN) based image classifiers are highly vulnerable to adversarial examples, which poses a threat to security-sensitive image recognition task. Prior work has shown that JPEG compression can combat the drop in classification accuracy on adversarial examples to some extent. But, as the compression ratio i...
The depth image-based rendering paves the path to success of 3-D video. However, one issue still remained in 3-D video is how to fill the disocclusion areas. To this end, Gaussian mixture model (GMM) is commonly employed to generate the background, and then to fill the holes. Nevertheless, GMM usually has poor performance for sequences with big for...
In this paper¹, we investigate cross-scene video foreground segmentation via supervised and unsupervised model communication. Traditional unsupervised background subtraction methods often face the challenging problem of updating the statistical background model online. In contrast, supervised foreground segmentation methods, such as those that are...
Recently, the study on object detection in aerial images has made tremendous progress in the community of computer vision. However, most state-of-the-art methods tend to develop elaborate attention mechanisms for the space-time feature calibrations with high computational complexity, while surprisingly ignoring the importance of feature calibration...
Omnidirectional video, also known as 360-degree video, has become increasingly popular nowadays due to its ability to provide immersive and interactive visual experiences. However, the ultra high resolution and the spherical observation space brought by the large spherical viewing range make omnidi-rectional video distinctly different from traditio...
Volumetric video (VV) pipelines reached a high level of maturity, creating interest to use such content in interactive visualisation scenarios. VV allows real world content to be captured and represented as 3D models, which can be viewed from any chosen viewpoint and direction. Thus, VV is ideal to be used in augmented reality (AR) or virtual reali...
Virtual views generation is of great significance in free viewpoint video (FVV) as it can avoid the need to transmit a large volume of video data. An important issue in generating virtual views is how to fill the holes caused by occlusion. Using the Gaussian mixture model (GMM) to generate the background reference image is a commonly used hole-fill...
Omnidirectional video, also known as 360-degree video, offers an immersive visual experience by providing viewers with an ability to look in all directions within a scene. The quality assessment for omnidirectional video is still a quite difficult task compared to 2D video. As the temporal changes of spatial distortions can considerably influence h...
For high compression efficiency, 3-D video coding usually employs a multimode methodology to exploit the dependencies between multiple views as well as between texture and depth. However, different coding modes will posses differentiating error propagation behaviour when the compressed 3-D video bit stream is transmitted over packet-switched networ...
In depth map coding, rate-distortion optimization for those pixels that will cause occlusion in view synthesis is a rather challenging task, since the synthesis distortion estimation is complicated by the warping competition and the occlusion order can be easily changed by the adopted optimization strategy. In this paper, an efficient depth map cod...
Volumetric video is becoming easier to capture and display with the recent technical developments in the acquisition, and display technologies. Using point clouds is a popular way to represent volumetric video for augmented or virtual reality applications. This representation, however, requires a large number of points to achieve a high quality of...
A novel joint source and channel coding scheme tailored to 3-D video is proposed in this paper to minimize the end-to-end view synthesis distortion within a given total bit rate for both texture and depth as well as a maximum tolerable distortion constraint for texture. Firstly, we formulate a joint texture and depth coding mode selection strategy...
The optimization of occlusion-inducing depth pixels in depth map coding has received little attention in the literature, since their associated texture pixels are occluded in the synthesized view and their effect on the synthesized view is considered negligible. However, the occlusion-inducing depth pixels still need to consume the bits to be trans...
This paper addresses the problem of error-resilient source coding for 3-D video transmitted over packet-loss networks. The approach jointly optimizes the texture coding mode and the depth coding mode for each macroblock in the reference views. Firstly, a distortion model is developed to capture the effect of the texture distortion and depth distort...
View synthesis prediction (VSP) is a crucial coding tool for improving compression efficiency in the next generation three-dimensional (3-D) video systems. However, VSP is susceptible to catastrophic error propagation when multi-view video plus depth (MVD) data are transmitted over lossy networks. This paper aims at accurately modeling the transmis...
Video transmission over packet-switched networks usually suffers from packet losses. The use of the prediction loop in video coding will cause these errors to propagate to subsequent frames, and thus significantly impacts on the received video quality. With the increasing number of cameras to capture the scene, robustly delivering multi-view video...
View synthesis prediction (VSP) is an important tool for enhancing the coding efficiency in the next-generation three-dimensional (3-D) video systems. However, VSP will lead to prediction position errors when the depth maps are corrupted by packet losses during transmission. In order to mitigate the prediction position errors, a novel disparity vec...
In this paper, a new rate-distortion (R-D) optimized error-resilient scheme is proposed to improve error resilience for multi-view video transmission over lossy networks. Based on the study on the characteristics of multi-view video coding and the propagating behavior of channel errors, a recursive model to estimate the end-to-end distortion is fir...
With the rapid development of electronic and communications technology, multiview video services are becoming more and more feasible in practice. However, robustly delivering many views over error-prone channels is a rather challenging task. Due to the recent development of distributed video coding, this paper proposes two alternative error-resilie...
In this paper, a rate-distortion optimized coding mode switching scheme is proposed to improve error resilience for multi-view video plus depth (MVD) based 3-D video transmission over lossy networks. First, we derive a new end-to-end distortion model for MVD-based 3-D video transmission. As compared with the previous MVD-based video distortion mode...
In this paper, a Wyner-Ziv (WZ) coding based error-resilient scheme is proposed for multi-view video transmission over error-prone channels. At the encoder, the key frames of the odd views are protected by WZ encoding to generate the auxiliary bit-stream alongside the multi-view video coded bit-stream. At the decoder, error-concealed multi-view dec...
Robustly delivering multi-view 3-D video over error-prone channel is a rather challenging task. In this paper, a rate-distortion optimized algorithm is proposed to improve error resilience for multi-view video transmission. Firstly, a recursive model to estimate the end-to-end distortion is developed for multi-view video coding, in which the channe...
Multiview video coding (MVC) has employed motion estimation and disparity estimation to achieve the highest coding efficiency. While the compressed MVC video bit-stream is transmitted through error-prone networks, transmission error will propagate along intra-view and inter-view direction. Based on the characteristic of MVC, we propose a new end-to...
In the joint mode of MVC (Multiview Video Coding), to fully exploit both temporal and inter-view correlation, motion estimation (ME) and disparity estimation (DE) have been employed to achieve the highest coding efficiency. However, it causes enormous computational complexity. An adaptive ME and DE algorithm is proposed to reduce the complexity in...