
Hongxun Yao- Harbin Institute of Technology
Hongxun Yao
- Harbin Institute of Technology
About
333
Publications
59,734
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,295
Citations
Current institution
Publications
Publications (333)
Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details w...
In recent years, Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning. However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security. To address these i...
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLM...
Successful video deblurring relies on effectively using sharp pixels from other frames to recover the blurry pixels of the current frame. However, mainstream methods only use estimated optical flows to align and fuse features from adjacent frames without considering the pixel-wise blur levels, leading to the introduction of blurry pixels from adjac...
Visual emotion recognition (VER), which aims at understanding humans' emotional reactions toward different visual stimuli, has attracted increasing attention. Given the subjective and ambiguous characteristics of emotion, annotating a reliable large-scale dataset is hard. For reducing reliance on data labeling, domain adaptation offers an alternati...
The effectiveness of supervised ML heavily depends on having a large, accurate, and diverse annotated dataset, which poses a challenge in applying ML for yield prediction. To address this issue, we developed a self-training random forest algorithm capable of automatically expanding the annotated dataset. Specifically, we trained a random forest reg...
Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems...
Learning to recognize novel visual classes from few samples is challenging but promising. Previous studies have shown that few-shot model tends to overfit and lead to poor generalization performance, which is because it finds a biased distribution based on a few samples. In addition, in agriculture-specific domains, there are more serious research...
Although stereo image restoration has been extensively studied, most existing work focuses on restoring stereo images with limited horizontal parallax due to the binocular symmetry constraint. Stereo images with unlimited parallax (e.g., large ranges and asymmetrical types) are more challenging in real-world applications and have rarely been explor...
Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to...
Artistic image synthesis is receiving increasing engagement in the multimedia community because of the development and improvement of generative adversarial networks. Digital art synthesis methods perform uncontrolled manipulation in complicated landscape scenarios because of the domain diversity of the paintings. To solve this problem, this paper...
We present a framework for artwork image synthesis from unsupervised segmentation maps input and style images. The output has style consistency with style images and the semantic structure from the corresponding segmentation label. Existing methods of transferring semantic labels to painting images require large amounts of manual segmentation pairs...
Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details w...
Human instance matting aims to estimate an alpha matte for each human instance in an image, which is extremely challenging and has rarely been studied so far. Despite some efforts to use instance segmentation to generate a trimap for each instance and apply trimap-based matting methods, the resulting alpha mattes are often inaccurate due to inaccur...
The key success factor of the video deblurring methods is to compensate for the blurry pixels of the mid-frame with the sharp pixels of the adjacent video frames. Therefore, mainstream methods align the adjacent frames based on the estimated optical flows and fuse the alignment frames for restoration. However, these methods sometimes generate unsat...
The key success factor of the video deblurring methods is to compensate for the blurry pixels of the mid-frame with the sharp pixels of the adjacent video frames. Therefore, mainstream methods align the adjacent frames based on the estimated optical flows and fuse the alignment frames for restoration. However, these methods sometimes generate unsat...
Inferring the complete 3D shape of an object from an RGB image has shown impressive results, however, existing methods rely primarily on recognizing the most similar 3D model from the training set to solve the problem. These methods suffer from poor generalization and may lead to low-quality reconstructions for unseen objects. Nowadays, stereo came...
Current image editing algorithms prevailingly involve specific processing intensities. After the network training phase, the definite mapping is carried on testing images, which leads to under- or over-editing in certain cases. However, for many real problems, having access to diverse intensities of processing of the output is preferable. The hypot...
In recent years, supervised hashing has been validated to greatly boost the performance of image retrieval. However, the label-hungry property requires massive label collection, making it intractable in practical scenarios. To liberate the model training procedure from laborious manual annotations, some unsupervised methods are proposed. However, t...
View-based 3D model classification and retrieval are increasingly important in various fields. High classification accuracy and retrieval precision are urgently needed in the related applications. However, these two topics are always considered separately and very few works give an in-depth analysis of their relations. We would like to argue that a...
Online hashing for streaming data has attracted increasing attention recently. However, most existing algorithms focus on batch inputs and instance-balanced optimization, which is limited in the single datum input case and does not match the dynamic training in online hashing. Furthermore, constantly updating the online model with new-coming sample...
Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, whi...
Sketch recognition remains a significant challenge due to the limited training data and the substantial intra-class variance of freehand sketches for the same object. Conventional methods for this task often rely on the availability of the temporal order of sketch strokes, additional cues acquired from different modalities and supervised augmentati...
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstru...
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To so...
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstru...
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To so...
Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment da...
Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause e...
In recent years, hashing methods have been proved to be effective and efficient for large-scale Web media search. However, the existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have a subtle difference. To solve this problem, we for the first time introdu...
With the breakthroughs in general action understanding, it has become an inevitable trend to analyze the actions in finer granularity. However, related researches have been largely hindered by the lack of fine-grained datasets and the difficulty of capturing subtle differences between fine-grained actions that are highly similar overall. In this pa...
Body language is one of the most common ways of expressing human emotion. In this article, we make the first attempt to generate an action video with a specific emotion from a single person image. The goal of the emotion-based action generation task (EBAG) is to generate action videos expressing a specific type of emotion given a single reference i...
Deep neural networks (DNNs) are vulnerable to adversarial examples. Generally speaking adversarial examples are defined by adding input samples a small-magnitude perturbation, which is hardly misleading human observers’ decision but would lead to misclassifications for a well trained models. Most of existing iterative adversarial attack methods suf...
Multiple Object Tracking (MOT) meets great challenges in videos captured by Unmanned Aerial Vehicles (UAVs). Different from traditional videos, due to high altitude and abrupt motion changes of UAVs, the sizes of target objects in UAVs videos are usually very small and the appearance information of target objects is unreliable. The motion analysis...
Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment da...
Inferring the 3D shape of an object from an RGB image has shown impressive results, however, existing methods rely primarily on recognizing the most similar 3D model from the training set to solve the problem. These methods suffer from poor generalization and may lead to low-quality reconstructions for unseen objects. Nowadays, stereo cameras are p...
In this paper, we propose a novel deep framework for part-level semantic parsing of freehand sketches, which makes three main contributions that are experimentally shown to have substantial practical merit. First, we introduce a new idea named homogeneous transformation to address the problem of domain adaptation. For the task of sketch parsing, th...
Sketch recognition remains a significant challenge due to the limited training data and the substantial intra-class variance of freehand sketches for the same object. Conventional methods for this task often rely on the availability of the temporal order of sketch strokes, additional cues acquired from different modalities and supervised augmentati...
Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same s...
Recognition of general actions has witnessed great success in recent years. However, the existing general action representations cannot work well to recognize fine-grained actions, which usually share high similarities in both appearance and motion pattern. To solve this problem, we introduce the visual attention mechanism into the proposed descrip...
Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same s...
In recent years, deep hashing methods have been proved to be effective since it employs convolutional neural network to learn features and hashing codes simultaneously. However, these methods are mostly supervised. In real-world applications, it is a time-consuming and overloaded task for annotating a large number of images. In this paper, we propo...
DenseNet features dense connections between layers. Such an architecture is elegant but suffers memory-hungry and time-consuming. In this paper, we explore the relation between density of connections and performance of DenseNet (Huang et al., in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017). We find that some...
Nighttime image captured in low- or non-uniform illumination scene always suffers from the loss of visibility and contains various noise and objectionable artifact. When we enlarge the amplitude of the brightness, the noise and artifact will be amplified as well. Hence, we propose a nighttime image enhancement approach based on image decomposition....
3D object retrieval has been a hot research topic in recent years. Within such field, view-based approaches are attracting increasing attention due to the flexibility of the data representation as well as thanks to the reported state-of-the-art performance. One of the most important issues related to view-based 3D object retrieval is how to learn e...
Image classification is a prominent topic and a challenging task in the field of remote sensing. Recently many various classification methods have been proposed for satellite images specifically the frameworks based on spectral-spatial feature extraction techniques. In this paper, a feature extraction strategy of multispectral data is taken into ac...
In this paper, we propose a novel deep model for unbalanced distribution Character Recognition by employing focal loss based connectionist temporal classification (CTC) function. Previous works utilize Traditional CTC to compute prediction losses. However, some datasets may consist of extremely unbalanced samples, such as Chinese. In other words, b...
Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015), we combine trajectory pooling method and 3D ConvNets (Tran et al. 2015) and put forward a nove...
Recent research shows that auto-encoder is suitable to model a variation which varies smoothly. In this paper, we attempt to utilize auto-encoder to recognize partially occluded digit images with gradual recovery. We propose a new variation of auto-encoder, namely the “generalized auto-encoder”, and construct stacked generalized auto-encoders (SGAE...
It is challenge to segment fine-grained objects due to appearance variations and clutter of backgrounds. Most of existing segmentation methods hardly separate small parts of the instance from its background with sufficient accuracy. However, such small parts usually contain important semantic information, which is crucial in fine-grained categoriza...
Representation of videos is essential since it conveys an understanding of video content and enables many higher-level tasks to be tackled efficiently. However, it is challenging to propose a rational representation for complex event videos, as most video information is either noisy or redundant. In this work, we propose a compact event representat...
Affective social multimedia computing is an emergent research topic for both affective computing and multimedia research communities. Social multimedia is fundamentally changing how we communicate, interact, and collaborate with other people in our daily lives. Social multimedia contains much affective information. Effective extraction of affective...
Convolutional neural networks (CNNs) have been applied to visual tracking with demonstrated success in recent years. However, the performance of CNN-based trackers can be further improved, because the predicted upright bounding box cannot tightly enclose the target due to factors such as deformations and rotations. Besides, many existing CNN-based...
Classifying remote sensing images with high spectral and spatial resolution became an important topic and challenging task in computer vision and remote sensing (RS) fields because of their huge dimensionality and computational complexity. Recently, many studies have already demonstrated the efficiency of employing spatial information where a combi...
3D reconstruction has been attracting increasing attention in the past few years. With the surge of deep neural networks, the performance of 3D reconstruction has been improved significantly. However, the voxel reconstructed by extant approaches usually contains lots of noise and leads to heavy computation. In this paper, we define a new voxel repr...
Effective feature representation is crucial to view-based 3D object retrieval (V3OR). Most previous works employed hand-crafted features to represent the views of each object. Although deep learning based methods has shown its excellent performance in many vision tasks, it is hard to get excellent performance for unsupervised 3D object retrieval. I...
We propose to do object discovery and cosegmentation in noisy datasets with utilization of CNN features. We use an object discovery framework which supposes that common object patterns are sparse concerning transformations across images. The key issue is then how to take advantage of the interrelations among images. Since an image normally matches...
Various mobile devices with high-quality cameras are very popular in human daily life. Appropriate directions about the standing postures can greatly improve the user experience while taking photos. In this paper, we propose a method to recommend custom model-like standing style based on model sketches. We first translate the real images of splendi...
Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to combine their merits to design good descriptors has been the research hotspot recently. Following the idea of TDD [1], in this paper, we investigate if the trajectory pooling method is suitable to 3D ConvNets [2]. Specifi...
Nowadays, the development of agriculture is growing very fast. The yields of corn is also an important indicator and a great part in the agriculture, which makes automatic weeds removal a necessary and urgent task. There are many challenges to distinguish the corn and weed, the biggest one is the similarity in both color and shape between corn and...
Fine-grained visual categorization (FGVC) is a challenging vision problem since the similar appearance between object classes. It is important to note that human visual recognition system generally focuses on the specific part to distinguish those confused classes, which is also the breakthrough point for FGVC. In this paper, we will introduce the...
Fine-grained classification is challenging since sub-categories have little intra-class variances and large intra-class variations. The task of flower classification can be achieved through highlighting the discriminative parts. Most traditional methods trained Convolutional Neural Networks (CNN) to handle the variations of pose, color and rotation...
Convolutional Neural Networks (CNNs) have been applied to visual tracking with demonstrated success in recent years. Most CNN-based trackers utilize hierarchical features extracted from a certain layer to represent the target. However, features from a certain layer are not always effective for distinguishing the target object from the backgrounds e...
Existing methods for flower classification are usually focused on segmentation of the foreground, followed by extraction of features. After extracting the features from the foreground, global pooling is performed for final classification. Although this pipeline can be applied to many recognition tasks, however, these approaches have not explored st...
Computationally modelling the affective content of images has been extensively studied recently because of its wide applications in entertainment, advertisement, and education. Significant progress has been made on designing discriminative features to bridge the affective gap. Assuming that viewers can reach a consensus on the emotion of images, mo...
Event detection, which targets the detection of complex events among numerous videos, has attracted growing interest recently. Previous approaches suffered from huge computation costs in multiple feature extraction and classification process. Lately, a discriminative CNN video representation method for event detection is proposed to obtain promisin...
Histogram is commonly used in the area of designing features. However, most existing histogram-based descriptors ignore the information of the distribution of points in each bin. Motivated by VLAD, we introduce the locally aggregation strategy into the design of hand-crafted features to address this issue, and put forward several locally aggregated...
Image matching remains an important and challenging problem in computer vision, especially for the dense correspondence estimation between images with high category-level similarity. The effectiveness of image matching largely depends on the advance of image descriptors. Inspired by the success of Convolutional Neural Network(CNN), we propose a hie...
Since the ancient times, free-hand sketch has been widely used as an effective and convenient intermediate means to express human thoughts and highly diverse objects in reality. In recent years, a great quantity of researchers realized the significance of sketch and gradually focused on sketch-related problems, such as sketch-based image retrieval...
We present a simple yet effective approach for human action recognition. Most of the existing solutions based on multi-class action classification aim to assign a class label for the input video. However, the variety and complexity of real-life videos make it very challenging to achieve high classification accuracy. To address this problem, we prop...
Recent developments in the field of computer vision have led to a renewed interest in sketch correlated research. There have emerged considerable solid evidence which revealed the significance of sketch. However, there have been few profound discussions on sketch based action analysis so far. In this paper, we propose an approach to discover the mo...
Structure-texture image decomposition aims to interpret an image as the superposition of a structural component and a textural component, which is a very challenging problem, yet opens the door to many applications once solved successfully. The number of zero crossings in derivatives is utilized as a type of coarseness measure to perform structure-...
Understanding a scene provided by Very High Resolution (VHR) satellite imagery has become a more and more challenging problem. In this paper, we propose a new method for scene classification based on different pre-trained Deep Features Learning Models (DFLMs). DFLMs are applied simultaneously to extract deep features from the VHR image scene, and t...
The rapid development of remote sensing technology allows us to get images with high and very high resolution (VHR). VHR imagery scene classification has become an important and challenging problem. In this paper, we introduce a framework for VHR scene understanding. First, the pretrained visual geometry group network (VGG-Net) model is proposed as...
View-based 3D object retrieval techniques have become increasingly important in various fields, and lots of ingenious studies have promoted the development of retrieval performance from different aspects. In this paper, we focus on the 2D projective views that represent the 3D objects and propose a boosting approach by evaluating the discriminative...
In this paper, we proposed a unified framework for anomaly detection and localization in crowed scenes. For each video frame, we extract the spatio-temporal sparse features of 3D blocks and generate the saliency map using a block-based center-surround difference operator. Two sparse coding strategies including off-line long-term sparse representati...
Identification of characters in TV series and movies is an important and challenging problem. Actor identification results are important information for many higher level multimedia analysis tasks, such as semantic indexing and retrieval, interaction analysis and video summarization. Compared with previous works on actor identification that mainly...
General natural image deblurring methods do not work well for document images. We exploit a two-tone prior to steer the intermediate latent image towards a piece-wise constant two-tone intermediate image. This prior is helpful for the process of kernel estimation to overcome undesirable local minima, and it is not too restrictive to deblur text ima...
Recent years have witnessed great progress in image deblurring. However, as an important application case, the deblurring of face images has not been well studied. Most existing face deblurring methods rely on exemplar set construction and candidate matching, which not only cost much computation time but also are vulnerable to possible complex or e...
Deep convolutional neural networks have demonstrated breakthrough accuracies for image classification. A series of feature extractors learned from CNN have been used in other computer vision tasks. However, CNN features of different layers aim to encode different-level information. High-layer features care more about semantic information but less d...
Dance is a unique and meaningful type of human expression, composed of abundant and various action elements. However, existing methods based on associated texts and spatial visual features have difficulty in capturing the highly articulated motion patterns. To overcome this limitation, we propose to take advantage of the intrinsic motion informatio...
Images can convey rich semantics and induce various emotions to viewers. Most existing works on affective image analysis focused on predicting the dominant emotions for the majority of viewers. However, such dominant emotion is often insufficient in realworld applications, as the emotions that are induced by an image are highly subjective and diffe...
Previous works on image emotion analysis mainly focused on predicting the dominant emotion category or the average dimension values of an image for affective image classification and regression. However, this is often insufficient in various realworld applications, as the emotions that are evoked in viewers by an image are highly subjective and dif...
Although saliency prediction in crowd has been recently recognized as an essential task for video analysis, it is not comprehensively explored yet. The challenges lie in that eye fixations in crowded scenes are inherently "distinct" and "multi-modal", which differs from those in regular scenes. To this end, the existing saliency prediction schemes...
Images can convey rich semantics and induce various emotions to viewers. Most existing works on affective image analysis focused on predicting the dominant emotions for the majority of viewers. However, such dominant emotion is often insufficient in real-world applications, as the emotions that are induced by an image are highly subjective and diff...