Hong-yuan Mark Liao

Hong-yuan Mark Liao
  • Ph.D. Northwestern University
  • Director at Academia Sinica

About

284
Publications
120,432
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,971
Citations
Introduction
My current research is related to Multimedia Information Processing and Machine Learning.
Skills and Expertise
Current institution
Academia Sinica
Current position
  • Director
Additional affiliations
August 2014 - July 2016
Sun Yat-sen University
Position
  • Honorary Chair Professor
Description
  • Joint appointment
August 2013 - July 2015
National Cheng Kung University
Position
  • Professor (Joint Appointment)
Description
  • Joint Appointment
September 2012 - June 2024
Academia Sinica
Position
  • Distinguished Research Fellow

Publications

Publications (284)
Preprint
Full-text available
Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the...
Preprint
This is a comprehensive review of the YOLO series of systems. Different from previous literature surveys, this review article re-examines the characteristics of the YOLO series from the latest technical point of view. At the same time, we also analyzed how the YOLO series continued to influence and promote real-time computer vision-related research...
Preprint
Full-text available
We propose a post-processor, called NeighborTrack, that leverages neighbor information of the tracking target to validate and improve single-object tracking (SOT) results. It requires no additional data or retraining. Instead, it uses the confidence score predicted by the backbone SOT network to automatically derive neighbor information and then us...
Preprint
Full-text available
Designing a high-efficiency and high-quality expressive network architecture has always been the most important research topic in the field of deep learning. Most of today's network design strategies focus on how to integrate features extracted from different layers, and how to design computing units to effectively extract these features, thereby e...
Preprint
Full-text available
The paper presents a new method, SearchTrack, for multiple object tracking and segmentation (MOTS). To address the association problem between detected objects, SearchTrack proposes object-customized search and motion-aware features. By maintaining a Kalman filter for each object, we encode the predicted motion into the motion-aware feature, which...
Preprint
Full-text available
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS...
Article
Purpose: Retinopathy screening via digital imaging is promising for early detection and timely treatment, and tracking retinopathic abnormality over time can help to reveal the risk of disease progression. We developed an innovative physician-oriented artificial intelligence-facilitating diagnosis aid system for retinal diseases for screening multi...
Preprint
Full-text available
Learning to capture human motion is essential to 3D human pose and shape estimation from monocular video. However, the existing methods mainly rely on recurrent or convolutional operation to model such temporal information, which limits the ability to capture non-local context relations of human motion. To address this problem, we propose a motion...
Article
Full-text available
In this paper, we propose an end-to-end key-player-based group activity recognition network specially applied to the identification of basketball offensive tactics in limited data scenarios. Our previous studies show that basketball tactics can be better recognized via key player detection with multiple instance learning (MIL) using the support vec...
Preprint
Full-text available
People ``understand'' the world via vision, hearing, tactile, and also the past experience. Human experience can be learned through normal learning (we call it explicit knowledge), or subconsciously (we call it implicit knowledge). These experiences learned through normal learning or subconsciously will be encoded and stored in the brain. Using the...
Preprint
Full-text available
We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4-large model achiev...
Article
Full-text available
An experienced director usually switches among different types of shots to make visual storytelling more touching. When filming a musical performance, appropriate switching shots can produce some special effects, such as enhancing the expression of emotion or heating up the atmosphere. However, while the visual storytelling technique is often used...
Preprint
The goal of this work is to provide a viable solution based on reinforcement learning for traffic signal control problems. Although the state-of-the-art reinforcement learning approaches have yielded great success in a variety of domains, directly applying it to alleviate traffic congestion can be challenging, considering the requirement of high sa...
Preprint
Full-text available
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale d...
Preprint
Full-text available
State-of-the-art (SoTA) models have improved the accuracy of object detection with a large margin via a FP (feature pyramid). FP is a top-down aggregation to collect semantically strong features to improve scale invariance in both two-stage and one-stage detectors. However, this top-down pathway cannot preserve accurate object positions due to the...
Preprint
Full-text available
Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPN...
Conference Paper
Mobile devices such as smart phones are ubiquitously being used to take photos and videos, thus increasing the importance of image deblurring. This study introduces a novel deep learning approach that can automatically and progressively achieve the task via adversarial blurred region mining and refining (adversarial BRMR). Starting with a collabora...
Article
Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director. The technique is often used in creating professional recordings of a live concert, but meanwhile may not be appropriately applied in audience recordings of the same event. Such variations could cause the task of classifying shot...
Article
Object proposal generation methods have been widely applied to many computer vision tasks. However, existing object proposal generation methods often suffer from the problems of motion blur, low contrast, deformation, etc., when they are applied to video related tasks. In this paper, we propose an effective and highly accurate target-specific objec...
Article
Full-text available
Despite recent progress, computational visual aesthetic is still challenging. Image cropping, which refers to the removal of unwanted scene areas, is an important step to improve the aesthetic quality of an image. However, it is challenging to evaluate whether cropping leads to aesthetically pleasing results because the assessment is typically subj...
Preprint
Despite recent progress, computational visual aesthetic is still challenging. Image cropping, which refers to the removal of unwanted scene areas, is an important step to improve the aesthetic quality of an image. However, it is challenging to evaluate whether cropping leads to aesthetically pleasing results because the assessment is typically subj...
Article
Person re-identification (person re-ID) aims at matching target person(s) grabbed from different and non-overlapping camera views. It plays an important role for public safety and has application in various tasks such as, human retrieval, human tracking, and activity analysis. In this paper, we propose a new network architecture called Hierarchical...
Conference Paper
Full-text available
An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is...
Article
In recent years, visual saliency detection has become a popular research topic. It can provide useful prior knowledge for high-level vision tasks, such as object detection and image classification. In this paper, a graph-based superpixel-wise similarity called “homology similarity” is proposed, which describes how likely two superpixels belong to t...
Conference Paper
Multimedia content creation and manipulation have attracted great attention in recent years due to the popularity of mobile image and video capturing devices. In our daily life, the most common subject appearing in the captured content are people. Hence, to create images and videos by manipulating appearances or motions of the human character insid...
Conference Paper
Full-text available
Video-based group behavior analysis is drawing attention to its rich applications in sports, military, surveillance and biological observations. The recent advances in tracking techniques, based on either computer vision methodology or hardware sensors, further provide the opportunity of better solving this challenging task. Focusing specifically o...
Article
We present a general framework and working system for predicting likely affective responses of the viewers in the social media environment after an image is posted online. Our approach emphasizes a mid-level concept representation, in which intended affects of the image publisher is characterized by a large pool of visual concepts (termed PACs) det...
Conference Paper
DroidExec is a novel root exploit recognition to reduce the influence of wide variability, which usually affects the Android malware detection rate, because of Android applications's various properties. In Android, a specific malware family (e.g., root exploit malware), and thus its implementation may be influenced by the campaign it is serving, an...
Article
Multimedia content creation and manipulation have garnered attention in recent days due to the desires of personalization. As a content producing application, we propose a novel idea that requires the fusion of video and audio intelligence. The system is composed of at least three core techniques: 1) the capability to process the video sequence to...
Article
We introduce a technique of calibrating camera motions in basketball videos. Our method particularly transforms player positions to standard basketball court coordinates and enables applications such as tactical analysis and semantic basketball video retrieval. To achieve a robust calibration, we reconstruct the panoramic basketball court from a vi...
Article
The recent advances in imaging devices have opened the opportunity of better solving the tasks of video content analysis and understanding. Next-generation cameras, such as the depth or binocular cameras, capture diverse information, and complement the conventional 2D RGB cameras. Thus, investigating the yielded multi-modal videos generally facilit...
Article
In this paper, we present a clustering approach, MK-SOM, that carries out cluster-dependent feature selection, and partitions images with multiple feature representations into clusters. This work is motivated by the observations that human visual systems (HVS) can receive various kinds of visual cues for interpreting the world. Images identified by...
Conference Paper
Full-text available
With the aim at accurate action video retrieval, we firstly present an approach that can infer the implicit skeleton structure for a query action, an RGB video, and then propose to expand this query with the inferred skeleton for improving the performance of retrieval. It is inspired by the observation that skeleton structures can compactly and eff...
Article
We present a novel two-pass framework for counting the number of people in an environment where multiple cameras provide different views of the subjects. By exploiting the complementary information captured by the cameras, we can transfer knowledge between the cameras to address the difficulties of people counting and improve the performance. The c...
Article
This work aims to develop a system for predicting age progression in children's faces from a small exemplar-image set, which is a critical task to assist in the search for missing children. The proposed method consists of a facial component extraction module, a facial component distance measurement module, and a face synthesis module. It is develop...
Conference Paper
The recent advances in RGB-D cameras have allowed us to better solve increasingly complex computer vision tasks. However, modern RGB-D cameras are still restricted by the short effective distances. The limitation may make RGB-D cameras not online accessible in practice, and degrade their applicability. We propose an alternative scenario to address...
Conference Paper
The recent advances in imaging devices have opened the opportunity of better solving computer vision tasks. The next-generation cameras, such as the depth or binocular cameras, capture diverse information, and complement the conventional 2D RGB cameras. Thus, investigating the yielded multi-modal images generally facilitates the accomplishment of r...
Conference Paper
Full-text available
Visual sentiment analysis is getting increasing attention because of the rapidly growing amount of images in online social interactions and several emerging applications such as online propaganda and advertisement. Recent studies have shown promising progress in analyzing visual affect concepts intended by the media content publisher. In contrast,...
Article
We propose a human motion extrapolation algorithm that synthesizes new motions of a human object in a still image from a given reference motion sequence. The algorithm is implemented in two major steps: contour manifold construction and object motion synthesis. Contour manifold construction searches for low-dimensional manifolds that represent the...
Article
Facial attributes are shown effective for mining specific persons and profiling human activities in large-scale media such as surveillance videos or photo-sharing services. For comprehensive analysis, a rich number of facial attributes is required. Generally, each attribute detector is obtained by supervised learning via the use of large training d...
Conference Paper
Most learning-based approaches to face detection suffer from the problem of performance degradation on faces that are not covered by training data. However, including all variations of faces in training is practically infeasible due to the scalability restriction of machine learning algorithms and expensive manual labeling. In this work, we focus o...
Article
Tensor completion, which is a high-order extension of matrix completion, has generated a great deal of research interest in recent years. Given a tensor with incomplete entries, existing methods use either factorization or completion schemes to recover the missing parts. However, as the number of missing entries increases, factorization schemes may...
Conference Paper
Full-text available
In this paper, we propose a novel framework to automatically perform player tracking and identification for sport videos filmed by a single pan-tilt-zoom camera from the court view. The proposed scheme is separated into three parts. The first part is to detect players by a deformable part model. The second part is to recognize jersey numbers by gra...
Article
Image forgery is becoming more prevalent in our daily lives due to advances in computers and image-editing software. As forgers develop more sophisticated forgeries, researchers must keep up to design more advanced ways of detecting these forgeries. Copy-move forgery is one type of image forgery where one region of an image is copied to another reg...
Conference Paper
The recent deployment of very large-scale camera networks consisting of fixed/moving surveillance cameras and vehicle video recorders, has led to a novel field in object tracking problem. The major goal is to detect and track each vehicle within a large area, which can be applied to video forensics. For example, a suspected vehicle can be automatic...
Conference Paper
In this paper, we will investigate a more challenging vehicle matching problem. The problem is formulated as invariant image feature matching among opposite viewpoints of cameras, i.e. complementary object matching. For example, a front vehicle object may be given as a query to retrieve a rear vehicle object of the same vehicle. To solve the comple...
Article
In this paper, we present an automatic foreground object detection method for videos captured by freely moving cameras. While we focus on extracting a single foreground object of interest throughout a video sequence, our approach does not require any training data nor the interaction by the users. Based on the SIFT correspondence across video frame...
Article
Recently, Ma et al. proposed an efficient error propagation-free discrete cosine transform-based (DCT-based) data hiding algorithm that embeds data in H.264/AVC intra frames. In their algorithm, only 46% of the 4 × 4 luma blocks can be used to embed hidden bits. In this paper, we propose an improved error propagation-free DCT-based perturbation sch...
Article
In this paper, we present an efficient RDH algorithm based on a new gradient-based edge direction prediction (GEDP) scheme. Since the proposed GEDP scheme can generate more accurate prediction results, the prediction errors tend to form a sharper Laplacian distribution. Therefore, the proposed algorithm can guarantee larger embedding capacity and p...
Conference Paper
We present a framework to count the number of people in an environment where multiple cameras with different angles of view are available. We consider the visual cues captured by each camera as a knowledge source, and carry out cross-camera knowledge transfer to alleviate the difficulties of people counting, such as partial occlusions, low-quality...
Conference Paper
An increasing number of users are contributing the sheer amount of group photos (e.g., for family, classmates, colleagues, etc.) on social media for the purpose of photo sharing and social communication. There arise strong needs for automatically understanding the group types (e.g., family vs. classmates) for recommendation services (e.g., recommen...
Conference Paper
We aim to resolve the difficulties of action recognition arising from the large intra-class variations. These unfavorable variations make it infeasible to represent one action instance by other ones of the same action. We hence propose to extract both instance-specific and class-consistent features to facilitate action recognition. Specifically, th...
Conference Paper
This paper proposes a learning-based approach to increase the temporal resolutions of human motion sequences. Given a set of high resolution motion sequences, our idea is first to learn the motion tendency from this learning dataset and then synthesize new postures for the low-resolution sequence according to the learned motion tendency. We summari...
Article
In this paper, we present a novel edge sensing-based demosaicing algorithm for digital time delay and integration (DTDI) mosaic images, which are captured by DTDI line-scan cameras and suitable for industrial print inspection. We propose to use Sobel- and interpolation-based masks to extract more accurate gradient information in the color differenc...
Conference Paper
Sports video analysis has attracted great attention in recent years. In the past decade, numerous sports video indexing approaches have been proposed at different semantic levels. In this paper, an individual level sports video indexing (ILSVI) scheme is proposed. The individual level refers to the indexing of a sports video on a player basis, i.e....
Article
A reversible data hiding algorithm which uses prediction errors in the color difference domain for mosaic images with the Bayer color filter array (CFA) is proposed. Furthermore, the proposed algorithm can be extended to deal with the digital time delay and integration (DTDI) mosaic images and Lukac and Plataniotis (LP) mosaic images. Experimental...
Conference Paper
With the urgent demand in information security, biometric feature-based verification systems have been extensively explored in many application domains. However, the efficacy of existing biometric-based systems is unsatisfactory and there are still a lot of difficult problems to be solved. Among many existing biometric features, palmprint has been...
Conference Paper
This work aims to develop a system for predicting age progression in children faces. Age progression prediction in children faces is critical to assist missing children searching. An integral module including feature extraction, distance measurement, and face synthesis is devised in this paper to predict faces at different ages. In the proposed met...
Article
Full-text available
We propose a human object inpainting scheme that divides the process into three steps: 1) human posture synthesis; 2) graphical model construction; and 3) posture sequence estimation. Human posture synthesis is used to enrich the number of postures in the database, after which all the postures are used to build a graphical model that can estimate t...
Conference Paper
In this paper, we propose a new framework to synthesize human motions based on only one single posture given in the input image. To generate visually pleasing motion sequences, the proposed framework consists of two key techniques. One is motion retrieval, which retrieves reference motions from a human motion database on a low-dimensional motion ma...
Conference Paper
Full-text available
Leveraging community-contributed data (e.g., blogs, GPS logs, and geo-tagged photos) for travel recommendation is one of the active researches since there are rich contexts and trip activities in such explosively growing data. In this work, we focus on personalized travel recommendation by leveraging the freely available community-contributed photo...
Article
We introduce a novel approach to cross camera people counting that can adapt itself to a new environment without the need of manual inspection. The proposed counting model is composed of a pair of collaborative Gaussian processes (GP), which are respectively designed to count people by taking the visible and occluded parts into account. While the f...
Conference Paper
The objective of this research is to design a new JPEG-based compression scheme which simultaneously considers the security issue. Our method starts from dividing image into non-overlapping blocks with size 8×8. Among these blocks, some are used as reference blocks and the rest are used as query blocks. A query block is the combination of the resid...
Article
Full-text available
Video inpainting is an important video enhancement technique used to facilitate the repair or editing of digital videos. It has been employed worldwide to transform cultural artifacts such as vintage videos/films into digital formats. However, the quality of such videos is usually very poor and often contain unstable luminance and damaged content....
Conference Paper
Full-text available
Most Web videos are captured in uncontrolled environments (e.g. videos captured by freely-moving cameras with low resolution); this makes automatic video annotation very difficult. To address this problem, we present a robust moving foreground object detection method followed by the integration of features collected from heterogeneous domains. We a...
Article
Full-text available
This paper presents a novel framework for object completion in a video. To complete an occluded object, our method first samples a 3-D volume of the video into directional spatio-temporal slices, and performs patch-based image inpainting to complete the partially damaged object trajectories in the 2-D slices. The completed slices are then combined...
Article
Digital time delay and integration (DTDI) mosaic video sequences captured by high-speed DTDI line-scan cameras are commonly used in industrial print inspection and high-speed capture applications. To reduce the memory requirement for saving these video sequences, it is necessary to compress them. In this paper, we present an efficient chroma subsam...
Conference Paper
Facial attributes such as gender, race, age, hair style, etc., carry rich information for locating designated persons and profiling the communities from image/video collections (e.g., surveillance videos or photo albums). For plentiful facial attributes in photos and videos, collecting costly manual annotations for training detectors is time-consum...
Conference Paper
Storytelling and narrative creation are very popular research issues in the field of interactive media design. In this paper, we propose a framework for generating video narrative from existing videos which user only needs to involve in two steps: (1) select background video and avatars; (2) set up the movement and trajectory of avatars. To generat...
Book
This two-volume proceedings constitutes the refereed papers of the 17th International Multimedia Modeling Conference, MMM 2011, held in Taipei, Taiwan, in January 2011. The 51 revised regular papers, 25 special session papers, 21 poster session papers, and 3 demo session papers, were carefully reviewed and selected from 450 submissions. The papers...
Conference Paper
Full-text available
This paper proposes a two-step prototype-face-based scheme of hallucinating the high-resolution detail of a low-resolution input face image. The proposed scheme is mainly composed of two steps: the global estimation step and the local facial-parts refinement step. In the global estimation step, the initial high-resolution face image is hallucinated...

Network

Cited By