
Hong-yuan Mark Liao- Ph.D. Northwestern University
- Director at Academia Sinica
Hong-yuan Mark Liao
- Ph.D. Northwestern University
- Director at Academia Sinica
About
284
Publications
120,432
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,971
Citations
Introduction
My current research is related to Multimedia Information Processing and Machine Learning.
Skills and Expertise
Current institution
Additional affiliations
August 2014 - July 2016
August 2013 - July 2015
September 2012 - June 2024
Publications
Publications (284)
Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the...
This is a comprehensive review of the YOLO series of systems. Different from previous literature surveys, this review article re-examines the characteristics of the YOLO series from the latest technical point of view. At the same time, we also analyzed how the YOLO series continued to influence and promote real-time computer vision-related research...
We propose a post-processor, called NeighborTrack, that leverages neighbor information of the tracking target to validate and improve single-object tracking (SOT) results. It requires no additional data or retraining. Instead, it uses the confidence score predicted by the backbone SOT network to automatically derive neighbor information and then us...
Designing a high-efficiency and high-quality expressive network architecture has always been the most important research topic in the field of deep learning. Most of today's network design strategies focus on how to integrate features extracted from different layers, and how to design computing units to effectively extract these features, thereby e...
The paper presents a new method, SearchTrack, for multiple object tracking and segmentation (MOTS). To address the association problem between detected objects, SearchTrack proposes object-customized search and motion-aware features. By maintaining a Kalman filter for each object, we encode the predicted motion into the motion-aware feature, which...
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS...
Purpose: Retinopathy screening via digital imaging is promising for early detection and timely treatment, and tracking retinopathic abnormality over time can help to reveal the risk of disease progression. We developed an innovative physician-oriented artificial intelligence-facilitating diagnosis aid system for retinal diseases for screening multi...
Learning to capture human motion is essential to 3D human pose and shape estimation from monocular video. However, the existing methods mainly rely on recurrent or convolutional operation to model such temporal information, which limits the ability to capture non-local context relations of human motion. To address this problem, we propose a motion...
In this paper, we propose an end-to-end key-player-based group activity recognition network specially applied to the identification of basketball offensive tactics in limited data scenarios. Our previous studies show that basketball tactics can be better recognized via key player detection with multiple instance learning (MIL) using the support vec...
People ``understand'' the world via vision, hearing, tactile, and also the past experience. Human experience can be learned through normal learning (we call it explicit knowledge), or subconsciously (we call it implicit knowledge). These experiences learned through normal learning or subconsciously will be encoded and stored in the brain. Using the...
We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4-large model achiev...
An experienced director usually switches among different types of shots to make visual storytelling more touching. When filming a musical performance, appropriate switching shots can produce some special effects, such as enhancing the expression of emotion or heating up the atmosphere. However, while the visual storytelling technique is often used...
The goal of this work is to provide a viable solution based on reinforcement learning for traffic signal control problems. Although the state-of-the-art reinforcement learning approaches have yielded great success in a variety of domains, directly applying it to alleviate traffic congestion can be challenging, considering the requirement of high sa...
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale d...
State-of-the-art (SoTA) models have improved the accuracy of object detection with a large margin via a FP (feature pyramid). FP is a top-down aggregation to collect semantically strong features to improve scale invariance in both two-stage and one-stage detectors. However, this top-down pathway cannot preserve accurate object positions due to the...
Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPN...
Mobile devices such as smart phones are ubiquitously being used to take photos and videos, thus increasing the importance of image deblurring. This study introduces a novel deep learning approach that can automatically and progressively achieve the task via adversarial blurred region mining and refining (adversarial BRMR). Starting with a collabora...
Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director. The technique is often used in creating professional recordings of a live concert, but meanwhile may not be appropriately applied in audience recordings of the same event. Such variations could cause the task of classifying shot...
Object proposal generation methods have been widely applied to many computer vision tasks. However, existing object proposal generation methods often suffer from the problems of motion blur, low contrast, deformation, etc., when they are applied to video related tasks. In this paper, we propose an effective and highly accurate target-specific objec...
Despite recent progress, computational visual aesthetic is still challenging. Image cropping, which refers to the removal of unwanted scene areas, is an important step to improve the aesthetic quality of an image. However, it is challenging to evaluate whether cropping leads to aesthetically pleasing results because the assessment is typically subj...
Despite recent progress, computational visual aesthetic is still challenging. Image cropping, which refers to the removal of unwanted scene areas, is an important step to improve the aesthetic quality of an image. However, it is challenging to evaluate whether cropping leads to aesthetically pleasing results because the assessment is typically subj...
Person re-identification (person re-ID) aims at matching target person(s) grabbed from different and non-overlapping camera views. It plays an important role for public safety and has application in various tasks such as, human retrieval, human tracking, and activity analysis. In this paper, we propose a new network architecture called Hierarchical...
An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is...
In recent years, visual saliency detection has become a popular research topic. It can provide useful prior knowledge for high-level vision tasks, such as object detection and image classification. In this paper, a graph-based superpixel-wise similarity called “homology similarity” is proposed, which describes how likely two superpixels belong to t...
Multimedia content creation and manipulation have attracted great attention in recent years due to the popularity of mobile image and video capturing devices. In our daily life, the most common subject appearing in the captured content are people. Hence, to create images and videos by manipulating appearances or motions of the human character insid...
Video-based group behavior analysis is drawing attention to its rich applications in sports, military, surveillance and biological observations. The recent advances in tracking techniques, based on either computer vision methodology or hardware sensors, further provide the opportunity of better solving this challenging task. Focusing specifically o...
We present a general framework and working system for predicting likely affective responses of the viewers in the social media environment after an image is posted online. Our approach emphasizes a mid-level concept representation, in which intended affects of the image publisher is characterized by a large pool of visual concepts (termed PACs) det...
DroidExec is a novel root exploit recognition to reduce the influence of wide variability, which usually affects the Android malware detection rate, because of Android applications's various properties. In Android, a specific malware family (e.g., root exploit malware), and thus its implementation may be influenced by the campaign it is serving, an...
Multimedia content creation and manipulation have garnered attention in recent days due to the desires of personalization. As a content producing application, we propose a novel idea that requires the fusion of video and audio intelligence. The system is composed of at least three core techniques: 1) the capability to process the video sequence to...
We introduce a technique of calibrating camera motions in basketball videos. Our method particularly transforms player positions to standard basketball court coordinates and enables applications such as tactical analysis and semantic basketball video retrieval. To achieve a robust calibration, we reconstruct the panoramic basketball court from a vi...
The recent advances in imaging devices have opened the opportunity of better solving the tasks of video content analysis and understanding. Next-generation cameras, such as the depth or binocular cameras, capture diverse information, and complement the conventional 2D RGB cameras. Thus, investigating the yielded multi-modal videos generally facilit...
In this paper, we present a clustering approach, MK-SOM, that carries out cluster-dependent feature selection, and partitions images with multiple feature representations into clusters. This work is motivated by the observations that human visual systems (HVS) can receive various kinds of visual cues for interpreting the world. Images identified by...
With the aim at accurate action video retrieval, we firstly present an approach that can infer the implicit skeleton structure for a query action, an RGB video, and then propose to expand this query with the inferred skeleton for improving the performance of retrieval. It is inspired by the observation that skeleton structures can compactly and eff...
We present a novel two-pass framework for counting the number of people in an environment where multiple cameras provide different views of the subjects. By exploiting the complementary information captured by the cameras, we can transfer knowledge between the cameras to address the difficulties of people counting and improve the performance. The c...
This work aims to develop a system for predicting age progression in children's faces from a small exemplar-image set, which is a critical task to assist in the search for missing children. The proposed method consists of a facial component extraction module, a facial component distance measurement module, and a face synthesis module. It is develop...
The recent advances in RGB-D cameras have allowed us to better solve increasingly complex computer vision tasks. However, modern RGB-D cameras are still restricted by the short effective distances. The limitation may make RGB-D cameras not online accessible in practice, and degrade their applicability. We propose an alternative scenario to address...
The recent advances in imaging devices have opened the opportunity of better solving computer vision tasks. The next-generation cameras, such as the depth or binocular cameras, capture diverse information, and complement the conventional 2D RGB cameras. Thus, investigating the yielded multi-modal images generally facilitates the accomplishment of r...
Visual sentiment analysis is getting increasing attention because of the rapidly growing amount of images in online social interactions and several emerging applications such as online propaganda and advertisement. Recent studies have shown promising progress in analyzing visual affect concepts intended by the media content publisher. In contrast,...
We propose a human motion extrapolation algorithm that synthesizes new motions of a human object in a still image from a given reference motion sequence. The algorithm is implemented in two major steps: contour manifold construction and object motion synthesis. Contour manifold construction searches for low-dimensional manifolds that represent the...
Facial attributes are shown effective for mining specific persons and profiling human activities in large-scale media such as surveillance videos or photo-sharing services. For comprehensive analysis, a rich number of facial attributes is required. Generally, each attribute detector is obtained by supervised learning via the use of large training d...
Most learning-based approaches to face detection suffer from the problem of performance degradation on faces that are not covered by training data. However, including all variations of faces in training is practically infeasible due to the scalability restriction of machine learning algorithms and expensive manual labeling. In this work, we focus o...
Tensor completion, which is a high-order extension of matrix completion, has generated a great deal of research interest in recent years. Given a tensor with incomplete entries, existing methods use either factorization or completion schemes to recover the missing parts. However, as the number of missing entries increases, factorization schemes may...
In this paper, we propose a novel framework to automatically perform player tracking and identification for sport videos filmed by a single pan-tilt-zoom camera from the court view. The proposed scheme is separated into three parts. The first part is to detect players by a deformable part model. The second part is to recognize jersey numbers by gra...
Image forgery is becoming more prevalent in our daily lives due to advances in computers and image-editing software. As forgers develop more sophisticated forgeries, researchers must keep up to design more advanced ways of detecting these forgeries. Copy-move forgery is one type of image forgery where one region of an image is copied to another reg...
The recent deployment of very large-scale camera networks consisting of fixed/moving surveillance cameras and vehicle video recorders, has led to a novel field in object tracking problem. The major goal is to detect and track each vehicle within a large area, which can be applied to video forensics. For example, a suspected vehicle can be automatic...
In this paper, we will investigate a more challenging vehicle matching problem. The problem is formulated as invariant image feature matching among opposite viewpoints of cameras, i.e. complementary object matching. For example, a front vehicle object may be given as a query to retrieve a rear vehicle object of the same vehicle. To solve the comple...
In this paper, we present an automatic foreground object detection method for videos captured by freely moving cameras. While we focus on extracting a single foreground object of interest throughout a video sequence, our approach does not require any training data nor the interaction by the users. Based on the SIFT correspondence across video frame...
Recently, Ma et al. proposed an efficient error propagation-free discrete cosine transform-based (DCT-based) data hiding algorithm that embeds data in H.264/AVC intra frames. In their algorithm, only 46% of the 4 × 4 luma blocks can be used to embed hidden bits. In this paper, we propose an improved error propagation-free DCT-based perturbation sch...
In this paper, we present an efficient RDH algorithm based on a new gradient-based edge direction prediction (GEDP) scheme. Since the proposed GEDP scheme can generate more accurate prediction results, the prediction errors tend to form a sharper Laplacian distribution. Therefore, the proposed algorithm can guarantee larger embedding capacity and p...
We present a framework to count the number of people in an environment where multiple cameras with different angles of view are available. We consider the visual cues captured by each camera as a knowledge source, and carry out cross-camera knowledge transfer to alleviate the difficulties of people counting, such as partial occlusions, low-quality...
An increasing number of users are contributing the sheer amount of group photos (e.g., for family, classmates, colleagues, etc.) on social media for the purpose of photo sharing and social communication. There arise strong needs for automatically understanding the group types (e.g., family vs. classmates) for recommendation services (e.g., recommen...
We aim to resolve the difficulties of action recognition arising from the large intra-class variations. These unfavorable variations make it infeasible to represent one action instance by other ones of the same action. We hence propose to extract both instance-specific and class-consistent features to facilitate action recognition. Specifically, th...
This paper proposes a learning-based approach to increase the temporal resolutions of human motion sequences. Given a set of high resolution motion sequences, our idea is first to learn the motion tendency from this learning dataset and then synthesize new postures for the low-resolution sequence according to the learned motion tendency. We summari...
In this paper, we present a novel edge sensing-based demosaicing algorithm for digital time delay and integration (DTDI) mosaic images, which are captured by DTDI line-scan cameras and suitable for industrial print inspection. We propose to use Sobel- and interpolation-based masks to extract more accurate gradient information in the color differenc...
Sports video analysis has attracted great attention in recent years. In the past decade, numerous sports video indexing approaches have been proposed at different semantic levels. In this paper, an individual level sports video indexing (ILSVI) scheme is proposed. The individual level refers to the indexing of a sports video on a player basis, i.e....
A reversible data hiding algorithm which uses prediction errors in the color difference domain for mosaic images with the Bayer color filter array (CFA) is proposed. Furthermore, the proposed algorithm can be extended to deal with the digital time delay and integration (DTDI) mosaic images and Lukac and Plataniotis (LP) mosaic images. Experimental...
With the urgent demand in information security, biometric feature-based verification systems have been extensively explored in many application domains. However, the efficacy of existing biometric-based systems is unsatisfactory and there are still a lot of difficult problems to be solved. Among many existing biometric features, palmprint has been...
This work aims to develop a system for predicting age progression in children faces. Age progression prediction in children faces is critical to assist missing children searching. An integral module including feature extraction, distance measurement, and face synthesis is devised in this paper to predict faces at different ages. In the proposed met...
We propose a human object inpainting scheme that divides the process into three steps: 1) human posture synthesis; 2) graphical model construction; and 3) posture sequence estimation. Human posture synthesis is used to enrich the number of postures in the database, after which all the postures are used to build a graphical model that can estimate t...
In this paper, we propose a new framework to synthesize human motions based on only one single posture given in the input image. To generate visually pleasing motion sequences, the proposed framework consists of two key techniques. One is motion retrieval, which retrieves reference motions from a human motion database on a low-dimensional motion ma...
Leveraging community-contributed data (e.g., blogs, GPS logs, and geo-tagged photos) for travel recommendation is one of the active researches since there are rich contexts and trip activities in such explosively growing data. In this work, we focus on personalized travel recommendation by leveraging the freely available community-contributed photo...
We introduce a novel approach to cross camera people counting that can adapt itself to a new environment without the need of manual inspection. The proposed counting model is composed of a pair of collaborative Gaussian processes (GP), which are respectively designed to count people by taking the visible and occluded parts into account. While the f...
The objective of this research is to design a new JPEG-based compression scheme which simultaneously considers the security issue. Our method starts from dividing image into non-overlapping blocks with size 8×8. Among these blocks, some are used as reference blocks and the rest are used as query blocks. A query block is the combination of the resid...
Video inpainting is an important video enhancement technique used to facilitate the repair or editing of digital videos. It has been employed worldwide to transform cultural artifacts such as vintage videos/films into digital formats. However, the quality of such videos is usually very poor and often contain unstable luminance and damaged content....
Most Web videos are captured in uncontrolled environments (e.g. videos captured by freely-moving cameras with low resolution); this makes automatic video annotation very difficult. To address this problem, we present a robust moving foreground object detection method followed by the integration of features collected from heterogeneous domains. We a...
This paper presents a novel framework for object completion in a video. To complete an occluded object, our method first samples a 3-D volume of the video into directional spatio-temporal slices, and performs patch-based image inpainting to complete the partially damaged object trajectories in the 2-D slices. The completed slices are then combined...
Digital time delay and integration (DTDI) mosaic video sequences captured by high-speed DTDI line-scan cameras are commonly used in industrial print inspection and high-speed capture applications. To reduce the memory requirement for saving these video sequences, it is necessary to compress them. In this paper, we present an efficient chroma subsam...
Facial attributes such as gender, race, age, hair style, etc., carry rich information for locating designated persons and profiling the communities from image/video collections (e.g., surveillance videos or photo albums). For plentiful facial attributes in photos and videos, collecting costly manual annotations for training detectors is time-consum...
Storytelling and narrative creation are very popular research issues in the field of interactive media design. In this paper,
we propose a framework for generating video narrative from existing videos which user only needs to involve in two steps:
(1) select background video and avatars; (2) set up the movement and trajectory of avatars. To generat...
This two-volume proceedings constitutes the refereed papers of the 17th International Multimedia Modeling Conference, MMM 2011, held in Taipei, Taiwan, in January 2011. The 51 revised regular papers, 25 special session papers, 21 poster session papers, and 3 demo session papers, were carefully reviewed and selected from 450 submissions. The papers...
This paper proposes a two-step prototype-face-based scheme of hallucinating the high-resolution detail of a low-resolution input face image. The proposed scheme is mainly composed of two steps: the global estimation step and the local facial-parts refinement step. In the global estimation step, the initial high-resolution face image is hallucinated...