Yong Man Ro's research while affiliated with Daejeon Institute Science and Technology and other places

Publications (427)

Preprint
Full-text available
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movement...
Article
Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two cha...
Article
Recently, automated surveillance cameras can change a visible sensor and a thermal sensor for all-day operation. However, existing single-modal pedestrian detectors mainly focus on detecting pedestrians in only one specific modality (i.e., visible or thermal), so they cannot cope with other modal inputs. In addition, recent multispectral pedestrian...
Article
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to...
Preprint
Full-text available
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, an...
Preprint
Object detection has attracted great attention in the computer vision area and has emerged as an indispensable component in many vision systems. In the era of deep learning, many high-performance object detection networks have been proposed. Although these detection networks show high performance, they are vulnerable to adversarial patch attacks. C...
Preprint
Adversarial examples provoke weak reliability and potential security issues in deep neural networks. Although adversarial training has been widely studied to improve adversarial robustness, it works in an over-parameterized regime and requires high computations and large memory budgets. To bridge adversarial robustness and model compression, we pro...
Preprint
Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields. Recent works have argued that the existence of the robust and non-robust features is a primary cause of the adversarial examples, and investigated their internal interactions in the feature space. In this paper, we propose a...
Preprint
Full-text available
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal represe...
Preprint
Full-text available
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while g...
Preprint
Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two cha...
Article
Along with the outstanding performance of the deep neural networks (DNNs), considerable research efforts have been devoted to finding ways to understand the decision of DNNs structures. In the computer vision domain, visualizing the attribution map is one of the most intuitive and understandable ways to achieve human-level interpretation. Among the...
Article
The goal of this work is to reconstruct speech from silent video, in both speaker dependent and speaker independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen sp...
Article
Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing a...
Article
Recently, VR sickness assessment for VR videos is highly demanded in industry and research fields to address VR viewing safety issues. Especially, it is difficult to evaluate VR sickness of individuals due to individual differences. To achieve the challenging goal, we focus on deep feature fusion of sickness-related information. In this paper, we p...
Article
Multispectral pedestrian detection has received great attention in recent years as multispectral modalities (i.e. color and thermal) can provide complementary visual information. However, there are major inherent issues in multispectral pedestrian detection. First, the cameras of the two modalities have different field-of-views (FoVs), so that imag...
Article
Full-text available
Although the hyperspectral image (HSI) classification has adopted deep neural networks (DNNs) and shown remarkable performances, there is a lack of studies of the adversarial vulnerability for the HSI classifications. In this paper, we propose a novel HSI classification framework robust to adversarial attacks. To this end, we focus on the unique sp...
Preprint
Full-text available
We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To...
Preprint
Full-text available
Depth adjustment aims to enhance the visual experience of stereoscopic 3D (S3D) images, which accompanied with improving visual comfort and depth perception. For a human expert, the depth adjustment procedure is a sequence of iterative decision making. The human expert iteratively adjusts the depth until he is satisfied with the both levels of visu...
Preprint
Full-text available
Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to predict the lon...
Chapter
With the development of deep neural networks, multispectral pedestrian detection has been received a great attention by exploiting complementary properties of multiple modalities (e.g., color-visible and thermal modalities). Previous works usually rely on network prediction scores in combining complementary modal information. However, it is widely...
Chapter
This paper presents a new version of the Interactive VIdeo Search Tool (IVIST), a video retrieval tool, for the participation of the Video Browser Showdown (VBS) 2021. In the previous IVIST (VBS 2020), there were core functions to search for videos practically, such as object detection, scene-text recognition, and dominant-color finding. Including...
Article
Recently, a wide range of research on object detection has shown breakthrough performance. However, in a challenging environment, such as occlusion and small object cases, object detectors still produce inaccurate or erroneous predictions. To effectively cope with such conditions, most of the existing methods have suggested loss functions to guide...
Conference Paper
Recently, cybersickness assessment for VR content is required to deal with viewing safety issues. Assessing physical symptoms of individual viewers is challenging but important to provide detailed and personalized guides for viewing safety. In this paper, we propose a novel symptom-aware cybersickness assessment network (SACA Net) that quantifies p...
Preprint
Full-text available
Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units. Facial action units for an elaborate facial expression synthesis need to be intuitively represented for human comprehension, not a numeric categorization of facial action units. To address this issue, w...
Article
Facial recognition for surveillance applications still remains challenging in uncontrolled environments, especially with the appearances of masks/veils and different ethnicities effects. Multimodal facial biometrics recognition becomes one of the major studies to overcome such scenarios. However, to cooperate with multimodal facial biometrics, many...
Preprint
To combat against adversarial attacks, autoencoder structure is widely used to perform denoising which is regarded as gradient masking. In this paper, we revisit the role of autoencoders in adversarial settings. Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the au...
Preprint
Recent studies have shown that ensemble approaches could not only improve accuracy and but also estimate model uncertainty in deep learning. However, it requires a large number of parameters according to the increase of ensemble models for better prediction and uncertainty estimation. To address this issue, a generic and efficient segmentation fram...
Preprint
Deep neural networks have achieved substantial achievements in several computer vision areas, but have vulnerabilities that are often fooled by adversarial examples that are not recognized by humans. This is an important issue for security or medical applications. In this paper, we propose an ensemble model training framework with random layer samp...
Preprint
The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data. Compared to their predictive performance, relatively less attention has been devoted to the robustness of multimodal fusion models. In this paper, we investigated whether the current multimodal fusion...
Article
Video frame interpolation has increasingly attracted attention in computer vision and video processing fields. When motion patterns in a video are complex, large and non-linear (exceptional motion), the generated intermediate frame is blurred and likely to have large artifacts. In this paper, we propose a novel video frame interpolation considering...
Article
Satellite image prediction is important in weather nowcasting. In this article, we propose a novel multichannel satellite image prediction network (MCSIP Net) for predicting satellite images. The proposed MCSIP Net consists of three parts such as the satellite image predictor, the spatio-temporal 3-D discriminators, and the domain knowledge critic...
Article
Encoding the facial expression dynamics is efficient in classifying and recognizing facial expressions. Most facial dynamics-based methods assume that a sequence is temporally segmented before prediction. This requires the prediction to wait until a full sequence is available, resulting in prediction delay. To reduce the prediction delay and enable...
Article
Long short-term memory (LSTM) is a type of recurrent neural networks that is efficient for encoding spatio-temporal features in dynamic sequences. Recent work has shown that the LSTM retains information related to the mode of variation in the input dynamic sequence which reduces the discriminability of the encoded features. To encode features robus...
Chapter
This paper presents a new video retrieval tool, Interactive VIdeo Search Tool (IVIST), which participates in the 2020 Video Browser Showdown (VBS). As a video retrieval tool, IVIST is equipped with proper and high-performing functionalities such as object detection, dominant-color finding, scene-text recognition and text-image retrieval. These func...
Chapter
This paper introduces a video retrieval tool for the 2020 Video Browser Showdown (VBS). The tool enhances the user’s video browsing experience by ensuring full use of video analysis database constructed prior to the Showdown. Deep learning based object detection, scene text detection, scene color detection, audio classification and relation detecti...
Chapter
Full-text available
Human facial expression plays the key role in the understanding of the social behavior. Many deep learning approaches present facial emotion recognition and automatic image captioning considering human sentiments. However, most current deep learning models for facial expression analysis do not contain comprehensive, detailed information of a single...
Chapter
The ambiguity of the decision-making process has been pointed out as the main obstacle to practically applying the deep learning-based method in spite of its outstanding performance. Interpretability can guarantee the confidence of the deep learning system, therefore it is particularly important in the medical field. In this study, a novel deep net...
Article
Abnormal event detection is an important task in video surveillance systems. In this paper, we propose a novel bidirectional multi-scale aggregation networks (BMAN) for abnormal event detection. The proposed BMAN learns spatiotemporal patterns of normal events to detect deviations from the learned normal patterns as abnormalities. The BMAN consists...
Chapter
Generating realistic breast masses is a highly important task because the large-size database of annotated breast masses is scarcely available. In this study, a novel realistic breast mass generation framework using the characteristics of the breast mass (i.e. BIRADS category) has been devised. For that purpose, the visual-semantic BIRADS descripti...
Article
Full-text available
Although there is an abundance of current research on facial recognition, it still faces significant challenges that are related to variations in factors such as aging, poses, occlusions, resolution, and appearances. In this paper, we propose a Multi-feature Deep Learning Network (MDLN) architecture that uses modalities from the facial and periocul...
Article
Spatio-temporal feature encoding is essential for encoding the dynamics in video sequences. Recurrent neural networks, particularly long short-term memory (LSTM) units, have been popular as an efficient tool for encoding spatio-temporal features in sequences. In this work, we investigate the effect of mode variations on the encoded spatio-temporal...
Preprint
Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in t...
Preprint
The ambiguity of the decision-making process has been pointed out as the main obstacle to applying the deep learning-based method in a practical way in spite of its outstanding performance. Interpretability could guarantee the confidence of deep learning system, therefore it is particularly important in the medical field. In this study, a novel dee...
Article
Purpose: Transvaginal ultrasound imaging provides useful information for diagnosing endometrial pathologies and reproductive health. Endometrium segmentation in transvaginal ultrasound (TVUS) images is very challenging due to ambiguous boundaries and heterogeneous textures. In this study, we developed a new segmentation framework which provides ro...
Article
Object detection has received significant interest in the research field of computer vision and widely used in human-centric applications. The occlusion problem is a frequent obstacle that degrades detection quality. In this paper, we propose a novel object detection framework targeting robust object detection in occlusion. The proposed deep learni...
Article
In this paper, we propose a novel deep learningbased virtual reality image quality assessment method that automatically predicts the visual quality of an omnidirectional image. In order to assess the visual quality in viewing the omnidirectional image, we propose deep networks consisting of VR quality score predictor and human perception guider. Th...
Article
Facial landmark detection plays an important role in face analysis tasks. Moreover, it is used as a prerequisite in many facial related applications, the simplicity as well as effectiveness is essential in the facial landmark detection. In this paper, we propose an effective facial landmark detection network and associated learning framework with t...
Chapter
This paper deals with a method for generating realistic labeled masses. Recently, there have been many attempts to apply deep learning to various bio-image computing fields including computer-aided detection and diagnosis. In order to learn deep network model to be well-behaved in bio-image computing fields, a lot of labeled data is required. Howev...