Yong Man Ro's research while affiliated with Daejeon Institute Science and Technology and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (427)
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movement...
Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two cha...
Recently, automated surveillance cameras can change a visible sensor and a thermal sensor for all-day operation. However, existing single-modal pedestrian detectors mainly focus on detecting pedestrians in only one specific modality (i.e., visible or thermal), so they cannot cope with other modal inputs. In addition, recent multispectral pedestrian...
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to...
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, an...
Object detection has attracted great attention in the computer vision area and has emerged as an indispensable component in many vision systems. In the era of deep learning, many high-performance object detection networks have been proposed. Although these detection networks show high performance, they are vulnerable to adversarial patch attacks. C...
Adversarial examples provoke weak reliability and potential security issues in deep neural networks. Although adversarial training has been widely studied to improve adversarial robustness, it works in an over-parameterized regime and requires high computations and large memory budgets. To bridge adversarial robustness and model compression, we pro...
Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields. Recent works have argued that the existence of the robust and non-robust features is a primary cause of the adversarial examples, and investigated their internal interactions in the feature space. In this paper, we propose a...
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal represe...
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while g...
Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two cha...
Along with the outstanding performance of the deep neural networks (DNNs), considerable research efforts have been devoted to finding ways to understand the decision of DNNs structures. In the computer vision domain, visualizing the attribution map is one of the most intuitive and understandable ways to achieve human-level interpretation. Among the...
The goal of this work is to reconstruct speech from silent video, in both speaker dependent and speaker independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen sp...
Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing a...
Recently, VR sickness assessment for VR videos is highly demanded in industry and research fields to address VR viewing safety issues. Especially, it is difficult to evaluate VR sickness of individuals due to individual differences. To achieve the challenging goal, we focus on deep feature fusion of sickness-related information. In this paper, we p...
Multispectral pedestrian detection has received great attention in recent years as multispectral modalities (i.e. color and thermal) can provide complementary visual information. However, there are major inherent issues in multispectral pedestrian detection. First, the cameras of the two modalities have different field-of-views (FoVs), so that imag...
Although the hyperspectral image (HSI) classification has adopted deep neural networks (DNNs) and shown remarkable performances, there is a lack of studies of the adversarial vulnerability for the HSI classifications. In this paper, we propose a novel HSI classification framework robust to adversarial attacks. To this end, we focus on the unique sp...
We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To...
Depth adjustment aims to enhance the visual experience of stereoscopic 3D (S3D) images, which accompanied with improving visual comfort and depth perception. For a human expert, the depth adjustment procedure is a sequence of iterative decision making. The human expert iteratively adjusts the depth until he is satisfied with the both levels of visu...
Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to predict the lon...
With the development of deep neural networks, multispectral pedestrian detection has been received a great attention by exploiting complementary properties of multiple modalities (e.g., color-visible and thermal modalities). Previous works usually rely on network prediction scores in combining complementary modal information. However, it is widely...
This paper presents a new version of the Interactive VIdeo Search Tool (IVIST), a video retrieval tool, for the participation of the Video Browser Showdown (VBS) 2021. In the previous IVIST (VBS 2020), there were core functions to search for videos practically, such as object detection, scene-text recognition, and dominant-color finding. Including...
Recently, a wide range of research on object detection has shown breakthrough performance. However, in a challenging environment, such as occlusion and small object cases, object detectors still produce inaccurate or erroneous predictions. To effectively cope with such conditions, most of the existing methods have suggested loss functions to guide...
Recently, cybersickness assessment for VR content is required to deal with viewing safety issues. Assessing physical symptoms of individual viewers is challenging but important to provide detailed and personalized guides for viewing safety. In this paper, we propose a novel symptom-aware cybersickness assessment network (SACA Net) that quantifies p...
Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units. Facial action units for an elaborate facial expression synthesis need to be intuitively represented for human comprehension, not a numeric categorization of facial action units. To address this issue, w...
Facial recognition for surveillance applications still remains challenging in uncontrolled environments, especially with the appearances of masks/veils and different ethnicities effects. Multimodal facial biometrics recognition becomes one of the major studies to overcome such scenarios. However, to cooperate with multimodal facial biometrics, many...
To combat against adversarial attacks, autoencoder structure is widely used to perform denoising which is regarded as gradient masking. In this paper, we revisit the role of autoencoders in adversarial settings. Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the au...
Recent studies have shown that ensemble approaches could not only improve accuracy and but also estimate model uncertainty in deep learning. However, it requires a large number of parameters according to the increase of ensemble models for better prediction and uncertainty estimation. To address this issue, a generic and efficient segmentation fram...
Deep neural networks have achieved substantial achievements in several computer vision areas, but have vulnerabilities that are often fooled by adversarial examples that are not recognized by humans. This is an important issue for security or medical applications. In this paper, we propose an ensemble model training framework with random layer samp...
The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data. Compared to their predictive performance, relatively less attention has been devoted to the robustness of multimodal fusion models. In this paper, we investigated whether the current multimodal fusion...
Video frame interpolation has increasingly attracted attention in computer vision and video processing fields. When motion patterns in a video are complex, large and non-linear (exceptional motion), the generated intermediate frame is blurred and likely to have large artifacts. In this paper, we propose a novel video frame interpolation considering...
Satellite image prediction is important in weather nowcasting. In this article, we propose a novel multichannel satellite image prediction network (MCSIP Net) for predicting satellite images. The proposed MCSIP Net consists of three parts such as the satellite image predictor, the spatio-temporal 3-D discriminators, and the domain knowledge critic...
Encoding the facial expression dynamics is efficient in classifying and recognizing facial expressions. Most facial dynamics-based methods assume that a sequence is temporally segmented before prediction. This requires the prediction to wait until a full sequence is available, resulting in prediction delay. To reduce the prediction delay and enable...
Long short-term memory (LSTM) is a type of recurrent neural networks that is efficient for encoding spatio-temporal features in dynamic sequences. Recent work has shown that the LSTM retains information related to the mode of variation in the input dynamic sequence which reduces the discriminability of the encoded features. To encode features robus...
This paper presents a new video retrieval tool, Interactive VIdeo Search Tool (IVIST), which participates in the 2020 Video Browser Showdown (VBS). As a video retrieval tool, IVIST is equipped with proper and high-performing functionalities such as object detection, dominant-color finding, scene-text recognition and text-image retrieval. These func...
This paper introduces a video retrieval tool for the 2020 Video Browser Showdown (VBS). The tool enhances the user’s video browsing experience by ensuring full use of video analysis database constructed prior to the Showdown. Deep learning based object detection, scene text detection, scene color detection, audio classification and relation detecti...
Human facial expression plays the key role in the understanding of the social behavior. Many deep learning approaches present facial emotion recognition and automatic image captioning considering human sentiments. However, most current deep learning models for facial expression analysis do not contain comprehensive, detailed information of a single...
The ambiguity of the decision-making process has been pointed out as the main obstacle to practically applying the deep learning-based method in spite of its outstanding performance. Interpretability can guarantee the confidence of the deep learning system, therefore it is particularly important in the medical field. In this study, a novel deep net...
Abnormal event detection is an important task in video surveillance systems. In this paper, we propose a novel bidirectional multi-scale aggregation networks (BMAN) for abnormal event detection. The proposed BMAN learns spatiotemporal patterns of normal events to detect deviations from the learned normal patterns as abnormalities. The BMAN consists...
Generating realistic breast masses is a highly important task because the large-size database of annotated breast masses is scarcely available. In this study, a novel realistic breast mass generation framework using the characteristics of the breast mass (i.e. BIRADS category) has been devised. For that purpose, the visual-semantic BIRADS descripti...
Although there is an abundance of current research on facial recognition, it still faces significant challenges that are related to variations in factors such as aging, poses, occlusions, resolution, and appearances. In this paper, we propose a Multi-feature Deep Learning Network (MDLN) architecture that uses modalities from the facial and periocul...
Spatio-temporal feature encoding is essential for encoding the dynamics in video sequences. Recurrent neural networks, particularly long short-term memory (LSTM) units, have been popular as an efficient tool for encoding spatio-temporal features in sequences. In this work, we investigate the effect of mode variations on the encoded spatio-temporal...
Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in t...
The ambiguity of the decision-making process has been pointed out as the main obstacle to applying the deep learning-based method in a practical way in spite of its outstanding performance. Interpretability could guarantee the confidence of deep learning system, therefore it is particularly important in the medical field. In this study, a novel dee...
Purpose:
Transvaginal ultrasound imaging provides useful information for diagnosing endometrial pathologies and reproductive health. Endometrium segmentation in transvaginal ultrasound (TVUS) images is very challenging due to ambiguous boundaries and heterogeneous textures. In this study, we developed a new segmentation framework which provides ro...
Object detection has received significant interest in the research field of computer vision and widely used in human-centric applications. The occlusion problem is a frequent obstacle that degrades detection quality. In this paper, we propose a novel object detection framework targeting robust object detection in occlusion. The proposed deep learni...
In this paper, we propose a novel deep learningbased virtual reality image quality assessment method that automatically predicts the visual quality of an omnidirectional image. In order to assess the visual quality in viewing the omnidirectional image, we propose deep networks consisting of VR quality score predictor and human perception guider. Th...
Facial landmark detection plays an important role in face analysis tasks. Moreover, it is used as a prerequisite in many facial related applications, the simplicity as well as effectiveness is essential in the facial landmark detection. In this paper, we propose an effective facial landmark detection network and associated learning framework with t...
This paper deals with a method for generating realistic labeled masses. Recently, there have been many attempts to apply deep learning to various bio-image computing fields including computer-aided detection and diagnosis. In order to learn deep network model to be well-behaved in bio-image computing fields, a lot of labeled data is required. Howev...