Kiyoharu Aizawa

Kiyoharu Aizawa
The University of Tokyo | Todai · Department of Information and Communication Engineering

Ph.D

About

763
Publications
125,400
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,848
Citations
Additional affiliations
April 1988 - present
The University of Tokyo
Position
  • Professor (Full)

Publications

Publications (763)
Preprint
Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate compr...
Preprint
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have de...
Article
In this work, we investigate facial anonymization techniques in 360° videos and assess their influence on the perceived realism, anonymization effect, and presence of participants. In comparison to traditional footage, 360° videos can convey engaging, immersive experiences that accurately represent the atmosphere of real-world locations. As the ent...
Preprint
Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities in generating diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge...
Preprint
In this work, we investigate facial anonymization techniques in 360{\deg} videos and assess their influence on the perceived realism, anonymization effect, and presence of participants. In comparison to traditional footage, 360{\deg} videos can convey engaging, immersive experiences that accurately represent the atmosphere of real-world locations....
Preprint
Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these...
Preprint
Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysi...
Article
While deep image compression performs better than traditional codecs like JPEG on natural images, it faces a challenge as a learning-based approach: compression performance drastically decreases for out-of-domain images. To investigate this problem, we introduce a novel task that we call universal deep image compression, which involves compressing...
Article
We propose a novel classification problem setting where Undesirable Classes (UCs) are defined for each class. UC is the class you specifically want to avoid misclassifying. To address this setting, we propose a framework to reduce the probabilities for UCs while increasing the probability for a correct class.
Article
Full-text available
Comics are a bimodal form of art involving a mixture of text and images. Since comics require a combination of various cognitive processes to comprehend their contents, the analysis of human comic reading behavior sheds light on how humans process such bimodal forms of media. In this paper, we particularly focus on the viewing times of each comic p...
Article
Currently, Transformer has been widely used in natural language and image processing and has achieved excellent results. Benefiting from the self-attention operation and global interaction, Transformer has demonstrated more powerful spatiotemporal modeling capabilities than traditional convolutional and recurrent neural networks. However, research...
Chapter
Comic panel detection is the task of identifying panel regions from a given comic image. Many comic datasets provide the borders of the panel lines as its panel region annotations, expressed in formats such as bounding boxes. However, since such panel annotations are usually not aware of the contents of the panel, they do not capture objects that e...
Article
We propose Quality Enhancement via a Side bitstream Network (QESN) technique for lossy image compression. The proposed QESN utilizes the network architecture of deep image compression to produce a bitstream for enhancing the quality of conventional compression. We also present a loss function that directly optimizes the Bjontegaard delta bit rate (...
Preprint
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential soluti...
Article
360° cameras have gained popularity over the last few years. In this paper, we propose two fundamental techniques-Field-of-View IoU (FoV-IoU) and 360Augmentation for object detection in 360° images. Although most object detection neural networks designed for perspective images are applicable to 360° images in equirectangular projection (ERP) format...
Preprint
Full-text available
The expanding market for e-comics has spurred interest in the development of automated methods to analyze comics. For further understanding of comics, an automated approach is needed to link text in comics to characters speaking the words. Comics speaker detection research has practical applications, such as automatic character assignment for audio...
Preprint
We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-sh...
Preprint
Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experime...
Preprint
Removing out-of-distribution (OOD) images from noisy images scraped from the Internet is an important preprocessing for constructing datasets, which can be addressed by zero-shot OOD detection with vision language foundation models (CLIP). The existing zero-shot OOD detection setting does not consider the realistic case where an image has both in-d...
Preprint
In deep image compression, uniform quantization is applied to latent representations obtained by using an auto-encoder architecture for reducing bits and entropy coding. Quantization is a problem encountered in the end-to-end training of deep image compression. Quantization's gradient is zero, and it cannot backpropagate meaningful gradients. Many...
Article
Full-text available
In deep image compression, uniform quantization is applied to latent representations obtained by using an auto-encoder architecture for reducing bits and entropy coding. Quantization is a problem encountered in the end-to-end training of deep image compression. Quantization’s gradient is zero, and it cannot backpropagate meaningful gradients. Many...
Article
Unsupervised domain adaptation (UDA) is extremely effective for transferring knowledge from a label-rich source domain to a label-scarce target domain. Because the target domain is unlabeled and may contain additional novel classes, open-set domain adaptation (ODA) has been suggested as a possible solution to detect these novel classes in the train...
Preprint
In recent years, the performance of novel view synthesis using perspective images has dramatically improved with the advent of neural radiance fields (NeRF). This study proposes two novel techniques that effectively build NeRF for 360{\textdegree} omnidirectional images. Due to the characteristics of a 360{\textdegree} image of ERP format that has...
Preprint
Diverse image completion, a problem of generating various ways of filling incomplete regions (i.e. holes) of an image, has made remarkable success. However, managing input images with large holes is still a challenging problem due to the corruption of semantically important structures. In this paper, we tackle this problem by incorporating explicit...
Preprint
Deep image compression performs better than conventional codecs, such as JPEG, on natural images. However, deep image compression is learning-based and encounters a problem: the compression performance deteriorates significantly for out-of-domain images. In this study, we highlight this problem and address a novel task: universal deep image compres...
Preprint
Rotation is frequently listed as a candidate for data augmentation in contrastive learning but seldom provides satisfactory improvements. We argue that this is because the rotated image is always treated as either positive or negative. The semantics of an image can be rotation-invariant or rotation-variant, so whether the rotated image is treated a...
Chapter
Recognizing irregular texts has been a challenging topic in text recognition. To encourage research on this topic, we provide a novel comic onomatopoeia dataset (COO), which consists of onomatopoeia texts in Japanese comics. COO has many arbitrary texts, such as extremely curved, partially shrunk texts, or arbitrarily placed texts. Furthermore, som...
Article
Image quality assessment (IQA) is a fundamental metric for image processing tasks (e.g., compression). With full-reference IQAs, traditional IQAs, such as PSNR and SSIM, have been used. Recently, IQAs based on deep neural networks (deep IQAs), such as LPIPS and DISTS, have also been used. It is known that image scaling is inconsistent among deep IQ...
Preprint
Full-text available
360{\deg} images are informative -- it contains omnidirectional visual information around the camera. However, the areas that cover a 360{\deg} image is much larger than the human's field of view, therefore important information in different view directions is easily overlooked. To tackle this issue, we propose a method for predicting the optimal s...
Preprint
Image quality assessment (IQA) is a fundamental metric for image processing tasks (e.g., compression). With full-reference IQAs, traditional IQAs, such as PSNR and SSIM, have been used. Recently, IQAs based on deep neural networks (deep IQAs), such as LPIPS and DISTS, have also been used. It is known that image scaling is inconsistent among deep IQ...
Preprint
Full-text available
Recognizing irregular texts has been a challenging topic in text recognition. To encourage research on this topic, we provide a novel comic onomatopoeia dataset (COO), which consists of onomatopoeia texts in Japanese comics. COO has many arbitrary texts, such as extremely curved, partially shrunk texts, or arbitrarily placed texts. Furthermore, som...
Article
Unsupervised domain adaptation (UDA) has been highly successful in transferring knowledge acquired from a label-rich source domain to a label-scarce target domain. Open-set domain adaptation (open-set DA) and universal domain adaptation (UniDA) have been proposed as solutions to the problem concerning the presence of additional novel categories in...
Preprint
Designing fonts for Chinese characters is highly labor-intensive and time-consuming. While the latest methods successfully generate the English alphabet vector font, despite the high demand for automatic font generation, Chinese vector font generation has been an unsolved problem owing to its complex shape and numerous characters. This study addres...
Preprint
Movie-Map, an interactive first-person-view map that engages the user in a simulated walking experience, comprises short 360{\deg} video segments separated by traffic intersections that are seamlessly connected according to the viewer's direction of travel. However, in wide urban-scale areas with numerous intersecting roads, manual intersection seg...
Preprint
Full-text available
360{\deg} images are widely available over the last few years. This paper proposes a new technique for single 360{\deg} image depth prediction under open environments. Depth prediction from a 360{\deg} single image is not easy for two reasons. One is the limitation of supervision datasets - the currently available dataset is limited to indoor scene...
Preprint
360{\deg} cameras have gained popularity over the last few years. In this paper, we propose two fundamental techniques -- Field-of-View IoU (FoV-IoU) and 360Augmentation for object detection in 360{\deg} images. Although most object detection neural networks designed for the perspective images are applicable to 360{\deg} images in equirectangular p...
Article
This paper introduces our work on a Movie Map, which will enable users to explore a given city area using 360° videos. Visual exploration of a city is always needed. Nowadays, we are familiar with Google Street View (GSV) that is an interactive visual map. Despite the wide use of GSV, it provides sparse images of streets, which often confuses users...
Article
Full-text available
360° images are informative – it contains omnidirectional visual information around the camera. However, the areas that cover a 360° image is much larger than the human’s field of view, therefore important information in different view directions is easily overlooked. To tackle this issue, we propose a method for predicting the optimal set of Regio...
Preprint
Full-text available
Supervised training of object detectors requires well-annotated large-scale datasets, whose production is costly. Therefore, some efforts have been made to obtain annotations in economical ways, such as cloud sourcing. However, datasets obtained by these methods tend to contain noisy annotations such as inaccurate bounding boxes and incorrect class...
Article
Well annotated dataset is crucial to the training of object detectors. However, the production of finely annotated datasets for object detection tasks is extremely labor-intensive, therefore, cloud sourcing is often used to create datasets, which leads to these datasets tending to contain incorrect annotations such as inaccurate localization boundi...
Conference Paper
Full-text available
A coronavirus pandemic is forcing people to be "at home" all over the world. In a life of hardly ever going out, we would have realized how the food we eat affects our bodies. What can we do to know our food more and control it better? To give us a clue, we are trying to build a World Food Atlas (WFA) that collects all the knowledge about food in t...
Article
In this paper, we present a novel portrait impression estimation method using nine pairs of semantic impression words: bitter-majestic, clear-pure, elegant-mysterious, gorgeous-mature, modern-intellectual, natural-mild, sporty-agile, sweet-sunny, and vivid-dynamic. In the first part of the study, we analyzed the relationship between the facial feat...
Preprint
Full-text available
This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have pro...
Preprint
Positive-unlabeled learning refers to the process of training a binary classifier using only positive and unlabeled data. Although unlabeled data can contain positive data, all unlabeled data are regarded as negative data in existing positive-unlabeled learning methods, which resulting in diminishing performance. We provide a new perspective on thi...
Preprint
Full-text available
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer labels) is important when we have to train STR models without synthetic data: for handwritten or artistic texts that are difficult t...
Article
Full-text available
Computational models of saliency estimation have been studied in a wide range of research fields, including visual perception, image processing, computer vision, multimedia, and their intersections. However, most of them seek to simulate scene viewing by adults only, and the impact of observer’s age has rarely been considered. In this paper, we qua...
Preprint
We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmente...
Preprint
Designing fonts for languages with a large number of characters, such as Japanese and Chinese, is an extremely labor-intensive and time-consuming task. In this study, we addressed the problem of automatically generating Japanese typographic fonts from only a few font samples, where the synthesized glyphs are expected to have coherent characteristic...
Preprint
We propose a new optimization framework for aleatoric uncertainty estimation in regression problems. Existing methods can quantify the error in the target estimation, but they tend to underestimate it. To obtain the predictive uncertainty inherent in an observation, we propose a new separable formulation for the estimation of a signal and of its un...
Chapter
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available. While existing SSL methods assume that samples in the labeled and unlabeled data share the classes of their samples, we address a more complex novel scenario named open-set SSL, where out-of-distribut...
Preprint
There are five features to consider when using generative adversarial networks to apply makeup to photos of the human face. These features include (1) facial components, (2) interactive color adjustments, (3) makeup variations, (4) robustness to poses and expressions, and the (5) use of multiple reference images. Several related works have been pro...
Preprint
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available. While existing SSL methods assume that samples in the labeled and unlabeled data share the classes of their samples, we address a more complex novel scenario named open-set SSL, where out-of-distribut...
Preprint
Deep image compression systems mainly contain four components: encoder, quantizer, entropy model, and decoder. To optimize these four components, a joint rate-distortion framework was proposed, and many deep neural network-based methods achieved great success in image compression. However, almost all convolutional neural network-based methods treat...

Network