Chenhui Chu’s research while affiliated with Kyoto University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (44)


Learning More May Not Be Better: Knowledge Transferability in Vision-and-Language Tasks
  • Article
  • Full-text available

November 2024

Journal of Imaging

Tianwei Chen

·

Noa Garcia

·

Mayu Otani

·

[...]

·

Is learning more knowledge always better for vision-and-language models? In this paper, we study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks, their overall performance improves. However, we show that not all knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conducted an exhaustive analysis based on hundreds of cross-experiments on twelve vision-and-language tasks categorized into four groups. While tasks in the same group are prone to improve each other, results show that this is not always the case. In addition, other factors, such as dataset size or the pre-training stage, may have a great impact on how well the knowledge is transferred.

Download

Figure 2. Examples of synthetic samples generated using our proposed data augmentation techniques for VQA.
Figure 4. Answer overlap: comparison of prediction agreement between our language-only model, BAN, and VisualBERT. Bar graphs show proportions of identical or differing answers.
Figure 5. Qualitative comparison: red boxes indicate object detection results by Faster R-CNN with confidence scores over 0.5. Highlighted words in the descriptions correspond to key details relevant to the answers.
Examples of Data Augmentation for Language when applied to questions.
Results of data augmentation techniques on the VQA-CP v2 test set, showing the impact of different data augmentation methods on model accuracy for Yes/No, Number, and Other question types. D indicates techniques applied to image descriptions, while Q indicates application to ques- tions. The Gap column highlights the improvement in accuracy compared to the baseline, where no synthetic data were used.
A Picture May Be Worth a Hundred Words for Visual Question Answering

October 2024

·

7 Reads

·

1 Citation

Electronics

How far can textual representations go in understanding images? In image understanding, effective representations are essential. Deep visual features from object recognition models currently dominate various tasks, especially Visual Question Answering (VQA). However, these conventional features often struggle to capture image details in ways that match human understanding, and their decision processes lack interpretability. Meanwhile, the recent progress in language models suggests that descriptive text could offer a viable alternative. This paper investigated the use of descriptive text as an alternative to deep visual features in VQA. We propose to process description–question pairs rather than visual features, utilizing a language-only Transformer model. We also explored data augmentation strategies to enhance training set diversity and mitigate statistical bias. Extensive evaluation shows that textual representations using approximately a hundred words can effectively compete with deep visual features on both the VQA 2.0 and VQA-CP v2 datasets. Our qualitative experiments further reveal that these textual representations enable clearer investigation of VQA model decision processes, thereby improving interpretability.


Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

September 2022

·

80 Reads

·

6 Citations

SN Computer Science

Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by 1.7%1.7\% via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction.


Figure 1: We explore the transferability among 12 visionand-language tasks in 4 different groups: visual question answering (VQA), image retrieval (IR), referring expression (RE), and multi-modal verification (MV). Here, we illustrate the transferability among 5 tasks. Different tasks have different effect (positive or negative) on the other tasks.
Figure 2: Analysis of transferability relationships between tasks. In Step 1, we train 12 vision-and-language tasks independently. In Step 2, we use the models from Step 1 and fine-tune them on each of the other tasks. In Step 3, we form a transferability relation table for the 12 vision-and-language tasks in four groups: visual question answering (VQA), image retrieval (IR), multi-modal verification (MV) and referring expressions (RE).
Figure 3: Box plots of the 12 tasks trained with 10 random seeds showing a big gap between the best and the worst scores.
Results of direct model m t in the 12 tasks.
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

August 2022

·

23 Reads

Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conduct an exhaustive analysis based on hundreds of cross-experiments on 12 vision-and-language tasks categorized in 4 groups. Whereas tasks in the same group are prone to improve each other, results show that this is not always the case. Other factors such as dataset size or pre-training stage have also a great impact on how well the knowledge is transferred.


Overview of the public meeting article extraction method (Small columns in articles are shown as blue lines in the “trimming” sub-figure, and OCR errors are shown in blue fonts in the “OCR” sub-figure)
Example of a public meeting article. Information corresponding to each question is shown in the red boxes (information corresponding to question number 1 is shown in box q1 and so on)
Example of annotation information from an extracted public meeting article
Illustration of fine-tuning ALBERT on our public meeting information extraction task
Information Extraction from Public Meeting Articles

SN Computer Science

Public meeting articles are the key to understanding the history of public opinion and public sphere in Australia. Information extraction from public meeting articles can obtain new insights into Australian history. In this paper, we create an information extraction dataset in the public meeting domain. We manually annotate the date and time, place, purpose, people who requested the meeting, people who convened the meeting, and people who were convened of 1258 public meeting articles. We further present an information extraction system, which formulates information extraction from public meeting articles as a machine reading comprehension task. Experiments indicate that our system can achieve an F1 score of 74.98% for information extraction from public meeting articles.


Region-Attentive Multimodal Neural Machine Translation

January 2022

·

32 Reads

·

27 Citations

Neurocomputing

We propose a multimodal neural machine translation (MNMT) method with semantic image regions called region-attentive multimodal neural machine translation (RA-NMT). Existing studies on MNMT have mainly focused on employing global visual features or equally sized grid local visual features extracted by convolutional neural networks (CNNs) to improve translation performance. However, they neglect the effect of semantic information captured inside the visual features. This study utilizes semantic image regions extracted by object detection for MNMT and integrates visual and textual features using two modality-dependent attention mechanisms. The proposed method was implemented and verified on two neural architectures of neural machine translation (NMT): recurrent neural network (RNN) and self-attention network (SAN). Experimental results on different language pairs of Multi30k dataset show that our proposed method improves over baselines and outperforms most of the state-of-the-art MNMT methods. Further analysis demonstrates that the proposed method can achieve better translation performance because of its better visual feature use.


The semantic typology of visually grounded paraphrases

December 2021

·

20 Reads

·

2 Citations

Computer Vision and Image Understanding

Visually grounded paraphrases (VGPs) are different phrasal expressions describing the same visual concept in an image. Previous studies treat VGP identification as a binary classification task, which ignores various phenomena behind VGPs (i.e., different linguistic interpretation of the same visual concept) such as linguistic paraphrases and VGPs from different aspects. In this paper, we propose semantic typology for VGPs, aiming to elucidate the VGP phenomena and deepen the understanding about how human beings interpret vision with language. We construct a large VGP dataset that annotates the class to which each VGP pair belongs according to our typology. In addition, we present a classification model that fuses language and visual features for VGP classification on our dataset. Experiments indicate that joint language and vision representation learning is important for VGP classification. We further demonstrate that our VGP typology can boost the performance of visually grounded textual entailment.


Transferring Domain-Agnostic Knowledge in Video Question Answering

October 2021

·

16 Reads

Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge. First, we develop a novel transfer learning framework, which finetunes the pre-trained model by applying domain-agnostic knowledge as the medium. Second, we construct a new VideoQA dataset with 21,412 human-generated question-answer samples for comparable transfer of knowledge. Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.




Citations (21)


... For example, the methods of constructing corpora described in [1][2][3][4] require a large amount of natural text data based on which the corpus is generated. This requirement significantly limits the possibility of their use in developing information systems because the initial data must be stored somewhere additionally. ...

Reference:

DICTIONARY-BASED DETERMINISTIC METHOD OF GENERATION OF TEXT CORPORA
Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

SN Computer Science

... (i) Traditional Multimodal Machine Translation models (MMT), including Soul-Mix (Cheng et al., 2024), RG-MMT-EDC (Tayir and Li, 2024), WRA-guided (Zhao et al., 2022), Imagination (Elliott and Kádár, 2017) and ImagiT (Long et al., 2021). These MMT baselines take the source language sentence as textual input while utilizing the image as visual input. ...

Word-Region Alignment-Guided Multimodal Neural Machine Translation

IEEE/ACM Transactions on Audio Speech and Language Processing

... Moreover, counterfactual causal inference offers a framework to enhance [1,36] and explain [9,10] models in counterfactual scenarios. However, the majority of these counterfactual-related works are tailored for classification tasks, such as image classification [1,9,47], representations learning [36,50], or visual question answering [11,17,23], rather than for generation tasks. Classification tasks exhibit a deterministic correspondence between input and output, whereas, in the generation process, the counterfactual image and preceding generated tokens collectively influence the subsequent token generation, creating an effect propagation. ...

Visual Question Answering with Textual Representations for Images
  • Citing Conference Paper
  • October 2021

... The authors evaluated the performance of proposed methods on the WikiArt dataset for artwork classification by style, artist, and timeframe and multi-label fine art categorization. El Vaigh et al. [53] worked on a multi-label classification problem, they trained a model using a graph convolutional network, relying on the relationships between entities of the knowledge graph. The evaluation was done on the SemArt and Buddha statues datasets. ...

GCNBoost: Artwork Classification by Label Propagation through a Knowledge Graph
  • Citing Conference Paper
  • August 2021

... Furthermore, a few other issues, such as a reliability crisis, underfitting, and inadequacy of data, have limited the use of ML in cephalometry (Asiri et al., , Tandon et al., 2020, Palanivel et al., 2021, Tanikawa et al., 2021. This meta-analysis had several limitations. ...

Machine/Deep Learning for Performing Orthodontic Diagnoses and Treatment Planning
  • Citing Chapter
  • July 2021

... Hirota et al. [6] believe that it is difficult for traditional deep visual features to capture all the details in the image as humans do. At the same time, with the recent progress of natural language models, the authors propose to replace the "image-question" pair with the "description-question" pair as input and feed them into a language-only Transformer model. ...

A Picture May Be Worth a Hundred Words for Visual Question Answering

... Humorous headlines [Horvitz et al. 2024;Hossain et al. 2019Hossain et al. , 2020b, Puns [Miller et al. 2017;, Jokes Meaney et al. 2021;Weller and Seppi 2019;Zhang et al. 2019b;Zhong et al. 2023], [Hasan et al. 2019;Mihalcea and Strapparava 2005] Methods Humorous content [Arora et al. 2022;Kayatani et al. 2021;Peyrard et al. 2021;Ravi et al. 2024;Xie et al. 2023b], [Amin and Burghardt 2020;Annamoradnejad and Zoghi 2020;Bertero and Fung 2016;Chen and Soo 2018;Liu et al. 2018b;Ziser et al. 2020], [Radev et al. 2016;Raskin and Attardo 1994;Stock and ...

The Laughing Machine: Predicting Humor in Video
  • Citing Conference Paper
  • January 2021

... The WRIME (version 1) dataset [6] consists of 43,200 tweets written by 80 individuals, each annotated with 4-point intensity ratings for Plutchik's eight primary emotions by the author and three anonymous annotators. These intensity data were all obtained through a crowdsourcing service. ...

WRIME: A New Dataset for Emotional Intensity Estimation with Subjective and Objective Annotations
  • Citing Conference Paper
  • January 2021

... To overcome this limitation, supervised approaches [8][9][10][11][12][13][14][15][16][17][18]36] have shown significant advancements, leveraging summaries crafted by human-annotated labels. Li et al. [8] presented a summarization framework for both edited and raw videos. ...

A Comparative Study of Language Transformers for Video Question Answering
  • Citing Article
  • March 2021

Neurocomputing