Kaiyuan Sun’s research while affiliated with The University of Sydney and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering
  • Chapter

November 2020

·

37 Reads

·

5 Citations

Lecture Notes in Computer Science

Siwen Luo

·

·

Kaiyuan Sun

·

Josiah Poon

Visual Question Answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect; either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question of an image with simple objects. In this paper, we propose a deep reasoning VQA model (REXUP- REason, EXtract, and UPdate) with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting complex object-relationships in photo-realistic images. REXUP consists of two branches, image object-oriented and scene graph-oriented, which jointly works with the super-diagonal fusion compositional attention networks. We evaluate REXUP on the benchmark GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP’s effectiveness. Our best model significantly outperforms the previous state-of-the-art, which delivers 92.7% on the validation set, and 73.1% on the test-dev set. Our code is available at: https://github.com/usydnlp/REXUP/.


REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

July 2020

·

33 Reads

Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect, either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question in an image with simple objects. In this paper, we propose a deep reasoning VQA model with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting a complex object-relationship in photo-realistic images. REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network. We quantitatively and qualitatively evaluate REXUP on the GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP's effectiveness. Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.

Citations (1)


... Recently, scene graphs have also been used in VQA. For example, ref. [27] processed the scene graphs and image features simultaneously through two parallel branches of recurrent memory networks to improve the model's reasoning ability over objects' relationships. Ref. [28] proposed to use a probabilistic scene graph of images as the state machine, where questions were transformed into instructions to perform the reasoning process. ...

Reference:

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering
  • Citing Chapter
  • November 2020

Lecture Notes in Computer Science