Relation-Aware Image Captioning for Explainable Visual Question Answering

Ching Shan Tseng, Ying Jia Lin, Hung Yu Kao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.

Original languageEnglish
Title of host publicationProceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages149-154
Number of pages6
ISBN (Electronic)9798350399509
DOIs
StatePublished - 2022
Externally publishedYes
Event27th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022 - Tainan, Taiwan
Duration: 01 12 202203 12 2022

Publication series

NameProceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022

Conference

Conference27th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022
Country/TerritoryTaiwan
CityTainan
Period01/12/2203/12/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Keywords

  • cross-modality learning
  • explainable VQA
  • image captioning
  • multi-task learning
  • visual question answering

Fingerprint

Dive into the research topics of 'Relation-Aware Image Captioning for Explainable Visual Question Answering'. Together they form a unique fingerprint.

Cite this