Abstract
Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 149-154 |
| Number of pages | 6 |
| ISBN (Electronic) | 9798350399509 |
| DOIs | |
| State | Published - 2022 |
| Externally published | Yes |
| Event | 27th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022 - Tainan, Taiwan Duration: 01 12 2022 → 03 12 2022 |
Publication series
| Name | Proceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022 |
|---|
Conference
| Conference | 27th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022 |
|---|---|
| Country/Territory | Taiwan |
| City | Tainan |
| Period | 01/12/22 → 03/12/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.
Keywords
- cross-modality learning
- explainable VQA
- image captioning
- multi-task learning
- visual question answering