MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen Ling Kuo, Zhiting Hu, Tomer D. Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

Original languageEnglish
Title of host publicationLong Papers
EditorsLun-Wei Ku, Andre F. T. Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics (ACL)
Pages16077-16102
Number of pages26
ISBN (Electronic)9798891760943
StatePublished - 2024
Externally publishedYes
Event62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Bangkok, Thailand
Duration: 11 08 202416 08 2024

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityBangkok
Period11/08/2416/08/24

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'MMToM-QA: Multimodal Theory of Mind Question Answering'. Together they form a unique fingerprint.

Cite this