|
Persistent Identifier
|
doi:10.21979/N9/SMR703 |
|
Publication Date
|
2024-10-08 |
|
Title
| FunQA: Towards Surprising Video Comprehension |
|
Author
| Xie, Binzhu (Beijing University of Posts and Telecommunications, Beijing, China)
Zhang, Sicheng (Beijing University of Posts and Telecommunications, Beijing, China)
Zhou, Zitang (Beijing University of Posts and Telecommunications, Beijing, China)
Li, Bo (Nanyang Technological University)
Zhang, Yuanhan (Nanyang Technological University)
Hessel, Jack (The Allen Institute for AI, WA, USA)
Yang, Jingkang (Nanyang Technological University)
Liu, Ziwei (Nanyang Technological University) |
|
Point of Contact
|
Use email button above to contact.
Yang Jingkang (Nanyang Technological University) |
|
Description
| Surprising videos, e.g., funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA , 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video, and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Extensive experiments with existing VideoQA models reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation. |
|
Subject
| Computer and Information Science |
|
Keyword
| Video Understanding |
|
Related Publication
| Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., Hessel, J., ... & Liu, Z. (2023). Funqa: Towards surprising video comprehension. arXiv preprint arXiv:2306.14899. arXiv: 2306.14899v2 https://arxiv.org/abs/2306.14899
Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., Hessel, J., ... & Liu, Z. (2024, September). Funqa: Towards surprising video comprehension. In European Conference on Computer Vision (pp. 39-57). Cham: Springer Nature Switzerland. doi: 10.1007/978-3-031-73232-4_3 https://link.springer.com/chapter/10.1007/978-3-031-73232-4_3
Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., Hessel, J., ... & Liu, Z. (2024, September). Funqa: Towards surprising video comprehension. In European Conference on Computer Vision (pp. 39-57). Cham: Springer Nature Switzerland. handle: 10356/201847 https://hdl.handle.net/10356/201847 |
|
Funding Information
| Ministry of Education (MOE): under its MOE AcRF Tier 2 (MOE-T2EP20221- 0012)
Nanyang Technological University: NTU NAP
RIE2020 Industry Alignment Fund– Industry Collaboration Projects (IAF-ICP) Funding Initiative |
|
Depositor
| Yang Jingkang |
|
Deposit Date
| 2024-09-26 |
|
Data Type
| Video Question Answering |
|
Software
| OPENAI GPT-4 |
|
Related Dataset
| Github: Link |