1 to 4 of 4 Results
Dec 10, 2025
Li, Miaoyu; Chao, Qin; Li, Boyang, 2025, "Replication Data for: Two Causally Related Needles in a Video Haystack", https://doi.org/10.21979/N9/WCSXMT, DR-NTU (Data), V1
Causal2Needles is a benchmark dataset and evaluation toolkit designed to assess the capabilities of both proprietary and open-source multimodal large language models in long-video understanding. It features a large number of "2-needle" questions, where the model must locate and r... |
Dec 10, 2025
Chinchure, Aditya; Ravi, Sahithya; Ng, Raymond; Shwartz, Vered; Li, Boyang; Sigal, Leonid, 2025, "Replication Data for: Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events", https://doi.org/10.21979/N9/HOAFUL, DR-NTU (Data), V1
BlackSwanSuite is a benchmark for evaluating VLMs’ ability to reason about unexpected events through abductive and defeasible tasks. The tasks either artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or p... |
Dec 10, 2025
Zhang, Wenyu; Ng, Wei En; Ma, Lixin; Wang, Yuwen; Zhao, Junqi; Koenecke, Allison; Li, Boyang; Wang, Lu, 2025, "Replication Data for: SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models", https://doi.org/10.21979/N9/HI9OFD, DR-NTU (Data), V2
SPHERE (Spatial Perception and Hierarchical Evaluation of Reasoning) is a hierarchical evaluation framework built on a new human-annotated dataset of 2,285 question–answer pairs. It systematically probes models across increasing levels of complexity, from fundamental skills to mu... |
Dec 10, 2025
Tiong, Anthony Meng Huat; Zhao, Junqi; Li, Boyang; Li, Junnan; Hoi, Steven C.H.; Xiong, Caiming, 2025, "Replication Data for: What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases", https://doi.org/10.21979/N9/SL0VV1, DR-NTU (Data), V1
The OLIVE dataset is a highly diverse, human-corrected multi-modal collection designed to simulate the variety and idiosyncrasies of user queries vision-language models (VLMs) face in real-world scenarios. It supports the training and evaluation of VLMs in conditions that more cl... |
