Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

Advanced Search

1 to 4 of 4 Results
Dec 10, 2025
Li, Miaoyu; Chao, Qin; Li, Boyang, 2025, "Replication Data for: Two Causally Related Needles in a Video Haystack", https://doi.org/10.21979/N9/WCSXMT, DR-NTU (Data), V1
Causal2Needles is a benchmark dataset and evaluation toolkit designed to assess the capabilities of both proprietary and open-source multimodal large language models in long-video understanding. It features a large number of "2-needle" questions, where the model must locate and r...
Dec 10, 2025
Chinchure, Aditya; Ravi, Sahithya; Ng, Raymond; Shwartz, Vered; Li, Boyang; Sigal, Leonid, 2025, "Replication Data for: Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events", https://doi.org/10.21979/N9/HOAFUL, DR-NTU (Data), V1
BlackSwanSuite is a benchmark for evaluating VLMs’ ability to reason about unexpected events through abductive and defeasible tasks. The tasks either artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or p...
Dec 10, 2025
Zhang, Wenyu; Ng, Wei En; Ma, Lixin; Wang, Yuwen; Zhao, Junqi; Koenecke, Allison; Li, Boyang; Wang, Lu, 2025, "Replication Data for: SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models", https://doi.org/10.21979/N9/HI9OFD, DR-NTU (Data), V2
SPHERE (Spatial Perception and Hierarchical Evaluation of Reasoning) is a hierarchical evaluation framework built on a new human-annotated dataset of 2,285 question–answer pairs. It systematically probes models across increasing levels of complexity, from fundamental skills to mu...
Dec 10, 2025
Tiong, Anthony Meng Huat; Zhao, Junqi; Li, Boyang; Li, Junnan; Hoi, Steven C.H.; Xiong, Caiming, 2025, "Replication Data for: What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases", https://doi.org/10.21979/N9/SL0VV1, DR-NTU (Data), V1
The OLIVE dataset is a highly diverse, human-corrected multi-modal collection designed to simulate the variety and idiosyncrasies of user queries vision-language models (VLMs) face in real-world scenarios. It supports the training and evaluation of VLMs in conditions that more cl...
Add Data

Log in to create a dataverse or add a dataset.

Share Dataverse

Share this dataverse on your favorite social media networks.

Link Dataverse
Reset Modifications

Are you sure you want to reset the selected metadata fields? If you do this, any customizations (hidden, required, optional) you have done will no longer appear.