A Comprehensive Benchmark for Video-And-Language Understanding Evaluation.
With both Video Frames and Subtitle/ASR
Diverse video content from YouTube, TV Episodes and Movie Clips
11 datasets over 3 tasks: Retrieval, Question Answering and Captioning.
To track the advances in Video-and-Langauge research.
The Video-And-Language Understanding Evaluation (VALUE) benchmark is a collection of resources for training, evaluating, and analyzing systems for understanding both video and subtitles. VALUE consists of:
The format of the VALUE benchmark is model-agnostic, so any system capable of processing multi-channel video (video+subtitle) + natural langugage sentence pairs and producing corresponding predictions is eligible to participate. The ultimate goal of VALUE is to drive research in the development of general and robust video+language understanding systems.
@InProceedings{li2021value,
title={VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation},
author={Li, Linjie and Lei, Jie and Gan, Zhe and Yu, Licheng and Chen, Yen-Chun and Pillai, Rohit
and Cheng, Yu and Zhou, Luowei and Wang, Xin Eric and Wang, William Yang and others},
booktitle = {35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
year = {2021}
}