A Comprehensive Benchmark for Video-And-Language Understanding Evaluation.


Multi-channel Video

With both Video Frames and Subtitle/ASR

Diverse Video Domain

Diverse video content from YouTube, TV Episodes and Movie Clips

Various Datasets over Representative Tasks

11 datasets over 3 tasks: Retrieval, Question Answering and Captioning.


To track the advances in Video-and-Langauge research.

What is VALUE?

The Video-And-Language Understanding Evaluation (VALUE) benchmark is a collection of resources for training, evaluating, and analyzing systems for understanding both video and subtitles. VALUE consists of:

  • A benchmark of 11 video and language tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, video genres, degrees of difficulty and task types
  • A public leaderboard for tracking performance on the benchmark

The format of the VALUE benchmark is model-agnostic, so any system capable of processing multi-channel video (video+subtitle) + natural langugage sentence pairs and producing corresponding predictions is eligible to participate. The ultimate goal of VALUE is to drive research in the development of general and robust video+language understanding systems.


Please cite our paper as below if you use the VALUE benchmark or starter code.

title={VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation},
author={Li, Linjie and Lei, Jie and Gan, Zhe and Yu, Licheng and Chen, Yen-Chun and Pillai, Rohit
        and Cheng, Yu and Zhou, Luowei and Wang, Xin Eric and Wang, William Yang and others},
booktitle = {35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
year = {2021}
We sincerely thank all dataset contributors to VALUE benchmark, please cite the following datasets if you use the VALUE benchmark.


Have any questions or suggestions? Feel free to reach us at!