A Comprehensive Benchmark for Video-And-Language Understanding Evaluation.


Multi-channel Video

With both Video Frames and Subtitle/ASR

Diverse Video Domain

Diverse video content from YouTube, TV Episodes and Movie Clips

Various Datasets over Representative Tasks

11 datasets over 3 tasks: Retrieval, Question Answering and Captioning.


To track the advances in Video-and-Langauge research.

What is VALUE?

The Video-And-Language Understanding Evaluation (VALUE) benchmark is a collection of resources for training, evaluating, and analyzing systems for understanding both video and subtitles. VALUE consists of:

  • A benchmark of 11 video and language tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, video genres, degrees of difficulty and task types
  • A public leaderboard for tracking performance on the benchmark

The format of the VALUE benchmark is model-agnostic, so any system capable of processing multi-channel video (video+subtitle) + natural langugage sentence pairs and producing corresponding predictions is eligible to participate. The ultimate goal of VALUE is to drive research in the development of general and robust video+language understanding systems.


Have any questions or suggestions? Feel free to submit an issue to our GitHub repository!