A Comprehensive Benchmark for Video-And-Language Understanding Evaluation.
With both Video Frames and Subtitle/ASR
Diverse video content from YouTube, TV Episodes and Movie Clips
11 datasets over 3 tasks: Retrieval, Question Answering and Captioning.
To track the advances in Video-and-Langauge research.
The Video-And-Language Understanding Evaluation (VALUE) benchmark is a collection of resources for training, evaluating, and analyzing systems for understanding both video and subtitles. VALUE consists of:
The format of the VALUE benchmark is model-agnostic, so any system capable of processing multi-channel video (video+subtitle) + natural langugage sentence pairs and producing corresponding predictions is eligible to participate. The ultimate goal of VALUE is to drive research in the development of general and robust video+language understanding systems.