Leaderboard



If you would like to report your results here, please follow instructions at VALUE website GitHub repository. All results must have a submission entry on CodaLab.

The VALUE leaderboard compiles results from task-agnostic models which can be applied to all three tasks. For task-specific models that works on a single type of tasks, please use the tabs below to navigate to the corresponding leaderboard.





The models are ranked by the Mean-Rank, the average of model ranks over 11 tasks. We break ties using the Meta-Ave, the average of model performance across 11 tasks. AveR, accuracy and CiDER are used as evaluation metrics for Retrieval, QA and Captioning tasks, respectively.

Rank Model Mean-Rank Meta-Ave TVR How2R YC2R VATEX-EN-R TVQA How2QA VIOLIN VLEP TVC YC2C VATEX-EN-C

-

06/07/2021
Human

VALUE baseline

- - - - - - 89.41 90.32 91.39 90.50 62.89 - 62.66

For text-to-video retrieval tasks, we report AveR for each task. AveR is the average of R@K (K = 1, 5, 10). The models are ranked by the Mean-Rank, the average of model ranks over 4 retrieval tasks. We break ties using the average of AveRs across 4 tasks.

Rank Model Mean-Rank Ave-Score TVR How2R YC2R VATEX-EN-R

For video question answering tasks, we report accuracy on each task. The models are ranked by the Mean-Rank, the average of model ranks over 4 QA tasks. We break ties using the average of accuracies across 4 tasks.

Rank Model Mean-Rank Ave-Score TVQA How2QA VIOLIN VLEP

-

06/07/2021
Human

VALUE baseline

- 90.41 89.41 90.32 91.39 90.50

For video captioning tasks, we report CIDEr-D on each task. The models are ranked by the Mean-Rank, the average of model ranks over 4 captioning tasks. We break ties using the average of CIDEr-D scores across 3 tasks.

Rank Model Mean-Rank Ave-Score TVC YC2C VATEX-EN-C

-

06/07/2021
Human

VALUE baseline

- - 62.89 - 62.66