If you would like to report your results here, please follow instructions at VALUE website GitHub repository. All results must have a submission entry on CodaLab.
The VALUE leaderboard compiles results from task-agnostic models which can be applied to all three tasks. For task-specific models that works on a single type of tasks, please use the tabs below to navigate to the corresponding leaderboard.
The models are ranked by the Mean-Rank, the average of model ranks over 11 tasks. We break ties using the Meta-Ave, the average of model performance across 11 tasks. AveR, accuracy and CiDER are used as evaluation metrics for Retrieval, QA and Captioning tasks, respectively.
Rank | Model | Mean-Rank | Meta-Ave | TVR | How2R | YC2R | VATEX-EN-R | TVQA | How2QA | VIOLIN | VLEP | TVC | YC2C | VATEX-EN-C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
- 06/07/2021 |
Human
VALUE baseline |
- | - | - | - | - | - | 89.41 | 90.32 | 91.39 | 90.50 | 62.89 | - | 62.66 |
For text-to-video retrieval tasks, we report AveR for each task. AveR is the average of R@K (K = 1, 5, 10). The models are ranked by the Mean-Rank, the average of model ranks over 4 retrieval tasks. We break ties using the average of AveRs across 4 tasks.
Rank | Model | Mean-Rank | Ave-Score | TVR | How2R | YC2R | VATEX-EN-R |
---|
For video question answering tasks, we report accuracy on each task. The models are ranked by the Mean-Rank, the average of model ranks over 4 QA tasks. We break ties using the average of accuracies across 4 tasks.
Rank | Model | Mean-Rank | Ave-Score | TVQA | How2QA | VIOLIN | VLEP |
---|---|---|---|---|---|---|---|
- 06/07/2021 |
Human
VALUE baseline |
- | 90.41 | 89.41 | 90.32 | 91.39 | 90.50 |