FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment
Jinglin Xu* Yongming Rao* Xumin Yu Guangyi Chen Jie Zhou Jiwen Lu
Department of Automation, Tsinghua University, China
Beijing National Research Center for Information Science and Technology, China
[Paper] [Code & Dataset]
Figure 1. An overview of the FineDiving dataset and procedure-aware action quality assessment approach. FineDiving is a fine-grained sports video dataset with detailed annotations on action procedures. It provides a potential for proposing an action quality assessment approach with better interpretability via constructing temporal segmentation attention between query and exemplar.
Abstract
Most existing action quality assessment methods rely on the deep features of an entire video to predict the score, which is less reliable due to the nontransparent inference process and poor interpretability. We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. Towards this goal, we construct a new fine-grained dataset, named FineDiving, developed on diverse diving events with detailed annotations on action procedures. We also propose a procedure-aware approach for action quality assessment, learned by a new Temporal Segmentation Attention module. Specifically, we propose to parse pairwise query and exemplar action instances into consecutive steps with diverse semantic and temporal correspondences. The procedure-aware cross-attention is proposed to learn embeddings between query and exemplar steps to discover their semantic, spatial, and temporal correspondences, and further serve for fine-grained contrastive regression to derive a reliable scoring mechanism. Extensive experiments demonstrate our approach achieves substantial improvements over state-of-the-art methods with better interpretability.
Dataset
Figure 2. Two-level semantic structure.
Figure 3. Two-level temporal structure.
Action type indicates an action routine described by a dive number. Sub-action type is a component of action type, where each combination of the presented sub-action types can produce an action type, where different action types can share the same sub-action type. In Figure 2, the green branch denotes different kinds of take-offs; the purple, yellow, and red branches represent the somersaults with three positions (i.e., straight, pike, and tuck) in the air, respectively, where each branch contains different somersault turns; the orange branch indicates different twist turns in the air, combined with somersaults; the light blue denotes entering the water. In Figure 3, the action-level labels describe temporal boundaries of valid action routines, while the step-level labels provide the starting frames of consecutive steps in the procedure.
109C
205B
407C
5152B
5154B
6243D
109C
Forward 4½ Somersaults Tuck
Forward
4.5 Soms.Tuck
Entry
6142D
Armstand Forward 2 Soms. 1 Twist Free
Arm.Forward
1 Twist
2 Soms.Pike
Entry
5152B
Forward 2½ Somersaults 1 Twist Pike
Forward
2.5 Soms.Pike
1 Twist
2.5 Soms.Pike
Entry
Approach
Figure 3. The architecture of the proposed procedure-aware action quality assessment. Given a pairwise query and exemplar videos, we extract spatial-temporal visual features with I3D and propose a Temporal Segmentation Attention (TSA) module to assess action quality via successively accomplishing procedure segmentation, procedure-aware cross-attention learning, and fine-grained contrastive regression. The temporal segmentation attention is supervised by step transition labels and action score labels, which guides the model to focus on exemplar regions that are consistent with the query step and quantify their differences to predict reliable action scores.
Results
Table 2. Comparisons of performance with existing AQA methods on FineDiving. (w/o DN) indicates the methods select exemplars randomly while (w/ DN) using dive numbers to select exemplars, and / indicates the methods without segmentation.
Table 4. Effects of the number of exemplars for voting.
Table 3. Ablation study on FineDiving. / indicates the methods without segmentation and $\checkmark$ denotes the method using the ground-truth step transition labels.
Figure 6. The visualization of procedure-aware cross attention between pairwise query and exemplar procedures. Our approach can focus on the exemplar regions that are consistent with the query step, which makes step-wise quality differences quantifying reliable. The presented pairwise query and exemplar contain the same action and sub-action types.
Acknowledgement
The template of this webpage is borrowed from DenseCLIP.