Figure 1. An overview of the FineDiving dataset and procedure-aware action quality assessment approach. FineDiving is a fine-grained sports video dataset with detailed annotations on action procedures. It provides a potential for proposing an action quality assessment approach with better interpretability via constructing temporal segmentation attention between query and exemplar.
Most existing action quality assessment methods rely on the deep features of an entire video to predict the score, which is less reliable due to the nontransparent inference process and poor interpretability. We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. Towards this goal, we construct a new fine-grained dataset, named FineDiving, developed on diverse diving events with detailed annotations on action procedures. We also propose a procedure-aware approach for action quality assessment, learned by a new Temporal Segmentation Attention module. Specifically, we propose to parse pairwise query and exemplar action instances into consecutive steps with diverse semantic and temporal correspondences. The procedure-aware cross-attention is proposed to learn embeddings between query and exemplar steps to discover their semantic, spatial, and temporal correspondences, and further serve for fine-grained contrastive regression to derive a reliable scoring mechanism. Extensive experiments demonstrate our approach achieves substantial improvements over state-of-the-art methods with better interpretability.
Figure 2. Two-level semantic structure.
Figure 3. Two-level temporal structure.
Action type indicates an action routine described by a dive number. Sub-action type is a component of action type, where each combination of the presented sub-action types can produce an action type, where different action types can share the same sub-action type. In Figure 2, the green branch denotes different kinds of take-offs; the purple, yellow, and red branches represent the somersaults with three positions (i.e., straight, pike, and tuck) in the air, respectively, where each branch contains different somersault turns; the orange branch indicates different twist turns in the air, combined with somersaults; the light blue denotes entering the water. In Figure 3, the action-level labels describe temporal boundaries of valid action routines, while the step-level labels provide the starting frames of consecutive steps in the procedure.
Action Type Examples
Sub-action Type Examples
Forward 4½ Somersaults Tuck
Armstand Forward 2 Soms. 1 Twist Free
Forward 2½ Somersaults 1 Twist Pike
Figure 3. The architecture of the proposed procedure-aware action quality assessment. Given a pairwise query and exemplar videos, we extract spatial-temporal visual features with I3D and propose a Temporal Segmentation Attention (TSA) module to assess action quality via successively accomplishing procedure segmentation, procedure-aware cross-attention learning, and fine-grained contrastive regression. The temporal segmentation attention is supervised by step transition labels and action score labels, which guides the model to focus on exemplar regions that are consistent with the query step and quantify their differences to predict reliable action scores.
Table 2. Comparisons of performance with existing AQA methods on FineDiving. (w/o DN) indicates the methods select exemplars randomly while (w/ DN) using dive numbers to select exemplars, and / indicates the methods without segmentation.
Table 4. Effects of the number of exemplars for voting.
Table 3. Ablation study on FineDiving. / indicates the methods without segmentation and $\checkmark$ denotes the method using the ground-truth step transition labels.