FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment


Jinglin Xu* Yongming Rao* Xumin Yu Guangyi Chen Jie Zhou Jiwen Lu

Department of Automation, Tsinghua University, China

Beijing National Research Center for Information Science and Technology, China

[Paper] [Code & Dataset]

Figure 1. An overview of the FineDiving dataset and procedure-aware action quality assessment approach. FineDiving is a fine-grained sports video dataset with detailed annotations on action procedures. It provides a potential for proposing an action quality assessment approach with better interpretability via constructing temporal segmentation attention between query and exemplar.

Abstract

Most existing action quality assessment methods rely on the deep features of an entire video to predict the score, which is less reliable due to the nontransparent inference process and poor interpretability. We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. Towards this goal, we construct a new fine-grained dataset, named FineDiving, developed on diverse diving events with detailed annotations on action procedures. We also propose a procedure-aware approach for action quality assessment, learned by a new Temporal Segmentation Attention module. Specifically, we propose to parse pairwise query and exemplar action instances into consecutive steps with diverse semantic and temporal correspondences. The procedure-aware cross-attention is proposed to learn embeddings between query and exemplar steps to discover their semantic, spatial, and temporal correspondences, and further serve for fine-grained contrastive regression to derive a reliable scoring mechanism. Extensive experiments demonstrate our approach achieves substantial improvements over state-of-the-art methods with better interpretability.

Dataset

Figure 2. Two-level semantic structure.

Figure 3. Two-level temporal structure.

Action type indicates an action routine described by a dive number. Sub-action type is a component of action type, where each combination of the presented sub-action types can produce an action type, where different action types can share the same sub-action type. In Figure 2, the green branch denotes different kinds of take-offs; the purple, yellow, and red branches represent the somersaults with three positions (i.e., straight, pike, and tuck) in the air, respectively, where each branch contains different somersault turns; the orange branch indicates different twist turns in the air, combined with somersaults; the light blue denotes entering the water. In Figure 3, the action-level labels describe temporal boundaries of valid action routines, while the step-level labels provide the starting frames of consecutive steps in the procedure.

Action Type Examples

109C

205B

407C

5152B

5154B

6243D

Sub-action Type Examples

109C

Forward 4½ Somersaults Tuck

Forward

4.5 Soms.Tuck

Entry

6142D

Armstand Forward 2 Soms. 1 Twist Free

Arm.Forward

1 Twist

2 Soms.Pike

Entry

5152B

Forward 2½ Somersaults 1 Twist Pike

Forward

2.5 Soms.Pike

1 Twist

2.5 Soms.Pike

Entry

Approach

Figure 3. The architecture of the proposed procedure-aware action quality assessment. Given a pairwise query and exemplar videos, we extract spatial-temporal visual features with I3D and propose a Temporal Segmentation Attention (TSA) module to assess action quality via successively accomplishing procedure segmentation, procedure-aware cross-attention learning, and fine-grained contrastive regression. The temporal segmentation attention is supervised by step transition labels and action score labels, which guides the model to focus on exemplar regions that are consistent with the query step and quantify their differences to predict reliable action scores.

Results

Table 2. Comparisons of performance with existing AQA methods on FineDiving. (w/o DN) indicates the methods select exemplars randomly while (w/ DN) using dive numbers to select exemplars, and / indicates the methods without segmentation.

Table 4. Effects of the number of exemplars for voting.

Table 3. Ablation study on FineDiving. / indicates the methods without segmentation and $\checkmark$ denotes the method using the ground-truth step transition labels.

Figure 6. The visualization of procedure-aware cross attention between pairwise query and exemplar procedures. Our approach can focus on the exemplar regions that are consistent with the query step, which makes step-wise quality differences quantifying reliable. The presented pairwise query and exemplar contain the same action and sub-action types.

Acknowledgement

The template of this webpage is borrowed from DenseCLIP.