Automated Scoring For Tiktok Dance Challenge

Shawn Tng
6 min readMay 20, 2021

This is a continuation of my previous post, where Google’s Mediapipe was used for multi-person pose estimation. From each video frame, we are able to extract body landmarks or keypoints from the image captured. These are useful features that represent the body posture of the dancer at each recorded frame. When we collect these information for the entire video of dance sequence, we are able to obtain some form of time series data. The idea here is to explore a method to evaluate the posture of the dancer against his coach.

The motivation here is to explore automated scoring for Tiktok Dance Challenges. Tiktok is a 60-second short video app is filled with memes, comedy, dancing, and talents. In a TikTok dance challenge, the originator will perform a dance based on a certain audio track and TikTok users will imitate the moves with their own video.

In the video example, we have popular Taiwanese singer Jolin Tsai (right) doing a dance for her new song release Stars Align. The guy in yellow is trying to copy her dance moves in his own video.

In this post, I shall explore the idea of using dynamic time warping as a score component for dance evaluation. The caveat here is to assume we have almost a full view of a person, and known challenges from occlusion or camera motions will not be discussed here.

Dynamic Time Warping

In time series analysis, dynamic time warping (DTW) is one of the algorithms for measuring similarity between two temporal sequences that are usually not synchronized.

Source: Wiki Commons

With DTW, elastic alignment between points of two signals produces a better, more intuitive similarity measure. DTW calculates an optimal match between two given sequences and warps the sequences along time dimension to determine similarity independent of variations in time. DTW produces warping path, which enables alignment between two signals as seen above. See here for more information about DTW. In this post, we are using DTW to discover pairs of matching frames so that the posture coordinates from each frame of

Dataset

For illustration, I will be using 2 videos below.

Video 1 — Learner (Left) and Coach (Right)
Video 2 — Learner (Left) and Coach (Right)

Both videos have their respective video lengths cropped such that the starting dance move and the last dance move in both videos are matched. This is an important criteria for the DTW to work well.

For the ground truth, the learner in Video 2 is better dancer. For the curious folks, he is Wayne Huang, a Taiwanese actor, singer, dancer and host.

Multi-Person Pose Estimation

Both videos are processed with 2D pose estimation (See my previous post) to extract normalized pose landmarks for each pair of persons.

Video 1 (With Pose Estimation) — Learner 1 (Left) and Coach (Right)
Video 2 (With Pose Estimation) — Learner 2 (Left) and Coach (Right)

Here, we are using the full-body landmark model in Google’s MediaPipe Pose. It is able to predicts the location of 33 pose landmarks. You may also consider the upper-body version for scenarios where the lower-body parts are mostly out of view. The upper-body version predicts only the first 25 pose landmarks.

Pose Landmark Model

Source: https://google.github.io/mediapipe/solutions/pose.html

Each landmark consists of the following (Source: here):

  • x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.
  • z: Represents the landmark depth with the depth at the midpoint of hips being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x. [This is not used here as we do not have depth information]
  • visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible (present and not occluded) in the image. [To simply my explanation, we will ignore this attribute]

Feature Representation

Now before I go further to talk about how the keypoints will be used, let’s take a look at the DTW library that I have used — fastdtw. The original DTW algorithm consists of mainly 3 parts (a) Compute distance matrix (b)Compute accumulated cost matrix (c)Search the optimal path. Hence, it takes quite a bit of time to compute. , fastDTW is python implementation of FastDTW that approximates the DTW algorithm and according to its website, it provides optimal or near-optimal alignments with an O(N) time and memory complexity. Below is code sample of fastDTW

import numpy as np
from scipy.spatial.distance import euclidean

from fastdtw import fastdtw

x = np.array([[1,1], [2,2], [3,3], [4,4], [5,5]])
y = np.array([[2,2], [3,3], [4,4]])
distance, path = fastdtw(x, y, dist=euclidean)

Assuming we now want to replace x and y of the code sample with pose descriptors to help us with frame matching.

However, FastDTW only accepts single dimensional array, so there is a limitation to how we can model the pose descriptor.

Back to our data.

Idea 1: For each frame, we have 33 sets of positional data of each body part. One idea is to generate 33 sets of DTW warped paths. Eg. x(nose) = np.array([[x1,y1],[x2,y2],…])

Idea 2: For each frame, calculate pairwise distance and angles for desired adjacent pairs of body parts. Then join up all the pairs of angles and distances for desired parts into a single array. Eg. x(nose) = np.array([[angle1,distance1,angle2,distance2,…], angle1,distance1,angle2,distance2,…],…]). See illustration below:

Using pairwise distance and angles instead of positional data
connecting all the dots

Idea 3: For each frame, join up all the positional data for desired parts into a single array. Eg. x = np.array([[x1,y1,x2,y2,…],..])

Results

Idea 1 didn’t work well as the 33 DTW sets are disparate, and there is a need to give further thought how to resolve this problem.

Idea 2 didn’t work well, even when angles or distance are used alone.

Idea 3 was finally implemented as it works for me. It was able to discern the difference between a good dancer and poor dancer. You may consider using points and method that are relevant to your use case.

Learner 2 is a better dancer!

A smaller DTW distance implies high similarity. The lower distance by Learner 2 is validated against ground truth.

In addition, DTW warp paths are used to determine the matched frames. See the image where posture of coach and learner in matched frames are superimposed on each other.

Learner 1’s matched frames with Coach
Learner 2’s matched frames with Coach
6 core angles to distinguish dance movement. See Source

Coordinates of matched frames could be used to compute the 6 six core angles for further evaluation. Using a angle or distance threshold, we could effectively evaluate the learner’s dance move against that of a coach!

Thanks for reading, hope you enjoy the post. Feel free to share if you have other ideas about automated scoring. =)

--

--