Multi-Person Pose Estimation with Mediapipe

Shawn Tng
6 min readMar 20, 2021

Human pose estimation from video is adopted in various applications such as sign language recognition and full-body gesture control. There are also usages in movement sequence classification of physical activities such as yoga, exercise and dance, enabling quantification of movements through body landmarks detection.

Are you with the crew?

There are a variety of pose estimations software available, such as OpenPose , MediaPipe, PoseNet, etc. While OpenPose and PoseNet are able to support real-time multi-person pose estimations, Mediapipe is only able to support single person pose estimation.

Single Pose Estimation

Current state-of-the-art approaches rely primarily on powerful desktop environments for inference, whereas Mediapipe’s method is able to achieve real-time performance on most modern mobile phones, desktops and web(javascript). In this article, I will be sharing how to adapt Mediapipe for multi-person pose estimations.


I have decide to use a TikTok video to do this experiment. The dataset could be found here. In the video, celebrities Fengze and Shou did a TikTok dance of the latter’s song “Colorful”.

Something about Mediapipe’s Pose Estimation


MediaPipe Pose Landmark feature is able to extract 33 landmark keypoints as shown above. The output is a list of pose landmarks, and each landmark consists of x and y landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.

If we intend to compare the pose of 2 persons in a photo, eg, in a dance like this video, one feasible approach was to determine the boundary box of a person, crop the box and then run single person pose estimation on the cropped image.

Sample Code

import cv2
import mediapipe as mp
mp_drawing =
mp_pose =

# For static images:
with mp_pose.Pose(
static_image_mode=True, min_detection_confidence=0.5) as pose:
for idx, file in enumerate(file_list):
image = cv2.imread(file)
image_height, image_width, _ = image.shape
# Convert the BGR image to RGB before processing.
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

if not results.pose_landmarks:
f'Nose coordinates: ('
f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].x * image_width}, '
f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].y * image_height})'
# Draw pose landmarks on the image.
annotated_image = image.copy()
# Use mp_pose.UPPER_BODY_POSE_CONNECTIONS for drawing below when
# upper_body_only is set to True.
annotated_image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS)
cv2.imwrite('/tmp/annotated_image' + str(idx) + '.png', annotated_image)

# For webcam input:
cap = cv2.VideoCapture(0)
with mp_pose.Pose(
min_tracking_confidence=0.5) as pose:
while cap.isOpened():
success, image =
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.

# Flip the image horizontally for a later selfie-view display, and convert
# the BGR image to RGB.
image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
# To improve performance, optionally mark the image as not writeable to
# pass by reference.
image.flags.writeable = False
results = pose.process(image)

# Draw the pose annotation on the image.
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS)
cv2.imshow('MediaPipe Pose', image)
if cv2.waitKey(5) & 0xFF == 27:

Cropping using YOLO

You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev. As YOLO is proven to be extremely fast and accurate, it can be used to detect a person object in the video above.

For each person object detected in a video frame, we execute pose estimation.

Multi Person Pose Estimation

Yay! Are we done yet ?

Not quite yet, as there are some interesting observations made for subsequent frames extracted. The pose estimation seems to break down at some frames before resuming its functionality later.

A deeper look into MediaPipe’s pose detection shows that it explicitly predicts two additional virtual keypoints that firmly describe the human body center, rotation and scale as a circle. Inspired by Leonardo’s Vitruvian man, it predicts the midpoint of a person’s hips, the radius of a circle circumscribing the whole person, and the incline angle of the line connecting the shoulder and hip midpoints.

Vitruvian man aligned via two virtual keypoints predicted by BlazePose detector in addition to the face bounding box. Source:

In a separate experiment, it was demonstrated that indeed there is some “initial calibration” that happens at the start before the subsequent pose estimations get accurate. It also implies that each person detected is required to have his /her own instance of pose estimator (see below).


If we are able to detect and somehow identify a person, it is possible to correct the above problem. Basically if the person “exists”, use his/her pose estimator. Otherwise, “create” a new person with a new instance of pose estimator.

So now the burden is on YOLO to consistently detect the same person for every frame. Can YOLO do it?

Every person object detected was labelled to conduct this experiment. As seen above, there seem to be a “swopping” issue. The first object detected may not be the same first object detected in the previous video frame. So we need a way to “identify” a person.


2 global variables are created. pose_estimator keeps track of the pose estimators and each will have a corresponding dimensional data (object boundary) last collected. Each person object detected by YOLO will now have his boundary box checked (using distance measures) against the global variable. We can assume each set of pose estimator and dimensions represent a person.

# global variables
pose_estimator = []
pose_estimator_dim = []
# For each object detected

pose = mp_pose.Pose(min_detection_confidence=0.6,
pose_estimator_dim.append(<detected object's boundary>)
selected_pose_idx = len(pose_estimator)-1
thresholdForNew = 100
prev_high_score = 0
selected_pose_idx_high =0
prev_low_score = 1000000000
selected_pose_idx_low =0
pose_idx = 0
for dim in pose_estimator_dim:
score = compareDist(dim,<detected object's boundary>)
if(score > prev_high_score):
selected_pose_idx_high = pose_idx
prev_high_score = score
if(score < prev_low_score):
selected_pose_idx_low = pose_idx
prev_low_score = score
if prev_high_score > thresholdForNew:
pose = mp_pose.Pose(min_detection_confidence=0.6,
pose_estimator_dim.append(<detected object's boundary>)
selected_pose_idx = len(pose_estimator)-1
selected_pose_idx = selected_pose_idx_low
pose_estimator_dim[selected_pose_idx]=<detected object's boundary>

pose_idx = 0
prev_score = 1000000000
for dim in pose_estimator_dim:
score = compareDist(dim,[x_min, y_min, box_width, box_height])
if(score < prev_score):
selected_pose_idx = pose_idx
prev_score = score
pose_estimator_dim[selected_pose_idx]=<detected object's boundary>

Final Output


We have achieved multi-person pose with Mediapipe in this implementation. Do note that it is possible to add additional “features” such as face similarity in order to identify the person. Hope you find this useful for your work or studies. Stay tuned for the next post that will demonstrate how to further utilize the landmarks obtained.