DirectMHP: Direct 2D Multi-Person Head Pose Estimation with Full-range Angles
Arxiv 2023

000002_mpiinew_test 000003_mpiinew_test

Abstract

overview

Existing head pose estimation (HPE) mainly focuses on single person with pre-detected frontal heads, which limits their applications in real complex scenarios with multi-persons. We argue that these single HPE methods are fragile and inefficient for Multi-Person Head Pose Estimation (MPHPE) since they rely on the separately trained face detector that cannot generalize well to full viewpoints, especially for heads with invisible face areas. In this paper, we focus on the full-range MPHPE problem, and propose a direct end-to-end simple baseline named DirectMHP. Due to the lack of datasets applicable to the full-range MPHPE, we firstly construct two benchmarks by extracting ground-truth labels for head detection and head orientation from public datasets AGORA and CMU Panoptic. They are rather challenging for having many truncated, occluded, tiny and unevenly illuminated human heads. Then, we design a novel end-to-end trainable one-stage network architecture by joint regressing locations and orientations of multi-head to address the MPHPE problem. Specifically, we regard pose as an auxiliary attribute of the head, and append it after the traditional object prediction. Arbitrary pose representation such as Euler angles is acceptable by this flexible design. Then, we jointly optimize these two tasks by sharing features and utilizing appropriate multiple losses. In this way, our method can implicitly benefit from more surroundings to improve HPE accuracy while maintaining head detection performance. We present comprehensive comparisons with state-of-the-art single HPE methods on public benchmarks, as well as superior baseline results on our constructed MPHPE datasets.

Datasets: Single-Person HPE vs. Multi-Person HPE

overview


overview


Example images of two constructed challenging datasets: AGORA-HPE (top row) and CMU-HPE (middle row), and two widely used single HPE datasets: 300W-LP & AFLW2000 (left-down) and BIWI (right-down).

More Illustrations of CMU-HPE and AGORA-HPE

overview


Snapshot of 31 views from the sequence 170307_dance5 at sampling moment T = 24 seconds. Frames are cropped for ease of presentation.

overview


Example of some challenging head samples from AGORA-HPE (first line) and CMU-HPE (second line). The third line is 3D head model indicator.

overview


Pose label distribution of three Euler angles in datasets AGORA-HPE, CMU-HPE, 300W-LP & AFLW2000 and BIWI.

More Visualization Results of DirectMHP

overview overview
overview overview
overview overview
overview overview
overview overview
overview overview

Visualization on some in-the-wild images from COCO val-set. As shown in Figures, in some challenging cases, especially the head sample with invisible face, our method generally does not have significant head orientation angle errors. This is largely supported by our end-to-end design which can exploit the relation of each head with its whole human body in the original images.

Citation

Acknowledgements

We acknowledge the effort from authors of project YOLOv5 and both publish datasets AGORA and CMU Panoptic Studio Dataset. These datasets make our construction of AGORA-HPE and CMU-HPE possible. This paper was supported by NSFC (No. 62176155, 62207014), Shanghai Municipal Science and Technology Major Project, China, under grant no. 2021SHZDZX0102.

The website template was borrowed from Jon Barron and Zip-NeRF.