Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Research output: Contribution to journalArticlepeer-review

35 Scopus citations

Abstract

The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4mm on Human 3.6M dataset. Our code is available at https://github.com/lrxjason/Attention3DHumanPose

Original languageEnglish
Pages (from-to)1596-1615
Number of pages20
JournalInternational Journal of Computer Vision
Volume129
Issue number5
DOIs
StatePublished - May 2021

Bibliographical note

Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature.

Funding

This work is partially supported by the National Endowment for the Humanities under Grant No. AKA-260488-18 and National Science Foundation (NSF) under Grant No. 1910844.

FundersFunder number
National Science Foundation Arctic Social Science Program1910844
National Endowment for the HumanitiesAKA-260488-18

    Keywords

    • 3D human pose
    • Attention
    • Monocular capture
    • Motion reconstruction
    • Multi-scale dilation
    • Performance-driven retargeting

    ASJC Scopus subject areas

    • Software
    • Computer Vision and Pattern Recognition
    • Artificial Intelligence

    Fingerprint

    Dive into the research topics of 'Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions'. Together they form a unique fingerprint.

    Cite this