Resumen
The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4mm on Human 3.6M dataset. Our code is available at https://github.com/lrxjason/Attention3DHumanPose
| Idioma original | English |
|---|---|
| Páginas (desde-hasta) | 1596-1615 |
| Número de páginas | 20 |
| Publicación | International Journal of Computer Vision |
| Volumen | 129 |
| N.º | 5 |
| DOI | |
| Estado | Published - may 2021 |
Nota bibliográfica
Publisher Copyright:© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature.
Financiación
This work is partially supported by the National Endowment for the Humanities under Grant No. AKA-260488-18 and National Science Foundation (NSF) under Grant No. 1910844.
| Financiadores | Número del financiador |
|---|---|
| National Science Foundation Arctic Social Science Program | 1910844 |
| National Endowment for the Humanities | AKA-260488-18 |
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Huella
Profundice en los temas de investigación de 'Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions'. En conjunto forman una huella única.Citar esto
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver