Abstract
Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at https://github.com/xuefanfu/ACDS-STR.
| Original language | English |
|---|---|
| Pages (from-to) | 717-728 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Image Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2025 |
Bibliographical note
Publisher Copyright:© 1992-2012 IEEE.
Funding
This work was supported in part by the National Key R&D Program of China (Grant No. 2023YFE0208800) and the Joint Project for Smart Computing of Shandong Natural Science Foundation (Grant No. ZR2023LZH015). (Corresponding author: Jiande Sun) Fanfu Xue, Jiande Sun and Lei Zhu are with the School of Information Science and Engineering, Shandong Normal University, Jinan 250014, China (e-mail: [email protected], [email protected], [email protected]) Yaqi Xue is with the College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan 250355, China (e-mail: [email protected]) Qiang Wu is with the School of Information Science and Engineering, Shandong University, Qingdao 266237, China (e-mail: [email protected]) Xiaojun Chang is with the Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia (e-mail: [email protected]) Sen-ching Cheung is with the Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA (e-mail: [email protected]) Fig. 1. Some examples of attention drift on text images. Text images (left) and attention map (right).
| Funders | Funder number |
|---|---|
| National Key Basic Research and Development Program of China | 2023YFE0208800 |
| National Key Basic Research and Development Program of China | |
| Natural Science Foundation of Shandong Province | ZR2023LZH015 |
| Natural Science Foundation of Shandong Province |
Keywords
- Scene text recognition
- attention drift
- attention guidance
- feature fusion
- vision transformer
ASJC Scopus subject areas
- Software
- Computer Graphics and Computer-Aided Design