Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition

Fanfu Xue, Jiande Sun, Yaqi Xue, Qiang Wu, Lei Zhu, Xiaojun Chang, Sen Ching Cheung

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at https://github.com/xuefanfu/ACDS-STR.

Original languageEnglish
Pages (from-to)717-728
Number of pages12
JournalIEEE Transactions on Image Processing
Volume34
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© 1992-2012 IEEE.

Funding

This work was supported in part by the National Key R&D Program of China (Grant No. 2023YFE0208800) and the Joint Project for Smart Computing of Shandong Natural Science Foundation (Grant No. ZR2023LZH015). (Corresponding author: Jiande Sun) Fanfu Xue, Jiande Sun and Lei Zhu are with the School of Information Science and Engineering, Shandong Normal University, Jinan 250014, China (e-mail: [email protected], [email protected], [email protected]) Yaqi Xue is with the College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan 250355, China (e-mail: [email protected]) Qiang Wu is with the School of Information Science and Engineering, Shandong University, Qingdao 266237, China (e-mail: [email protected]) Xiaojun Chang is with the Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia (e-mail: [email protected]) Sen-ching Cheung is with the Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA (e-mail: [email protected]) Fig. 1. Some examples of attention drift on text images. Text images (left) and attention map (right).

FundersFunder number
National Key Basic Research and Development Program of China2023YFE0208800
National Key Basic Research and Development Program of China
Natural Science Foundation of Shandong ProvinceZR2023LZH015
Natural Science Foundation of Shandong Province

    Keywords

    • Scene text recognition
    • attention drift
    • attention guidance
    • feature fusion
    • vision transformer

    ASJC Scopus subject areas

    • Software
    • Computer Graphics and Computer-Aided Design

    Fingerprint

    Dive into the research topics of 'Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition'. Together they form a unique fingerprint.

    Cite this