Mapping Natural Language Intents to User Interfaces through Vision-Language Models

Halima Abukadah, Moghis Fereidouni, A. B. Siddique

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Efficiently navigating through mobile applications to accomplish specific tasks can be time-consuming and challenging, particularly for users who are unfamiliar with the app or faced with intricate menu structures. Simplifying access to a particular screen is a shared user priority, especially for individuals with diverse needs, including those with specific accessibility requirements. This underscores the exploration of innovative solutions to streamline the navigation process. This work addresses the challenge of mapping natural language intents to user interfaces, with a specific focus on the context of mobile applications. The primary objective of this work is to provide users with a more intuitive and efficient method for accessing desired screens in mobile applications by expressing their intentions in natural language. Existing approaches to this task have relied heavily on qualitative human studies for evaluating the performance. Moreover, widely used pre-trained vision-language models, such as Contrastive Language-Image Pretraining (CLIP), struggle to generalize effectively to the unique visual characteristics of user interfaces. Acknowledging the limitations, we introduce a novel approach that harnesses the power of the pre-trained vision-language models. Specifically, we investigate whether fine-tuning pre-trained vision-language models on mobile screens can address the challenges posed by the intricate nature of mobile application interfaces. Our approach involves the utilization of state-of-the-art pre-trained text and image encoders and employing a supervised fine-tuning process, where pre-trained models are adapted to the specific needs of mobile screen interactions. Moreover, a shared embedding space facilitates the alignment of embeddings of both text and image modalities, fostering a cohesive understanding between the natural language intents and visual features of user interface elements. We conduct extensive experimental evaluations using the Screen2Word dataset. Through systematic analysis and established metrics, we examine the models' ability to accurately map diverse linguistic intents to specific user interfaces. Our analysis demonstrates that fine-tuning yields substantial improvements over the zero-shot performance of the pre-trained vision-language models.

Original languageEnglish
Title of host publicationProceedings - 18th IEEE International Conference on Semantic Computing, ICSC 2024
Pages237-244
Number of pages8
ISBN (Electronic)9798350385359
DOIs
StatePublished - 2024
Event18th IEEE International Conference on Semantic Computing, ICSC 2024 - Hybrid, Laguna Hills, United States
Duration: Feb 5 2024Feb 7 2024

Publication series

NameProceedings - IEEE International Conference on Semantic Computing, ICSC
ISSN (Print)2325-6516
ISSN (Electronic)2472-9671

Conference

Conference18th IEEE International Conference on Semantic Computing, ICSC 2024
Country/TerritoryUnited States
CityHybrid, Laguna Hills
Period2/5/242/7/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Natural language processing
  • pre-trained vision-language models
  • user interface navigation

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Human-Computer Interaction
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Mapping Natural Language Intents to User Interfaces through Vision-Language Models'. Together they form a unique fingerprint.

Cite this