Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior

Md Sultan al Nahian, Spencer Frazier, Mark Riedl, Brent Harrison

Research output: Contribution to journalArticlepeer-review

Abstract

Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model that we train using naturally occurring stories. These stories encode societal norms and can be used to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.

Original languageEnglish
Pages (from-to)1-11
Number of pages11
JournalIEEE Transactions on Artificial Intelligence
DOIs
StateAccepted/In press - 2024

Bibliographical note

Publisher Copyright:
IEEE

Keywords

  • Artificial intelligence
  • Autonomous Agents
  • Behavioral sciences
  • Natural language processing
  • Natural languages
  • Reinforcement Learning
  • Reinforcement learning
  • Task analysis
  • Training
  • Transformers

ASJC Scopus subject areas

  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior'. Together they form a unique fingerprint.

Cite this