Abstract
Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning (RL), in which we train an agent with two reward signals: a standard task performance reward plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model that we train using naturally occurring stories. These stories encode societal norms and can be used to classify text as normative or nonnormative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.
Original language | English |
---|---|
Pages (from-to) | 3350-3361 |
Number of pages | 12 |
Journal | IEEE Transactions on Artificial Intelligence |
Volume | 5 |
Issue number | 7 |
DOIs | |
State | Published - 2024 |
Bibliographical note
Publisher Copyright:© 2020 IEEE.
Keywords
- Autonomous agents
- natural language processing
- reinforcement learning (RL)
ASJC Scopus subject areas
- Computer Science Applications
- Artificial Intelligence