Grants and Contracts Details
Description
In recent years, Natural Language Processing (NLP) has witnessed remarkable advancements, with models like ChatGPT and Bard achieving text-generation capabilities indistinguishable from human-created content. These models have revolutionized the way computers understand and interact with human language, opening up new possibilities. Large language models (LLMs), the driving force behind these achievements, have the potential to inadvertently perpetuate and amplify deeply ingrained historical biases found in the data they are trained on, including biases related to gender, race, and disability. This unintended perpetuation of biases threatens our pursuit of a fair and just society, making it essential that our technologies actively contribute to a more equitable and inclusive future. Motivated by this urgent need, we propose a systematic approach to delve into the inner workings of LLMs, with a specific focus on identifying critical neuron activations that influence bias and developing systematic methods to modify these neurons and mitigate bias. Our hypothesis is that certain key vectors encode biased textual patterns within input sequences, while their corresponding value vectors reflect the distribution of tokens that align with these recognized patterns. To locate hidden state variables with the most significant impact on the recall of biased textual patterns, we propose employing causal mediation analysis and causal graphs. After localizing bias to specific neurons, we can insert new key-value memory optimally to mitigate bias. What distinguishes our approach is its proactive strategy to address the root cause of bias, transforming the way we approach bias mitigation in AI-generated content. By directly intervening at the neural level, we aim to prevent biases from emerging in the first place, offering a fundamental solution to a pressing issue. This research represents a critical step towards ensuring that our advanced NLP technologies do not perpetuate historical biases and instead contribute to a fair, equitable, and inclusive future.
Status | Finished |
---|---|
Effective start/end date | 3/1/24 → 2/28/25 |
Funding
- University of Kentucky UNITE Research Priority Area: $49,769.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.