Description
LLMs can be enhanced to become "agents" that can autonomously plan and act in the real world (e.g., write and execute code, browse the web). This increased autonomy and ability to learn throughout their lifetime bring new safety challenges. For example, goals given in natural language can be underspecified, leading to unintended negative side-effects. Goal-directedness might also incentivize undesirable behaviors like deception or power-seeking, and makes robust oversight very difficult.