rpglover64 comments on Deconfusing “Capabilities vs. Alignment”

rpglover64 23 Jan 2023 15:31 UTC
1 point
0
My working model for unintended capabilities improvement focuses more on second- and third-order effects: people see promise of more capabilities and invest more money (e.g. found startups, launch AI-based products), which increases the need for competitive advantage, which pushes people to search for more capabilities (loosely defined). There is also the direct improvement to the inner research loop, but this is less immediate for most work.

Under my model, basically any technical work that someone could look at and think “Yes, this will help me (or my proxies) build more powerful AI in the next 10 years” is a risk, and it comes down to risk mitigation (because being paralyzed by fear of making the wrong move is a losing strategy):
- One way to be safe is to take a very comms and policy flavored, helping highlight already known problems with AI, like Adversarial Policies Beat Superhuman Go AIs and Discovering Language Model Behaviors with Model-Written Evaluations
- Another way is if the work just doesn’t apply to existing models, i.e. most MIRI work