A potentially high impact differential technological development area

Noosphere898 Jun 2023 14:33 UTC

5 points

Security Mindset AI Risk Skepticism AI-Assisted Alignment Instrumental convergence AI

Specifically, this is boosting LLMs and Non-RL AI relative to RL AI.

I am making this post because I want to signal boost an observation by porby, which could be lossily compressed as, “Instrumental goals/Instrumental convergence is at best a debatable assumption for LLMs and Non-RL AI, and may not be there at all for LLMs/Non-RL AI.”

Expressed in this post and in a comment below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/?commentId=GKhn2ktBuxjNhmaWB

I want to point this out, because on LW/EA analyses of AI risk that conclude it’s a big problem (especially the more pessimistic ones) generally require instrumental convergence/instrumental goals in order for AI risks that isn’t part of AI ethics, or AI misuse concerns.

So it being false, or at least having a lot higher probability of being false for LLMs/Non-RL AI is a very big deal, because the concern over AI risk that isn’t a standard misuse/ethics story pretty much relies on instrumental convergence being right.

The basic reason for this, again lossily compressed, if anyone wants to know this, is basically because the steps between the input to output don’t contain many intermediate steps unlike in RL, so there’s far less incentive for instrumental goals/instrumental convergence since instrumental goals are much less useful when there are only a few steps between the input and output.

Given this new reality, I have identified a pretty high impact area for making AI progress safer:

Help OpenAI/Microsoft and Anthropic win the race to AGI, relative to Deepmind, or at least have Non-RL AI win the race to AGI, relative to RL AI

The reason I’m suggesting this as a way to make safer AI is because Non-RL AI, especially SOTA LLMs have far, far less incentive to have instrumental goals, relative to RL. And if you can avoid instrumental convergence, you have a much easier alignment problem, since you don’t have to solve the adversarial cases of AI Alignment, you only need to solve the non-adversarial cases, and in general, solving problems in the non-adversarial case is radically easier compared to solving it in adversarial cases.

This also makes OpenAI’s alignment plan described here work a lot better, particularly the part where they want to make AI that solves the AI alignment problem, and it has been criticized before, in a comment and in some posts below, like Akash’s post on OpenAI’s alignment plan. But OpenAI’s alignment plan, in light of the probable incorrectness of the instrumental convergence, looks a lot better than before, as it’s much easier to align an LLM AI alignment/safety researcher than standard RL AI, because there’s little incentive for power-seeking where we don’t want to have, so almost every AI risk story that isn’t misuse or ethics falls apart.

In particular, since the base AI is easy to align, it’s much easier to recursively align successors of LLMs.

The OpenAI alignment plan, and criticism is linked below:

https://openai.com/blog/our-approach-to-alignment-research

https://www.lesswrong.com/posts/tD9zEiHfkvakpnNam/?commentId=MWQpJrTsYzFoQiPoE

https://www.lesswrong.com/posts/FBG7AghvvP7fPYzkx/my-thoughts-on-openai-s-alignment-plan-1

Implications of this new finding

I’d say the most important implication of the probable non-existence of instrumental convergence in LLMs is that a lot of the standard LW framing on the AI alignment problem is basically wrong, because we aren’t dealing with an adversarial problem, we are dealing with a problem that normal science can probably solve given more resources.

Equivalently, Eliezer Yudkowsky’s and others emphasis on security mindset is fundamentally oversold and the wrong frame for the problem.

It also implies that fairly radical downshifting of p(DOOM) is warranted for a lot of people, or equivalently that we should be way more optimistic about the AI Alignment problem being solved.

The people that are skeptical of AI doom stories or optimistic around AI progress are basically right, or at least a lot more right than the pessimists, even if it’s for the wrong reasons.

AI governance should probably be careful about restricting AI, given that at least one AI technology path to AGI is way, way safer than the others.

OpenAI was fundamentally right about it’s claims to being a safe AGI company, compared to every other company. In particular, people working on non-RL AI are making safer AI, at least compared to other paths.

And finally, while the AI Alignment problem still matters, I’d say it probably matters a lot less than it used to, and in particular I suspect it’s big enough to downgrade the problem such that it matters way less than other problems, so LWers should be less fixated on AI existential risks, compared to other problems.

What links here?

Noosphere89's comment on A Playbook for AI Risk Reduction (focused on misaligned AI) by HoldenKarnofsky (10 Jun 2023 1:18 UTC; 1 point)