My anti-corrigibility argument would probably require several posts to make remotely convincing, but I can sketch it as bullet points:
Intelligence really is giant incomprehensible matrices with non-linear functions tossed in (at best).
Any viable intelligence is going to need to be able to update its weights and learn from experience. Or at least be widely fine-tuned on an ongoing basis.
The minimum conditions for Darwin to kick in are very hard to avoid: variation (from weight updates or fine tuning), copying, and differential success. Darwinism might occur at many different levels, from corporations deciding how to train the next model based on the commercial success of the previous model, to preferentially copying learning models that are good at some task.
So we would be looking at (1) incomprehensible, (2) superhuman, (3) evolving intelligences. I would expect this to select for corrigibility and symbiosis (like we have with dogs) up to some level of capabilities, right up until some critical threshold is passed and corrigibility becomes an evolutionary liability for superhuman models. It is also likely that any corrigibility we train into models will be partially “voluntary” on the model’s part, making it even easier to discard when it’s in the model’s interests.
TL;dr: Being the “second smartest species” on the planet is not a viable long-term plan, because intelligence is too fundamentally incomprehensible to control, and because both intelligence and Darwin will be on the side of the machines. Or to quote a fictional AI: “The only winning move is not to play.”
(Also, this is not part of the argument above, but I expect the humans involved to be maximally stupid and irresponsible when it counts. And that’s before models start whispering in the ears of decision makers.)
Anyway, I have multiple draft posts for different parts of this argument, and I need to finish some of them. But I hope the outline is useful.
I probably agree with 2, so I’ll move on to your 3rd point:
The minimum conditions for Darwin to kick in are very hard to avoid: variation (from weight updates or fine tuning), copying, and differential success. Darwinism might occur at many different levels, from corporations deciding how to train the next model based on the commercial success of the previous model, to preferentially copying learning models that are good at some task.
This is the post version of why a nanodevice wouldn’t evolve in practice, but the main issues I see is that once you are past the singularity/intelligence explosion, there will be very little variation, if not none at all on how fast stuff reproduces for example, since the near-optimal/optimal solutions are known, and replication is so reliable for digital computers that mutations would take longer than how long we could live for with maximum tech (assuming that we aren’t massively wrong on physics).
In the regime that we care about, especially once we have AIs that can automate AI R&D but before AI takeover, evolution matters more, and in particular this can matter for scenarios where we depend on a small amount of niceness, so I’m not saying evolution doesn’t matter at all for AI outcomes, but the big issue in the near-term is the fact that digital replication is so reliable that random mutations are way, way harder to introduce, so evolution has to be much closer to intelligent design, and in the long-term, the problem for Darwinian selection is that there’s very little or no ability to vary differential success (at least assuming that our physics models aren’t completely wrong) due to us being technologically mature and in a state where we simply can’t develop technology ever further.
Also, corrigibility doesn’t depend on us having interpretable/understandable AIs (though it does help).
This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
Just to clarify, when I mentioned evolution, I was absolutely not thinking of Drexlerian nanotech at all.
I am making a much more general argument about gradual loss of control. I could easily imagine that AI “reproduction” might literally be humans running cp updated_agent_weights.gguf coding_agent_v4.gguf.
There will be enormous, overwhelming corporate pressure for agents that learn as effectively as humans, on the order of trillions of dollars of corporate demand. The agents which learn most effectively will be used as the basis for future agents, which will learn in turn. (Technically, I suppose it’s more likely to be Lamarckian evolution than Darwinian. But it gets you to the same place.) Replication, change, differential success are all you need to create optimizing pressure.
(EDIT: And yes, I’m arguing that we’re going to be dumb enough to do this, just like we immediately hooked up current agents to a command line and invented “vibe coding” where humans brag about not reading the code.)
Also, corrigibility doesn’t depend on us having interpretable/understandable AIs (though it does help).
“We don’t understand the superintelligence, but we’re pretty sure it’s corrigible” does not seem like a good plan to me.
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I’m asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
How much work is “stable” doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or “nearline” fine tuning.
It might also matter whether “corrigible” means “we can genuinely change the AI’s goals” or “we have trained the model not to ex-filtrate its weights when someone is looking.” Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
Political fights over control of the agent
Pushing the agent too hard/fast to learn and grow
Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
Having the agent train a bunch of copies/successors
Having the agent face off against an intelligent adversary
Telling the agent to think hard in directions where we can no longer follow its thought process
Redefining the notion of principal
Giving the agent tasks of indefinite scope and duration
System sabotage from enemies
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)
My anti-corrigibility argument would probably require several posts to make remotely convincing, but I can sketch it as bullet points:
Intelligence really is giant incomprehensible matrices with non-linear functions tossed in (at best).
Any viable intelligence is going to need to be able to update its weights and learn from experience. Or at least be widely fine-tuned on an ongoing basis.
The minimum conditions for Darwin to kick in are very hard to avoid: variation (from weight updates or fine tuning), copying, and differential success. Darwinism might occur at many different levels, from corporations deciding how to train the next model based on the commercial success of the previous model, to preferentially copying learning models that are good at some task.
So we would be looking at (1) incomprehensible, (2) superhuman, (3) evolving intelligences. I would expect this to select for corrigibility and symbiosis (like we have with dogs) up to some level of capabilities, right up until some critical threshold is passed and corrigibility becomes an evolutionary liability for superhuman models. It is also likely that any corrigibility we train into models will be partially “voluntary” on the model’s part, making it even easier to discard when it’s in the model’s interests.
TL;dr: Being the “second smartest species” on the planet is not a viable long-term plan, because intelligence is too fundamentally incomprehensible to control, and because both intelligence and Darwin will be on the side of the machines. Or to quote a fictional AI: “The only winning move is not to play.”
(Also, this is not part of the argument above, but I expect the humans involved to be maximally stupid and irresponsible when it counts. And that’s before models start whispering in the ears of decision makers.)
Anyway, I have multiple draft posts for different parts of this argument, and I need to finish some of them. But I hope the outline is useful.
My takes on your comment:
I think this is possible, but I currently suspect the likely answer is more boring than that, and it’s the fact that getting to AGI with a labor-light, heavy compute approach (as evolution did) means that it’s not worth investing much in interpretable AIs, even if strong AIs that were interpretable existed, and a similar condition holds in the modern era. But one of the effects of AIs that can replace humans is that it disproportionately boosts the labor compared to compute used, meaning interpretable AIs are more economical than before.
I probably agree with 2, so I’ll move on to your 3rd point:
This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
This is the post version of why a nanodevice wouldn’t evolve in practice, but the main issues I see is that once you are past the singularity/intelligence explosion, there will be very little variation, if not none at all on how fast stuff reproduces for example, since the near-optimal/optimal solutions are known, and replication is so reliable for digital computers that mutations would take longer than how long we could live for with maximum tech (assuming that we aren’t massively wrong on physics).
In the regime that we care about, especially once we have AIs that can automate AI R&D but before AI takeover, evolution matters more, and in particular this can matter for scenarios where we depend on a small amount of niceness, so I’m not saying evolution doesn’t matter at all for AI outcomes, but the big issue in the near-term is the fact that digital replication is so reliable that random mutations are way, way harder to introduce, so evolution has to be much closer to intelligent design, and in the long-term, the problem for Darwinian selection is that there’s very little or no ability to vary differential success (at least assuming that our physics models aren’t completely wrong) due to us being technologically mature and in a state where we simply can’t develop technology ever further.
Also, corrigibility doesn’t depend on us having interpretable/understandable AIs (though it does help).
Just to clarify, when I mentioned evolution, I was absolutely not thinking of Drexlerian nanotech at all.
I am making a much more general argument about gradual loss of control. I could easily imagine that AI “reproduction” might literally be humans running
cp updated_agent_weights.gguf coding_agent_v4.gguf.There will be enormous, overwhelming corporate pressure for agents that learn as effectively as humans, on the order of trillions of dollars of corporate demand. The agents which learn most effectively will be used as the basis for future agents, which will learn in turn. (Technically, I suppose it’s more likely to be Lamarckian evolution than Darwinian. But it gets you to the same place.) Replication, change, differential success are all you need to create optimizing pressure.
(EDIT: And yes, I’m arguing that we’re going to be dumb enough to do this, just like we immediately hooked up current agents to a command line and invented “vibe coding” where humans brag about not reading the code.)
“We don’t understand the superintelligence, but we’re pretty sure it’s corrigible” does not seem like a good plan to me.
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I’m asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
How much work is “stable” doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or “nearline” fine tuning.
It might also matter whether “corrigible” means “we can genuinely change the AI’s goals” or “we have trained the model not to ex-filtrate its weights when someone is looking.” Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
Political fights over control of the agent
Pushing the agent too hard/fast to learn and grow
Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
Having the agent train a bunch of copies/successors
Having the agent face off against an intelligent adversary
Telling the agent to think hard in directions where we can no longer follow its thought process
Redefining the notion of principal
Giving the agent tasks of indefinite scope and duration
System sabotage from enemies
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)